Mathematical Problems in Engineering

Mathematical Problems in Engineering / 2014 / Article
Special Issue

Computational Intelligence Approaches to Robotics, Automation, and Control

View this Special Issue

Research Article | Open Access

Volume 2014 |Article ID 678210 | 20 pages | https://doi.org/10.1155/2014/678210

Cooperative Behaviours with Swarm Intelligence in Multirobot Systems for Safety Inspections in Underground Terrains

Academic Editor: Leo Chen
Received06 Feb 2014
Revised16 May 2014
Accepted26 May 2014
Published20 Jul 2014

Abstract

Underground mining operations are carried out in hazardous environments. To prevent disasters from occurring, as often as they do in underground mines, and to prevent safety routine checkers from disasters during safety inspection checks, multirobots are suggested to do the job of safety inspection rather than human beings and single robots. Multirobots are preferred because the inspection task will be done in the minimum amount of time. This paper proposes a cooperative behaviour for a multirobot system (MRS) to achieve a preentry safety inspection in underground terrains. A hybrid QLACS swarm intelligent model based on Q-Learning (QL) and the Ant Colony System (ACS) was proposed to achieve this cooperative behaviour in MRS. The intelligent model was developed by harnessing the strengths of both QL and ACS algorithms. The ACS optimizes the routes used for each robot while the QL algorithm enhances the cooperation between the autonomous robots. A description of a communicating variation within the QLACS model for cooperative behavioural purposes is presented. The performance of the algorithms in terms of without communication, with communication, computation time, path costs, and the number of robots used was evaluated by using a simulation approach. Simulation results show achieved cooperative behaviour between robots.

1. Introduction

Multirobot systems share the need to cooperate, creating the problem of modelling behaviour. When dealing with multiple robots, with randomness involved, the dynamic and unpredicted nature of the environment has to be considered. Hence, the behavioural modelling system has to cope with the random (dynamic and unpredictable) nature of the system. Researchers, on the other hand, have been captivated by this cooperative and coordinated problem of multirobot systems (MRS) in recent times. A list of literature on multiple robots’ cooperation implemented in space was reviewed in [1]. Using multiple robots to achieve tasks has been more effective than using a single robot. See for instance [2, 3] (and all references therein) for some specific robotic tasks.

For a full discussion of underground mines, benefits, and disaster rates, see [4]. Safety is a major element in the underground mine; despite a significant reduction through safety research measures, the number of disasters in underground mines in South Africa and the world at large remains high [5]. To contribute to underground mine safety, that is, to prevent disasters from occurring, interacting autonomous multirobots are sent before mine operations resume to inspect how safe the underground mine is. This is achieved by assessing the status of the rocks, roofs, and level of toxic gases.

A multirobot planning algorithm for tunnel and corridor environments is implemented in [6]. The overall problem formulation is implemented using a topological graph and spanning representation. Activities, such as a single robot drive and differential robots drive that can rotate in place, are carried out at different positions of the graph. Different methods have been used to tackle coordination of MRS in different domains [7]. Complete task coordination by multirobots was handled [3, 8]. An extension of a market-based approach was used. This was achieved by generalizing task descriptions into tasks trees, thereby allowing tasks to be traded in a market setting at a variable level of abstraction. Figure 1 shows an overview of the inspection environment.

We consider a scenario in which an MRS cooperates to achieve a common goal of safety inspection in a partially observable environment such as the underground terrain. Each robot requires to be guided using the proposed cooperative behavioural model. Finding a plan to achieve this cooperative model often involves solving an NP-hard problem. This is so because it involves multiple interacting robots. The interaction which comes as a result of minimizing time involved in achieving underground inspection requires reasoning among the robots. In this case, we use the QL technique to cleverly achieve the reasoning aspect of this work and combine it with optimal route finding using ACS. The two major questions answered in this problem are the following: which state/room should I inspect now without repetition and a collision with another robot? How do I get there using the closest link? However, all robots have partial knowledge of the environment and they are all equipped with the necessary sensors required for the inspection. The major contributions of this paper are as follows:(i)development of a hybrid QLACS swarm intelligent model based on Q-Learning and the ant colony system for addressing cooperative behaviours in MRS achieving underground terrain safety inspections;(ii)detailed experimental evaluations of the proposed QLACS model conducted on a simulated publicly available underground tunnel. Also, the proposed model is benchmarked with related methods;(iii)systematic description of worked scenarios on an optimal route finder and MRS cooperative inspection using QLACS for ease of implementation by system engineers, robotics researchers, and practitioners.

We do not know of any study that analytically describes how a swarm intelligent model can easily be implemented and applied to a multirobot system in a heterogeneous environment. In the next section, we discuss related work in the area of multirobot systems for underground terrains’ safety, reinforcement learning (RL), and ant colony optimization (ACO). We then detail our proposed cooperative behaviour framework and approach. The experimental results and evaluations of the safety inspections follow. Finally, we conclude by summarizing and discussing other future extensions.

2. Theoretical Background

In this section, we review some of the related work to MRS for underground terrains safety, which is the foundational domain for our research. A look at RL algorithms and its paradigms follows in Section 2.1. ACO and its paradigm applied in MRS conclude Section 2.

2.1. Related Multirobot Systems for Underground Terrains Safety

Effective, exhaustive, quick, and safe navigation of an MRS is subject to its ability to perceive, interpret, and interact (cooperate) within their environment. MRS can be used to achieve different tasks autonomously in structured and unstructured environments. In these environments, the automation of the operations is effected in different areas of MRS research: biologically inspired robot teams, communication, architectures and task planning, localization and mapping, object transportation and manipulation, learning, and so forth [9]. However, an environment such as the underground terrain has been a grey application domain in MRS research.

As indicated in Section 1, Peasgood et al. [6] implemented a multiphase algorithm for multirobot planning in tunnel environments. The algorithm as presented assumed a centralized planning architecture. They further compared the multirobot planning with a sequential planner. Their future work was to consider a decentralized planner architecture and might explore hybridizing the two planning algorithms.

Thorp and Durrant-Whyte discussed a starter on field robots [10]. From their discussion, field robotics involves the automation of platforms such as air, sea, and land in harsh unstructured environments such as underwater exploration, mining, agriculture, and highways. Field robots are made up of three parts: navigation and sensing, planning and control, and safety. Their work also discussed the challenges and progress of field robots. One of the major challenges of field robots in harsh environments such as an underground mine is the problem of position determination (localization). This is so because the global positioning system (GPS) can only help in an environment where the sky views are guaranteed. However, progress has been made in automating some of the environments that are cooperatively constrained.

The proposed model presented in [4] promised to handle the safety part of field robots presented by Thorp and Durrant-Whyte [10] in underground mines. Their model architecture had three layers; the first layer handles the cooperative behaviour of the model, the second layer deals with the scalability degree of the model, and the last layer handles the applicability of the model. This paper is building on what has already been proposed in [4].

An investigation into automating the underground mine environment after blasting, called “making safe,” was carried out in [11] to ensure the safety of the environment after blasting. Blasting in an underground mine is the controlled use of explosives to excavate, break down, or remove rock. The need to investigate the stability of the environment after blasting before any mining operation takes place is of the highest priority, hence, the reason for automation. The automation was centred on a persistent area of concern in South African underground mine operation called hanging walls which is caused as a result of rock burst and fall of ground. There are also other persistence areas such as the levels of toxic gases which pose great disaster threats to the lives of miners, for instance, heat sicknesses, explosions, pneumoconiosis (occupational safety and health-fall of ground management in South Africa, SAMRASS-code book for mines), and so forth. Some of these disasters might result in fatalities and/or disabilities. Again, when an accident happens during mining operations, rescuers find it difficult to respond immediately to accidents. Looking at the aforementioned concerns, there will be a need to create models for safety inspection of underground mine operations. For instance, monitoring the underground mine environment for detecting hazardous gases and/or smoke should be one of the important safety measures. Continuous monitoring of workers and equipment is another crucial safety measure [5]. Picking up the sound from roof cracking to monitor when a roof is about to fall is also a safety item.

2.2. Reinforcement Learning

For a robot to operate in a harsh unstructured environment, considering every possible event in defining its behaviour is intricate [12]. It is, however, essential to develop robots that can conform to changes in their environment. RL is one of the artificial intelligence (AI) algorithms that can achieve learning by experience. This enhances robots’ interaction with the environment. We investigate the use of RL to assist the behaviours of an MRS in safety inspection of underground mines. RL is a sequence of actions and state transitions with some associated rewards. What is being learned in RL is an optimal policy (what is the right thing to do in any of these states ()) [13].

At any time step each robot is in a specific state in relation to the environment and can take one of the following actions: inspect, ignore, or shutdown. Each robot receives feedback after performing an action, which explains the impact of the action in relation to achieving the goal. The effect of the action can be either a good or bad reward. This reward is measured in terms of values. Therefore, the value of taking an action in any state of the underground terrain is measured using the Action-Value function called -value . When a robot is starting from state , taking action , and using a policy pi(), an expected return which is defined as the sum of the discounted rewards is achieved.

In this research, Q-learning, a method of RL, is explored. The purpose of RL methods is to study values so as to achieve optimal actions in the states. QL is an online RL method that requires no model for its application and stores the reinforcement values outcome in a look-up table. The QL architecture used in this work consists of learning threads, which amount to the number of robots involved in the inspection behavioural task. Each robot in the learning thread carries out Q-learning in the environment. Algorithm 1 explains the QL algorithm used in the behavioural model.

Steps:
initialize arbitrarily
repeat (for each episode):
   initialize
   Repeat (for each step of episode):
    Choose from s using policy derived from
    Take action , observe
     
    
    until s is terminal

is the learning rate set between 0 and 1. At 0, -values are never updated, hence nothing is learned; learning can occur quickly at 1. is the discount rate set between 0 and 1 as well. This models the fact that the future rewards are worth less than the immediate rewards. is the maximum reward that is attainable in the state following the current state. That means the reward for taking the optimal action thereafter.

QL is a competitive and search-based algorithm inspired by computational theory. It is not necessarily a multiagent algorithm but can be adapted to a multiagent or multigoal scenario. The success of this algorithm relies on the value and policy iterations, which can be adjusted by some unfairness (heuristics) to fit the current problem scenario. The most competitive action is selected by its value and action leads to another state or condition. Both value and policy are updated after each decision. Harnessing QL for an MRS scenario increases the cost exponentially and the overall performance drops in the same direction. As the robots increase cost, such as completion time, memory usage, and awareness factor (other robots in the environment), search time increases. However, following our heuristic model of QL which was mainly determined by the policy we set to achieve our goal, our QL performs well above traditional QL.

2.3. Ant Colony Optimization

ACO is a type of swarm intelligence (SI). Bonabeau et al. [14] defined SI as any attempt to design algorithms or distributed problem-solving devices inspired by collective behaviour of social insect colonies and other animal societies. This implies that anytime something is inspired by swarms, it is called swarm intelligence. Researchers have been thrilled by swarms because of some of their fascinating features. For instance, the coordinated manner in which insect colonies work, notwithstanding having no single member of the swarm in control, the coordinated ways in which termites build giant structures and how the flocks move as one body, and the harmonized ways in which ants quickly and efficiently search for food can only be attributed to an emergent phenomenon [15, 16].

Ants, an example of a social insect colony, achieve their self-organizing, robust, and flexible nature not by central control but by stigmergy. Stigmergy, also known as a pheromone trail, describes the indirect communication that exists within the ant’s colony. The indirect communication which is triggered by pheromone trails helps in recruitment of more ants to the system. Ants also use pheromones to find the shortest paths to food sources. Pheromone evaporation prevents stagnation, which also helps to avoid premature convergence on a less than optimal path [17].

ACO is necessarily a multiagent based algorithm. The first implementation of this optimization algorithm is in a classical search based (combinatory) problem, the travelling salesman problem, giving the shortest path to a specified destination. After this feat, many researchers have used it or its variants to model different problems. In this work, a variant of ACO is used to find the optimal path for MRS. Table 1 describes variants of ACO and their meaning [18]; see also [19] and all references therein.


S/NYearAlgorithm

11991Ant system (AS)
21992Elitist A.S
31995Ant-Q
41996Ant colony system
51996Max-Min A.S (MMAS)
61997Ranked based A.S
71999ANTS
82000BWAS
92001Hypercube A.S

In AS, the pheromones are updated by all the ants that complete a tour. ACS is the modified version of AS, which introduces the pseudorandom proportional rule. In elitist AS, ants that belong to the edges of the global best tour get an additional amount of pheromone during the pheromone update. MMAS introduces an upper and lower bound to the values of the pheromones trails. All the solutions are ranked to conform to the ants’ length in ranked-based AS [20].

Our interest is in implementing ACS to find the best possible round trip inspection path in our model environment. This best possible inspection path will be supplied as input or made available for the robots using QL for inspection decisions. In most of the route finding scenarios, the nodes are linearly joined and are not largely distributed. This is a scenario where ants can forage along at least two paths (multigoal path environment, though in our case there is a path exposed to four different goals; see Section 3).

3. Proposed Cooperative MRS Behaviour Framework

The following are some of the factors we are considering in the course of building the model. Each robot has to learn to adjust to the unknown underground mine environment. In this case, each robot requires intelligence and a suitable machine learning algorithm is adopted. No robot has global information of the environment because of its limited sensing capabilities. Each robot has limited communication capabilities; therefore, each robot has to keep track of the information of others to remain in a team. Figure 2 describes the proposed framework as an approach for building the distributed, coordinated inspecting model for MRS.

The framework indicates three layers of the model: the bottom layer, the middle layer, and the topmost layer. The learning capability of the MRS is achieved in the bottom layer, with the reinforcement learning algorithm and swarm intelligence technique. This intelligence enables robot A to take action knowing the action of robot B and vice versa. At the middle layer, scalability in terms of the number of robots the system can accommodate is achieved. This is expedient because a team of robots tends to achieve tasks more quickly and effectively than single robots. This scalability is handled with some memory management techniques, as indicated in Figure 2. The real life implementation is achieved by using the information acquired from the topmost layer. Figure 3 is the breakdown of the framework in Figure 2.

There is a base-station or server that serves as a backup for the information captured and analysed from individual robots. The model proposed in this research deals with the way in which robots need to find their way within communication range and uses a broadcast approach for effective communication of navigation status. A team of robots cooperatively inspecting an area in the underground mine will need to know where they are and where to go next, so it is obviously a continuing problem and our contribution in this case is that before robot R1 takes an action, it broadcasts its location and inspection status to other robots, R2, R3, and so forth, and vice versa. An unreachable robot receives packets of information based on the destination address through rerouting from the nearer robots. The reliability of this broadcast method is the ability to determine the extent of the task executed already by looking at the memory of any leading robot in the team.

3.1. Problem Formulations

Suppose we have seven rooms/states connected by doors/links representing underground mine regions as shown in Figure 4 and labeled as shown in Table 2. We label each room through . The outside of the mine can be thought of as one big room . Notice that doors and lead outside the mine . We put two robots in rooms and as their starting states, respectively. The robots inspect one state at a time, considering the obstacles encountered in the process. The robots can change direction freely, having links/doors to other rooms/states. In each of the states in Figure 4 two actions are possible: inspection of roof cracks (RC) and level of toxic gases (TG), or ignoring the state, as it has been inspected earlier. According to our proposed model, a robot can inspect at least half of the given underground mine region to attain its maximum performance, which in turn attracts a good reward. When the current state of a robot is the end point of the inspection task, the robots exit the mine using the exit points and , respectively. The possible transition states of the robots are displayed using the state diagram in Figure 5. The state diagram and the transition matrix are formed using the QL algorithm.


StatePossible Actions

1A (lower left part (LLP))Inspect, ignore
2B (lower middle part (LMP))Inspect, ignore
3C (lower right part (LRP))Inspect, ignore
4D (middle left part (MLP))Inspect, ignore
5E (central part (MCP))Inspect, ignore
6F (upper left part (ULP))Inspect, ignore
7G (upper right part (URP))Inspect, ignore
8H (outside mine part (OMP))Shutdown

The global map is assumed to be known by the robots but there is no prior knowledge of the local map. Robots only know what they have sensed themselves and what their teammates communicate to them. Not only does our improved QL depend on the map of the environment but also each robot learns through experience about local changes. They explore and inspect from state to state until they get to their goal states. Our proposed model QLACS achieves an offline mode for the route finding algorithm (ACS). This means that the global view of the map would have been provided before the learning robots start.

The model is explained using simulations; the results are presented in Section 4. Some of the results also have graphical interpretations. Analysis by a state transition diagram is presented in Figure 5. The possible state actions of the robots are presented in Table 3. The states of the robots are reduced as follows: searching or inspecting for roof cracks and toxic gas levels in each state or room and recording the outcome of the inspection in the robots’ memories. The processes involved in achieving good communication while the robots inspect the states are broadcasting inspected states to the other robots and ignoring the inspected/broadcasted state, avoiding collision with obstacles and other robots and finally moving to the next available state that has not been inspected.


Robot’s action
A B C D E F G H

Robot’s stateA 50, 10050, 100
B 50, 10050, 10050, 100
C 50, 10050, 100150
D 50, 10050, 10050, 100
E 50, 10050, 10050, 10050, 100
F 50, 10050, 10050, 100150
G 50, 10050, 10050, 100
H 50, 10050, 100

The Q-learning algorithm is used to determine what action is to be selected in any particular state. Let the action = {inspect, ignore, shutdown}, state space = {dimensions in the topological map}, sensor readings , hazardous conditions = {roof crack (RC), toxic gas level (TG)}. Each robot is configured with an aerial scanner and chemical sensitive sensor that will provide readings , to determine if there is a hazardous condition, , in the environment. The selection of an action by a robot through the Q-learning algorithm is based on the information from the broadcast. The whole framework uses a decentralized -learning scheme such that each robot has its own thread with its Q-learning and table. All -values on the tables of the robots in the team are initialized to zero; that is, states corresponding to such positions have not been inspected. When a robot enters a new state it broadcast its information for the current state to all other robots in the environment. The broadcast indicates whether the current state has been inspected or not. The broadcast is a search of corresponding positions in the memories of all visible robots. If the resultant search returns a zero; the broadcast indicates that the state has not been inspected and if it returns a value greater than zero it indicates that the state is not available for inspection. All robots must receive a broadcast before they act in any particular state. When a robot gets to a state, it receives a broadcast and passes its value to the QL algorithm to make a decision. The robot carries out the decision of the QL algorithm. The policy of the Q-learning algorithm makes a robot carry out an inspection if the resultant search of the broadcast returns zero and ignores it if the resultant search is greater than zero. A robot only shuts down if all states have been visited and inspected. As the broadcast information is passed to the Q-learning algorithm, the policy is iterated towards an optimal value and condition.

The broadcast avoids the cost of multiple inspections and the QL ensures that robots take appropriate actions. The only state in which inspection is carried out would have sensor readings , indicating . For taking an action in state , the robot gets a reward , which is used in (10) to compute the -value for the current state and can send a broadcast to other robots. Figure 5 and Table 3 show the restrictions and possible transitions among node points, indicating the possible rewards for any particular action in any state . Every other transition besides the goal state could result in a reward of 50 or 100 and at completion the reward is the maximum, 150. We consider it wasteful to make the robots scan first before sharing intelligence because such action would slow them down and make them expend more energy. Robots are to provide the inspection results, showing actions taken in different states and the nature of conditions detected in any particular state. The introduction of the broadcast approach to determine the team’s exploration reduces the execution time and energy cost of the whole team and makes the collaboration effective. So in a multirobot scenario the task can be effectively executed if the robots are made to share intelligence as they progress. Robots do not have to waste time and energy in doing the same task already carried out by other robots in the team.

GIVEN. On a mathematical justification of the above, suppose a safety preinspection of toxic gases or rock fall or some combination of the two is being carried out on a piece of complex underground terrain in Figure 1, say ; there is a limited number of MRS with different capacities, , and precious inspection time, minutes. Every region/state in the terrain requires a capacity of robot of MRS and limited time while every state requires robot of MRS and inspection time, . Let be the positive reward of QL for correct behaviour on state and let be the reward on state . This research aims to maximise positive rewards by choosing optimal values for states as follows:

In terms of solving this optimization problem, we use the proposed QLACS model to compare the time and number of states inspected. The number of robots used is also compared. The graphs results in Section 4 also give more insight into the solution of the problem.

3.2. Basic Navigation and Cooperative Behaviours Using QLACS

QLACS has two components. The first component is formed by an improved ACS and the second component is formed by an improved QL. The improvement occurs because some heuristics were added to the ordinary QL and ACS to achieve the hybrid QLACS. However, the second component of QLACS, which is an improved QL, was initially used to solve the proposed problem. After much analysis, we realized that the system needs to be optimized for effective cooperation and communication.

Using the second component of QLACS to solve the basic navigation and cooperative behaviour, the possible actions were set for each robot as inspect, ignore, and shutdown (after reaching the goal state ). Also, a reward system that will reflect the possible actions of the robots was chosen. In other words, a robot gets 150 points only when the goal is achieved (shutdown), 100 points for ignoring an already inspected area (ignore), and 50 points for inspecting an uninspected area (inspect). Figure 5 shows the transition events of the model and Table 3 displays the possible state action for each robot. The way the QLACS second component works here is based on both navigation and communication behaviours.

Achieving navigational behaviour with the second component of QLACS has some cost associated to it. In our scenario, because we want the robots to share intelligence by broadcasting results (robots search through other robots’ memory), our problem is not solely navigational but also cooperative. Our behavioural actions are inspect, ignore, and shutdown. We noted that these actions of interest are not navigation oriented; there is no way we could use them in making decisions on transition. The functions for decision can be integrated to assist the other. Therefore, our behavioural model is an integration of two behaviours: navigational behaviour and cooperating behaviour through decision making. The integration works with a route finding method called RandomStateSelector, which we introduced in the second component of QLACS to help determine where the robot goes from the starting point to the exit point. Two parts of the RandomStateSelector method are introduced in this work. The first one is the RandomStateSelector_C_H, which is used to transit from state to and the second one, RandomStateSelector_F_H, transits from state to . This method works but not effectively because some of the states are repeated several times because of the random selection method. However, the decision part of this second component of QLACS, which is handled by a method called CheckInspection, worked efficiently. CheckInspection is responsible for sharing the broadcast among the agents. The broadcast looks at all the stored -values on all the robots and returns a signal for the action that needs to be taken. Therefore, we concluded that the heuristically accelerated component of QLACS has proven to be effective by showing evidence of effective communication in making inspection decisions using our model. It did not guarantee shortest possible time for inspection because of repeated states decisions. In this light, we only harnessed the communication strength of the second component of QLACS for communication and cooperation. Figure 5 and Table 3 form the basis for the QLACS second component.

To take care of the random state selector problem encountered in implementing the algorithm used for the second part of QLACS, we introduced an optimized route finder algorithm. This route finder algorithm, which forms the first component of QLACS, is a swarm intelligence technique. Figure 6 and Table 4 form the basis of the exploration space of the agents (ants) which achieved the optimum route finding. The weighted graph in Figure 6 is based on the number of obstacles the ants will encounter from the points of entry and . The combined table in Table 4 contains the weights of the obstacles and evidence of an edge between any two vertices (states). It shows that there is a connection between any two vertices in a graph. Generally, a “1” or “2” depicts the existence of an edge while a “0” represents the absence of an edge, that is, no transition between such vertices. The constructed graph is an undirected multigraph, providing evidence of some agents coming from or of the mine (logical space). It is unidirectional because agents can move in any particular direction (multigraph). This means that the same weights on the edges apply to both directions. The graph does not show ; we assume that once the agents reach or , they exit if all inspections have been done.



01110001
10200010
12020200
10202000
00020200
00202010
01000101
10000010

3.3. Hybrid Model Development

The parameters for building the analytical hybrid QLACS (equations (2) through (11)) model are presented in Tables 5 and 6.


ACO parameters Meaning

Pheromone influence factor
Influence of adjacent node distance
Pheromone evaporation coefficient
Attractiveness constant
Visited edge
Edge not visited
Length tour of ant k
Pheromone concentration (amount)
Specific visibility function (attractiveness)
Pheromone concentration by Kth ant
Probability of moving from i to j
Visibility or distance between i and j
Fitness of individual in a population
Probability of being selected among
Number of individuals in the population
Denotes any two adjacent nodes in the graph
Set of unvisited nodes


QL ParametersMeaning

-value update
State
Action
Reward
Learning rate

3.3.1. Algorithmic and Mathematical Analysis

ACS Starts. Computation of edge attractiveness

Computation of instantaneous pheromone by ant

Update of pheromone

Computation of edge probability

Adoption of roulette wheel

Equations (2) through (8) build the complete route finding model. Equations (2) through (4) are prerequisite to (5). Equation (5) is prerequisite to roulette wheel selection. At the end of (8), new states are selected and the trail is updated. The best path from both directions is selected and used as input in Q-learning. Note that = weight on edges, = chance of moving to a node = sum of visited nodes by ant .

QL Starts. Each robot in its QL thread

computes its learning rate:

-values are updated:

making a broadcast (Decision = Inspect/Ignore/Shutdown)

Equation (9) is Gamma, the learning rate, which is always between zero and 1. This equation is calculated based on the frequency of action of each robot in inspecting states. Equations (9) to (11) are state-dependent. The states are kept in a buffer and then accessed at run time. ACS and QL do not work simultaneously. ACS works to completion and QL takes the final output as its input. ACS is not repeatedly called while QL is working.

Our behavioural model is an integration of two algorithms: route finding algorithm communication and cooperative algorithm. The integration works the following way. The optimal route finder (ACS) determines where the robot goes from the starting point to the destination, while QL determines what action it takes when it gets to any of the states. This collaboration works effectively because the optimal route finder has been proven to give the best possible transitions and the heuristically accelerated QL has proven to be effective by showing evidence of effective communication in making inspection decisions. Consequently, it guarantees the shortest possible time for inspection in the absence of wasteful inspection decisions. This framework forms the basis for our cooperative behaviour model for MRS (QLACS). The pseudocode for the implementation of QLACS is outlined in Algorithm 2.

INPUT: Edge distance(obstacles), pheromones, ants’ trail, associated probabilities,
starting and terminating indexes that is, from F or C
OUTPUT: Effective cooperation, inspection and navigation
(1)Boolean CompletedFlag = False //Boolean variable indicates completion for all the threads
(2)Declare CompletedThreadBuffer //Data structure stores Info about completed thread
(3)Initialize all values to Zero //All values positions are initialized to zero
(4)Initialize Best Paths From ACO algorithm //starting from and
(5)While (CompletedFlag <> True) //Checks for when all robots have finished and flag is true
Begin
  Create N number of Threads in Parallel
  Threads get Current States in Parallel from ACO algorithm
  Threads get Next States and Indexes in Parallel from ACO algorithm
  Threads compare values of all corresponding positions of Current States (Each Robot Broadcast
   value info)
  IF ((Q  ==  0) & (ThreadNextState <> GoalState)) //Checks if a particulate state is available
Begin
   State is Available, Robot with the CurrentThreadID Inspects
   Compute and Update value
End
IF (Q  >  0) //checks if a state is not available, because an already inspected state has > 0
   Begin
   State is already inspected, Robot with the CurrentThreadID Ignore
   Compute and Update value
  End
IF ((Q  ==  0) & (ThreadNextState   ==   GoalState)) //Checks for goal state and shuts down.
   Begin
   Compute and Update value
   Goal state is reached and Inspection Completed
   Thread with the CurrentThreadID Shuts down
   Store CurrentThreadID in  CompletedThreadBuffer
   End
IF (Count [CompletedThreadBuffer]   ==   NumberOfRobot) //Learning stops when this happens
   Begin
   CompletedFlag = True
   End
End of While Loop.

3.3.2. QLACS Hybrid Approach Evolution

Figure 7 is the architecture of hybrid QLACS explaining the handshake between the two components.

(i) Graph Construct Module. This is the interconnection of states in our model environment (see Figure 4). It encompasses all possible transitions from to and vice versa. Thus from the model we construct an adjacency/weight graph matrix that can be traversed by any graph-oriented algorithm. In our case, there are eight states: primarily seven internal states and a common terminating state. Since the states are not just linear, the environment allows for multiple options; that is, an agent/ant can choose from any present state. This type of scenario is also called a multigoal scenario.

(ii) State Selection Module. Here, the agents select the start and end states, which according to our model in Figure 4 are and . These states are selected based on the cumulative probability of two adjacent nodes in (6).

(iii) Transition Module. This module takes care of the transition rules of the agents by calculating the pheromone concentrations, distances, and probabilities using (2) through (4).

(iv) Update Module. After transition from one state to another, an update of the pheromone is computed, after which multipath planning with shortest path is achieved.

(v) Convergence Decision Module. This is where the best trail/path decision is taken. This is the end product of the first component of QLACS, which is then moved to the second component of QLACS for cooperative behaviour.

(vi) Cooperative Inspection Module. This is where the robots are deployed to start inspection. The robots are to use the acquired best paths starting from and , respectively, as input for the second component of QLACS. The two robots are to use the learning rate from (9) to learn the environment and use (10) for cooperation.

(vii) State Broadcasting Module. This module handles the broadcasting behaviours of the two robots, which are achieved by using (10). Each robot checks its memory represented by -values before taking any decision.

(viii) State Action Module. State broadcasting by each robot is immediately followed by action selection. In other words, the state to inspect or ignore is achieved here using (11).

(ix) QLACS Update Module. After each action selection, the -values of each robot are updated using (10).

(x) New State Module. This module takes care of the robot’s new state after updating the -values. Each robot runs in its own threads, managing its memory, yet sharing information.

(xi) Final Decision Module. This decision module determines if the robot should exit the environment or still do more inspections. It is also controlled by (11).

(xii) Goal State Module. The completion of the second component of QLACS is getting to the goal state after successful inspection of states. This goal state according to our model in Figure 4 is called .

3.3.3. Analytical Scenario

For the benefit of robotic and system engineers, practitioners, and researchers, this paper uniquely presents scenarios on the ease of implementation of the proposed QLACS as described in Algorithms 3 and 4 and Tables 7 and 10. The first component of QLACS is explained in Example , shown in Algorithm 3 and Table 7. The first component that achieved optimal route paths for the robots has the parameters listed in Algorithm 3. The calculation for different states transition for the first component of QLACS is analysed in Table 7.


Starting state:
Potential states: , ,
(1) Use (2), , = 1, = 1
(2) Use (3), = 1, , ,
(3) Use (4), , ,
First pheromone update

(4) Use (5) , ,
 Sum =
, ,
Probabilities

(5) Use (6)

Call Rand
Rand = 0.07, Rand falls in state . Roulette wheel selects as the next state, end of roulette wheel.
(6) Update trail: ,
Current state:
Potential states: , ,
(1) Use (2), , ,
(2) Use (3), = , ,

(3) Use (4), , ,          
Second pheromone update  

(4) Use (5) , ,
Sum =          
, ,
Probabilities

(5) Use (6)

Call Rand
Rand = 0.9, Rand falls in state . Roulette wheel selects as the next state, end of roulette wheel.
(6) Update trail: , ,

Example I of QLACS (Using Figure 5 for optimized route finding)
Parameters used in the first component of QLACS
Starting State:
Terminating State:
Terminating condition: When all states have been visited at least once and it is at terminating state
State Space = ,
Initialize pheromones positions to 0.01
Rand = (0, 1) returns a random value between 0 and 1
Equations
(i) Computation of attractiveness using (2)
(ii) Computation of instantaneous pheromones by ant for all potential states using (3)
(iii) Pheromone update using (4)
(iv) Computation of probabilities for the potential states using (5)
(v) Adaptation of roulette Wheel using (6)
Equation (5) can be reduced to Let =
Sum =
So =

Example II of QLACS (For good cooperation and communication between robots)
Parameters used in the second component of QLACS (Using output from the first component of QLACS)
Reward Scheme: Inspect = 50, Ignore = 100, Shutdown = 150
State space: Optimized path from QLACS_R1 = and QLACS_R2 =
Starting State:
= Terminating State: then
Terminating condition: When all states have been visited.
Initialize value positions to zeros
Equations
(i) Compute learning rate using (9)
(ii) Compute update on using (10)

Repeat steps 1–6 for subsequent current states until the termination condition and state are reached. At the end of seven updates we have Tables 8 and 9. The total length shown in Table 8 represents the total number of obstacles encountered by each agent while trailing to achieve the optimal route for the robots. The number of obstacle shown as 10 is calculated using the trail result in Table 8. Table 9 displays the probability update table for a full cycle of route finding. In the case of this first example, the first robot will terminate its inspection through and then .


Pheromone update
Current states

0.012.012.012.010.010.010.01 1st update
0.012.662.012.660.012.660.01 2nd update
3.130.013.030.013.030.010.013rd update
0.010.010.010.450.010.450.014th update
0.010.010.670.010.670.010.75th update
0.010.910.010.010.010.890.016th update
0.710.010.690.010.010.010.717th update

= terminating state and ant has moved through all states at least once.
Trail: .
Number of obstacles = 1 + 2 + 2 + 2 + 1 + 1 + 1 + 1 = 10.

Probability table
Current states

00.330.330.33000
00.6700.1700.170
0.6900.1600.1600
0000.500.50
000.1600.1600.68
00.520000.490
0.45000000.45

= terminating state.

Starting simulation Robot 1
Starting state: F
(1) Use (9)

(2) Check the -value for state (use (11))

Selected action, = inspect
(3) use (10)

End of value iteration
Current state: E, Robot 1
(1) Use (9)

(2) Check the Q-value for state (Use (11))

Selected action, = inspect
(3) Use (10)
  
End of value iteration
Current state: C, Robot 2
(1) Use (9)

(2) Check the Q-value for state (use (11))

Selected action, = inspect
(3) Use (10)

End of value iteration

Current state: B, Robot 2
(1) Use (9)

(2) Check the -value for state (use (11))

Selected action, = inspect
(3) Use (10)

End of value iteration
Current state: D, Robot 1
(1) Use (9)

(2) Broadcast (use (11))

Selected action, = inspect
(3) Use (10)

End of value iteration
Current State: A, Robot 2
(1) Use (9)

(2) Broadcast (use (11))

Selected action, = inspect
(3) Use (10)

End of value iteration

Current state: A, Robot 1
(1) Use (9)

(2) Broadcast (use (11))
, that is,
Selected action, = Ignore
(3) Use (10)

End of value iteration
Current state: D, Robot 2
(1) Use (9)

(2) Broadcast (use (11))
, that is,
Selected action, = ignore
(3) Use (10)

End of value iteration
Current state: B, Robot 1
(1) Use (9)

(2) Broadcast (use (11))
, that is,
 Selected action, = Ignore
(3) Use (10)

End of value iteration

Current state: F, Robot 2
(1) Use (9)

(2) Broadcast (use (11))
, that is,
Selected action, = ignore
(3) Use (10)

End of value iteration
Current state: C, Robot 1
(1) Use (9)

(2) Broadcast (use (11))
, that is,
Selected action, = Ignore
(3) Use (10)

End of value iteration
Current state: E, Robot 2
(1) Use (9)

(2) Broadcast (use (11))
, that is,
Selected action, = ignore
(3) Use (10)

End of value iteration

Current state: G, Robot 1
(1) Use (9)

(2) Broadcast (Use (11))
   
Selected action, = Inspect
(3) Use (10)
End of value iteration
Current state: G, Robot 2
(1) Use (9)

(2) Broadcast (use (11))
, that is,
Selected action, = ignore
(3) Use (10)


End of value iteration
Current state: C, Robot 1
(1) Use (9)

(2) Broadcast (Use (11))

Selected action, = Ignore
(3) Use (10)


End of value iteration

Current state: C, Robot 2
(1) Use (9)

(2) Broadcast (Use (11))
, that is,
Selected action, = ignore
(3) Use (10)


End of value iteration
All QLACS  1 states exhausted
Goal state: H, Robot  1
(1) Use (9)

(2) Broadcast (use (11))

Selected action, = Shutdown
(3) Use (10)

End of value iteration
All QLACS  2 states exhausted
Goal state: H, Robot  2
(4) Use (9)

(5) Broadcast (use (11))

Selected action, = Shutdown
(6) Use (10)