Abstract

We propose a cooperative multiagent Q-learning algorithm called exploring actions according to Q-value ratios (EAQR). Our aim is to design a multiagent reinforcement learning algorithm for cooperative tasks where multiple agents need to coordinate their behavior to achieve the best system performance. In EAQR, Q-value represents the probability of getting the maximal reward, while each action is selected according to the ratio of its Q-value to the sum of all actions’ Q-value and the exploration rate . Seven cooperative repeated games are used as cases to study the dynamics of EAQR. Theoretical analyses show that in some cases the optimal joint strategies correspond to the stable critical points of EAQR. Moreover, comparison experiments on stochastic games with finite steps are conducted. One is the box-pushing, and the other is the distributed sensor network problem. Experimental results show that EAQR outperforms the other algorithms in the box-pushing problem and achieves the theoretical optimal performance in the distributed sensor network problem.

1. Introduction

Reinforcement learning (RL) uses a scalar numeric feedback from the environment to improve the behavior of the learner. In the case with only one agent, RL is an effective unsupervised learning method to solve problems with the Markov property [1, 2]. Many researchers have been trying to extend RL to optimize performance indices in circumstances where multiple agents exist and a lot of multiagent reinforcement learning (MARL) algorithms, and their applications have been proposed [35]. In a multiagent system (MAS), on one hand, the state transition distribution and the local immediate reward received by each agent are not determined by the behavior of any single agent but the behavior of all the agents in the system. Thus, each agent has to adapt to the environment and the other agents at the same time, which leads to the invalidity of the Markov property. On the other hand, if all the agents in the system are viewed as a single one, the joint action space will grow exponentially, which deteriorates the scalability of MARL algorithms.

This paper investigates methods coordinating multiple agents through MARL techniques. In recent years, many MARL algorithms with different assumptions and goals have been presented to solve coordination issues in MAS. Some algorithms require sharing each agent’s local immediate reward; some algorithms require sharing each agent’s selected actions and even value functions or Q-value functions as well. The learning goal depends on the problems at hand. Nash equilibria have been used in optimal control [6, 7] and are also adopted as the learning goal by many MARL algorithms. Hu and Wellman [8] proposed Nash-Q which could converge to a Nash equilibrium in some repeated games. However, Nash-Q needed Q-value functions to be shared as well. Infinitesimal gradient ascent (IGA) [9] was proposed and guaranteed that the agents’ strategies would converge to a Nash equilibrium, or average rewards would converge to the expected rewards of a Nash equilibrium in two-player two-action repeated games. Win-or-learn fast policy with IGA (WoLF-IGA) [10] was proposed to address the issue that IGA would not converge to any Nash equilibrium in some repeated games. For IGA and WoLF-IGA, each agent has to know its own payoff matrix and the other agent’s strategy. Besides, Nash-Q, IGA, and WoLF-IGA would suffer the curse of dimensionality for joint action space.

To mitigate the above problems, some algorithms with less requirement of sharing were studied. WoLF-policy hill-climbing (WoLF-PHC) [10] only needed to share states and local immediate rewards of each agent, but the convergence property was not guaranteed any more. The exponential moving average (EMA) Q-learning [11] and the weighted policy learner (WPL) [12] empirically converged to a Nash equilibrium in some typical repeated games. To design scalable MARL algorithms that can gain the optimal total sum of reward in fully cooperative games is our motivation.

New MARL algorithms can be obtained by designing new action exploration method. Babes et al. [13] pointed out that more robust algorithms could be produced by inserting tools from nonlinear dynamics into Q-learning to modify the exploration or learning rate. So far, the dynamics of independent Q-learning (IQL) in two-player two-action repeated games have been extensively studied. Tuyls and Nowé, Tuyls and Parsons, and Bloembergen et al. [1416] firstly built the model of IQL with Boltzmann exploration in three typical repeated games. They pointed out that the IQL model was similar with dynamic replication equations, and they presented graphical representation of the relation between the temperature parameter and the critical points. Kianercy and Galstyan [17] further studied the dynamics of IQL. They analyzed the position and the stability of the critical points of IQL in some types of two-player two-action repeated games. Babes et al. [13] analyzed the dynamics of IQL with -greedy exploration. For -greedy exploration, since the action with the maximal Q-value will be selected for exploitation, the Q-value can be viewed as the switching signal. Thus results on stability analysis for switching systems [18, 19] might be beneficial to the analysis of IQL with -greedy exploration. Awheda and Schwartz [11] proposed EMA Q-learning and proved its ability to converge to a Nash equilibrium in two-player two-action games.

Nash equilibrium is important when analyzing the interaction between agents. Some multiagent reinforcement (MARL) algorithms do focus on convergence on Nash equilibrium, and most of these algorithms consider general sum games. In contrast, for cooperative tasks, reaching better performance indices is more important than converging to Nash equilibrium and becomes the prime concern for MARL algorithms.

To obtain the maximal expected total cumulative reward, this paper proposes a multiagent Q-learning algorithm called exploring actions according to Q-value ratios (EAQR). In standard fictitious play [20], each player’s strategy is a function of the other players’ empirical frequency, while in EAQR, each agent selects an action according to its Q-value function of its own actions and updates its Q-value function only according to the frequency of its own action selection. The maximal total immediate reward can still be achieved in some cooperative repeated games, which is the first contribution. The second contribution is that EAQR can be naturally extended to apply to stochastic games. Simulation results show that EAQR outperforms the other algorithms in the box-pushing problem and achieves the theoretical optimal performance in the distributed sensor network problem.

The remainder of this paper is organized as follows. Section II introduces stochastic games and repeated games. Section III proposes EAQR in repeated games. Section IV studies the dynamics of EAQR in seven different repeated games which are analyzed. Section V compares EAQR with EMA Q-learning, WoLF-PHC, and single-agent RL in two stochastic games—box-pushing and the DSN problem. Section VI summarizes the conclusions.

2. Preliminaries of Stochastic Games and Repeated Games

2.1. Stochastic Games

A stochastic game [5] is a tuple , where is the number of agents in the game; is the set of environment states; is the set of agent ’s available actions; and for all agents constitutes the joint action set ; the state transition function is a conditional probability determining the probability of transiting to the next state if the joint action has been executed in the current state , and is the local immediate reward function of agent . The global immediate reward function is the sum of local immediate reward functions of all agents and is defined as . In cooperative MAS, the learning objective is to maximize the discounted global cumulative reward at each time , where is the discount factor within (0, 1) (smaller values correspond to a greater importance of near future rewards); is the ending time of an episode; and is the global immediate reward received at time .

2.2. Repeated Games

There still exists interacting between agents, although this paper focuses on optimization problem. Repeated game is an ideal tool to depict interaction and build the model of EAQR. In a repeated game, the set of state is null. Each agent’s local immediate reward depends solely on the joint action. In a fully cooperative repeated game, we are concerned with only the global immediate reward representing the team benefit. Figure 1 shows the payoff matrix of a two-player two-action game. Each row represents an action of agent , and each column represents an action of agent . Each element of the payoff matrix is a numerical global immediate reward. For example, if agent chooses action while agent chooses action , then both agents will receive a global immediate reward of 2. The optimal global immediate reward of 6 is marked with parentheses.

3. EAQR: A Multiagent Q-Learning Algorithm for Coordination of Multiple Agents

EAQR is designed for optimizing performance indices of fully cooperative MAS. EAQR requires each agent to have full observation of states and local immediate reward of all agents. One merit of EAQR is that each agent does not need to observe any other agent’s action. Thus the size of the Q-table maintained by agent is . Sharing local immediate reward is to achieve the optimal global immediate reward. EAQR manages to converge to (, ) in Figure 1 through the procedures depicted in Algorithm 1. For agent , the probability of selecting action is where is the probability of obtaining the maximum global immediate reward by taking action at time ; the exploration rate is within (0,1); is the action set of agent ; is the number of available actions for agent . The nonnegativity of can be guaranteed by setting the learning rate and the initial value of to a positive value within (0, 1). will be strictly greater than zero if each action is visited by infinite times. To avoid being divided by zero in practical application, we randomly select an action if . In EAQR, the exploration rate balances exploration and exploitation. When , is equal to the ratio of to . When , a random action is selected according to the uniform distribution.

1: for each agent i, do
2: initialize with a number within (0,1) for ,
3: initialize with a number within (0,1)
4: : frequency of getting the maximum global immediate reward after selecting action
5: : number of sample games played
6: repeat for each game
7:  select an action with the probability of
   
8:  
9: execute action , update information about reward
10:  if then
11:   for each action do
12:    evaluate according to (4)
13:    
14:   end for each action
15:   
16:  end if
17: until the predefined number of games have been played
18: end for each agent
19: return Q-value function for each agent

In a repeated game, all agents keep their strategies unchanged and play the game for times. Then they update the Q-value of each action according to where is the learning rate; is the frequency of obtaining the maximum global immediate reward by taking action . It is evaluated according to where is the number of times for agent selecting action during the previous games, and is the number of times for agent achieving the maximum global immediate reward in history when selecting action during the previous games. Before playing the next games, all agents need to update their strategies according to (3).

In stochastic games with deterministic state transition, a state can be viewed as a repeated game, and the elements of the payoff matrix are cumulative rewards if EAQR can converge to a joint action at each of its subsequent states. In this situation, the frequency of obtaining the maximum global cumulative reward by taking action in each state is to be evaluated to update Q-value functions. In stochastic games with nondeterministic state transition, each state cannot be simply regarded as a repeated game. Yet we can try to treat each state in an optimistic way (The frequency of maximal global cumulative reward instead of the average global reward is concerned) and employ the same Q-value updating rule.

4. Dynamics of EAQR in Cooperative Repeated Games

In this section, the dynamics of EAQR in seven cooperative repeated games are analyzed. A theorem about the dynamics of EAQR is presented, and seven cases of repeated games are analyzed. If the updating of Q-value function is regarded as a continuous process, the EAQR can be modeled with differential equations. According to [14, 17], the continuous-time form of Q-value updating rule of EAQR can be obtained as follows:

After rescaling , we can obtain the following:

Theorem 1. For a cooperative repeated game with () players and optimal pure joint strategies, if for any optimal pure joint strategy each of its component actions is different from the corresponding component action of the other optimal pure joint strategies, then only the optimal pure joint strategies are the stable critical points of the model of EAQR with .

Proof 1. is used to denote player s component action of the optimal pure joint strategy , and is used to denote the Q-value of for and . For any optimal pure joint strategy, each of its component actions is different from the corresponding component action of the other optimal pure joint strategies, which is saying that and should not be the same action for player for . According to (6), the Q-value of actions that can never reach the optimal global reward will decrease to zero. Then the model of EAQR with can be expressed by the following equations. for and . If , it can be proved trivially that there is only one stable critical point which is the optimal pure joint strategy. It can be obtained from (7) that the critical points have to satisfy for and . It can be further obtained that for at the critical point. Suppose for at the critical point, the following can be obtained according to (8): for . It can be seen that the value of can only be 1, 0, or . and correspond to the optimal pure joint strategies, while corresponds to a mixed strategy. Thus the critical points include all the optimal pure joint strategies and the strategy equally choosing an action that has reached the optimal global reward, namely, for and . The stability of the critical points can be judged by the eigenvalues of the Jacobin matrix .
For the optimal pure joint strategies, the determinant of can be expanded according to rows and columns step by step. The following can be obtained: All eigenvalues are −1 which is negative. Thus the optimal pure joint strategies are stable critical points.
For the mixed strategy, we just need to transform the determinant of and extract a common factor of it for and , , respectively. Although the transformation processes are different in the two cases, the following can be obtained for both the cases: where is a polynomial of of degree . Thus, there always exists at least one positive eigenvalue when , which means the mixed strategy is unstable. Thus, only the optimal pure joint strategies are the stable critical points of the model of EAQR.

Cases 14 are two-player two-action repeated games. Case 5 and case 6 are two-player three-action repeated games. Case 7 is a three-player two-action repeated game. The corresponding payoff matrices are displayed in Figures 24, respectively. The numeric number represents the global immediate reward. The optimal global immediate reward is displayed in parentheses. In all cases, represents the probability of obtaining the maximum global immediate reward when player chooses action . , , and represent the Q-value of action for player 1, 2, and 3, respectively.

In Cases 14, player 1 and player 2 are literally the row player and the column player, respectively. We assume that a matrix exists and that each element of is strictly smaller than a scalar (). Cases 14 are examined first.

Case 1. There is only one optimal global immediate reward.

We can see that , , , and . Thus, we arrive at the following equations from (6):

It can be seen from (12) and (14) that and will be stable at zero after an infinite long time. Suppose at time , and are both very close to zero. Then we can obtain the following from (13) and (15) when :

If we let and , then (16) and (17) can be transformed to a set of linear differential equations. And it is easy to see that , namely, is a globally stable node. To sum up, there is only one globally stable critical point in Case 1. This point is corresponding to the strategy , which corresponds to the optimal global immediate reward. In Case 1, this conclusion is also valid for and .

To validate our analysis, we present the plot of the learning process of EAQR in Case 1 with Figures 510. The learning rate is 0.1, and the number of samples is 200. Twelve different points (, , , ) are used as initial conditions and marked with solid circles. It can be seen in Figures 5 and 8 that the learning trajectories converge to the point when . It can also be seen in Figures 6 and 9 that the learning trajectories converge to the point when . Both points are literally the critical point we have obtained earlier. The joint strategy (, ) is illustrated in Figures 7 and 10. It converges to (0, 0). This indicates that our analysis is reasonable.

Case 2. There are two optimal global immediate reward in diagonal positions.

We arrive at the following equations from (6):

If we let and , then the following system can be derived from (18)–(21):

It is obvious that is a globally stable node of the above system. Suppose at time , and are both very close to 1. Then the system described by (18)–(21) degenerates to the following one when :

The interior critical point must satisfy

Thus, we have the critical point when . Then we examine the stability of this critical point. The Jacobin matrix of the system described by (24), (25) is of which the eigenvalues are . When , we have , . According to the theorem of stability of almost linear systems, this critical point is stable. To sum up, there is only one stable critical point in Case 2 when . This point is corresponding to the strategy . The greedy joint action may not correspond to either of the optimal global immediate reward.

The above conclusion does not hold when . This is because we use the condition when determining the position and the stability of the critical point of the system described by (24) and (25). When the system described by (22) and (23) is stable, the point is on the line ; the point is on the line , and . The converged strategy is determined by initial conditions.

Case 3. There are two optimal global immediate reward in the same row.

The system is described by the following differential equations:

The analysis process is similar with that in Case 1. There is only one stable critical point for . This point is corresponding to the strategy , which corresponds to either of the optimal global immediate reward.

Case 4. There are three optimal global immediate reward.

The system is described by the following differential equations:

It is obvious that is a globally stable node. Suppose at time , and are both very close to 1. Then the system degenerates to the following one when :

The interior critical point must satisfy

It can be derived that the critical point is where . We can further get when . Thus at the critical point, , and they are both within [0.618, 0.667] when . The examination of stability follows the way in Case 2. It can be determined that the critical point is a stable node. This means that the converged greedy joint action corresponds to the top left optimal global immediate reward. This conclusion holds for.

We want to examine cases with more than two actions. Thus, Case 5 and Case 6 repeated games with two agents and three actions are given. In Case 5 and Case 6, and represent the probability of selecting action for player 1 and 2, respectively.

Case 5. There are three optimal global immediate reward in the diagonal line.

The system is described by the following differential equations:

If we let and and follow the way in Case 2, then it can be obtained that there is only one stable node when . This critical point is corresponding to the strategy . The greedy joint action may not correspond to any optimal global immediate reward.

As in Case 2, the above conclusion does not hold when . In this situation, when the system is stable, the point is on the plane , the point is on the plane , , and . The converged strategy is determined by initial conditions.

Case 6. There are four optimal global immediate reward.

The system is described by the following differential equations:

If we let , , , and , the following system can be derived from (33)–(38):

It is obvious that is a globally stable node of the above system. Suppose at time , the state of the above system is very close to the stable state (1, 1, 1, 1), that is

Then the system described by (33)–(38) degenerates to the following one when :

The critical point has to satisfy

It can be obtained that there is only one critical point where . It can be further determined that for . The Jacobin matrix of the system described by (44), (45) is of which the eigenvalues are . There are two repeated roots when . In this situation, the system described by (33)–(38) will be stable at the point which corresponds to the point . When , let , , then the eigenvalues can be rewrote as . We want to show that in this situation there are two different negative real eigenvalues. The following condition will suffice:

It is trivial to prove (48). To sum up, the system described by (33)–(38) has a stable node for , and the greedy action for both players is the second action. Unfortunately, this joint action does not correspond to any optimal global immediate reward.

When , there is only one stable node that satisfies (40)–(43) and

The converged strategy is determined by initial conditions, and the greedy joint action does not necessarily correspond to any optimal global immediate reward.

Case 7. There is only one optimal global immediate reward in a three-player three-action game.

Player 1 and player 2 are literally the row player and the column player, respectively. Player 3 can be viewed as a matrix player. If player 3 chooses the first action, the left payoff matrix will be adopted. Otherwise, the right payoff matrix will be adopted. We assume that there are matrix , matrix , and each element of and , is strictly smaller than a scalar (). Let , , and denote the Q-value of action for player 1, 2, 3, respectively, and let , , denote the probability of selecting the first action for player 1, 2, 3, respectively. The system is described by the following differential equations:

The analysis process is similar with that in Case 1. There is only one stable node for . This critical point corresponds to the strategy , which literally corresponds to the optimal global immediate reward.

To sum up, it can be seen that the optimal global immediate reward can be achieved in Cases 1, 3, 4, 7 and may not be achieved in Cases 2, 5, 6. In the next section, we will show the performance of EAQR in two stochastic games: box-pushing and the DSN problem.

5. Simulations on Stochastic Games

Case A. Box-pushing

The box-pushing problem is illustrated in Figure 11. Four boxes are represented with grey solid circles, and empty positions are represented with white circles. Four agents (which are not shown in the figure) need to collaborate with each other to make the boxes distribute uniformly. Each agent is responsible for moving one box and has three kinds of actions: pushing the box to the adjacent clockwise position, pushing the box to the adjacent anticlockwise position, or doing nothing. In the beginning of an episode, four boxes are located in random positions. Each agent selects an action, and the boxes are pushed to the new positions. An episode ends when the number of empty positions between any two adjacent boxes are the same, or 100 steps have occurred. If one episode ends, the positions of the four boxes, reward, and step for each agent will be reset for the next episode (but the Q-value function for each agent will be restored until the next run). Each agent receives a reward of −1 at each step and receives a reward of 10 at the end of an episode.

The rules of the box-pushing problem are as follows. First, all agents push boxes simultaneously. Second, if a conflict occurs, then the boxes in the conflict will stay still. A conflict occurs in the following cases: a box being pushed to a static box, two boxes being pushed to the same empty positions, two adjacent boxes being pushed in the opposite direction, and a string of adjacent boxes being pushed in the same direction while the head box is in a conflict. Third, a box can be pushed successfully if it is not in a conflict.

In the experiment, EMA (exponential moving average) Q-learning [11], WoLF-PHC [10], and SARSA (state-action-reward-state-action) [21] are chosen as comparison algorithms. EMA Q-learning and WoLF-PHC are MARL algorithms while SARSA is a type of single-agent RL algorithm corresponding to centralized learning in the context of multiple agents.

The parameters were fine tuned after many trials. For EAQR, the sample times . The learning rate follows where the initial learning rate , is the predefined number of learning episodes, and is number of experienced learning episodes. The exploration rate follows

For EMA Q-learning, , , and the discount factor . The learning rate follows (51) with ; and follows

For WoLF-PHC, , , , , and the learning rate follows (51), with . For SARSA, , , and are the same as those in WoLF-PHC.

The prime performance metric is the average number of steps per episode, which needs to be minimized. The second important performance metric is the success rate, which reflects the stability of the algorithm. A success means the minimum steps are used in an episode. The theoretical minimum number of steps of an episode is determined by a specially designed program. Thus, the success rate over a number of episodes can be evaluated. The experimental results in Tables 1 and 2 are averaged over 100 runs. The standard deviation is also presented in Table 1. Table 3 shows the worst run for each algorithm. Each run has experienced learning episodes and 50,000 evaluation episodes. During each of learning episodes, the agents update their strategies to try to obtain more cumulative reward. During each of 50,000 evaluation episodes, the agents do not update their strategies. For the sake of fairness, in the same columns of Tables 1 and 3, the initial positions of the boxes for the evaluation episodes are the same.

In Tables 1 and 2, it is noted that all algorithms perform better as the number of learning episodes grows. Optimal represents the theoretical minimum number of steps. EAQR presents the best performance for all different values of , which means EAQR learns faster than any of the other algorithms. Besides, EAQR can obtain an average success rate of 99.6% when is 1,000,000, which means that it can use the minimum steps to complete the box-pushing task with a probability of 99.6%. This result is sufficiently good to complete the task satisfactorily. Single-agent RL performs poorly in the beginning, but it performs fairly well when is 1,000,000 in the aspect of average success rate. Still, single-agent RL is outperformed by EAQR. It is noted from Table 3 that EAQR also has the better worst run compared with the other algorithms.

Case B. Distributed sensor network

The DSN problem is used as the second test bed for MARL algorithms. It was part of the NIPS 2005 benchmarking workshop [22]. Figure 12 shows a DSN composed of eight sensors. Each sensor is viewed as an agent. The sensors have to cooperate to capture both targets wandering in a grid of three cells. At each time step, each target moves to its left side, moves to its right side, or keeps still with equal probability. Each cell can be occupied by only one target at any time. The targets move sequentially. Thus, if a target moves out of the grid or moves to a cell which has been occupied by another one, it just stays where it was. Each sensor also has three actions: focus on its left side, focus on its right side, or no focus at all. For example, sensor 4 can focus on cell 1 which is on its left side, focus on cell 0 which is on its right side, or make no focus. Although there is only one cell near sensor 0, 1, 2, and 3, respectively, these four sensors can still focus on the side with no cells. To capture a target, the sensors must accomplish three hits on the same target. One hit happens if at least three sensors focus on the cell occupied by a target. A target is removed from the grid if it is captured. The reward allocation rules follow [23]. If a target is captured by four sensors, the sensor with the minimum index gets null reward, and the other three sensors are rewarded by 10, respectively. The action of focus gains a local immediate reward of −1, and no focus gains a local immediate reward of 0.

The goal of the DSN problem is to capture the targets with as many cumulative rewards as possible in an episode. At the beginning of each episode, both targets are randomly located in the grid. At each time step, all sensors take actions at the same time. The judgment of focus, no focus, hit, and capture is made, and the local immediate rewards are fed back to each sensor. Then it is the turn for targets to move, and the new state is fed back to each sensor. An episode ends if both targets are captured, or 1000 time steps have elapsed. Each sensor can perceive the state and the local immediate rewards. However, they do not have any a priori knowledge like what is a hit, what is a capture, or the goal of the problem. They do not know the reward allocation rules either.

There are 37 states and 38 = 6, 561 joint actions in the DSN problem. Single-agent RL algorithm needs to store and learn Q-value of state-action pairs, and this number grows exponentially as the number of sensors increases. By learning each agent’s own action instead of joint action, the number of state-action pairs will be reduced to , and it grows linearly as the number of sensors increases.

The optimal strategy for a DSN problem is that every three sensors focus on a target while the rest of the sensors make no focus at all. It is obvious that the optimal global cumulative reward is 42, and the optimal number of steps is 3. According to the credit assignment rules in [23], there will be no punishment if all agents do not focus at all. This can lead to more steps in an episode. Thus, we select average global cumulative rewards per episode as the main performance metric and select average number of steps as the secondary performance metric. A success is made if a global cumulative reward of 42 is obtained in an episode.

For EAQR, the learning rate is constant and is set to 0.2; is set to 50, and the exploration rate follows (52). For EMA Q-learning, the learning rate follows (51) with ; the exploration rate is constant and is set to 0.8, , , , and follows (53). For WoLF-PHC, the parameters , , the learning rate follows (51) with ; the exploration rate is constant and is set to 0.2, and . For single-agent RL, , , and are the same with those of WoLF-PHC.

Tables 4 and 5 show that after 100, 000 learning episodes the success rate of EAQR is 100%, and it gains an average global cumulative reward of 42 which is just the theoretical optimal cumulative reward in an episode. The standard deviation is also presented in Table 5. EAQR has great advantages over the other algorithms in terms of success rate. Table 5 shows that for WoLF-PHC and single-agent RL, higher average cumulative reward might be achieved if more learning episodes are given. However, there is no such a trend for EMA Q-learning. More learning would probably not improve the performance of EMA Q-learning. Table 6 shows the worst run of cumulative reward for each algorithm. It is noted from Table 6 that EAQR also has the better worst cumulative reward compared with the other algorithms.

Table 7 shows that EAQR consumes less steps to capture both targets than the other algorithms. The standard deviation is also presented in Tables 7. Table 8 shows the worst run of steps for each algorithm. Due to the large joint action space in the DSN problem, single-agent RL shows the worst performance among all algorithms. This experiment shows that solving multiagent reinforcement problem through single-agent view is inadvisable.

EAQR shows good performance in both stochastic games, which indicates that most of the time EAQR can converge to one of the optimal global cumulative reward under any initial strategies. Otherwise, EAQR will not gain a success rate of 99.6% in Case A and 100% in Case B. EAQR also alleviates the curse of dimensionality of joint action space. Yet the same problem for joint state space remains to be addressed. For some stochastic games such as box-pushing [24, 25] and hunting game [26, 27], the circle and the grid can be viewed as images. Many states are actually the same one if the translation operation is performed on the “images”. Thus, the structure of convolutional neural networks [28, 29] can be employed to realize an autoencoder [30] which automatically extracts features in the original state space and uses these features to construct a compressed state space.

6. Conclusions

In this paper, we deal with the problem of how to achieve optimal coordination in fully cooperative multiagent systems. Firstly, we propose a cooperative multiagent Q-learning algorithm called EAQR and analyze its dynamics in seven repeated games. The results in these games show that if there is only one optimal global immediate reward, then EAQR can converge to it. However, if more than one optimal global immediate reward exist, then EAQR may not necessarily converge to any optimal global immediate reward. Secondly, we test EAQR in two stochastic games—one with four agents and the other with eight agents. EAQR shows excellent performance in both tasks. It achieves the theoretical optimal cumulative reward in the DSN problem.

We will carry on our work towards three directions in the future. Firstly, we have to find a way to depict the learning process for stochastic games to help us find out why EAQR works well in these tasks. Secondly, we will learn from solutions to consensus [31, 32] to design new action exploration methods that can be analyzed more trivially and has rigorous theoretical proof in general cases. Thirdly, we will employ convolutional neural networks and auto-encoders to alleviate the curse of dimensionality of state space in some collaborative tasks.

Data Availability

All underlying data related to this paper are available by sending emails to the corresponding author Zhen Zhang with email address [email protected].

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors appreciate Professor Dongbin Zhao for his suggestions on the convergence analysis in the paper. This work was supported by Shandong Provincial Natural Science Foundation of China under Grant (ZR2017PF005, ZR2015FM017) and the National Natural Science Foundation of China (61573205).