Abstract

The exponential explosion of joint actions and massive data collection are two main challenges in multiagent reinforcement learning algorithms with centralized training. To overcome these problems, in this paper, we propose a model-free and fully decentralized actor-critic multiagent reinforcement learning algorithm based on message diffusion. To this end, the agents are assumed to be placed in a time-varying communication network. Each agent makes limited observations regarding the global state and joint actions; therefore, it needs to obtain and share information with others over the network. In the proposed algorithm, agents hold local estimations of the global state and joint actions and update them with local observations and the messages received from neighbors. Under the hypothesis of the global value decomposition, the gradient of the global objective function to an individual agent is derived. The convergence of the proposed algorithm with linear function approximation is guaranteed according to the stochastic approximation theory. In the experiments, the proposed algorithm was applied to a passive location task multiagent environment and achieved superior performance compared to state-of-the-art algorithms.

1. Introduction

An agent of reinforcement learning (RL) masters skills in a trial-and-error way. The agent is settled in an environment that gives the responses and rewards corresponding to its actions over discrete time steps. The learning process is modeled as a Markov decision process (MDP) [1, 2]. The goal of the agent is to find an optimal policy that maximizes the expectation of long-term gain without the knowledge of the world (model-free [3]). Traditional tabular RL fails to handle situations with large or continuous state space, which limits its wider application. Recent theoretical and practical development has revealed that deep learning (DL) techniques with the powerful representation ability can deal with such situations efficiently. Deep reinforcement learning (DRL), a combination of RL and DL, has made remarkable achievements in the fields of chess [4], video games [5], and physical control tasks [6]. At the same time, using DRL to solve multiagent problems [7, 8] has gradually broadened so that it has now introduced a new research field, called multiagent deep reinforcement learning (MARL) [9]. A series of recent studies have indicated that MARL algorithms have reached the top human levels in multiplayer real-time strategy games [10]. An important problem in MARL is to learn cooperation on team tasks [7], i.e., the MARL cooperation problem. So far, the most popular solution to this problem is centralized training and decentralized execution [11, 12]. Methods in this fashion need a global value function [13, 14] or a global critic [11] in the training stage and assume the existence of a control center that is able to access the global state [15]. However, due to the constraints of real-world factors such as energy, geographical limitations, and communication ability, it is hard to collect the data of all agents to a center with a large number of agents.

Fully decentralized MARL algorithms yield promising results in large-scale multiagent cooperation problems, where agents learn and execute actions based on local observations. Independent Q-learning (IQL) [16, 17], as a simple and scalability algorithm, is a typical fully decentralized MARL algorithm. An IQL agent is trained through its local observations and executes an independent local policy. However, in the perspective of an individual agent, other agents’ actions are nonstationary [18], which makes the IQL fail in many tasks. In [19], the authors proposed a decentralized multiagent version of the tabular Q-learning algorithm called -learning, where each agent is only aware of its own local reward, exchanges information with others, but observes the global state and joint actions. [20] assumes that all the agents are placed in a time-varying network, and it proposes a fully distributed MARL actor-critic algorithm, in which each agent has its own value function that is parameterized and updated with a weighted sum of other agents’ parameters. The main limitation of this algorithm is that it also assumes that the global state and the joint actions of all agents can be obtained directly, which is still difficult to satisfy in many actual scenarios.

The problem addressed in this paper is to reduce the problem of MARL algorithms caused by centralized training. To this end, this paper proposes a model-free and completely decentralized MARL algorithm based on message diffusion. In the method, all agents are assumed to be in a time-varying communication network, where each agent obtains information from its neighbors and spreads its local observation and action in a fashion of diffusion [21]. Each agent has its own reward function, and the global reward is computed as the mean of local rewards. Each agent is designed as an actor-critic reinforcement learner and maintains a local estimation of the global state and joint actions. These estimated variables are updated with their own observations as well as messages received from neighbors, namely, in a diffusion style. We leverage stochastic approximation to analyze the convergence of the update process in the proposed algorithm with linear function approximation, and the convergence is guaranteed under reasonable assumptions. The proposed algorithm is evaluated on multiagent passive location tasks, and the results demonstrate its convergence and effectiveness.

2. Actor-Critic for a Single Agent

The reinforcement learning process can be described by an MDP , where represents the set of states and is the set of actions. denotes the state transition probability. The reward function is denoted as , and the instant reward at is . The agent’s policy is represented as , which is parameterized to or for short, with representing the parameters. The task of the agent is to learn a policy to maximize the expectation of the cumulative reward, i.e., the objective function where represents the stationary state distribution of the Markov chain for policy . For policy , the action value function is defined by which represents the expectation of the cumulative reward in the future since state and taking action . is parameterized as or for short, with denoting the parameters of the action value function. The state value function is defined based on , i.e., , and the advantage function is , which can be regarded as a measure of advantage of taking action over other actions. According to [22], the gradient of the objective function can be computed by

The policy parameters are updated with the policy gradient: where is the step size, and is referred to as the score function of policy .

3. Agents in Time-Varying Networks

Definition 1 (Networked agents). Consider a finite directed graph , where and represent the set of agents and the set of communication relations, respectively. The number of agents is , and the set of communication pairs is denoted as , with denoting the connection matrix of agents. The multiagent Markov decision process with networked agents is denoted by . There is at least one path for any pair of agents, and each agent can obtain the observations and actions of neighbors.

At time , the observation of agent is , where represents the observation space and refers to the dimension of the observation vector. The observations of all agents contribute to the global state , . The action of agent is defined as , where represents the dimension of the action space. The joint actions of all agents can be expressed as , . Agent holds a local estimation of the global state and a local estimation of joint actions (if is not emphasized, they are denoted as and , respectively). The reward obtained by agent is denoted by , which is assumed to be bounded. The joint policy, , maps the global state onto the joint actions of all agents. Actions among agents are independent, and the relationship of the joint policy and local policies satisfies where is the local policy of agent , which is parameterized by , with being the dimension of agent ’s parameter space. The policy parameters of all agents are represented as , where . The set of agent ’s neighbors is defined by .

Assumption 2. For any , , and , the policy function is continuously differentiable at . For any , the Markov chain induced by is irreducible and aperiodic.

The policy is differentiable for parameter so as to use deep neural networks. Furthermore, the Markov chain induced by policy , being aperiodic and irreducible [23], has stationary distribution.

The task in the multiagent cooperation problem is to find a joint policy to maximize the expectation of the averaged long-term reward over all agents: where represents the global averaged reward function, and its value at time takes . Then, the global action value function is defined by

The local action value function of agent is which is parameterized as . Since , the local advantage function is defined as where represents the joint actions of all agents except for .

Assumption 3. The global value function can be decomposed into a weighted sum of the local value functions over agents, namely, where reflects the importance of individual action value function . Note that the action value function of agent depends only on its local observation and action.

Assumption 3 is similar to [14], under which the following theorem about the gradient of the global objective function to a single agent holds.

Theorem 4. Under the conditions of Definition 1, Assumption 2, and Assumption 3, the gradient of the global objective function to agent ’s policy is computed by where is the score function, and is the local advantage function defined by (9).

The proof of Theorem 4 follows a similar scheme to [22], and the complete proof is provided in Appendix A. Theorem 4 indicates that the gradient of global objective function can be estimated by the agent’s local score function and local advantage function . Although there is a certain deviation between the local observation of a single agent and the global state, in a time-varying network, an agent can improve the accuracy of the global state estimation by exchanging messages. In the next section, a completely decentralized multiagent reinforcement learning algorithm based on message diffusion over a communication network is proposed.

4. Distributed Actor-Critic Based on Message Diffusion

Suppose that, for a group of intelligent agents with communication abilities, each agent expects to obtain more information from other agents to estimate the value function and optimize its policy more accurately. But it can only obtain partial observations and receive messages via the communication network. In this section, a message diffusion-based distributed MARL algorithm is proposed to ensure agents gain global information and spread their observations efficiently.

In the algorithm, at time step , each agent holds a global state estimation and joint action estimation , which are updated by where is an estimation of the long-term return of the agent with the bootstrapping update, and is the step size. The agent ’s global state estimation is updated in two steps. In step one, the corresponding part of the global state estimation is replaced according to (13), where is a unit vector with the dimension of with only element being 1. represents the Kronecker product. is an -dimension vector with all the elements being 1, and is defined similarly. The operation “” in (13) and (14) is the element-wise product. In step two, the global state estimation is updated with message diffusion according to (15). The parameters in the value function of agent are updated through the following equation: where denotes the local temporal difference error of agent , i.e., . According to Theorem 4, the policy of agent is improved via gradient ascend: where is the step size, is the local advantage function defined by (9), and is the score function. Algorithm 1 summarizes the steps of the proposed approach. The algorithm works in an on-policy fashion; i.e., transitions are discarded once utilized in the update of policy parameters.

Input: Initialize parameters and step sizes . Agent initializes and obtains observation , then initializes estimations . Initialize the counter .
1: repeat
2: fordo
3:  Sample and execute an action from the policy function , and obtain ;
4:  Update the reward ;
5:  Update the local state ;
6:  Update the local action ;
7:  Send and to neighbors.
8: end for
9: fordo
10:  Update the estimation of the global state ;
11:  Update the estimation of the joint action ;
12: end for
13: fordo
14:  Calculate the temporal difference error
   
15:  Update the critic ;
16:  Update the advantage function
   ;
17:  Update the score function ;
18:  Update the actor ;
19: end for
20: ;
21: until the algorithm converges.

5. Convergence Analysis

To analyze the convergence of Algorithm 1, we make several assumptions on the policy update and the step sizes. Under these assumptions and linear function approximations, it is shown that both the value function and the policy function in Algorithm 1 converge almost surely.

Assumption 5. For agent , the update of policy is a projection operator . Moreover, includes at least one local maximum of .

This assumption is commonly used in the analysis of transient behavior with stochastic approximation [24, 25]. In fact, this assumption is only for the convenience of analysis and is not necessary for experiments.

Assumption 6. The step sizes and satisfy , , , and .

Assumption 6 is essential for stochastic approximation and other stochastic bootstrapping algorithms. It ensures that the correction of each step becomes smaller and smaller; however, the reduction speed is not too fast to converge independently without respect to the initial states.

The action value function of agent is approximated by a linear function family , where is the feature corresponding to the state-action pair , which is uniformly bounded. The feature matrix has full column rank, and the th column of which is denoted by . For convenience, is abbreviated as .

Theorem 7. Under Assumptions 2, 3, 5, and 6, for any policy , the value parameter sequence generated by (16) converges to with probability 1, where is related to policy .

Proof. With linear function approximation, the update process of parameter is The gradient in the update process can be seen as the error caused by current parameter . We construct the square of it as . When takes the minimum value, holds. Hence, the convergence point of can be found through optimizing . The gradient of is , where . Rewrite (16) via introducing a random variable to estimate : where is the update step size, which satisfies Assumption 6. We assume that the second-order moments of and are bounded. Let , , , , , and . Rewrite the update process in (19) by where We analyze the convergence of (20) with the stochastic approximation theory [26]. Let As and , we rewrite (20) as where , , , and . The stochastic approximation theory requires the following: (1)Function is Lipschitz continuous, and exists(2) is a martingale difference sequence, and there exits so that for any (3)The step size satisfies Assumption 6(4)Ordinary differential equation has origin as the unique global asymptotically stable equilibriumThe update process described by (20) obviously satisfies conditions (2) and (3). To satisfy condition (1), it only needs to judge whether is bounded. It is rational to let be bounded, and is updated based on message diffusion and bounded because is a stochastic matrix. According to Definition 1, the instant reward is bounded. We proved the convergence analysis of , which indicates that is bounded, in Appendix B.
Since the feature matrix has full rank, is a nonsingular matrix. Let , and then , and the eigenvalues of are nonzero. Denote an eigenvalue of by , , and the corresponding feature vector by , . Let , where . According to the definition of , . Because is a real matrix, . The real part of feature vector is . So has global asymptotically stable equilibria. According to the stochastic approximation theory [26], the update described by (21) converges; thus, converges almost surely.

Theorem 8. Under Assumptions 2, 5, and 6, the policy parameter updated through (17) converges to the asymptotically stable equilibrium of the following ordinary differential equation (ODE) with the probability of 1: where , and is a continuous function, e.g., .

Proof. Denote the field generated by as . Define two random variables: where The update process of (17) is represented in projection as Due to the convergence of the policy evaluation, , we have . Let , and then is a martingale difference sequence. Since , , and are bounded, is bounded. Based on Assumption 6, it holds that Furthermore, converges according to the convergence theory of the martingale difference sequence [23]. Thus, for any , Therefore, (27) satisfies the necessary conditions of the Kushner-Clark theory [28, 29], and the policy parameter will converge to the asymptotically stable equilibrium of ODE (24).

6. Experiments

In this section, from two aspects, our proposed method is evaluated: (i)Ablation study, which investigates the influence of the agent number on the performance(ii)Comparison study, which studies the advantages of the proposed method, compared with existing methods

6.1. Multiagent RL for Passive Location Tasks

The experiments are performed with a passive location task environment, which is a reinforcement learning environment where agents need to automatically find an optimal geometry to improve the positioning precision. The environment is introduced by [27], where all the agents are controlled by a brain that maps the global observation into joint actions. We modified the environment into a multiagent one by limiting the observation so that each agent can only access its own radio signals and position. Furthermore, each agent has a distinct brain that consists of an actor and a critic, learning and making decisions independently.

The whole scheme of the environment is shown in Figure 1. Consider a circular region with a radius of 6 km, at the center of which is a transmitter that emits radio signals all the time. The area within 1 km of the transmitter is a forbidden area, filled with the blue grid. Each agent is equipped with a radio receiver that can intercept wireless signals to estimate the position of the transmitter. All the radio receivers are assumed to have the same performance in processing wireless signals. According to the sensitivity of the radio receiver, when agents go beyond a distance of 4 km away from the transmitter, nothing can be received. So, it is better for the agents to optimize the geometry within a closer region to the transmitter, which is shown as the dark gray area in Figure 1. Considering the multipath and interference of electromagnetic propagation, there are three low signal-to-noise ratio (SNR) regions, where signals received by the agents are contaminated by strong noises, leading to low positioning precision. The position of the transmitter is estimated in two steps: firstly, figuring out the time lag of signal propagation of each pair of agents, and secondly, estimating the transmitter’s position that satisfies the time lags obtained in the first step with least square algorithms. The task of the agents is to navigate to an optimal geometry configuration step-by-step, avoiding low SNR and forbidden regions, improving positioning precision.

6.2. Setup

We model the passive location task as a multiagent decision-making problem, on which we evaluate our proposed method. According to the key components of reinforcement learning, the observation, action, and reward function are defined as follows.

6.2.1. Observations

The observation of agent comprises the features of the received signals and its position . The state of agent is presented by , where is a function that extracts features from the signals. In the experiment, the features of received signals refer to the SNRs.

6.2.2. Action

The action of agent is the position adjustment of itself, . Hence, the agent will move to in the next time step. In the experiment, actions are clipped within an interval of .

6.2.3. Reward

All the agents share the same reward in each time step. The reward function reflects the positioning precision, which is defined by where is the position of the transmitter, is the th estimation of , and denotes the estimation time in each time step. In the experiment, we have .

For each agent, both the actor and the critic are designed as fully connected neural networks, which have two hidden layers of 256 neural units, and each layer is followed by an activation layer of tanh function. The actor maps the observation into actions. Specifically, it takes in the observation, then generates a two-dimensional Gaussian distribution, from which the action is sampled. The structure of the critic is similar to that of the actor, but the output is the value function that is used to optimize the actor according to the policy gradient theory.

The diffusion processes are completed one by one among agents but in a random order, and each agent updates its global state estimation from neighbors only once at each time step. Agents within 2 km are called neighbors. As shown in Figure 1, agent 2 is a neighbor of agent 1. Under the setting of Figure 1, agent 1 is able to access the observation and last action of agent 2 but cannot obtain these information of agent 3.

6.3. Ablation Study

To understand the influence of the agent number on the performance, we performed a series of experiments with different agent numbers () on the passive location task described above. Concretely, we focus on these indices that reflect the process of training across all the agents: averaged episode return, averaged episode length, policy entropy, and averaged number of neighbors. The maximum number of steps in an epoch is 800, with each episode executing no more than 100 steps.

Figure 2 shows the training curves after 1000 epochs across 5 random seeds that initialize the agents. The optimizer and learning rate are Adam and , respectively. Averaged episode return is a key indicator that reflects the ability of agents to accomplish the given task. From Figure 2(a), it can be seen that when , agents cannot master useful skills to obtain a higher episode return. But the performance is improved with the increase of the agent number (). It is consistent with Figure 2(b), in which the averaged number of neighbors increases as the scale of agents becomes larger. The averaged number of neighbors reflects the connectivity of agents, which determines the diffusion efficiency of messages between agents. When agent number , agents may be scattered with distances out of the ability to exchange information. In that case, our message diffusion-based method reduced to independent agents, not being able to handle the cooperative task in a partial observation environment.

Figures 2(c) and 2(d) demonstrate the trend of mean policy entropy as well as the averaged episode length over agents across 1000 epochs of training. Both these two indicators drop as more agents take part in the task. The decrease of episode length indicates that agents can accomplish the task with fewer steps and find more elegant paths to optimal geometries collaboratively. The decline of policy entropy suggests that the agents become more confident in decision-making.

6.4. Comparison Study

In the comparison study, we investigate the advantages of our proposed method with two kinds of algorithms that are popular in the field of decentralized multiagent reinforcement learning. One is independent agents developed by algorithms such as IQL [16]. But in the passive location task, the action space is continuous, and we have every agent learning with its own actor-critic structure for handling partial observation, called independent actor-critic (IAC). Another similar decentralized algorithm that we make a comparison to is proposed in [20], where the agents update their neural networks by directly combining the parameters of neighbors. We refer to this method by weight sharing (WS).

The proposed method in this paper facilitates the collaboration of agents by message diffusion, which makes individual estimation of the global state more accurate. We performed our method (Diffu for short) and the contrast methods (IAC and WS) with different numbers of agents (), and the results are demonstrated in Figure 3. In general, our method has superior performance in the experiments. It can be seen that for the proposed method, the more agents there are, the more advantages can be obtained. IAC failed in all the experiments due to the inherent challenge faced with a multiagent environment; i.e., from the perspective of an individual agent, the actions of other agents become nonstationary. WS attempts to address the challenge by introducing weight sharing among agents. It obtained the information of neighbors in an indirect way that combines neural network parameters of neighbors, but it is not so effective in the experiments. In the task of passive location, with more agents, message diffusion becomes easier so that each agent has a more accurate estimation of the global state, which is helpful to tackle the nonstationary problem.

The learned agents are able to perform passive location tasks effectively. Figure 4 shows the trajectories that navigate to an optimal geometry with different numbers of agents (). In these two scenarios, agents can adjust the geometry collaboratively to improve the positioning precision and even master the skills such as taking a detour to avoid low forbidden regions or sacrifice immediate reward for a better geometry configuration.

7. Conclusion

This paper investigated the multiagent problem in a time-varying network, where agents only observe partial information and exchange information with neighbors. A fully decentralized actor-critic multiagent reinforcement learning algorithm based on message diffusion is proposed. Each agent is trained to make decisions depending on its own local observations and messages received from neighbors. This completely noncentral training and execution method overcomes the data collection challenges, especially when both the state space and the action space are massive. The convergence of the proposed algorithm with linear function approximation is guaranteed. Experimental results confirmed the convergence and effectiveness of the proposed algorithm. This decentralized method can be used in many other areas, such as packet routing in computer/wireless communications. In future work, more general function approximations will be employed to analyze the convergence of the algorithm.

Appendix

A. Proof of Theorem 4

Theorem A.1. Under the conditions of Definition 1, Assumption 2, and Assumption 3, the gradient of the global objective function to agent ’s policy is computed by where is the score function, and is the local advantage function:

Proof. The gradient of global value function is Then, the gradient of the global objective is Take summation for on both sides of (A.4), and we have Notice that is in a stable state; then, Hence, Based on Assumption 2 in the paper, we have and also Hence,

B. Proof of Convergence of Estimated Reward Function

Theorem B.1. is bounded, namely, , .

Proof. The update process of the estimated value of the reward function can be written as and the asymptotics of which can be described by the following ordinary differential equation: Using to represent the right side of (B.2), it is obvious that is continuous to . Let , since and are probability functions, and the reward is bounded, and then we have . Therefore, the differential equation has a unique asymptotically stable equilibrium point. Meanwhile, since is consistently bounded, there exists which makes According to the stochastic approximation theory in Appendix C, converges.

C. Stochastic Approximation Theory

Consider a stochastic approximation process for an -dimensional random variable where is the iteration step size, and is a finite Markov chain.

Assumption C.1. For the above random approximation process, the following assumption is made: (1) is Lipschitz continuous for its first parameter(2) is an irreducible Markov chain, of which the stationary distribution is (3)The step size satisfies , , and (4) is a martingale difference sequence, namely, , and it satisfies(5)Sequence is a bounded random sequence, and

The iteration of stochastic approximation is presented in (C.1), and its asymptotic property can be captured by the following differential equation:

Assuming (C.3) has a global asymptotically stable equilibrium point, the following two theorems hold.

Theorem C.2. Based on Assumption C.1 (1)–(4), if , .

Theorem C.3. Based on Assumption C.1 (1)–(4), the limit of exists. If the differential equation has origin as the unique globally asymptotically stable equilibrium, then .

Data Availability

The data (experiment environment and source code) is developed by our research team, and we will open the source code at an appropriate time. Requests for access to these data should be made to Shengxiang Li, [email protected].

Conflicts of Interest

The authors declare that they have no conflicts of interest.