Theory and Methods of Wireless Cooperative Localization in Complex ScenariosView this Special Issue
A Multiagent Reinforcement Learning Solution for Geometric Configuration Optimization in Passive Location Systems
Passive location systems receive electromagnetic waves at one or multiple base stations to locate the transmitters, which are widely used in security fields. However, the geometric configurations of stations can greatly affect the positioning precision. In the literature, the geometry of the passive location system is mainly designed based on empirical models. These empirical models, being hard to track the sophisticated electromagnetic environment in the real world, result in suboptimal geometric configurations and low positioning precision. In order to master the characteristics of complicated electromagnetic environments to improve positioning performance, this paper proposes a novel geometry optimization method based on multiagent reinforcement learning. In the proposed method, agents learn to optimize the geometry cooperatively by factorizing team value function into agentwise value functions. To facilitate cooperation and deal with data transmission challenges, a constraint is imposed on the data sent from the central station to vice stations to ensure conciseness and effectiveness of communications. According to the empirical results under direct position determination systems, agents can find better geometric configurations than the existing methods in complicated electromagnetic environments.
Passive location techniques are used for various scenarios, such as telecommunication pseudobase station discovery, and aviation interference investigation. Traditional passive location algorithms  mainly include angle of arrival (AOA) , difference of time arrival (TDOA) , and frequency difference of arrival (FOA) . These algorithms locate transmitters after estimating the signal parameter called the two-step positioning methods. The direct position determination (DPD) [5, 6] uses observations from all the stations to locate the transmitter without estimating the signal parameters, which outperforms two-step methods in low signal-to-noise ratio (SNR) scenarios .
The geometric configurations of stations can significantly affect the positioning precision , both in two-step and DPD positioning algorithms. In the literature, some existing studies tried to obtain general principles in geometric configurations from massive experiments [9, 10]. And, only some rough conclusions have been drawn. For instance, all stations should not line up, or stations should form a triangle to surround the transmitter. There also exist several studies that have employed heuristic methods, such as genetic algorithm (GA)  or particle swarm optimization (PSO) , to search the optimal geometry. These methods are based on empirical models in which signals are assumed to propagate ideally. However, in the real world, an electromagnetic environment changes abruptly with the positions of stations due to various factors, such as signal frequency, interference, attenuation, multipath, obstacles, and noises. These factors can hardly be described fully in empirical models, leading to suboptimal geometric configurations and low positioning precision. Therefore, it is vital to adjust geometric configuration to fit the sophisticated electromagnetic environment, so as to improve precision in a passive positioning task. This problem is regarded as a sequential decision-making problem in a real-world complex electromagnetic environment, rather than an optimal geometry searching problem based on empirical models.
Reinforcement learning (RL) is a viable and elegant approach to yield an optimal policy for sequential decision-making problems . The tricky electromagnetic environment can be tracked by RL in a trial-and-error paradigm. The nonlinear and parameterized deep neural network (DNN), providing the compact and powerful representation of experiences, can adapt to the complicated electromagnetic spatial distribution accurately. Therefore, this paper addresses the problem of finding optimal geometric configuration in the passive location system through deep reinforcement learning (DRL) .
Under the framework of DRL, a station is used as a mobile agent. The terms’ station and agent are hereafter used interchangeably which can receive signals and decide where to go. Agents need to optimize the geometric configuration collaboratively to improve the positioning precision, and they can share information via communication channels to facilitate the collaboration. However, the communication traffic matters when the number of agents increases, especially in adverse communication conditions.
This paper proposes an efficient multiagent reinforcement learning algorithm to optimize the geometric configuration for passive location systems. To this end, each station is regarded as a mobile agent with all agents having a collective objective of finding an optimal geometry to improve the positioning precision. To facilitate the collaborations among agents, they are trained based on value function decomposition, which can solve the credit assignment problem among agents implicitly. For a vice station, it needs to obtain information from other stations to improve the evaluation of the situation and promote the quality of decisions on where to go. Meanwhile, it is necessary to reduce the communication traffic due to transmission and processing challenges. A mutual information objective function then is employed to constrain the messages sent to vice stations to ensure the expressiveness and conciseness. The proposed method is evaluated on simulated DPD positioning tasks in a complicated electromagnetic environment. The results demonstrated that the agents can find better geometric configurations than existing methods.
This section introduces the relevant background on passive locations (concretely, DPD) and MARL.
2.1. Passive Location with DPD
Consider transmitters and stations intercepting the transmitted signal, as shown in Figure 1. Each station is equipped with an antenna array consisting of elements. The th transmitter’s position is denoted by . The complex envelopes of the signals observed by the th station are given by the following equation :where , is a complex time dependent observation vector, and is an unknown complex scalar representing the channel attenuation between the th transmitter and the th station. Moreover, is the th array response to the signal transmitted from position , and is the th signal waveform transmitted at time and delayed by . The vector represents noise, interference, and multipath effects on the signals.
For brevity, we use and instead of and . The observed signal can be partitioned into sections with length . Taking the Fourier transform of each section, we obtainwhere is the index of Fourier coefficients and is the time section index, .
In (2), we have
Then, the vector concludes all information about the transmitter’s position. Furthermore, the phase shift caused by the transmit time is cancelled out when is used by the DPD method.
In (3), the received signal is presented in matrix notation with
Since the vector is the same at all stations, the observed vectors of all stations can be concatenated together aswhere
Assume the th column of is denoted by , corresponding to the th emitter, and can be factored aswhere is a diagonal matrix whose elements are the response of the arrays at all stations, , , stands for the identity matrix of sizes , stands for column vector of ones, and stands for the Kronecker product.
The additive noise vector is assumed to be a realization of a circularly complex Gaussian process with zero mean. The second-order moments is given by
The covariance matrix represents the thermal noise as well as interference. In the case of spatially white noise, is a block diagonal matrix given by
Assume that signals and noise are uncorrelated so that
A matrix is defined to construct the DPD estimator as follows:where and . The matrix becomes diagonal for large if the signals are uncorrelated. The th column of and its th subvector are denoted by and , respectively. The DPD estimator for general noise covariance is presented aswhere , , , and is the pseudoinverse of .
In the case of partially white noise with a spectral density matrix defined in (10), the DPD estimator becomes
According to , the Cramér–Rao lower bound (CLRB) on the covariance of any unbiased estimator of the position vector with no model errors iswhere , , and .
The CRLB obtained from (15), determined by received signals and locations of stations , is utilized as the reward function. The CRLB plays a major role in developing the passive location agents through MARL.
2.2. Multiagent Reinforcement Learning
In reinforcement learning, an agent interacts with the environment for a given goal. At time , it observes state with denoting the state space, takes action with representing the action space, receives reward , and moves to the new state . The agent aims to learn a policy that maximizes the long-term reward. The action-value function, which starts from state , takes action , and follows policy , is denoted by :where is the discount factor that determines the importance of future rewards.
In the multiagent reinforcement learning (MARL) [15, 16], agents (robots, UAVs, sensors, etc.) interact with a shared environment to complete the given tasks. Basically, agents are the learnable units that want to learn policies in order to maximize the long-term reward through interactions with the environment. Most MARL problems are classified as NP-hard problems  for the sophisticated environments and the combinatorial nature of the problem.
In a cooperative MARL problem, agents must jointly optimize an accumulative scalar team reward over time. The centralized RL approach can be employed to solve the cooperation problem, i.e., all state observations are merged together and the problem is reduced to a single agent problem with a combinatorial action space. Whereas, according to Peter , the naive centralized RL methods fail to find the global optimum even if it is possible to solve the problems with such huge state and action spaces. The challenge lies in the fact that some of the agents may become lazy and unable to learn and cooperate as they should. This may cause the whole system to face a failure. They addressed these problems by training individual agents with a value decomposition network (VDN) architecture. The agents learn to decompose the team value function into agentwise value functions as follows:where and represent the observation-action history and joint action, respectively, and is the value function parameters of agent . VDN aims to learn an optimal linear value decomposition from the team reward signal, by backpropagating the total gradient through deep neural networks representing the summation of individual value functions. The VDN solves the credit assignment among agents implicitly without any specific reward for individual agents. Rashid  regarded the cooperative MARL problem as the VDN does, but added a constraint on the objective:which makes the weights of the mixing network positive and ensures monotonic improvement.
3. MARL-Based Geometry Optimization
This section proposes a MARL-based geometric configuration optimization method for passive location systems.
3.1. Model Framework
In this paper, a DPD location system is considered with mobile stations (e.g., UAVs equipped with positioning equipment), i.e., DPD agents. Each agent transfers the intercepted signals to a central processing agent where the emitter’s position is estimated. Agents have no knowledge of the emitter and the electromagnetic environment. Due to the influence of multipath and noises, the signals received by different agents vary. To adapt to the complicated electromagnetic spatial distribution accurately, a MARL-based method, with positioning error being the reward function, is considered. The key elements in the MARL scheme are defined as follows:(i)States. At each time step , agent intercepts signals, , emitted by the transmitter. The total messages it receives from other agents are denoted by . The state of agent is denoted by , where is the position of agent at time , defined in Section 3.1. Then, the global state is represented as , where .(ii)Actions. Actions represent the decisions regarding where to receive signals at next step. Let denote the action of agent , where and represent its moving direction and distance, respectively. And, the joint action of all agents is denoted as , .(iii)Rewards. This paper aims to develop agents that can properly adjust the geometry automatically to improve the positioning precision. To this end, we evaluate agents’ behavior by positioning errors. Two types of positioning errors are considered:(a)CRLB is an effective index for evaluating the precision of a passive location system. Let the background position of the transmitter be . Then, the CRLB is a function of state and the background position , i.e., , according to (15).(b)The statistic error is a popular class of position errors, such as the mean error (ME) and the mean square error (MSE).
Among the errors listed above, only CRLB can assess the geometry without estimating the target position, which reduces considerable amounts of time and computing in training. Therefore, CRLB is used as the reward function in training the DPD agents.
3.2. Learn to Optimize the Geometry
This section presents an efficient multiagent actor-critic algorithm for geometric configuration optimization in passive location tasks. The overall architecture of the proposed method is illustrated in Figure 2. It is developed based on two main considerations: (i) factorizing the global value function into individual value functions with local observations for better collaboration and (ii) utilizing information constraints to facilitate communications and optimize the messages to tackle the transmission challenges.
3.2.1. Value Decomposition
As shown in Figure 2, the global value function is factorized into linear combination of individual value functions as follows:where . And, is the history of local observations, actions, and messages received. Local value functions are parameterized by . The policy of each station maps the history of observations and actions to the next action: . The joint policy for the location system is denoted by . Both actor and critic of each agent utilize the gated recurrent unit (GRU)  to process the input of observation history. GRU is a special kind of recurrent neural network that has the ability to capture the long range connections of states. The mixing network and individual value functions are trained in an end-to-end manner by minimizing the TD loss as follows:
3.2.2. Information Constraint
For the central station, it must collect observations from all the stations to estimate the transmitter’s position. Nevertheless, a vice station just needs the data that can help to make better decisions. Therefore, the central station must learn how to send messages as short as possible but enough for vice stations to act better. A natural solution is to add information constraints. In practice, to improve the effectiveness of messages sent to vice stations, it is necessary to maximize the mutual information of messages and station’s actions. Let be the index of the central station; then, mutual information is defined bywhere represents the message sent from central station to a vice station and and is the joint action of all vice stations.
However, if this is the only objective, agents could always ensure a maximally informative representation by taking the identity encoding of raw data (), which contradicts the transmission reduction goal. To increase the conciseness of messages, the complexity of the messages is limited by the constraint . It is then possible to learn an encoding , which is maximally expressive about in addition to being maximally compressive about . With Lagrange multiplier , the information bottleneck (IB) is defined as follows:where represents the parameters of the encoder and the decoder network.
The value networks are then trained together with the encoder and the decoder by minimizing an overall objective:where consists of and and is the weight that trades off between these two subobjectives.
The policy gradient  of station is defined aswhere
The policy of station is optimized through the gradient ascend:where refers to the parameters of station ’s policy and shows the step size. The details of the training process are shown in Algorithm 1.
In this section, we develop a simulated electromagnetic environment for passive location tasks, based on which the agents are evaluated.
In the experiment, the simulator’s geographical coverage is 10 km10 km, as shown in Figure 3. The transmitter is located in the center of the map and is equipped with an isotopically radiating antenna. The signal model, defined by (1), is employed with some modifications. The channel attenuation is a function of the receiver’s position : , which follows the free space path loss. The noise and interference, as well as the multipath effect, are all compassed in the noise , which is modeled by the spatially white noise in (10). There are some low regions, highlighted in green in Figure 3, where the noises are stronger than other areas. It should be noted that due to these low SNR regions, the contours of SNR turn into irregular concentric rings. Furthermore, in the real world, it is also impossible to approach too close to the transmitter; therefore, there is a forbidden 1 km area around the transmitter in the simulator.
Consider one central station and vice stations with the task of cooperatively optimizing the geometric configuration in an area consisting of free propagation regions, low SNR regions, and forbidden regions. At each time step , stations observe the environment to obtain the state and make decisions about moving in direction on distance , e.g., . While moving, stations shut off the positioning and communication devices until arriving the next positions. If the time step reaches the maximum of , the location task ends. Figure 4 demonstrates the process of executing a passive location task in training and execution mode in different branches. With geometry formed by stations at each time step , the reward is given by the theoretic error bound, CRLB:where is the received signals and refers to the positions of all stations. Also, the root mean square error (RMSE) is calculated to describe the positioning error more intuitively:where is the th estimation of and denotes the estimation times for each geometric configuration.
4.3. Results and Analysis
The agents are trained in the passive positioning task mentioned above by setting the maximum time step to 100. For the sake of comparison, a basic version of the proposed method is also evaluated. In that version, the central station sends nothing except for the reward (naive DPD agents).
The top segment of Figure 5 shows the learning curve in terms of the averaged reward for DPD agents with communications versus naive DPD agents. DPD agents with communications converge to a much higher return than the naive DPD agents, which indicates that, with messages sent by the central station, vice stations are able to estimate the value function more accurately. In other words, communications are essential to geometry optimization in DPD location tasks. The bottom segment of Figure 5 illustrates the information bottleneck loss against the training epochs. declines quickly through training. The proposed agents can achieve a higher position precision with lighter communication overhead.
To show the learned decomposition of value functions, Figure 6 demonstrates the error curve, normalized value functions, and the agents’ situations when learned agents perform a certain DPD positioning task. According to the top segment of Figure 6, both CRLB and RMSE decline with more steps taken by agents. Furthermore, the RMSE of positioning converges to the CRLB with respect to optimization steps. It means that agents can find geometric configurations where estimation error becomes closer and closer to the CRLB, which is the best achievable output for passive location systems. The middle and bottom segments of Figure 6 show that when the agents are in the low SNR area, their value functions decrease and the positioning error increases, which is consistent with our experiences.
Figure 7 demonstrates the final geometric configuration found by the proposed agents as well as that optimized by the GA . According to the geometry yielded by the GA, there is a station in the low SNR region, which is a suboptimal geometry. In other words, the GA optimizes the geometry on the empirical model, which cannot identify the low SNR regions in the simulator. By contrast, the trained agents can avoid low SNR regions and find the optimal geometry successfully.
This paper analyzed the geometry optimization problem of passive location systems in a complex electromagnetic environment and proposed a MARL method to address it in a try-and-error fashion. In the method, by factorizing the global value function into the agentwise value functions, agents can learn to optimize the geometric configuration cooperatively. Moreover, by adding the mutual information constraints, the communication traffic from the central station to vice stations can be greatly reduced while effectiveness is ensured. A simulator with a sophisticated electromagnetic environment for passive location task is also developed, the results on which showed that the agents could find better geometric configurations than existing methods.
This paper should be seen as a first attempt at learning geometric configuration optimization through MARL in a passive location task. Although DPD is used in the proposed method, it can be replaced by any other passive location algorithm (e.g., TDOA or AOA) to enhance the algorithm flexibility in various location scenarios.
The data used to support the findings of the study are available from Shengxiang Li ([email protected]) upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
J. Don, “Statistical theory of passive location systems”,” IEEE Transactions on Aerospace and Electronic Systems, vol. 2, pp. 183–198, 1984.View at: Google Scholar
J. Weiss, “Direct position determination of narrowband radio transmitters”,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, Quebec, Canada, May 2004.View at: Google Scholar
Krzysztof Bronk and J. Stefanski, “Bad geometry effect in the TDOA systems,” Polish Journal of Environmental Studies, vol. 16, no. Jan, pp. 11–13, . 2007.View at: Google Scholar
B. Sun, “Analysis of the influence of station placement on the position precision of passive area positioning system based on TDOA,” Fire Control & Command Control, vol. 36, pp. 129–132, 2011.View at: Google Scholar
Bo Wang and L. Xue, “Station arrangement stretegy of TDOA location system based on genetic algorithm,” Systems Engineering and Electronics, vol. 31, pp. 2125–2128, 2009.View at: Google Scholar
G. Zhou et al., “Analysis of the influence of base station layout on location accuracy based on TDOA,” Command Control and Simulation, vol. 39, pp. 119–126, 2017.View at: Google Scholar
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, USA, 2018.
L. Busoniu, R. Babuska, and B. De Schutter, “Multi-agent reinforcement learning: a survey,” in Proceedings of 9th International Conference on Control, Automation, Robotics and Vision, Singapore, December 2006.View at: Google Scholar
S. Peter, “Value-decomposition networks for cooperative multi-agent learning based on team reward,” in Proceedings of the 2018 International Joint Conference on Autonomous Agents and Multiagent Systems, vol. 3, pp. 2085–2087, 2018.View at: Google Scholar
T. Rashid, “QMIX: Monotonic value function factorisation for deep multi-agent reinforcement Learning,” in Proceedings of 35th International Conference on Machine Learning, vol. 10, pp. 6846–6859, Stockholm, Sweden, July 2018.View at: Google Scholar
R. S. Sutton, “Policy gradient methods for reinforcement learning with function approximation,” Advances in Neural Information Processing Systems, vol. 12, pp. 1057–1063, 2000.View at: Google Scholar