Abstract
Multiagent goal recognition is important in many simulation systems. Many of the existing modeling methods need detailed domain knowledge of agents’ cooperative behaviors and a training dataset to estimate policies. To solve these problems, we propose a novel decentralized partially observable decision model (DecPOMDM), which models cooperative behaviors by joint policies. In this compact way, we only focus on the distribution of joint policies. Additionally, a modelfree algorithm, cooperative colearning based on Sarsa, is exploited to estimate agents’ policies under the assumption of rationality, which makes the training dataset unnecessary. In the inference, considering that the DecPOMDM is discrete and its state space is large, we implement a marginal filter (MF) under the framework of the DecPOMDM, where the initial world states and results of actions are uncertain. In the experiments, a new scenario is designed based on the standard predatorprey problem: we increase the number of preys, and our aim is to recognize the real target of predators. Experiment results show that (a) our method recognizes goals well even when they change dynamically; (b) the DecPOMDM outperforms supervised trained HMMs in terms of precision, recall, and Fmeasure; and (c) the MF infers goals more efficiently than the particle filter under the framework of the DecPOMDM.
1. Introduction
With the fast development of computational software and artificial intelligence techniques, agentbased simulation systems become more and more popular for staff training, policy analysis and evaluation, and even entertainments. In developing these systems, people always need to create humanlike agents who can make decisions and have interactions with other agents or humans autonomously. For example, in the famous realtime strategy game StarCraft, the AI players have to construct buildings, collect resources, produce units, and defeat their enemies [1]. Unfortunately, even though many decision and planning algorithms have been applied to improve the intelligence of these agents, they are still easily defeated, especially when human players play with them in the same scenario for several times. One important reason for that is that these agents are unable to recognize the goal of their opponents or friends. On the other hand, that is what the human players usually do in the game [2]. Obviously, if agents know the goal of others, they can make counter decisions more efficiently.
Because goal recognition is significant for creating humanlike agents and decision support, many related models and algorithms have been proposed and applied in different fields, such as hidden Markov models (HMMs) [3], partially observable Markov decision processes (POMDPs) [4], Markov logical networks (MLNs) [5], and particle filtering (PF) [6, 7]. However, most of the existing research focuses on single agent scenarios. However, in some scenarios missions are so complex that a number of agents have to constitute a group and achieve their joint goal through cooperation. And our aim is to identify the joint goal of the group but not one member. In most cases, it does not work to directly apply methods for recognizing the single agent in multiagent goal recognition, because we have to consider the relations or interactions between agents and the state space is usually very large.
There are three fundamental components in the framework of multiagent goal recognition: (a) modeling the agents’ behaviors, the environment, and the observations for the recognizer; (b) estimating the parameters of the model obtained through learning or other methods, and (c) inferring the goals from the observations. In the past, people have done some works on all these aspects. However, we still have some difficulties in recognizing multiagent goals in simulation systems:(a)For modeling behaviors, we usually have little knowledge about the details of agents’ cooperation, such as the decomposition of the complex task, the allocation of the subtasks, the communication, and other details. Even though this information is available, it is hard to present all of them formally in a model in practice.(b)For learning parameters, sometimes a training dataset for supervised or unsupervised learning cannot be provided. Even if we have a training set, the unsupervised learning is still infeasible, because the state space in multiagent scenarios is always very large. Additionally, the supervised learning may suffer from the overfitting problem, which will be shown in our experiments.(c)For inferring goals, traditional exact filters such as an HMM filter are infeasible because the state space is large. The widely applied PF is available for computing the posterior distribution of goals, but it may fail when there are not sufficient particles, and increasing the number of particles will consume much more computing time.
To solve the problems above, we present a solution for recognizing multiagent goals in simulation systems. The core of our method is a novel decentralized partially observable Markov decision model (DecPOMDM). After modeling the agents’ behaviors, the environment, and the observations for the recognizer by the DecPOMDM, we use an existing multiagent reinforcement learning (MARL) algorithm to estimate the behavior parameters and a marginal filter (MF) to infer the joint goal of agents. Our method has several advantages considering the above problems:(a)For the modeling problem, the DecPOMDM presents the agents’ behaviors in a compact way. The DecPOMDM can be regarded as an extension of the wellknown decentralized partially observable Markov decision process (DecPOMDP) [8]. As in the DecPOMDP, all details of cooperation are hidden in joint policies in the DecPOMDM. In this implicit way of behavior modeling, we only need to concern ourselves with the selection of primitive actions with given goals and situations. Further knowledge on interactions between agents is unnecessary. Another advantage of the DecPOMDM is that it can make use of the large amount of existing algorithms for the DecPOMDP, which will be explained later.(b)For the problem of estimating the agents’ joint policies, the MARL algorithm does not need a training dataset. We borrow the definition of goals from the domain of planning under uncertainty and associate each goal with a reward function. Then, we assume that agents achieve their joint goal by executing optimal policies, which can bring them the maximum cumulative reward. The optimal policies define cooperative behaviors well and can be computed accurately or approximated by any algorithm for the DecPOMDP. In this way, the training dataset is unnecessary. In this paper, the cooperative colearning based on the Sarsa algorithm is exploited, because it does not need information of the model, which may be difficult to get in complex scenarios [9]. Actually, heuristic algorithms such as memorybounded dynamic programming (MBDP) and joint equilibriumbased search for policies (JESP) may also work, if we have enough information of the environment model [10, 11].(c)For the inference, the MF outperforms the PF when the state space is discrete and large, which has been proved in [12, 13]. Additionally, we will also show that the MF solves the inference failure problem of the PF in our research. Another contribution is that we implement the MF under the framework of DecPOMDM, in which filtering process is different from the work in [12, 13].
To validate the DecPOMDM together with the MARL and MF in multiagent goal recognition, we modify the classic predatorprey problem and design a new scenario [9]. In this scenario, there are more than one prey, and predators have to choose one prey and capture it by moving on a grid map. Additionally, predators may change their goal on the half way. Our method is applied to recognize the real goal of predators based on the observed noisy traces. We also use the simulation model to generate different training datasets, which consist of different numbers of labeled traces. With these datasets, we compare performances of our method and the widely applied hidden Markov model (HMM), whose transition matrix is obtained through supervised learning (the MF is applied doing inference for both models). In the inference part, results computed by the MF are compared to those of PF with different numbers of particles. All performances are evaluated in terms of precision, recall, and measure.
The rest of the paper is organized as follows: Section 2 introduces related work. Section 3 gives the formal definition of the DecPOMDM, the model’s DBN representation, the cooperative colearning algorithm, and the baseline used for comparison. Section 4 introduces how to use the MF to infer the joint goal. Section 5 presents the scenarios, settings, and results of our experiments. Subsequently, we draw conclusions and discuss future works in Section 6.
2. Related Work
In many simulation systems such as training systems and commercial digital games, effects of actions are uncertain, which is usually caused by two reasons: (a) agents execute erroneous actions with a given probability; (b) environment states are impacted by some events, which are not under control of the agents. Because of this uncertainty in simulation systems, a goal cannot be achieved by a certain sequence of primitive actions, as what we do in the classical planning. Instead, we have to find a set of policies, which define the distribution of selecting actions within given situations. Since policies essentially reveal the goal of planning, goals can be inferred as discrete parameters or hidden states, after knowing their corresponding policies, and this process is usually implemented under the Markov decision framework.
2.1. Recognizing Goals of a Single Agent
People have proposed many methods for recognizing goals of a single agent; some of them are foundations of methods for multiagent goal recognition. Baker et al. proposed a computational framework based on Bayesian inverse planning for recognizing mental states such as goals [14]. They assumed that the agent is rational: actions are selected based on an optimal or approximate optimal value function, given the beliefs about the world, and the posterior distribution of goals is computed by Bayesian inference. The core of Baker’s method is that policies are computed by the planner based on the standard Markov decision process (MDP), which does not model the observing process of the agent. Thus, Ramırez and Geffner extended Baker’s work by applying the goalPOMDP in formalizing the problem [4]. Compared to the MDP, the POMDP models the relation between real world state and observation of the agent explicitly; compared to the POMDP, the goalPOMDP defines the set of goal states. Besides, Ramırez and Geffner also solved the inference problem even when observations are incomplete. Works in [4, 14] are very promising but both of them suffer two limitations: (a) the input for goal recognition is an action sequence; however, sometimes we only have observations of environment states from real or virtual sensors, and translating observations of states to actions is not easy; (b) the goal is estimated as a static parameter; however, it may be interrupted and changed in one episode.
Recently, the computational state space model (CSSM) became more and more popular for human behavior modeling [15, 16]. In the CSSM, transition models of the underlying dynamic system can be described by any computable function using compact algorithmic representations. Krüger et al. also discussed the performances of applying the CSSM on intention recognition in different real scenarios [16, 17]. In this research, (a) intentions as well as actions and environment states are modeled as hidden states, which can be inferred by online filtering algorithm; (b) observations reflect not only primitive actions, but also environment states. The limitation of the research in [17] is that goal inference is not implemented in scenarios where results of actions are uncertain. Another related work on human behavior modeling under the MDP framework was done by Tastan et al. [18]. By making use of the inverse reinforcement learning (IRL) and the PF, they learned the opponent’s motion model and tracked it in the game Unreal 2004. This work made the following contributions: (a) the features for decision in the pursuit problem were abstracted; (b) IRL was used to learn the reward function of the opponent; (c) the solved decision model was regarded as the motion function in the PF. However, IRL relies on a large dataset, and Tastan’s method is proposed for tracking but not for goal recognition.
2.2. Multiagent Goal Recognition Based on Inverse Planning
The inverse planning theory can also be used in the multiagent domain. Baker et al. inferred relational goals between agents (such as chasing and fleeing), by using multiagent MDP framework to model interactions between agents and the environment [19]. In this model, each agent selected actions based on the world state, its goal, and its beliefs about other agents’ goals. Mental state inference is done by inverse planning, under the assumption that all agents are approximately rational. Ullman et al. also successfully applied this theory in more complex social goals, such as helping and hindering, where an agent’s goals depend on the goals of other agents [20]. In the military domain, Riordan et al. borrowed Baker’s idea and applied Bayesian inverse planning to inferred intents in multiUnmanned Aerial Systems (UASs) [21]. Additionally, IRL was also used to learn reward function. Even though Baker’s theory is quite promising, it can only work when every agent has accurate knowledge of the world state, because the multiagent MDP does not model the observing process. Besides, Bayesian inverse planning does not allow the goal to change. Another related work under the Markov decision framework in multiagent settings was done by Doshi et al. [22]. Although their main aim is to learn the agents’ behavior models, without recognizing goals, the process of estimating mental states is very similar to Bayesian approaches for probabilistic plan recognition. In Doshi’s work, the interactive partially observable Markov decision process (IPOMDP) was used to model interactions between agents. IPOMDP is an extension of POMDP for multiagent settings. Comparing to POMDP, IPOMDP defines an interactive state space, which combines the traditional physical state space with explicit models of other agents sharing the environment in order to predict their behavior. Thus, IPOMDP is applicable in situations where agents may have identical or conflicting objectives. However, IPOMDP has to deal with the problem “what do you think that I think that you think,” which makes finding optimal or approximately optimal policies very hard [23]. Actually, in many multiagent scenarios such as the football game or the first person shooting game, the agents being recognized share a common goal. This makes the DecPOMDP framework sufficient for modeling cooperation behaviors. Additionally, the increasing interests/number of works of planning theory based on DecPOMDP can provide us with a large number of planners [24, 25].
2.3. Multiagent Goal Recognition Based on DBN Filtering
If all actions and world states (the agent and the environment) are defined as variables with time labels, the MDP can be regarded as a special case of directed probabilistic graphical models (PGMs). With this idea, some people ignore the reward function but only concern themselves with the policies and infer goals under the dynamic Bayesian framework. For example, Saria and Mahadevan presented a theoretical framework for online probabilistic plan recognition in cooperative multiagent systems. This model extends the Abstract Hidden Markov Model and consists of a Hierarchical Multiagent Markov Process that allows reasoning about the interaction among multiple cooperating agents. The RaoBlackwellized particle filtering (RBPF) is also used for the inference [26, 27]. Pfeffer et al. [28] studied the problem of monitoring goals, team structure, and state in a dynamic environment: an urban warfare field, where uncoordinated or loosely coordinated units attempt to attack a target. They used the standard DBN to model cooperation behaviors (communication, group constitution) and the world states. An extension of the PF named factored particle filtering is also exploited in the inference. We also proposed a Logical Hierarchical Hidden SemiMarkov Model (LHHSMM) to recognize goals as well as cooperation modes of a team in a complex environment, where the observation was noisy and partially missing and the goal was changeable. The LHHSMM is a branch of the Statistical Relational Learning (SRL) method, which combines the PGM and the first order logic; it also presents the team behaviors in a hierarchical way. The inference for the LHHSMM was done by a logical particle filer. These works based on directed PGM theory have the advantage that they can use filtering algorithm. However, they suffer some problems: (a) constructing the graph model needs a lot of domain knowledge, and we have to vary the structure in different applications; (b) the graph structure will be very complex when the number of agents is large, which will make parameter learning and goal inference time consuming, sometimes even infeasible; (c) they need a training dataset. Other models based on datadriven training, such as the Markov logic networks and deep learning, have the same problems listed above [29, 30].
3. The Model
We propose the DecPOMDM for formalizing the world states, behaviors, and goals in the problem. In this section, we first introduce the formal definition of the DecPOMDM and explain relations among variables in this model by a DBN representation. Then, the planning algorithm for finding out the policies is given.
3.1. Formalization
One of foundations of our POMDM is the widely applied DecPOMDP. However, the DecPOMDP is proposed for solving decision problem, and there is no definition of the goal and the observation model for the recognizer in the DecPOMDP. Additionally, the multiagent joint goal may be terminated because of achievement or interruption. Thus, we design the DecPOMDM as a combination of three parts: (a) the standard DecPOMDP; (b) the joint goal and goal termination variable; (c) the observation model for the recognizer. The DecPOMDP is the foundation of the DecPOMDM.
A DecPOMDM is a tuple , where(i) is the set of agents;(ii) is a finite set of world states , which contains all necessary information for making a decision;(iii) is the finite set of joint actions;(iv) is the state transition function;(v) is the reward function;(vi) is the finite set of joint observations for agents making a decision;(vii) is the observation probability function for agents making a decision;(viii) is the discount factor;(ix) is the horizon of the problem;(x) is the initial state distribution at stage ;(xi) is the set of possible joint goals;(xii) is the set of goal termination variables;(xiii) is the observation function for the recognizer;(xiv) is the finite set of joint observations for the recognizer;(xv) is the goal selection function;(xvi) is the goal termination function;(xvii) is the initial goal distribution at stage .
Symbols including , , , , , , , in the DecPOMDM have the same meanings as those in the DecPOMDP. More definition details and explanations can be found in [9, 27]. The reward function is defined as , which shows that the reward depends on the joint goal; the goal set consists of all possible joint goals; the goal termination variable set indicates whether the current goal will be continued in the next step (if is 0, the goal will be continued; otherwise, a new goal will be selected again in the next step); the observation function for the recognizer is defined as : is the probability that the recognizer observes while the real worlds state is ; the goal selection function is defined as : is the conditional probability that agents select as the new goal while the world state is ; the goal termination function is defined as : is the conditional probability that agents terminate their while the world state is .
In the DecPOMDP, the policy of the th agent is defined as , where is the set of possible actions of agent ; is the set of observation sequences. Thus, given an observation sequence , the agent selects an action with a probability defined by . Since the selection of actions depends on the history of the observations, the DecPOMDP does not satisfy the Markov assumption. This attribute makes inferring goals online very hard: (a) if we precompute policies and store them offline, it will require a very large memory because of the combination of possible observations; (b) if we compute policies online, the filtering algorithm is infeasible because weights of possible states cannot be updated with only the last observation. One possible solution is to define an interior belief variable for each agent to filter the world state, but it will make the inferring process much more complex. In this paper, we simply assume that all agents are memoryless as the work in [9]. Then, the policy of the th agent is defined as , where is the set of possible observations of agent . The definition of policies in the DecPOMDM shows that (a) an agent does not need the history of observations for making decisions; (b) selection of actions depends on the goal at that stage. In the next section, we will further explain the relations of variables in the DecPOMDM by its DBN representation.
3.2. The DBN Representation
After estimating the agents’ policies by a multiagent reinforcement learning algorithm, we do not need the reward function in the inference process. Thus, in the DBN representation of the DecPOMDM, there are six sorts of variables in total: the goal, the goal termination, the action, observations for agents making decision, state, and observations for the recognizer. In this section, we first analyze how these variables are affected by other factors. Then, we give the DBN representation in two adjacent time slices.
(A) Goal (). The goal at time depends on the goal , the goal termination variable , and the state . If , agents keep their goal at time ; otherwise, is selected depending on .
(B) Goal Termination Variable (). The goal termination variable at time depends on the goal and state at the same time. The is terminated with the probability .
(C) Action (). The action selected by agent at time depends on the goal and its observation at the same time, and the distribution of actions is defined by .
(D) Observations for Agents Making Decision (). The observation of agent at time reflects the real world state at time , and the agent observes with the probability .
(E) State (). The world state at time depends on the state at time and the actions of all agents at time . The distribution of the updated states can be computed by the state transition function .
(F) Observations for the Recognizer (). The observation for the recognizer at time reflects the real world state at the same time, and the recognizer observes with the probability .
The DBN representation of the DecPOMDM in two adjacent time slices presents all dependencies among variables discussed above, as is shown in Figure 1.
In Figure 1, only actions and observations of agent and agent are presented for simplicity, and each agent has no knowledge about others and can only make decision based on its own observations. Although the DecPOMDM has a hierarchical structure, it models the task decomposition and allocation in an inexplicit way: all information about cooperation is hidden in the joint policies. From filtering theory point of view, the joint policies actually play the role of the motion function. Thus, estimating policies is a key problem for goal inference.
3.3. Learning the Policy
Because the DecPOMDP is essentially a DBN, we can simply use some datadriven methods to learn parameters from a training dataset. However, the training dataset is not always available in some cases. Besides, when the number of agents is large, the DBN structure will become large and complex, which makes supervised or unsupervised learning time consuming. To solve these problems, we assume that the agents to be recognized are rational, which is reasonable when there is no history of agents. Then, we can use an existing planner based on the DecPOMDP framework to find out the optimal or approximately optimal policies for each possible goal.
Various algorithms have been proposed for solving DecPOMDP problems. Roughly, these algorithms can be divided into two categories: (a) modelbased algorithms, under the general name of dynamic programming; (b) modelfree algorithms, under the general name of reinforcement learning [8]. In this paper, we select a multiagent reinforcement learning algorithm, cooperative colearning based on Sarsa [9], because it does not need a state transition function of the world.
The main idea of cooperative colearning is that at each step one chooses a subgroup of agents and updates their policies to optimize the task, given the rest of the agents have fixed plans; then, after a number of iterations, the joint policies can converge to a Nash equilibrium. In this paper, we only consider settings where agents are homogeneous. All agents share the same observation model and policies. Thus, we only need to define one POMDP for all agents. All parameters of can be obtained directly from the given DecPOMDP, except for the transition function . Later, we will show how to compute from , which is the transition function of the DecPOMDP. The DecPOMDP problem can be solved by the following steps [9].
Step 1. We set and start from an arbitrary policy .
Step 2. We select an arbitrary agent and compute the state transition function of considering that policies of all agents are , except the selected agent that is refining the plan.
Step 3. We compute which is the optimal policy of and set .
Step 4. We update the policy of each agent to , set , and return to Step 2.
In Step 2, the state transition function of the POMDP for any agent can be computed by formula (1) (we assume that we refine the plan for the first agent): where is the observation function defined in , is the action of the th agent, is the observation of the th agent, is the probability that the state transits from to while the action of the first agent is , and other agents choose actions based on their observations and policy .
Unfortunately, computing formula (1) is always very difficult in complex scenarios. Thus, we apply the Sarsa algorithm for finding out the optimal policy of (in Steps 2 and 3) [8]. In this process, the POMDP problem is mapped to the MDP problem by regarding the observation as the world state. Then, agents get feedback from the environment and we do not need to compute the updated reward function and state transition function.
4. Inference
In the DecPOMDM, the goal is defined as a hidden state indirectly reflected by observations. In this case, many filtering algorithms can be used to infer the goals. However, in the multiagent setting, the world state space is large because of combinations of agents’ states and goals. Besides, the DecPOMDM models a discrete system. To solve this problem, we use a MF algorithm to infer multiagent joint goals under the framework of the DecPOMDM. It has been proved efficient when the state space is large and discrete.
Nyolt et al. discussed the theory of the MF and how to apply it in a casual model in [12, 13]. Their main idea can still work in our paper, but there are also differences between the DecPOMDM and the causal model: (a) the initial distribution of states in our model is uncertain; (b) the results of actions in our model are uncertain; (c) we do not model duration of actions for simplicity. Thus, we have to modify the MF for casual models and make it available for the DecPOMDM.
When the MF is applied to infer goals under the framework of the DecPOMDM, the set of particles is defined as , where . is the number of particles at time , and the weight of th particle is . The detailed procedures of goal inference are given in Algorithm 1.

is the set of , which contains all particles and their weights at each stage. When we initialize , the weights are computed by and are provided by the model, and is computed by is a temp memory of particles and their weights which are transited from one specific particle. Because a world state may be after the execution of different actions from given , we have to use a LOOKUP operation to query the record in , which covers the new particle . The operation LOOKUP() searches in ; if there is a record in which covers , the operation returns this record; otherwise, it returns empty. This process is time consuming if we scan the for every query. One alternative method is to build a matrix for each , which records the indices of all reached particles. Then, if the index of is null, we add a new record in and update the index matrix; otherwise, we can read the index of from the matrix and merge the weight directly. After we finish building , its index matrix can be deleted to release memory. We also need to note that this solution saves computing time but needs extra memory.
The operation PUT() adds a new record in and indexes this new record. The generated contains a group of particles and corresponding weights. Some of these particles may have been covered in . Thus, we have to use a MERGE operation to get a new by merging and the existing . In this process, if a particle in has not appeared in , we directly put and its corresponding weight into ; otherwise, we need to add into the weight of the record in which covers . Similarly, an index matrix can also be used to save computing time in the MERGE operation.
Under the framework of DecPOMDM, we update bywhere and are the weight and particle, respectively, th of record in , and is the observation function for the recognizer in the model. The details of PRUNE operation can be found in [13], and we can use the existing pruning technique directly in our paper.
5. Experiments
5.1. The PredatorPrey Problem
The predatorprey problem is a classic problem to evaluate multiagent decision algorithms [9]. However, it cannot be used directly in this paper because predators have only one possible goal. Thus, we modify the standard problem by setting more than one prey on the map, and our aim is to recognize the real target of the predators at each time. Figure 2 shows the gridbased map in our experiments.
In our scenario, the map consists of grids, two predators (red diamonds: Predator PX and Predator PY), and two preys (blue stars: Prey PA and Prey PB). The two predators select one of the preys as their common goal and move around to capture it. As is shown in Figure 2, the observation of the predators is not perfect: a predator only knows the exact position of itself and others which are in the nearest 8 grids. If another agent is out of its observing range, the predator only knows its direction (8 possible directions). For the situation in Figure 2, Predator PX observes that none is near to itself, Prey PB is in the north, Prey PA is in the southeast, and Predator PY is in the south; Predator PY observes that none is near itself, Prey PB and Predator PY are in the north, and Prey PA is in the east. In each step, all predators and preys can get into one of the four adjacent grids (north, south, east, and west) or stay at the current grid. When two or more agents try to get into the same grid or try to get out of the map, they have to stay in the current grid. The predators can achieve their goal if and only if both of them are adjacent to their target. Additionally, the predators may also change their goal before they capture a prey. The recognizer can get the exact positions of the two preys, but its observations of the predators are noisy. We need to compute the posterior distribution of predators’ goals with the observation trace.
The modified predatorprey problem can be modeled by the DecPOMDM. Some important elements are as follows:(a): the two predators;(b): the positions of predators and preys;(c): five actions for each predator, moving into four directions and staying;(d): Prey PA or Prey PB;(e): the directions of agents far away and the real positions of agents nearby;(f): the real positions of preys and the noisy positions of predators;(g): predators getting a reward +1 once they achieve their goal; otherwise, the immediate reward is 0;(h): infinite horizons.
With the definition above, the effects of predator’s actions are uncertain, and the state transition function depends on the distribution of preys’ actions. Thus, actions of preys actually play the role of events in discrete dynamic systems.
5.2. Settings
In this section, we provide additional details for the scenario and the parameters used in the policy learning and inference algorithm.
(A) The Scenario. The preys are senseless: they select each action with equal probability. Initial positions of agents are uniform. The initial goal distribution is that and .
We simplify the goal termination function in the following way: if predators capture their target, the goal is achieved; otherwise, the predators change their goal with a probability of 0.05 for every state. The observation for the recognizer reveals the real position of each predator with a probability of 0.5, and the observation may be one of 8 neighbors of the current grid with a probability of 0.5/8. When the predator is on the edge of the map, the observation may be out of the map. The observed results of the agents do not affect each other.
(B) Multiagent Reinforcement Learning. The discount factor is . The predator selects an action given the observation with a probabilitywhere is the Boltzmann temperature. We set as a constant, which means that predators always select approximately optimal actions. In our scenario, the value converges after 750 iterations. In the learning process, if predators cannot achieve their goal in 10000 steps, we will reinitialize their positions and begin next episode.
(C) The Marginal Filter. In the MF inference, a particle consists of the joint goal, the goal termination variable, and the positions of predators and preys. Although there are 25 possible positions for each predator or prey, after getting new observation, we can identify the positions of preys and there are only 9 possible positions for each predator. Thus, the number of possible values of particles at one step never exceeds after the UPDATE operation. In our experiments, we simply set the max number of particles as 324; then we do not need to prune any possible particles, which means that we exploit an exact inference. We also make use of real settings in the DecPOMDM and the real policies of predators in the MF inference.
5.3. Results and Discussion
With the learned policy, we run the simulation model repeatedly and generated a test dataset consisting of 100 traces. There are 11.83 steps averagely in one trace, and the number of steps in one trace varies from 6 to 34 with a standard deviation of 5.36. We also found that the predators changed their goals for at least once in 35% of the test traces. To validate our method and compare it with baselines, we did experiments on three aspects: (a) to discuss the details of the recognition results obtained with our method, we computed the recognition results of two specific traces by our method; (b) to show the advantages of the DecPOMDM in modeling, we compared the recognition performances when the problem was modeled as a DecPOMDM and as an HMM; (c) to show efficiency of the MF under the framework of the DecPOMDM, we compared the recognizing performances when the goal was inferred by the MF and the PF.
In the second and the third parts, performances were evaluated by statistic metrics: precision, recall, and measure. Their meanings and computation details can be found in [31]. The value of the three metrics is between 0 and 1; a higher value means a better performance. Since these metrics can only evaluate the recognition results at a single step, and traces in the dataset had different time lengths, we defined a positive integer . The metric with means that the corresponding observation sequences are . Here, is the observation sequence from time 1 to time of the th trace, and is the length of the th trace. And we need to recognize for each observation sequence. Thus, metrics with different show the performances of algorithms in different phases of the simulation. Additionally, we regarded the destination with largest probability as the final recognition result.
(A) The Recognition Results of the Specific Traces. To show the details of the recognition results obtained with our method, we selected two specific traces from the dataset (number 1 and number 4). These two traces were selected because Trace number 1 was the first trace where the goal was changed before it was achieved, and number 4 was the first trace where the goal was kept until it was achieved. The detailed information about the two traces is shown in Table 1.
In Trace number 1, predators selected Prey PA as their first goal from to . Then, the goal was changed to Prey PB, which was achieved at . In Trace number 4 predators selected Prey PB as their initial goal. This goal was kept until it was achieved at . Given the real policies and other parameters of the DecPOMDM including , , , , , , and , we used the MF to compute the posterior distribution of goals at each time. The recognition results are shown in Figure 3.
(a) Recognition results of Trace number 1
(b) Recognition results of Trace number 4
In Figure 3(a), the probability of the real goal (Prey PA) increases fast during the initial period. At , the probability of Prey PA exceeds 0.9 and keeps a high value until . When the goal is changed at , our method has a very fast response, because predators select highly certain joint actions at this time. In Figure 3(b), the probability of the false goal increases at , and the probability of the real goal (Prey PB) is low at first. The failure happens because the observations support that predators selected joint actions with small probability if the goal is Prey PB. Anyway, the probability of the real goal increases continuously and exceeds that of Prey PA after . With the recognition results of the two specific traces, we conclude that the DecPOMDM and MF can perform well regardless of the goal change.
(B) Comparison of the Performances of the DecPOMDM and HMMs. To show the advantages of the DecPOMDM, we modeled the predatorprey problem as the wellknown HMM as a baseline. In the HMM, we set the hidden state as the tuple , where , , , and are positions of Predator PX, Predator PY, Prey PA, and Prey PB, respectively. The observation model for the recognizer in the HMM was the same as that in the DecPOMDM. Thus, there were possible states in this HMM, and the dimension of the transition matrix was . Since the state space and transition matrix were too large, an unsupervised learning method such as BalmWelch algorithm was infeasible in this problem. Instead, we used a simple supervised learning: counting the state transitions based on labeled training datasets. With the real policies of predators, we run the simulation model repeatedly and generated five training datasets. The detailed information of these datasets is shown in Table 2.
The datasets HMM50, HMM100a, and HMM200a were generated in a random and incremental way (HMM100a contains HMM50, and HMM200a contains HMM100a). Since HMM100a and HMM200a both covered 79% of traces of the test dataset, which might cause an overfitting problem, we removed the traces covered by the test dataset in HMM100a and HMM200a and compensated them by extra traces. In this way, we got new datasets HMM100b and HMM200b which did not cover any trace in the test dataset. With these five labeled datasets, we estimate the transition matrix by counting state transitions. Then, the MF was used to infer the goal. However, in the inference process, we may reach some states that have not been explored in the training dataset (HMM200a only explores 492,642 states, but there are totally 607,200 possible states). In this case, we assumed that the hidden state would transit to a possible state with a uniform distribution. The rest of the parameters in the HMM inference were the same as those in the DecPOMDM inference. We compare performances of the DecPOMDM and the HMMs. The recognition metrics are shown in Figure 4.
(a) Precision
(b) Recall
(c) measure
Figure 4 shows that comparing the results of the DecPOMDM and the HMMs trained by different datasets is similar in terms of precision, recall, and measure. More precisely, HMM200a had the highest performance; HMM100a performed comparable to our DecPOMDM, but DecPOMDM performed better after ; HMM50 had the worst performance. Generally, there was no big difference between performances of the DecPOMDM, HMM100a, and HMM200a, even though the number of traces in HMM200a was twice as large as that in HMM100a. The main reason is that the training datasets were overfitted. Actually, there was a very serious performance decline after we removed the traces covered in the test dataset from HMM200a and HMM100a. In this case, HMM200b performed better than HMM100b, but worse than our DecPOMDM. The results in Figure 4 showed that (a) our DecPOMDM performed well on three metrics, almost as well as the overfitted trained model; (b) the learned HMM suffered an overfitting problem, and there will be a serious performance decline if the training dataset does not cover most traces in the test dataset.
The curves of HMM50, HMM100b, and HMM200b also showed that when we model the problem through the HMM, it may be possible to improve the recognition performances by increasing the size of the training dataset. However, this solution is infeasible in practice. Actually, the dataset HMM200a which consisted of 200,000 traces only covered 81.13% of all possible states, and only 71.46% of the states in HMM200a had been reached more than once. Thus, we can conclude that HMM200a will have a poor performance if agents select actions with higher uncertainty. Besides, there is no doubt that the size of the training dataset will be extremely large if most states are reached a large number of times. In real applications, it is very hard to obtain a training dataset with so large size, especially when all traces are labeled.
We also performed a Wald test over the DecPOMDM and HMMs with a different training dataset to prove that their recognition results came from different distributions. Given our test dataset, there were 100 goals to be recognized for each value of . Let be the set of samples obtained from the DecPOMDM; then we set if the recognition result of the DecPOMDM is correct on the test case and , otherwise; let be the set of samples obtained from the baseline model (one of the HMMs); then we set if the recognition result of the DecPOMDM is correct on the test case and , otherwise. We define , and let ; the null hypothesis is , which means the recognition results from different models follow the same distribution. A more detailed test process can be found in [32]. The matrix of values is shown in Table 3.
The values in Table 3 show that recognition results of the DecPOMDM follow different distributions from those of HMM50, HMM100b, and HMM200b, respectively, with a high confidence. We cannot reject the null hypothesis when the baseline is an overfitted trained model. The Wald test results are consistent with the metrics in Figure 4. We also performed the Wilcoxon test, and the test results showed the same trend.
(C) Comparison of the Performances of the MF and the PF. Here, we exploited the PF as the baseline. The model information used in the PF is the same as that in the MF. We evaluated the PF with different number of particles and compared their performances to the MF. All inference was done under the framework of DecPOMDM. We have to note that when the PF used weighted particles to approximate the posterior distribution of the goal, it is possible that all weights decrease to 0 if the number of particles is not large enough. In this case, we simply reset all weights to continue the inference, where is the number of particles in the PF. The recognition metrics of the MF and PF are shown in Figure 5.
(a) Precision
(b) Recall
(c) measure
In Figure 5, the red solid curve indicates the metrics of the MF. The green, blue, purple, black, and cyan dashed curves indicate the metrics of PF with 1000, 2000, 4000, 6000, and 16000 particles, respectively. All filters had similar precision. However, considering the recall and the measure, the MF had the best performance, and the PF with the largest number of particles performed better than the rest of PF. We got these results because an exact MF (without pruning step) is used in this section, and the PF can approximate the real posterior distribution better with more particles.
Similar to the testing method we used in Part (B), here we also performed the Wald test on the MF and PF with different number of particles. The matrix of the values is shown in Table 4.
The null hypothesis is that the recognition results of the baseline and the MF follow the same distribution. Generally, with larger and smaller number of particles, we can reject the null hypothesis with higher confidence, which is consistent with the results in Figure 5.
Since there was a variance in the results inferred by the PF (this is due to the fact that the PF performs approximate inference), we run the PF with 6000 particles for 10 times. The mean value with 0.90 belief interval of the values of metrics at different is shown in Figure 6.
(a) Precision
(b) Recall
(c) measure
The blue solid line indicates the mean value of the PF metrics, and the cyan area shows the 90% belief interval. We need to note that since we do not know the underlying distribution of the metrics, an empirical distribution was used to compute the belief interval. At the same time, because we run PF for 10 times, the bound of the 90% belief interval also indicates the extremum of PF. We can see that the metrics of the MF are better than the mean of the PF when , and even better than the maximum value of PF except for . Actually, the MF also outperforms 90% of the runs of the PF at . Additionally, the MF only needs average of 75.78% of the time which is needed by the PF for inference at each step. Thus, the MF consumes less time and has a better performance than the PF with 6000 particles.
Generally, the computational complexities of the PF and the MF are both linear functions of number of particles. When there are multiple possible results of an action, the MF consumes more time than the PF when their numbers of particles are equal. However, since the MF does not duplicate particles, it needs much less particles than the PF when the state space is large and discrete. Actually, the number of possible states in the MF after UPDATE operation is never more than 156 in the test dataset. At the same time, the PF has to duplicate particles in the resampling step to approximate the exact distribution, which makes it inefficient under the framework of DecPOMDM.
6. Conclusion and Future Work
In this paper, we propose a novel model for solving multiagent goal recognition problems of the DecPOMDM and present its corresponding learning and inference algorithms, which solve a multiagent goal recognition problem.
First, we use the DecPOMDM to model the general multiagent goal recognition problem. The DecPOMDM presents the agents’ cooperation in a compact way; details of cooperation are unnecessary in the modeling process. It can also make use of existing algorithms for solving the DecPOMDP problem.
Second, we show that the existing MARL algorithm can be used to estimate the agents’ policies in the DecPOMDM assuming that agents are approximately rational. This method does not need a training dataset and the state transition function of the environment.
Third, we use the MF to infer goals under the framework of the DecPOMDM, and we show that the MF is more efficient than the PF when the state space is large and discrete.
Last, we also design a modified predatorprey problem to test our method. In this modified problem, there are multiple possible joint goals and agents may change their goals before they are achieved. With this scenario, we compare our method to other baselines including the HMM and the PF. The experiment results show that the DecPOMDM together with MARL and MF algorithms can recognize the multiagent goal well whether the goal is changed or not; the DecPOMDM outperforms the HMM in terms of precision, recall, and measure; and the MF can infer goals more efficiently than the PF.
In the future, we plan to apply the DecPOMDM in more complex scenarios. Further research on pruning technique of the MF is also planned.
Competing Interests
The authors declare that there are no competing interests regarding the publication of this paper.
Acknowledgments
This work is sponsored by the National Natural Science Foundation of China under Grants no. 61473300 and no. 61573369 and DFG research training group 1424 “Multimodal Smart Appliance Ensembles for Mobile Applications.”