Discrete Dynamics in Nature and Society

Volume 2017, Article ID 4580206, 15 pages

https://doi.org/10.1155/2017/4580206

## A Decentralized Partially Observable Markov Decision Model with Action Duration for Goal Recognition in Real Time Strategy Games

College of Information System and Management, National University of Defense Technology, Changsha 410073, China

Correspondence should be addressed to Kai Xu; nc.ude.tdun@90iakux

Received 22 March 2017; Accepted 8 June 2017; Published 16 July 2017

Academic Editor: Filippo Cacace

Copyright © 2017 Peng Jiao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Multiagent goal recognition is a tough yet important problem in many real time strategy games or simulation systems. Traditional modeling methods either are in great demand of detailed agents’ domain knowledge and training dataset for policy estimation or lack clear definition of action duration. To solve the above problems, we propose a novel Dec-POMDM-T model, combining the classic Dec-POMDP, an observation model for recognizer, joint goal with its termination indicator, and time duration variables for actions with action termination variables. In this paper, a model-free algorithm named cooperative colearning based on Sarsa is used. Considering that Dec-POMDM-T usually encounters multiagent goal recognition problems with different sorts of noises, partially missing data, and unknown action durations, the paper exploits the SIS PF with resampling for inference under the dynamic Bayesian network structure of Dec-POMDM-T. In experiments, a modified predator-prey scenario is adopted to study multiagent joint goal recognition problem, which is the recognition of the joint target shared among cooperative predators. Experiment results show that (a) Dec-POMDM-T works effectively in multiagent goal recognition and adapts well to dynamic changing goals within agent group; (b) Dec-POMDM-T outperforms traditional Dec-MDP-based methods in terms of precision, recall, and -measure.

#### 1. Introduction

Recently, more and more commercial real time strategy (RTS) games have received attention from AI researchers, behavior scientists, policy evaluators, and staff training groups [1]. A key aspect in developing these RTS games is to create human-like players or agents who can act or react intelligently against changing virtual environment and incoming interactions from real players [2]. Though many AI planning and decision-making algorithms have been applied to agents in RTS games, their behavior patterns are still easy to be predicted and thus making games less entertaining or intuitive. This is partially because of agents’ low information processing and understanding ability, for example, the recognition of goal or intention from opponents or friends. In other words, understanding goals or intentions in time helps agents cooperate better or make counter decisions more efficiently.

A typical scenario in RTS games is a group of AI players cooperating to achieve a certain mission. In the Star-Craft, for example, the AI players have to cooperate so as to besiege enemy bases or intercept certain logistic forces [3]. Therefore, if AI players can recognize the real moving or attacking target, they will be better prepared, no matter with early defense employment or counter decision-making. Considering these benefits, goal recognition has attracted lots of attention from researchers in many different fields. Many related models and algorithms have been proposed and applied, such as hidden Markov models (HMMs) [4], conditional random fields (CRFs) [5], Markov decision processes (MDPs) [6], and particle filtering (PF) [7].

Hidden Markov models [8] are especially known for their applications in temporal pattern recognition such as speech, handwriting, and gesture recognition. Though convenient in representing system states, HMMs have low ability in describing agent actions in dynamic environment. Comparing to HMMs, MDPs have a better representation of actions and their future effects. MDP is the framework for solving sequential decision problems: agents select actions sequentially based on states and each action will have an impact on future states. They have been successfully applied in goal and intention recognition [6]. Several modifications based on the MDP framework have a finer formalization towards more complex scenarios. Among these models, the Dec-POMDM (decentralized partially observable Markov decision model) [9] is a MDP-based method focusing on solving multiagent goal recognition problem. Though having all details of cooperation embedded in team’s joint policy, Dec-POMDM is only concerned about actions starting and terminating within one time step. This is usually not applicable in RTS games.

Based on ideas from Dec-POMDM and SMDPs [10], we propose a novel decentralized partially observable Markov decision model with time duration (Dec-POMDM-T) to formalize multiagent cooperative behaviors with durative actions. The Dec-POMDM-T models the joint goal, the actions, and the world states hierarchically. Compared to works in [9, 11], Dec-POMDM-T explicitly models the time duration for primitive actions, indicating whether actions are terminated or not. In Dec-POMDM-T, the multiagent joint goal recognition consists of three components: (a) formalization of behaviors, the environment, and the observation for organizers; (b) model parameter estimation through learning or other methods; and (c) goal inference from observations:(a)For the problem formalization, agents’ cooperative behaviors are modeled by joint policies, ensuring model’s effectiveness without considering domain-related cooperation mechanism. Besides, explicit time duration modeling of primitive actions is also implemented.(b)For the parameter estimation, under the assumption of agents’ rationality, many algorithms for Dec-POMDP could be exploited for accurate or approximate policy estimation, making the training dataset unnecessary. This paper uses a model-free algorithm named cooperative colearning based on Sarsa [12] in policy learning.(c)For the goal reference, the modified particle filtering method is exploited because of its advantages in solving goal recognition problems with different sorts of noises, partially missing data and unknown action duration.

Like the modified predator-prey problem presented in [9], the scenario in this paper also has more than one prey and predator. The predators first establish joint pursuing target or goal, which would be changed halfway, before capturing it. The model and its inference methods applied in this paper are to recognize the real goal behind agents’ cooperative behaviors which are partially observable traces with additional noises. Based on this scenario, we retrieve agents’ optimal policies using a model-free multiagent reinforcement learning (MARL) algorithm. After that, we run a simulation model in which agents select actions according to policies and generate a dataset consisting of 100 labeled traces. With this dataset, statistical metrics including precision, recall, and -measure are computed using Dec-POMDM-T and other Dec-MDP-based methods, respectively. Experiments show that Dec-POMDM-T outperforms the others in all three metrics. Besides, recognition results of two traces are also analyzed, showing that Dec-POMDM-T is also quite robust when joint goals change dynamically during the recognition process. The paper also analyzes the estimation variance and time efficiency of our modified particle filter algorithm and thus proves its effectiveness in practice.

The rest of the paper is organized as follows. Section 2 introduces related works. Section 3 analyzes the moving process in RTS games and presents the formal definition of Dec-POMDM-T as well as its DBN structure. Based on that, Section 4 introduces the way to use modified particle filter algorithm in multiagent joint goal inference. After that, experiment scenarios and parameter settings as well as results are shown in Section 5. Finally, the paper draws conclusions and discusses future works in Section 6.

#### 2. Related Works

As an interdisciplinary research hotspot covering psychology and artificial intelligence, the problem of goal recognition or intention recognition has been tried from many different ways. In early days, the formalization of goal recognition problem is usually related to the construction of plan library, in which the recognition process is based on logical consistency matching between observations and plan library. After that, the well-known Probabilistic Graphic Models (PGMs) [13] family, including MDPs [6], HMMs [3], and CRFs [5], were further proposed as a more compact graph-based representation approach. Additionally, PGMs have their advantage in modeling the uncertainty and dynamics both in environments and the agent itself, which is not possible in the above consistency-based methods. Among PGMs, several modifications including forming hierarchical graph model structure [14–16] and explicit modeling of action duration [17, 18] are also proposed. Although probabilistic methods have their advantage in uncertainty modeling, still they cannot represent and process structural or relational data. Statistical relational learning (SRL) [19] is a relatively new theory applied in intention recognition, including logical HMMs (LHMMs) [20], Markov logic networks (MLNs) [21], and Bayesian logic programs (BLPs) [22]. It combines relation representation, first-order logic, probabilistic inference, and machine learning altogether. Besides, several other methods based on probabilistic grammar have also been proposed on the discovery of the similarity between natural language process (NLP) and intention recognition [23]. Most recently, deep learning and other intelligent algorithms in retrieving agent’s decision model are also applied in intention recognition [24]. Other considerations like goal recognition design (GRD) [25, 26] try to solve the same problem from different aspects.

##### 2.1. Goal Recognition with Action Duration Modeling

There is a group of models in PGMs, like HMM-/MDP-based models, that has close relationship with Markov property. The property assumes that the future states depend only on the current state. Generally speaking, the Markov property enables reasoning and computation with the model that would otherwise be intractable. Though it is desirable for models to exhibit Markov property, it is not always the truth in real goal recognition scenarios, causing serious performance degradation like lower precision, longer convergence time, and even wrong prediction. One main reason for Markov property violation occurs in agents having durative primitive actions. Typically there are two approaches in solving the above problem. One is forming hierarchical structures. Fine et al. [14] proposed Hierarchical HMM (HHMM) in 1998. Bui et al. [3] used abstract hidden Markov models (AHMM) for hierarchical goal recognition based on abstract Markov policies (AMPs). A problem of the AHMM is that it does not allow the top-level policy to be interrupted when the subplan is not completed. Saria and Mahadevan [27] extended the work by Bui to multiagent goal recognition. Similar modifications include works like Layered HMM (LHMM) [15], Dynamic CRF (DCRF) [28], and Hierarchical CRF (HCRF) [16].

Another kind of approaches tackling non-Markov property falls into explicit modeling of action duration time. Hladky and Bulitko [17] applied hidden semi-Markov model (HSMM) to opponent position estimation in the first person shooting (FPS) game Counter Strike. Duong et al. [18] proposed a Coxian hidden semi-Markov model (CxHSMM) for recognizing human activities of daily living (ADL). The CxHSMM modifies HMM in two aspects: on one hand, it is a special DBN representation of two-layer HMM, and it also has termination variables; on the other hand, it used Coxian distribution to model the duration of primitive actions explicitly. Besides, Yue et al. [9] proposed a SMDM (semi-Markov Decision Model) based on AHMM, which not only has hierarchical structure, but also models the time duration. Similar methods also include Semi-Markov CRF (SMCRF) [29] and Hierarchical Semi-Markov CRF (HSCRF) [30].

##### 2.2. Multiagent Goal Recognition Based on MDP Framework

As what we have known, MDP is the framework for solving sequential decision problems. Baker et al. [6] proposed a computational framework based on Bayesian inverse planning for recognizing mental states such as goals. They assumed that the agent is rational: actions are selected based on an optimal or approximate optimal value function, given the beliefs about the world, and the posterior distribution of goals is computed by Bayesian inference. Ullman et al. [31] also successfully applied this theory in more complex social goals, such as helping and hindering, where an agent’s goals depend on the goals of other agents. In the military domain, Riordan et al. [32] borrowed Baker’s idea and applied Bayesian inverse planning to inferred intents in multi-Unmanned Aerial Systems (UASs). Ramırez and Geffner [11] extended Baker’s work by applying the goal-POMDP in formalizing the problem. Compared to the MDP, the POMDP models the relation between real world state and observation of the agent explicitly. Comparing to POMDP, I-POMDP defines an interactive state space, which combines the traditional physical state space with explicit models of other agents sharing the environment in order to predict their behavior. Ramirez and Geffner also solved the inference problem even when observations are incomplete. Besides, Yue et al. [9] also proposed a Dec-POMDM model based on Dec-POMDP in recognizing multiagent goal recognition. Its model, however, does not consider situations when agents are having durative actions in RTS games. Above modifications based on MDP framework, like SMDPs, POMDPs, and Dec-POMDPs, all have a finer formalization towards more complex scenarios.

#### 3. The Model

We propose the Dec-POMDM-T for formalizing the world states, behaviors, goals, and action durations in goal recognition problem. In this section, we first introduce how agents do path planning and move between adjacent grids in RTS games. Then, the formal definition of the Dec-POMDM-T and relations among variables in the model is explained by a DBN representation. Based on that, the planning algorithm for finding out the optimal policies is given.

##### 3.1. Agent Maneuvering in RTS Games

Agents’ maneuvering in RTS games usually consists of two processes: one is the path planning knowing the starting point and destination beforehand; the other one is agents moving from current positions to adjacent grids.

###### 3.1.1. Path Planning

Like many classical planning problems, path planning would also generate several courses of actions given starting points and destinations, which is a sequence of positions specifically. In dynamic environments however, the effects of actions would be uncertain. Besides, agent maneuvering is essentially a sequential decision problem, in which agents select actions according to current states and destinations. Further, in multiagent cooperative behaviors, path planning also needs to follow joint policy shared among the agent group. Thus a probabilistic Markov decision model is needed.

###### 3.1.2. Moving between Adjacent Grids

After knowing the next position or grid from path planning algorithm, agents need to move from original position to it. In real situations, one moving action usually lasts for several steps before agent arriving in target position. This situation breaks the Markov property, and thus making the agent decision process falls into a semi-Markov one.

As in Figure 1, which is originally shown in [33], assume that an agent is on the point which is in the grid of C2 and wants to go to the point which is in the grid A3. For the path planning level, the agent needs to choose a grid among the five adjacent grids (B1, B2, B3, C1, and C3). In this example, the agent decides to go to grid B2. In the moving level, the agent will move along the line from point to point which is the center of grid B2. Because the simulation step is a short time, the agent will compute how long it will take to reach point according to the current speed. Because the position of the agent is a continuous variable, it is very unlikely that the agent just gets the grid center when a simulation step ends. Thus, the duration of moving is usually computed bywhere is a constant in the moving process and is the time of a simulation step. is the distance between the point and point . The is computed by a floor operator. In this case, . After moving for 3 steps from position , the agent will get position and choose the next grid. This moving process will not be intercepted except that the intention is changed.