Mathematical Problems in Engineering

Volume 2016 (2016), Article ID 1907971, 12 pages

http://dx.doi.org/10.1155/2016/1907971

## A Semi-Markov Decision Model for Recognizing the Destination of a Maneuvering Agent in Real Time Strategy Games

College of Information System and Management, National University of Defense Technology, Changsha 410073, China

Received 19 August 2015; Accepted 16 December 2015

Academic Editor: M. I. Herreros

Copyright © 2016 Quanjun Yin et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Recognizing destinations of a maneuvering agent is important in real time strategy games. Because finding path in an uncertain environment is essentially a sequential decision problem, we can model the maneuvering process by the Markov decision process (MDP). However, the MDP does not define an action duration. In this paper, we propose a novel semi-Markov decision model (SMDM). In the SMDM, the destination is regarded as a hidden state, which affects selection of an action; the action is affiliated with a duration variable, which indicates whether the action is completed. We also exploit a Rao-Blackwellised particle filter (RBPF) for inference under the dynamic Bayesian network structure of the SMDM. In experiments, we simulate agents’ maneuvering in a combat field and employ agents’ traces to evaluate the performance of our method. The results show that the SMDM outperforms another extension of the MDP in terms of precision, recall, and -measure. Destinations are recognized efficiently by our method no matter whether they are changed or not. Additionally, the RBPF infer destinations with smaller variance and less time than the SPF. The average failure rates of the RBPF are lower when the number of particles is not enough.

#### 1. Introduction

In the recent decades, many commercial real time strategy (RTS) games such as Star Craft and War Craft become more and more popular. A key problem in developing these games is to create AI players who can recognize the intentions of their opponents. Then, the game will be more challenging and interesting [1].

A typical and significant intention in RTS games is the destination of a maneuvering player. In many attacking missions, players need to plan the path with a given destination and the current situation, move along the planned path, and then destroy the building of enemies. Thus, if the AI players can recognize the destination with observed traces of opponents, they can prepare for the defense. Because of these benefits, some recognizing methods have been applied in some digital games. Like the intention recognition, recognizing the destination of a maneuvering agent usually consists of three steps: formalization, parameter estimation, and destination inference [2]. In this paper, we only focus on formalization and inference. The parameters of the opponents’ real decision model are directly used. We need to note that these parameters can also be estimated by some machine learning algorithms or simply counting [3].

Hidden Markov models (HMMs) are widely used to model the maneuvering process of an agent. The idea behind HMMs for destination recognition is as follows: regarding the position of the agent as a hidden state, probabilities of transiting between waypoints are modelled by a transition matrix. In other words, a Markov chain represents all possible paths. From the view of planning, HMMs focus on representing the system states but neglect the actions. However, the action determined by the situation affects the state transition matrix very much. Particularly, when the agent is in a dynamic environment, the concept of action is quite important to model the behavior precisely.

In the uncertain planning domain, planning is defined as a sequential decision problem: the agent selects actions sequentially based on the state [4]. Because the action will affect the future state, the agent needs to compute the accumulative reward to get the optimal solution. Solving sequential decision problems is under the framework of Markov decision process (MDP). Comparing to the HMM, a MDP can describe actions of the agent and the intersection between the agent and the environment. Thus, many models based on the MDP framework are proposed for intention recognition.

In many cases, a complex mission will be decomposed into sublevel tasks repeatedly until the mission only consists of primitive actions. To present decision process hierarchically, Bui et al. [5] proposed an abstract hidden Markov model (AHMM) based on the notion of abstract Markov policies (AMPs), which can be described simply in terms of a state space and a Markov policy that selects among a set of other AMPs. When the AHMM is used in intention recognition, it only concerns the probabilities of selection of a policy or abstract policy and does need to build the reward functions as MDPs. A problem of the AHMM is that it does not allow the top-level policy to be interrupted when the subplan is not completed. To solve this problem, we refined the structure of AHMM and proposed a new model named AHMM with Changeable Top-level Policy (AHMM-CTP) [6]. In AHMM-CTP, the top policies are allowed to be changed and are executed from top to bottom, which means the subplans can be terminated forcedly when the top policies are interrupted. However, the execution of primitive actions of AHMM-CTP is still a MDP, which means a primitive action will be terminated at each step. In the RTS game, a primitive action of path planning is moving for a distance in a direction. Since the simulation step is quite short, the primitive action will keep for several steps. This process is consistent with the semi-Markov decision process (SMDP) in the domain of planning [7].

Based on the idea of the SMDP, we propose a semi-Markov decision model (SMDM) to formalize the maneuvering behaviors in RTS games. The SMDM has a similar structure as a three-layer AHMM-CTP. It models the intention (destination), the action, and the situation hierarchically: actions are selected depending on the intention and situations, and actions result in updated situations. One difference is that, in the SMDM, the primitive action is associated with a duration variable, which indicates whether the primitive action is completed. Obviously, the SMDM can also present multilayer policies as AHMM-CTP, but this complex model is not discussed in this paper.

Inferring destinations online is actually a filtering problem [8]. As an approximate inference method, the particle filter (PF) is suitable to infer the destinations modelled by SMDP, because it does not limit the sort of noise and can tackle partially missing data. Although other existing classic methods are also possible to solve this problem, they will be infeasible when the maximum duration of actions is unknown. Another advantage of the PF is that the costing time only depends on the number of particles. Thus, the time constraint can be satisfied by reducing the number of particles, if the computation resource is not enough. However, when the state space is multidimensional, a small number of particles will result in a very high estimating variance. One solution is to use the Rao-Blackwellised particle filter (RBPF) which combines accurate inference and the Monte Carlo sampling [9]. In the RBPF, some variables of a particle are not sampled but are presented as distributions. And their posterior distribution is computed by accurate inferring after other variables are instantiated. In this way, the state space is declined. The RBPF has been successfully applied in [5, 6]. However, the accurate inference is highly correlated to the dynamic Bayesian network (DBN) structure of the model. To compute the posterior distributions of uninstantiated variables accurately at each step, we exploit the link reversal method based on the DBN structure of the SMDM.

We design a combat scenario to validate our SMDM and the RBPF: on a grid map, a soldier moves to a predefined destination and tries to get far from a patrolling vehicle at the same time. Our goal is to recognize the destination of the soldier with observed traces. Based on this scenario, we design a decision model for the soldier and generate a dataset consisting of 100 traces. With this dataset, statistical metrics including precision, recall, and -measure are computed by both the SMDM and AHMM-CTP. The results show that the SMDM outperforms the AHMM-CTP in all three metrics. The recognition results of two specific traces are also analyzed, which shows that the SMDM can perform well no matter the soldier’s destination is changed on the half or not. We also compare the estimation variance and the costing time of standard particle filter (SPF) and the RBPF. The results show that the RBPF can get results with smaller weighted variance and cost less time at the same time, when the number of particles is large. Additionally, when the number of particles is not enough, the RBPF has a lower average failure rate than the SPF.

The rest of the paper is organized as follows: the next section introduces some related work of formalization and inference. Section 3 analyzes the maneuvering process in the RTS game and gives the formal definition of the SMDM as well as its DBN structure. Section 4 introduces how to use RBPF to infer destinations approximately. Section 5 presents the background, settings, and results of our experiment. Subsequently, we have conclusions and discuss future works in the last section.

#### 2. Related Work

Since the problem of intention recognition or plan recognition was proposed as an intersection of psychology and artificial intelligence, people have used different ways to formalize the planning or the decision process. In early days, the formalization is usually correlated with the conception of plan library. The event hierarchy proposed by Kautz and Allen may be the earliest representation for plan recognition [10]. In Kautz’s theory, plans and actions are both defined as events, and an event hierarchy describes abstraction, instantiation, components, and functions by first order logic. Kautz’s theory is a milestone in plan recognition, but it may fail when there are two or more hypotheses explaining the observations. To speed up the reason process, Avrahami-Zilberbrand and Kaminka presented a Feature Decision Tree (FDT), which efficiently mapped observations to a plan. In a FDT, nodes correspond to features, and branches correspond to conditions on their values. Since the FDT is actually a special sort of decision tree, it can be built automatically [11]. Another famous model in the automated planning domain is hierarchical task networks (HTN). A HTN recursively decomposes tasks into lower level components, until a plan constituted by a series of low-level actions or primitive tasks is got [12]. Although the HTN is proposed for auto planning, it can also be used to describe complex tasks in the recognition problem [13]. The common problem of event hierarchy, FDT and HTN, is that they do not use probabilistic theory to model the uncertainty and dynamics in real world or simulation systems and cannot provide us with probabilities of intentions. Actually, other formalization methods which aim to present plans in classic planning theory suffer the same problem.

The well-known Probabilistic Graphical Models (PGMs) use the graph-based representation to compactly encode complex distributions: the nodes correspond to the variables, and the edges correspond to direct probabilistic interactions between them [14]. Since PGMs are very expressive and can provide many effective learning and inferring algorithms, many researchers applied different sorts of PGMs to model the planning or strategy, such as conditional random fields (CRFs) [15], Markov logic networks (MLNs) [16], dynamic Bayesian networks (DBNs) [17], hidden Markov models (HMMs) [18], Markov decision processes (MDPs) [19], and other extensions [20]. Additionally, representations in some plan recognition theories such as parsing trees can also be viewed as special cases of directed PGMs [21].

PGMs have been successfully applied to recognize intentions in many domains. For example, the hidden semi-Markov model (HSMM) is applied to estimate the position of an opponent in the game Counter Strike [22]. The parameters of the model were learnt from a dataset, which consists of 190 game logs collected in a champion-level competition, and both prediction accuracy error and human similarity error are computed to evaluate the estimation performance. Southey et al. also used the HSMM for recognizing the destinations and start point in the RTS game War 3. Besides solving the recognition problem, they further abstract the map and make inference feasible even when number of grids is large and observations are partially missing [8]. Zouaoui-Elloumi et al. used the HMM to model the behaviors of ships in the harbor. Their goal was not to predict the destination or position of the object, but to recognize the behavior pattern of a ship when it enters into the harbor [23]. Duong et al. proposed a Coxian hidden semi-Markov model (CxHSMM) for recognizing human activities of daily living (ADL) [24]. The CxHSMM modifies HMM in two aspects: on one hand, it is a special DBN representation of two-layer HMM, and it also has termination variables; on the other hand, it used Coxian distribution to model the duration of primitive actions explicitly.

van Kasteren et al. further compared the performance of CRF, HMM, SMCRF, and HSMM in the ADL domain, and real data collected in the lab was used [25]. In their experiments the HSMM consistently outperformed the HMM, showing that accurate duration modeling can result in a significant increase in recognition performance. SMCRF only slightly outperformed CRF, showing that CRF was more robust in dealing with violations of the modeling assumptions. Auslander et al. evaluated the performances of the HMM, MLNs, and CRFs for detecting (small boat) maritime attacks. The data was obtained from the 2010 Trident Warrior exercise (Summer 2010) and the results showed that PGMs outperformed the deployed rule-based approach on these tasks [26].

Ullman et al. proposed a model for inferring social goals from peoples’ actions, based on inverse planning in multiagent MDPs [27]. In their model, under assuming that agents are rational, the most likely goal which drives observed behaviors is estimated. Tastan et al. presented a framework to predict the positions of the opponent [28]. Unlike the work of Hladky, they used the learned MDP as the motion model of the PF. Another contribution was that they defined tactical features as the states instead of the directly observed data.

In the inference problem, the destination is regarded as a hidden state. Accurate inference algorithms such as HMM filter can be used to compute the posterior probabilities of destinations, but they usually cost much time and need perfect dataset. Another way is approximate inference, and one of the most widely used algorithms is the PF. Besides the works in [8, 22, 28], Weber et al. estimated the location of enemy units that have been encountered in Star Craft based on particle filter. In their work, each single particle, which consists of class, weight, and trajectory, corresponds to one previously encountered enemy unit [29]. Pfeffer et al. used DBNs to represent the tasks that units collaborated to attack targets in an urban environment. They used a factor PF to reason the team composition and goal [30].

#### 3. Modeling Maneuvering by the SMDM

In this section, we discuss how to model the maneuvering process in RTS games by our SMDM. A simple example is used to explain how the agent plans path and moves between grids on a grid-based map. Then, definitions of the SMDM and their corresponding meanings in the maneuvering process are given. At last, we depict the DBN structure of our SMDM to analyze relations among variables.

##### 3.1. Maneuvering in RTS Games

In RTS games, the maneuvering process of an agent consists of two levels: path planning based on the grids and moving between adjacent grids.

*(a) Path Planning*. No matter the agent is controlled by human or computer, path planning returns a sequence of nodes from the start point to the destination. But in a dynamic environment, path planning is essentially a sequential decision process: moving to an adjacent grid is a primitive action; the agent selects an action based on the current destination and situation. Obviously, the decision results may differ even in the same situation, especially when the agent is controlled by the human. Thus, a probabilistic model is needed to describe the path planning.

*(b) Moving between Adjacent Grids*. After identifying the oriented adjacent grid, the agent needs to move from the current position to it. This process is totally controlled by an engineering mechanism. Although the moving mechanisms may differ in different systems, they are certain and can be known by the recognizer. Additionally, since the simulation step is short, the agent usually cannot reach the oriented grid in one step. In other words, the decision process is with a semi-Markov property.

A classic example of maneuvering process on a grid map in the RTS game is presented in Figure 1. Assume that an agent is now on point in grid C2 and wants to go to point in A3. In the path planning level, the agent needs to choose a grid among the five adjacent grids (B1, B2, B3, C1, and C3). In this example, the agent decides to go to grid B2. In the moving level, the agent moves along the line from point to point which is the center of grid B2. Because the simulation step is a short time, the agent has to compute how long it will take to reach point according to the current speed. Because the position of the agent is a continuous variable, it is very unlikely that the agent just gets the grid center when a simulation step ends. Thus, the duration of the moving is usually computed bywhere is a constant in the moving process and is the real time of a simulation step. is the distance between point and point ; is computed by a floor operator. In this case, . After moving for 3 steps from position , the agent will get position and choose the next grid. This moving process will not be intercepted except that the intention is changed.