Abstract

With the rapid advent of video games recently and the increasing numbers of players and gamers, only a tough game with high policy, actions, and tactics survives. How the game responds to opponent actions is the key issue of popular games. Many algorithms were proposed to solve this problem such as Least-Squares Policy Iteration (LSPI) and State-Action-Reward-State-Action (SARSA) but they mainly depend on discrete actions, while agents in such a setting have to learn from the consequences of their continuous actions, in order to maximize the total reward over time. So in this paper we proposed a new algorithm based on LSPI called Least-Squares Continuous Action Policy Iteration (LSCAPI). The LSCAPI was implemented and tested on three different games: one board game, the 8 Queens, and two real-time strategy (RTS) games, StarCraft Brood War and Glest. The LSCAPI evaluation proved superiority over LSPI in time, policy learning ability, and effectiveness.

1. Introduction

An agent is anything that can be viewed as perceiving its environment through sensors and acting in that environment through actuators as in Figure 1, while a rational agent is the one that does the right thing [1].

However, agents may have no prior knowledge on what the right or optimal actions are. To learn the best action selection the agent needs to explore the state-action search space and, from the rewards provided by the environment, the agent can calculate the true expected reward when selecting an action from a state.

Reinforcement learning (RL) is learning what the agent can do and how to map situations to actions in order to maximize the numerical reward signal. Reinforcement learning assists agents to discover which actions yield the most reward and the most punishment after trying them through trial-and-error and delayed reward. Reinforcement learning concentrated more on finding a balance between exploration of anonymous areas and exploitation of its current knowledge [24].

Batch reinforcement learning (BRL) is a subfield of dynamic programming (DP) [4, 5] based reinforcement learning that recently has immensely grown. Batch RL is mainly used, where the complete amount of learning experience, usually a set of transitions sampled from the system, is fixed and given a priori. The learning system concern then is to derive an optimal policy out of the given batch of samples [68]. So batch reinforcement learning algorithms aim to achieve the utmost data efficiency through saving experience data to make an aggregate batch of updates to the learned policy [7, 8].

Figure 2 classifies different batch RL algorithms based on interaction perspectives into offline and online algorithms. The offline algorithms are also known as pure batch algorithms that mainly work offline on a fixed set of transition samples as Fitted Q Iteration (FQI) [9, 10], Kernel-Based Approximate Dynamic Programming (KADP) [11, 12], and Least-Squares Policy Iteration (LSPI) [1315], while online algorithms comprised pure online algorithms, semibatch algorithms, and growing batch algorithms as Neural Fitted Q Iteration (NFQ) [16] that depends on making an aggregate update for several transitions and storing and reusing the experiences after making this update.

Different problems had been solved using BRL based on the RL agents’ ability to learn without expert supervision. Game playing problem is one of major areas where BRL represented a magical solution for determining the best policy to be applied in a game. This often depends on a number of different factors, as the space of possible actions is too large for an agent to reason about directly, while still meeting real-time constraints. To cover this many states, using a standard rule based approach would mean specifying a large number of hard coded rules. BRL cuts out this need to manually specify rules, as agents learn simply by playing the game. For example, in backgammon game, agent can be trained by playing against other human player or even other RL agent [7].

As we mentioned obtaining the optimal policy [17] (decision-making function which represents a mapping from states to actions) the game can apply is a major key in evaluating game performance through game-human interaction. The problem arises here as all games especially games with large space of possible actions as real-time strategy (RTS) games and board games suffer through finding the optimal policy and thus the best action. As almost all algorithms count only for discrete states, discrete actions, or continuous states, discrete actions while agents are in such a setting have to learn from the consequences of their continuous actions, in order to maximize the total reward over time. So the researchers in this paper proposed an algorithm based on LSPI considering continuous actions, which we called Least-Squares Continuous Actions Policy Iteration (LSCAPI). The LSCAPI was applied and tested using three game genres: StarCraft Brood War, Glest, and 8 Queens.

This paper is organized as follows. Section 2 covers a brief review of video games concentrating on RTS games and board games. In Section 3, the proposed Least-Squares Continuous Actions Policy Iteration (LSCAPI) algorithm is described in detail. Section 4 comprises the testing of LSCAPI on some real cases. In Section 5 the simulation results and discussion are introduced and in Section 6 the implantation of LSCAPI. Finally Section 7 outlines our conclusions.

2. Background

2.1. Board Games

Board games are the most known and mostly played games over centuries. A board game is a game that involves pieces moved or placed on a board, according to a set of rules. Moves in board games may depend on pure strategy or only pure chance as the rolling dice, or both; in all cases expert opponents can achieve their aims [20].

Earlier board games were represented as a battle between two armies with no rules except points, while modern board games basically relied on defeating opponent players in terms of counters, winning position, or points entitlement thus relying on rules. Many board [21] games as Tic-Tac-Toe [22], chess, and Chinese chess (checker) had a long history in machine learning researches. Trinh et al. in [23] discussed the application of temporal-difference-learning in training a neural network to play a scaled-down version of Chinese chess.

Runarsson and Lucas in [24] studied and likened the temporal difference learning (TDL) using the self-play gradient-descent method and coevolutionary learning, using an evolution strategy for acquiring position evaluation for small Go boards. The two approaches are compared with the hope of gaining a greater insight into the problem of searching for optimal strategies.

Wiering et al. in [25] used reinforcement learning algorithms that can learn a game position evaluation function through learning the backgammon game. They examine three different methods for training games: learning by self-play, learning by playing against an expert program, and learning from viewing experts play against themselves.

And Block et al. in [26] proposed a chess engine which proved that reinforcement learning in combination with the classification of board state leads to a notable improvement, when compared with other engines that only use reinforcement learning, such as Knight-Cap.

Skoulakis and Lagoudakis in [27] demonstrated the efficiency of the LSPI agent over the TD agent in the classical board game of Othello/Reversi. They presented a learning approach based on LSPI algorithm that focuses on learning a state-action evaluation function. The key advantage of the proposed approach is that the agent can make batch updates to the evaluation function with any collection of samples, can utilize samples from past games, and can make updates that do not depend on the current evaluation function since there is no bootstrapping.

Finally Szubert and Jaskowski in [28] employed three variants of temporal difference learning to acquire action value, state value, and after-state value functions for evaluating player moves through puzzle game 2048. To represent these functions they adopt n-tuple networks, which have recently been successfully applied to Othello and Connect 4 board games.

2.1.1. Queens

The 8 Queens puzzle is a context of the N-Queens problem where eight chess queens are placed on an 8 × 8 chessboard [29]. Any of the queens must not attack any of the others, so that no two queens share the same row, column, or diagonal as in Figure 3.

Many algorithms had been proposed for solving 8 Queens as in Lim and Son [30], who applied the -learning as a problem solving algorithm for the N-Queens and compared it with the traditional existing methods for solving N-Queens problem.

2.2. Real-Time Strategy Games

Real-time Strategy (RTS) games are a subgenre of strategy games where players need to build an economy and military power in order to defeat their opponents. In RTS games players race and struggle against enemy factions by harvesting resources scattered over a terrain and producing buildings and units and help one another in order to set up economies, improve their technological skill and level, and win battles, until their enemies are extinct [3134]. The better the balance you get among economy, technology, and army, the more the chances you have to win RTS games [34].

Marthi et al. in [35] applied hierarchical reinforcement learning in a limited RTS domain. This approach used reinforcement learning augmented with prior knowledge about the high-level structure of behavior, constraining the possibilities of the learning agent and thus greatly reducing the search space.

Gusmao in [36] considered the problem of effective and automated decision-making in modern real-time strategy games through the use of reinforcement learning techniques. The researcher proposed a stable, model-based Monte Carlo method assuming that models are imperfect, reducing their influence in the decision-making process. And its effectiveness is further improved by including a novel online search procedure in the control policy.

Leece and Jhala in [37] presented a simplified game that mimics spatial reasoning aspects of more complex games, while removing other complexities through analyzing the effectiveness of classical reinforcement learning for spatial management in order to build a detailed evaluative standard across a broad set of opponent strategies. The authors also demonstrated the potentiality of knowledge transfer to more complex games with similar components.

In 2015, Sethy et al. [38] proposed reinforcement learning algorithm based on -learning and SARSA with the generalized reward function to train agents. The proposed model was evaluated using Battle City RTS game proving superiority over the state of the art in enhancing agent learning ability.

2.2.1. Glest

Glest is an open source 3D real-time strategy game, where the player can have armies of two different factions, tech, and magic. Tech mainly included warriors and mechanical devices, while magic relied on mages and summoned creatures in the battlefield [34, 39]. In Glest warriors fight by weapons and mangonel, while magicians cast spells and magic forces as shown in Figure 4. The final goal for both factions is to extinct all enemy units, finish the episode, or die.

Glest actions are structured in three levels. The first consists of primitive low-level actions, such as modifications of basic variables that describe low-level microstates. In the second level of actions, grouped primitive actions form a more abstract task or rule. There are thirteen different rules in Glest as Worker Harvest, Return Base, Massive Attack, Add Tasks, Produce Resource Producer, Build One Farm, Produce, Build, Upgrade, and Repair [33, 39, 40].

The third level represents a strategy tactic level, grouping second-level actions-rules with respect to the timer of the game. Each second-level action-rule is associated with a certain time interval. Depending on the module of the timer a certain rule will be picked up and applied by the tactic of the game within the rule’s interval [40].

Glest has four main variables: Kills, Units, Resources, and Timer. Each state is characterized by the values of the four variables, where “Kills” refers to the number of kills that the player has achieved, “Units” counts the units the player has produced, “Resources” are the resources the player had harvested, and finally “Timer” counts the game time units.

Every variable is connected with some related actions except for “Timer”; its only role is to count time. “Units” is connected with Build One Farm, Produce, Build, Upgrade, Expand, and Repair actions, while “Kills” is connected with Scout Patrol, Return Base, Massive Attack, and Add Tasks. Finally “Resources” are connected with Worker Harvest, Refresh Harvester, and Produce Resource Producer.

Glest recently attracts the attention of different researchers as in 2009 Dimitriadis [33] investigated the design of reinforcement learning autonomous agents that learn to play Glest RTS game through interaction with the game itself. He used the well-known SARSA and LSPI algorithms to learn how to choose among different high-level strategies with the purpose of winning against the embedded AI opponent, while Aref et al. in 2012 proposed a new model called (REAL-SEE) for exchanging opponent experiences among real-time strategy game engines of Glest [34].

2.2.2. StarCraft Brood War

StarCraft Brood War shown in Figure 5 is an expansion pack released in 1998 for the award-winning real-time strategy game StarCraft. Brood War gameplay remains fundamentally unchanged from that of StarCraft but with new difficult campaigns, map tilesets, music, extra units for each race, upgraded advancements, and less practical rushing tactics where missions are no longer entirely linear [41, 42]. Through the Brood War a single agent can make high-level strategic decisions as attacking an opponent or creating a secondary base and mid-level decisions as deciding what buildings to build [42, 43].

Brood War has gained popularity as a test bed for research as it exhibits all of the issues concerning most interested AI research areas in RTS environments including pathfinding, planning, spatial and temporal reasoning, and opponent modeling. Also Brood War has a huge number of online game replays that have been used as the case base for case-based reasoning (CBR) systems to analyze opponent strategies [42, 43].

In 2012, Wender and Watson introduced an evaluation of the suitability of -learning and SARSA reinforcement learning algorithms to perform the task of micromanaging combat units in the StarCraft Brood War RTS game [44], while in 2014 Siebra and Neto proposed a modeling approach for the use of SARSA in enabling the computational agents evolving their combat behavior according to actions of opponents to obtain better results in later battles [45].

3. Least-Squares Continuous Actions Policy Iteration (LSCAPI)

Reinforcement learning and video games have a long beneficial conjoint history as games are fruitful fields for testing reinforcement learning algorithms. In any application of reinforcement learning, the choice of algorithm is just one of many factories that determined success or failure. The choice of the algorithm is not even the most significant factor. The choice of representation, formalization, the encoding of domain knowledge, and setting of parameters can all have great influence.

In this research the researchers proposed an offline pure batch algorithm based on the Least-Squares Policy Iteration algorithm. Least-Squares Policy Iteration is a relatively new, model-free, approximate policy iteration algorithm for control. It is an offline, off-policy batch training method that exhibits good sample efficiency and offers stability of approximation [1315].

Least-Squares Policy Iteration evaluates policies using the least-squares temporal difference for -functions (LSTD-) and performs exact policy improvements. To find the -function of the current policy, it uses a batch of transition samples, with LSTD-, to compute the parameter vector. Then, an improved policy in this -function is determined, to find the -function of the improved policy and so on [5]. Most releases of LSPI use discrete actions although for control problems; continuous actions are needed. As systems need to be stabilized, any discrete action may cause unneeded chattering of the control action.

The idea of LSCAPI as described in Algorithm 1 concentrates on solving discrete action problem through comparing actions to get the largest -value action to be applied. But first scalar control actions are considered to deal with the actions as a continuous action chain where ().

Input: discount factor
: , , : ,
() , initialize policy
() measure initial state
() for step do
() ,  ; ,  
() apply , measure state , and reward
() start LSTD- policy evaluation
  , ,
()
()
()
() finalize policy evaluation
() policy improvement
() until is a satisfactory
()
() end for
Output:  

A new parameter for evaluating actions other than state parameters is proposed called orthogonal polynomials Ψ. Ψ is evaluated by two values considering if the action is selected for applying in the game through the current step or not. Fitted action is evaluated by 1 where the engine will return the true value that was used to compute the policy. The other value is where the action is discarded and a new action is selected from the action space .

As soon as the fitted action is selected the LSTD- [13] function evaluates the policy by creating parameter from the input transition using the selected fitted action.

Finally the difference between the previous and the new selected actions is calculated and stored with the new action as in (1). This step optimizes the needed storage space for actions by only storing the difference between actions not the actions themselves:

4. Testing LSCAPI on Real Cases

For more explanation the LSCAPI algorithm was tested on 8 Queens, Glest, and StarCraft Brood War games and implemented using Microsoft visual studio 2013 C# engine.

4.1. Queens

Through this section the proposed algorithm will be used for solving the 8 Queens problem and tracing each step separately. But first there are some assumptions to clarify:(i)is the states in the game which are eight states (eight rows).(ii) is the actions which are continuous as every state has a chain of actions for it.(iii) considering board and which represents the probability of actions that the 1st queen can be taken on the board.

(i) 1st Queen. As clarified in Figure 6, the first queen will be set randomly on the first row of the board, where the initial state indicates that is in row 8 and in column 3 (). The probability [row, column] of the 1st queen is , which indicates that the action for placing the 1st queen in column 3 is taken.

(ii) 2nd Queen. After placing the 1st queen, the probability of actions for the second queen is , eliminating the probabilities that other queens could be on the same row, column, or diagonals. In the current state a matrix of continuous actions will be initiated holding the five available probabilities of actions that the second queen could take in the next row.

Through applying LSCAIP, the value of will be set to if action is selected as it increases the probability of having pairs of attacking queens and set to 1 if the action is selected as it provides more possible actions to take. Based on LSCAPI the action with the higher -function will be selected. State means that is in row 7 and in column 5 (). So the action with the largest -function was to place the queen in position as shown in Figure 7.

(iii) 3rd Queen. After placing the 2nd queen, we will move to the next state, the third row, having 26 probabilities of actions to place the 3rd queen on board and only 3 probabilities of actions in the continuous actions matrix . The previous steps will be repeated from state 2 to select the action with the higher -function indicating the action generating more possible positions on the board to place the 4th queen. As in Figure 8, the 3rd queen will be placed on position , where state means that is in row 6 and column 2.

The same is done in the next five states to get the policy of the game and solve the problem by placing the eight queens in their true places without errors as shown in Figures 9, 10, 11, 12, and 13.

Table 1 describes some of the other solutions generated by LSPI and LSCAPI for the 8 Queens game.

The performance of any action in the chain is measured based on the number of nonattacking queens. The minimum is equal to zero where all queens attack each other. The maximum is 28 where there are no attacking queens. The performance is calculated based on (2) to determine the fitness of the action compared to alternative actions:Table 2 represents the performance evaluation of actions taken by LSCAPI in case 1 . Every row has chain of fitted actions evaluated by (2). Notice that if two fitted actions have the same values, the fitted one is chosen depending on the previous chosen actions. Table 3 represents the performance evaluation of discrete actions taken by LSPI.

It is noticed that the case performance achieved by LSCAPI is higher than the case performance achieved by LSPI from which we can also determine that actions selected by LSCAPI are much better than LSPI that finally produce better game performance.

4.2. Glest

The evaluation of LSCAPI in Glest is a bit different than in 8 Queens as in Glest the LSCAPI works on improving the game engine policy against the player. The researchers in evaluating LSCAPI in Glest concentrated on three parameters Δkills, ΔUnits, and Δresources as in Table 4, where Δkills, ΔUnits, and Δresources represent the gained or lost knights, units, and resources as a result of moving from state to (executing an action).

In LSCAPI algorithm we will get action for a state not as discrete but as continuous chain of actions. Every action in a chain has the value of polynomial 1 or not as a calculation but only for selecting the fittest action in this state. Using continuous chain will improve efficiency of the game and speed it up. The LSCAPI arranges the most fitted action then the next action and so on based on priority. It saves time to select the fitted action and enhance performance. Fitted actions are evaluated by the most reward resulting from them.

In Table 4 a sample case of Glest game called three towers was introduced. LSCAPI arranges the available actions in a priority continuous chain. This method helps to get the actions with the higher performance in less time.

In this case the available actions were Add Tasks (AT), Build (B), Massive Attack (MA), Produce (P), Produce Resource Producer (PRP), and Worker Harvest (WH). These actions are arranged based on their priority; the action with a high priority will be in the beginning of the chain and so on until the only fitted actions that will be used in state are complete.

In “game” state 1 the priority of continuous actions chain is (PRP, MA, B, AT, P); we use only five actions from the six input actions in this case with the shown reward. In state 2 we use only four actions of them, while in player the priority chain of actions in state 1 includes all of the input actions and in state 2 it includes only five of them.

Meanwhile in the tower of souls case as in Table 5 the available actions were Produce Resource Producer, Build One Farm, Refresh Harvest, Massive Attack, Return Base, and Upgrade.

In game state 1 the priority of continuous actions chain is (MA, PRP, RH, RB, and BOF); we use only five actions from the six input actions in this case with the shown reward. In state 2 we use only four actions of them, while in player the priority chain of actions, as in the table, in state 1 includes all of the input actions and in state 2 it includes only five of them. Finally the calculations of the reward function to evaluate the action selection mechanism are based onwhereTable 6 demonstrates an example for calculating reward function, based on the values obtained from the number of kills, units, and resources in each two consecutive states as in (3). The reward values support the superiority of the LSCAPI over LSPI with a reasonable difference.

4.3. StarCraft Brood War

The StarCraft Brood War applied the same case form as Glest, so we will just concentrate on the calculations of the reward function. The reward function of Brood War is representing the difference between the damage done by the agent and the damage received by the agent in each two consecutive states, to evaluate the action selection mechanism as in Table 7 demonstrates an example for calculating the reward of agent in case of applying the LSPI and the SARSA (already applied in the StarCraft Brood War engine) action selection mechanisms, through the Protoss versus Zerg Case. This case included four ground units, Probe, Dragon, Drone, and Larva, considering that Drone and Larva units belong to Zerg which represents the game agent, while Probe and Dragon units belong to Protoss which represents the game enemy.

We can notice that the reward values in the Brood War are within the range of which is much smaller in range than Glest. Also from the table it is cleared that the reward of the agent resulting from applying actions supported by LSCAPI is much higher than the agent reward in case of SARSA.

5. LSCAPI Implementation

The LSCAPI was implemented using Microsoft visual studio 2013 C# and packaged to the Glest and StarCraft Brood War games engines while the 8 Queens was a fully standup application based on LSPI and LSCAPI.

6. Results and Discussion

6.1. Queens

The overall performance of 5 different solutions generated by the LSCAPI and LSPI is calculated through the implementation and shown in Figure 14 from which we can detect that the LSCAPI generated actions and solutions that achieved better performance against opponents than LSPI which also indicates better policy through the game.

Meanwhile in Figure 15 the time taken by each one of these solutions through LSPI and LSCAPI is presented. It is noticed from Figure 15 that LSCAPI not only generated policies that achieved a higher performance, but also generated them in less time as in case solution of LSCAPI where the performance reaches 0.905 in 0.269 seconds while in LSPI case 4 performance only reached 0.28 in 0.88 seconds which is nearly the triple of the time needed to generate the LSCAPI solution.

Table 8 demonstrates the number of generated solutions by LSPI and LSCAPI in 10 different simulation attempts and the time taken by each attempt. The LSCAPI in attempt 4 generated 34 different arrangements of the 8 Queens in 9.16 seconds while LSPI generated only 10 solutions in 8.8 seconds.

Finally we can assure that LSCAPI achieved a really obvious superiority over LSPI generating more accurate solutions in less time which leads to better policies through game solving.

6.2. GLEST

To evaluate Glest, the reward of the agent in cases under consideration generated by LSPI and LSCAPI was calculated based on (3) and listed in Table 9 demonstrating that LSCAPI achieved much higher rewards to agents after performing the selected actions by LSCAPI which leads to better policies through facing opponents. For example, LSCAPI achieved 15000 in case 8 while LSPI selected actions only achieved 8500 which is nearly the half.

As we also can see from Table 9, LSCAPI cases with higher reward are associated with less number of used actions. This means that the reward has an inverse relationship with the number of used actions so that when the number of actions decreases the reward increases and vice versa.

Finally Figure 16 presents the game army score achieved by LSCAPI policy learning against the score achieved by learning LSPI policy through the Duel scenario which is a medium difficulty level scenario. LSCAPI and LSPI policy scores were evaluated in 15 test games played taking into account that each policy game testing is played after every 10 games of learning.

Figure 16 clarified that game army using LSCAPI policy achieved much higher score than that achieved by LSPI, which means that LSCAPI really helps the game agents to efficiently face opponents and defeat them.

6.3. StarCraft Brood War

To evaluate Brood War, the reward function of the agent in the cases under consideration generated by LSCAPI and SARSA was calculated and listed in Table 10. The rewards values from Table 10 illustrated that LSCAPI achieved superiority over SARSA concerning the agent rewards, leading to better policies through facing opponents as in case 8, where LSCAPI achieved 0.75 as an agent reward value while SARSA action only achieved 0.44.

Finally Figure 17 presents the game army score achieved by LSCAPI policy learning against the score achieved by SARSA through the Terran scenario 1 which is a medium difficulty level scenario. LSCAPI and SARSA policy scores were evaluated in 15 test games played taking into account that each policy game testing is played after every 10 games of learning.

Figure 17 illustrated that LSCAPI achieved much higher scores through the 15 test games than those achieved by SARSA. Also we can notice that SARSA score values increase with a decreasing rate which finally turned into a fixed value whatever the number of trials. On the other side LSCAPI score values grow with an increasing rate indicating that the learning rate is increasing based on the increasing states of agent rewarding as a result of better actions selection. From all of the foregoing, we can assure that LSCAPI represents a real help to the game engines to easily and efficiently face opponents and defeat them.

7. Conclusions and Future Work

Mapping from states to actions is the base function of game engine to face and react towards opponent, which is known as game policy. In this paper, we had studied the impact of batch reinforcement learning on enhancing game policy and proposed a new algorithm named Least-Squares Continuous Actions Policy Iteration (LSCAPI). The LSCAPI algorithm relied on LSPI considering handling continuous actions through a tradeoff between available actions and electing the action that scores higher reward to the game agent.

LSCAPI was tested on two different types of games: board games represented in 8 Queens and RTS games represented in Glest and StarCraft Brood War open source games. The proposed algorithm was evaluated based on the agent reward values, scores, time, and number of generated solutions. The evaluation result indicated that LSCAPI achieved better performance, time, policy, and agent learning ability than original LSPI.

In the future we plan to pursue testing LSCAPI on more complicated games as chess and poker to check its impact on game policy especially that the nonplaying character in these two games heavily relies on the game policy.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.