Research Article  Open Access
Enhancing Video Games Policy Based on LeastSquares Continuous Action Policy Iteration: Case Study on StarCraft Brood War and Glest RTS Games and the 8 Queens Board Game
Abstract
With the rapid advent of video games recently and the increasing numbers of players and gamers, only a tough game with high policy, actions, and tactics survives. How the game responds to opponent actions is the key issue of popular games. Many algorithms were proposed to solve this problem such as LeastSquares Policy Iteration (LSPI) and StateActionRewardStateAction (SARSA) but they mainly depend on discrete actions, while agents in such a setting have to learn from the consequences of their continuous actions, in order to maximize the total reward over time. So in this paper we proposed a new algorithm based on LSPI called LeastSquares Continuous Action Policy Iteration (LSCAPI). The LSCAPI was implemented and tested on three different games: one board game, the 8 Queens, and two realtime strategy (RTS) games, StarCraft Brood War and Glest. The LSCAPI evaluation proved superiority over LSPI in time, policy learning ability, and effectiveness.
1. Introduction
An agent is anything that can be viewed as perceiving its environment through sensors and acting in that environment through actuators as in Figure 1, while a rational agent is the one that does the right thing [1].
However, agents may have no prior knowledge on what the right or optimal actions are. To learn the best action selection the agent needs to explore the stateaction search space and, from the rewards provided by the environment, the agent can calculate the true expected reward when selecting an action from a state.
Reinforcement learning (RL) is learning what the agent can do and how to map situations to actions in order to maximize the numerical reward signal. Reinforcement learning assists agents to discover which actions yield the most reward and the most punishment after trying them through trialanderror and delayed reward. Reinforcement learning concentrated more on finding a balance between exploration of anonymous areas and exploitation of its current knowledge [2–4].
Batch reinforcement learning (BRL) is a subfield of dynamic programming (DP) [4, 5] based reinforcement learning that recently has immensely grown. Batch RL is mainly used, where the complete amount of learning experience, usually a set of transitions sampled from the system, is fixed and given a priori. The learning system concern then is to derive an optimal policy out of the given batch of samples [6–8]. So batch reinforcement learning algorithms aim to achieve the utmost data efficiency through saving experience data to make an aggregate batch of updates to the learned policy [7, 8].
Figure 2 classifies different batch RL algorithms based on interaction perspectives into offline and online algorithms. The offline algorithms are also known as pure batch algorithms that mainly work offline on a fixed set of transition samples as Fitted Q Iteration (FQI) [9, 10], KernelBased Approximate Dynamic Programming (KADP) [11, 12], and LeastSquares Policy Iteration (LSPI) [13–15], while online algorithms comprised pure online algorithms, semibatch algorithms, and growing batch algorithms as Neural Fitted Q Iteration (NFQ) [16] that depends on making an aggregate update for several transitions and storing and reusing the experiences after making this update.
Different problems had been solved using BRL based on the RL agents’ ability to learn without expert supervision. Game playing problem is one of major areas where BRL represented a magical solution for determining the best policy to be applied in a game. This often depends on a number of different factors, as the space of possible actions is too large for an agent to reason about directly, while still meeting realtime constraints. To cover this many states, using a standard rule based approach would mean specifying a large number of hard coded rules. BRL cuts out this need to manually specify rules, as agents learn simply by playing the game. For example, in backgammon game, agent can be trained by playing against other human player or even other RL agent [7].
As we mentioned obtaining the optimal policy [17] (decisionmaking function which represents a mapping from states to actions) the game can apply is a major key in evaluating game performance through gamehuman interaction. The problem arises here as all games especially games with large space of possible actions as realtime strategy (RTS) games and board games suffer through finding the optimal policy and thus the best action. As almost all algorithms count only for discrete states, discrete actions, or continuous states, discrete actions while agents are in such a setting have to learn from the consequences of their continuous actions, in order to maximize the total reward over time. So the researchers in this paper proposed an algorithm based on LSPI considering continuous actions, which we called LeastSquares Continuous Actions Policy Iteration (LSCAPI). The LSCAPI was applied and tested using three game genres: StarCraft Brood War, Glest, and 8 Queens.
This paper is organized as follows. Section 2 covers a brief review of video games concentrating on RTS games and board games. In Section 3, the proposed LeastSquares Continuous Actions Policy Iteration (LSCAPI) algorithm is described in detail. Section 4 comprises the testing of LSCAPI on some real cases. In Section 5 the simulation results and discussion are introduced and in Section 6 the implantation of LSCAPI. Finally Section 7 outlines our conclusions.
2. Background
2.1. Board Games
Board games are the most known and mostly played games over centuries. A board game is a game that involves pieces moved or placed on a board, according to a set of rules. Moves in board games may depend on pure strategy or only pure chance as the rolling dice, or both; in all cases expert opponents can achieve their aims [20].
Earlier board games were represented as a battle between two armies with no rules except points, while modern board games basically relied on defeating opponent players in terms of counters, winning position, or points entitlement thus relying on rules. Many board [21] games as TicTacToe [22], chess, and Chinese chess (checker) had a long history in machine learning researches. Trinh et al. in [23] discussed the application of temporaldifferencelearning in training a neural network to play a scaleddown version of Chinese chess.
Runarsson and Lucas in [24] studied and likened the temporal difference learning (TDL) using the selfplay gradientdescent method and coevolutionary learning, using an evolution strategy for acquiring position evaluation for small Go boards. The two approaches are compared with the hope of gaining a greater insight into the problem of searching for optimal strategies.
Wiering et al. in [25] used reinforcement learning algorithms that can learn a game position evaluation function through learning the backgammon game. They examine three different methods for training games: learning by selfplay, learning by playing against an expert program, and learning from viewing experts play against themselves.
And Block et al. in [26] proposed a chess engine which proved that reinforcement learning in combination with the classification of board state leads to a notable improvement, when compared with other engines that only use reinforcement learning, such as KnightCap.
Skoulakis and Lagoudakis in [27] demonstrated the efficiency of the LSPI agent over the TD agent in the classical board game of Othello/Reversi. They presented a learning approach based on LSPI algorithm that focuses on learning a stateaction evaluation function. The key advantage of the proposed approach is that the agent can make batch updates to the evaluation function with any collection of samples, can utilize samples from past games, and can make updates that do not depend on the current evaluation function since there is no bootstrapping.
Finally Szubert and Jaskowski in [28] employed three variants of temporal difference learning to acquire action value, state value, and afterstate value functions for evaluating player moves through puzzle game 2048. To represent these functions they adopt ntuple networks, which have recently been successfully applied to Othello and Connect 4 board games.
2.1.1. Queens
The 8 Queens puzzle is a context of the NQueens problem where eight chess queens are placed on an 8 × 8 chessboard [29]. Any of the queens must not attack any of the others, so that no two queens share the same row, column, or diagonal as in Figure 3.
Many algorithms had been proposed for solving 8 Queens as in Lim and Son [30], who applied the learning as a problem solving algorithm for the NQueens and compared it with the traditional existing methods for solving NQueens problem.
2.2. RealTime Strategy Games
Realtime Strategy (RTS) games are a subgenre of strategy games where players need to build an economy and military power in order to defeat their opponents. In RTS games players race and struggle against enemy factions by harvesting resources scattered over a terrain and producing buildings and units and help one another in order to set up economies, improve their technological skill and level, and win battles, until their enemies are extinct [31–34]. The better the balance you get among economy, technology, and army, the more the chances you have to win RTS games [34].
Marthi et al. in [35] applied hierarchical reinforcement learning in a limited RTS domain. This approach used reinforcement learning augmented with prior knowledge about the highlevel structure of behavior, constraining the possibilities of the learning agent and thus greatly reducing the search space.
Gusmao in [36] considered the problem of effective and automated decisionmaking in modern realtime strategy games through the use of reinforcement learning techniques. The researcher proposed a stable, modelbased Monte Carlo method assuming that models are imperfect, reducing their influence in the decisionmaking process. And its effectiveness is further improved by including a novel online search procedure in the control policy.
Leece and Jhala in [37] presented a simplified game that mimics spatial reasoning aspects of more complex games, while removing other complexities through analyzing the effectiveness of classical reinforcement learning for spatial management in order to build a detailed evaluative standard across a broad set of opponent strategies. The authors also demonstrated the potentiality of knowledge transfer to more complex games with similar components.
In 2015, Sethy et al. [38] proposed reinforcement learning algorithm based on learning and SARSA with the generalized reward function to train agents. The proposed model was evaluated using Battle City RTS game proving superiority over the state of the art in enhancing agent learning ability.
2.2.1. Glest
Glest is an open source 3D realtime strategy game, where the player can have armies of two different factions, tech, and magic. Tech mainly included warriors and mechanical devices, while magic relied on mages and summoned creatures in the battlefield [34, 39]. In Glest warriors fight by weapons and mangonel, while magicians cast spells and magic forces as shown in Figure 4. The final goal for both factions is to extinct all enemy units, finish the episode, or die.
Glest actions are structured in three levels. The first consists of primitive lowlevel actions, such as modifications of basic variables that describe lowlevel microstates. In the second level of actions, grouped primitive actions form a more abstract task or rule. There are thirteen different rules in Glest as Worker Harvest, Return Base, Massive Attack, Add Tasks, Produce Resource Producer, Build One Farm, Produce, Build, Upgrade, and Repair [33, 39, 40].
The third level represents a strategy tactic level, grouping secondlevel actionsrules with respect to the timer of the game. Each secondlevel actionrule is associated with a certain time interval. Depending on the module of the timer a certain rule will be picked up and applied by the tactic of the game within the rule’s interval [40].
Glest has four main variables: Kills, Units, Resources, and Timer. Each state is characterized by the values of the four variables, where “Kills” refers to the number of kills that the player has achieved, “Units” counts the units the player has produced, “Resources” are the resources the player had harvested, and finally “Timer” counts the game time units.
Every variable is connected with some related actions except for “Timer”; its only role is to count time. “Units” is connected with Build One Farm, Produce, Build, Upgrade, Expand, and Repair actions, while “Kills” is connected with Scout Patrol, Return Base, Massive Attack, and Add Tasks. Finally “Resources” are connected with Worker Harvest, Refresh Harvester, and Produce Resource Producer.
Glest recently attracts the attention of different researchers as in 2009 Dimitriadis [33] investigated the design of reinforcement learning autonomous agents that learn to play Glest RTS game through interaction with the game itself. He used the wellknown SARSA and LSPI algorithms to learn how to choose among different highlevel strategies with the purpose of winning against the embedded AI opponent, while Aref et al. in 2012 proposed a new model called (REALSEE) for exchanging opponent experiences among realtime strategy game engines of Glest [34].
2.2.2. StarCraft Brood War
StarCraft Brood War shown in Figure 5 is an expansion pack released in 1998 for the awardwinning realtime strategy game StarCraft. Brood War gameplay remains fundamentally unchanged from that of StarCraft but with new difficult campaigns, map tilesets, music, extra units for each race, upgraded advancements, and less practical rushing tactics where missions are no longer entirely linear [41, 42]. Through the Brood War a single agent can make highlevel strategic decisions as attacking an opponent or creating a secondary base and midlevel decisions as deciding what buildings to build [42, 43].
Brood War has gained popularity as a test bed for research as it exhibits all of the issues concerning most interested AI research areas in RTS environments including pathfinding, planning, spatial and temporal reasoning, and opponent modeling. Also Brood War has a huge number of online game replays that have been used as the case base for casebased reasoning (CBR) systems to analyze opponent strategies [42, 43].
In 2012, Wender and Watson introduced an evaluation of the suitability of learning and SARSA reinforcement learning algorithms to perform the task of micromanaging combat units in the StarCraft Brood War RTS game [44], while in 2014 Siebra and Neto proposed a modeling approach for the use of SARSA in enabling the computational agents evolving their combat behavior according to actions of opponents to obtain better results in later battles [45].
3. LeastSquares Continuous Actions Policy Iteration (LSCAPI)
Reinforcement learning and video games have a long beneficial conjoint history as games are fruitful fields for testing reinforcement learning algorithms. In any application of reinforcement learning, the choice of algorithm is just one of many factories that determined success or failure. The choice of the algorithm is not even the most significant factor. The choice of representation, formalization, the encoding of domain knowledge, and setting of parameters can all have great influence.
In this research the researchers proposed an offline pure batch algorithm based on the LeastSquares Policy Iteration algorithm. LeastSquares Policy Iteration is a relatively new, modelfree, approximate policy iteration algorithm for control. It is an offline, offpolicy batch training method that exhibits good sample efficiency and offers stability of approximation [13–15].
LeastSquares Policy Iteration evaluates policies using the leastsquares temporal difference for functions (LSTD) and performs exact policy improvements. To find the function of the current policy, it uses a batch of transition samples, with LSTD, to compute the parameter vector. Then, an improved policy in this function is determined, to find the function of the improved policy and so on [5]. Most releases of LSPI use discrete actions although for control problems; continuous actions are needed. As systems need to be stabilized, any discrete action may cause unneeded chattering of the control action.
The idea of LSCAPI as described in Algorithm 1 concentrates on solving discrete action problem through comparing actions to get the largest value action to be applied. But first scalar control actions are considered to deal with the actions as a continuous action chain where ().

A new parameter for evaluating actions other than state parameters is proposed called orthogonal polynomials Ψ. Ψ is evaluated by two values considering if the action is selected for applying in the game through the current step or not. Fitted action is evaluated by 1 where the engine will return the true value that was used to compute the policy. The other value is where the action is discarded and a new action is selected from the action space .
As soon as the fitted action is selected the LSTD [13] function evaluates the policy by creating parameter from the input transition using the selected fitted action.
Finally the difference between the previous and the new selected actions is calculated and stored with the new action as in (1). This step optimizes the needed storage space for actions by only storing the difference between actions not the actions themselves:
4. Testing LSCAPI on Real Cases
For more explanation the LSCAPI algorithm was tested on 8 Queens, Glest, and StarCraft Brood War games and implemented using Microsoft visual studio 2013 C# engine.
4.1. Queens
Through this section the proposed algorithm will be used for solving the 8 Queens problem and tracing each step separately. But first there are some assumptions to clarify:(i)is the states in the game which are eight states (eight rows).(ii) is the actions which are continuous as every state has a chain of actions for it.(iii) considering board and which represents the probability of actions that the 1st queen can be taken on the board.
(i) 1st Queen. As clarified in Figure 6, the first queen will be set randomly on the first row of the board, where the initial state indicates that is in row 8 and in column 3 (). The probability [row, column] of the 1st queen is , which indicates that the action for placing the 1st queen in column 3 is taken.
(ii) 2nd Queen. After placing the 1st queen, the probability of actions for the second queen is , eliminating the probabilities that other queens could be on the same row, column, or diagonals. In the current state a matrix of continuous actions will be initiated holding the five available probabilities of actions that the second queen could take in the next row.
Through applying LSCAIP, the value of will be set to if action is selected as it increases the probability of having pairs of attacking queens and set to 1 if the action is selected as it provides more possible actions to take. Based on LSCAPI the action with the higher function will be selected. State means that is in row 7 and in column 5 (). So the action with the largest function was to place the queen in position as shown in Figure 7.
(iii) 3rd Queen. After placing the 2nd queen, we will move to the next state, the third row, having 26 probabilities of actions to place the 3rd queen on board and only 3 probabilities of actions in the continuous actions matrix . The previous steps will be repeated from state 2 to select the action with the higher function indicating the action generating more possible positions on the board to place the 4th queen. As in Figure 8, the 3rd queen will be placed on position , where state means that is in row 6 and column 2.
The same is done in the next five states to get the policy of the game and solve the problem by placing the eight queens in their true places without errors as shown in Figures 9, 10, 11, 12, and 13.
Table 1 describes some of the other solutions generated by LSPI and LSCAPI for the 8 Queens game.

The performance of any action in the chain is measured based on the number of nonattacking queens. The minimum is equal to zero where all queens attack each other. The maximum is 28 where there are no attacking queens. The performance is calculated based on (2) to determine the fitness of the action compared to alternative actions:Table 2 represents the performance evaluation of actions taken by LSCAPI in case 1 . Every row has chain of fitted actions evaluated by (2). Notice that if two fitted actions have the same values, the fitted one is chosen depending on the previous chosen actions. Table 3 represents the performance evaluation of discrete actions taken by LSPI.


It is noticed that the case performance achieved by LSCAPI is higher than the case performance achieved by LSPI from which we can also determine that actions selected by LSCAPI are much better than LSPI that finally produce better game performance.
4.2. Glest
The evaluation of LSCAPI in Glest is a bit different than in 8 Queens as in Glest the LSCAPI works on improving the game engine policy against the player. The researchers in evaluating LSCAPI in Glest concentrated on three parameters Δkills, ΔUnits, and Δresources as in Table 4, where Δkills, ΔUnits, and Δresources represent the gained or lost knights, units, and resources as a result of moving from state to (executing an action).

In LSCAPI algorithm we will get action for a state not as discrete but as continuous chain of actions. Every action in a chain has the value of polynomial 1 or not as a calculation but only for selecting the fittest action in this state. Using continuous chain will improve efficiency of the game and speed it up. The LSCAPI arranges the most fitted action then the next action and so on based on priority. It saves time to select the fitted action and enhance performance. Fitted actions are evaluated by the most reward resulting from them.
In Table 4 a sample case of Glest game called three towers was introduced. LSCAPI arranges the available actions in a priority continuous chain. This method helps to get the actions with the higher performance in less time.
In this case the available actions were Add Tasks (AT), Build (B), Massive Attack (MA), Produce (P), Produce Resource Producer (PRP), and Worker Harvest (WH). These actions are arranged based on their priority; the action with a high priority will be in the beginning of the chain and so on until the only fitted actions that will be used in state are complete.
In “game” state 1 the priority of continuous actions chain is (PRP, MA, B, AT, P); we use only five actions from the six input actions in this case with the shown reward. In state 2 we use only four actions of them, while in player the priority chain of actions in state 1 includes all of the input actions and in state 2 it includes only five of them.
Meanwhile in the tower of souls case as in Table 5 the available actions were Produce Resource Producer, Build One Farm, Refresh Harvest, Massive Attack, Return Base, and Upgrade.

In game state 1 the priority of continuous actions chain is (MA, PRP, RH, RB, and BOF); we use only five actions from the six input actions in this case with the shown reward. In state 2 we use only four actions of them, while in player the priority chain of actions, as in the table, in state 1 includes all of the input actions and in state 2 it includes only five of them. Finally the calculations of the reward function to evaluate the action selection mechanism are based onwhereTable 6 demonstrates an example for calculating reward function, based on the values obtained from the number of kills, units, and resources in each two consecutive states as in (3). The reward values support the superiority of the LSCAPI over LSPI with a reasonable difference.

4.3. StarCraft Brood War
The StarCraft Brood War applied the same case form as Glest, so we will just concentrate on the calculations of the reward function. The reward function of Brood War is representing the difference between the damage done by the agent and the damage received by the agent in each two consecutive states, to evaluate the action selection mechanism as in Table 7 demonstrates an example for calculating the reward of agent in case of applying the LSPI and the SARSA (already applied in the StarCraft Brood War engine) action selection mechanisms, through the Protoss versus Zerg Case. This case included four ground units, Probe, Dragon, Drone, and Larva, considering that Drone and Larva units belong to Zerg which represents the game agent, while Probe and Dragon units belong to Protoss which represents the game enemy.

We can notice that the reward values in the Brood War are within the range of which is much smaller in range than Glest. Also from the table it is cleared that the reward of the agent resulting from applying actions supported by LSCAPI is much higher than the agent reward in case of SARSA.
5. LSCAPI Implementation
The LSCAPI was implemented using Microsoft visual studio 2013 C# and packaged to the Glest and StarCraft Brood War games engines while the 8 Queens was a fully standup application based on LSPI and LSCAPI.
6. Results and Discussion
6.1. Queens
The overall performance of 5 different solutions generated by the LSCAPI and LSPI is calculated through the implementation and shown in Figure 14 from which we can detect that the LSCAPI generated actions and solutions that achieved better performance against opponents than LSPI which also indicates better policy through the game.
Meanwhile in Figure 15 the time taken by each one of these solutions through LSPI and LSCAPI is presented. It is noticed from Figure 15 that LSCAPI not only generated policies that achieved a higher performance, but also generated them in less time as in case solution of LSCAPI where the performance reaches 0.905 in 0.269 seconds while in LSPI case 4 performance only reached 0.28 in 0.88 seconds which is nearly the triple of the time needed to generate the LSCAPI solution.
Table 8 demonstrates the number of generated solutions by LSPI and LSCAPI in 10 different simulation attempts and the time taken by each attempt. The LSCAPI in attempt 4 generated 34 different arrangements of the 8 Queens in 9.16 seconds while LSPI generated only 10 solutions in 8.8 seconds.

Finally we can assure that LSCAPI achieved a really obvious superiority over LSPI generating more accurate solutions in less time which leads to better policies through game solving.
6.2. GLEST
To evaluate Glest, the reward of the agent in cases under consideration generated by LSPI and LSCAPI was calculated based on (3) and listed in Table 9 demonstrating that LSCAPI achieved much higher rewards to agents after performing the selected actions by LSCAPI which leads to better policies through facing opponents. For example, LSCAPI achieved 15000 in case 8 while LSPI selected actions only achieved 8500 which is nearly the half.

As we also can see from Table 9, LSCAPI cases with higher reward are associated with less number of used actions. This means that the reward has an inverse relationship with the number of used actions so that when the number of actions decreases the reward increases and vice versa.
Finally Figure 16 presents the game army score achieved by LSCAPI policy learning against the score achieved by learning LSPI policy through the Duel scenario which is a medium difficulty level scenario. LSCAPI and LSPI policy scores were evaluated in 15 test games played taking into account that each policy game testing is played after every 10 games of learning.
Figure 16 clarified that game army using LSCAPI policy achieved much higher score than that achieved by LSPI, which means that LSCAPI really helps the game agents to efficiently face opponents and defeat them.
6.3. StarCraft Brood War
To evaluate Brood War, the reward function of the agent in the cases under consideration generated by LSCAPI and SARSA was calculated and listed in Table 10. The rewards values from Table 10 illustrated that LSCAPI achieved superiority over SARSA concerning the agent rewards, leading to better policies through facing opponents as in case 8, where LSCAPI achieved 0.75 as an agent reward value while SARSA action only achieved 0.44.

Finally Figure 17 presents the game army score achieved by LSCAPI policy learning against the score achieved by SARSA through the Terran scenario 1 which is a medium difficulty level scenario. LSCAPI and SARSA policy scores were evaluated in 15 test games played taking into account that each policy game testing is played after every 10 games of learning.
Figure 17 illustrated that LSCAPI achieved much higher scores through the 15 test games than those achieved by SARSA. Also we can notice that SARSA score values increase with a decreasing rate which finally turned into a fixed value whatever the number of trials. On the other side LSCAPI score values grow with an increasing rate indicating that the learning rate is increasing based on the increasing states of agent rewarding as a result of better actions selection. From all of the foregoing, we can assure that LSCAPI represents a real help to the game engines to easily and efficiently face opponents and defeat them.
7. Conclusions and Future Work
Mapping from states to actions is the base function of game engine to face and react towards opponent, which is known as game policy. In this paper, we had studied the impact of batch reinforcement learning on enhancing game policy and proposed a new algorithm named LeastSquares Continuous Actions Policy Iteration (LSCAPI). The LSCAPI algorithm relied on LSPI considering handling continuous actions through a tradeoff between available actions and electing the action that scores higher reward to the game agent.
LSCAPI was tested on two different types of games: board games represented in 8 Queens and RTS games represented in Glest and StarCraft Brood War open source games. The proposed algorithm was evaluated based on the agent reward values, scores, time, and number of generated solutions. The evaluation result indicated that LSCAPI achieved better performance, time, policy, and agent learning ability than original LSPI.
In the future we plan to pursue testing LSCAPI on more complicated games as chess and poker to check its impact on game policy especially that the nonplaying character in these two games heavily relies on the game policy.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
References
 S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, Pearson, 3rd edition, 2009.
 E. Kok, Adaptive reinforcement learning agents in RTS games [M.S. thesis], Utrecht University, Utrecht, The Netherlands, 2008, Thesis number INF/SCR0773.
 R. Sutton and A. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, Mass, USA, 1998.
 F. Guenter, M. Hersch, S. Calinon, and A. Billard, “Reinforcement learning for imitating constrained reaching movements,” Advanced Robotics, vol. 21, no. 13, pp. 1521–1544, 2007. View at: Google Scholar
 L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, Reinforcement Learning and Dynamic Programming Using Function Approximators, CRC Press, New York, NY, USA, 2010.
 M. Riedmiller, T. Gabel, R. Hafner, and S. Lange, “Reinforcement learning for robot soccer,” Autonomous Robots, vol. 27, no. 1, pp. 55–73, 2009. View at: Publisher Site  Google Scholar
 S. Lange, T. Gabel, and M. Riedmiller, “Batch reinforcement learning,” in Reinforcement Learning: State of the Art, M. Wiering and M. van Otterlo, Eds., Springer, 2011. View at: Google Scholar
 S. Kalyanakrishnan and P. Stone, “Batch reinforcement learning in a complex domain,” in Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS '07), pp. 650–657, ACM, New York, NY, USA, May 2007. View at: Publisher Site  Google Scholar
 D. J. Lizotte, M. Bowling, and S. A. Murphy, “Linear fittedq iteration with multiple reward functions,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 3253–3295, 2012. View at: Google Scholar
 A. Antos, R. Munos, and C. Szepesvari, “Fitted Qiteration in continuous actionspace MDPs,” in Advances in Neural Information Processing Systems 20, pp. 9–16, MIT Press, Cambridge, Mass, USA, 2008. View at: Google Scholar
 X. Xu, H. Zhang, B. Dai, and H.G. He, “Selflearning pathtracking control of autonomous vehicles using kernelbased approximate dynamic programming,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN '08), pp. 2182–2189, Hong Kong, June 2008. View at: Publisher Site  Google Scholar
 X. Xu, C. Lian, L. Zuo, and H. He, “Kernelbased approximate dynamic programming for realtime online learning control: an experimental study,” IEEE Transactions on Control Systems Technology, vol. 22, no. 1, pp. 146–156, 2014. View at: Publisher Site  Google Scholar
 M. Lagoudakis and R. Parr, “Modelfree leastsquares policy iteration,” in Advances in Neural Information Processing Systems 14 (NIPS 2001), 2001. View at: Google Scholar
 M. G. Lagoudakis and R. Parr, “Leastsquares policy iteration,” The Journal of Machine Learning Research, vol. 4, pp. 1107–1149, 2003. View at: Publisher Site  Google Scholar
 L. Buşoniu, A. Lazaric, M. Ghavamzadeh, R. Munos, R. Babuška, and B. De Schutter, “Leastsquares methods for policy iteration,” in Reinforcement Learning, vol. 12 of Adaptation, Learning, and Optimization, pp. 75–109, Springer, Berlin, Germany, 2012. View at: Publisher Site  Google Scholar
 M. Riedmiller, “Neural fitted Q iteration—first experiences with a data efficient neural reinforcement learning method,” in Machine Learning: ECML 2005, Springer, Porto, Portugal, 2005. View at: Google Scholar
 B. King, A. Fern, and J. Hostetler, “On adversarial policy switching with experiments in realtime strategy games,” in Proceedings of the 23rd International Conference on Automated Planning and Scheduling (ICAPS '13), pp. 322–326, Rome, Italy, June 2013. View at: Google Scholar
 2015, https://johnbaps.wordpress.com/2014/03/20/intelligentagentsinartificialintelligence/.
 2014, http://cs.smith.edu/~thiebaut/transputer/chapter9/chap94.html.
 I. Ghory, “Reinforcement learning in board games,” Tech. Rep., Computer Science Department, Bristol University, Bristol, UK, 2004. View at: Google Scholar
 M. Genesereth, N. Love, and B. Pell, “General game playing: overview of the AAAI competition,” AI Magazine, vol. 26, no. 2, 2005. View at: Publisher Site  Google Scholar
 P. Ding and T. Mao, Reinforcement Learning in TicTacToe Game and Its Similar Variations, vol. 1, Thayer School of Engineering at Dartmouth College, Hanover, NH, USA, 2009.
 T. B. Trinh, A. S. Bashi, and N. Deshpande, “Temporal difference learning in Chinese Chess,” in Tasks and Methods in Applied Artificial Intelligence, vol. 1416 of Lecture Notes in Computer Science, pp. 612–618, Springer, Berlin, Germany, 1998. View at: Publisher Site  Google Scholar
 T. P. Runarsson and S. M. Lucas, “Coevolution versus selfplay temporal difference learning for acquiring position evaluation in smallboard go,” IEEE Transactions on Evolutionary Computation, vol. 9, no. 6, pp. 628–640, 2005. View at: Publisher Site  Google Scholar
 M. Wiering, J. Patist, and H. Mannen, “Learning to play board games using temporal difference methods,” Tech. Rep. UUCS2005048, Institute of Information and Computing Sciences, Utrecht University, 2007. View at: Google Scholar
 M. Block, M. Bader, E. Tapia et al., “Using reinforcement learning in chess engines,” Research in Computing Science, vol. 35, pp. 31–40, 2008. View at: Google Scholar
 I. E. Skoulakis and M. G. Lagoudakis, “Efficient reinforcement learning in adversarial games,” in Proceedings of the IEEE 24th International Conference on Tools with Artificial Intelligence (ICTAI '12), vol. 1, pp. 704–711, IEEE, Athens, Greece, November 2012. View at: Publisher Site  Google Scholar
 M. Szubert and W. Jaskowski, “Temporal difference learning of Ntuple networks for the game 2048,” in Proceedings of the IEEE Conference on Computational Intelligence and Games (CIG '14), pp. 1–8, IEEE, Dortmund, Germany, August 2014. View at: Publisher Site  Google Scholar
 G. Schrage, “The eight queens problem as a strategy game,” International Journal of Mathematical Education in Science and Technology, vol. 17, no. 2, pp. 143–148, 1986. View at: Publisher Site  Google Scholar  MathSciNet
 S. Lim and K. Son, “The improvement of convergence rate in nqueen problem using reinforcement learning,” International Journal of Information Technology, vol. 11, no. 5, pp. 52–60, 2005. View at: Google Scholar
 2014, http://en.wikipedia.org/wiki/Realtime_strategy.
 M. Buro, “Realtime strategy gaines: a new AI research challenge,” in Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI '03), pp. 1534–1535, August 2003. View at: Google Scholar
 K. Dimitriadis, Reinforcement Learning in Real Time Strategy Games Case Study on the Free Software Game Glest, Department of Electronic and Computer Engineering, Technical University of Crete, Chania, Greece, 2009.
 M. Aref, M. Zakaria, and S. Sarhan, “Realtime strategy experience exchanger model [realsee],” International Journal of Computer Science Issues, vol. 8, no. 3, supplement 1, pp. 360–368, 2011. View at: Google Scholar
 B. Marthi, S. Russell, and D. Latham, “Writing stratagusplaying agents in concurrent ALisp,” in Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI '05), pp. 67–71, Edinburgh, Scotland, 2005. View at: Google Scholar
 A. Gusmao, Reinforcement learning in realtime strategy games [M.S. thesis], Aalto School of Science, Department of Information and Computer Science, 2011.
 M. Leece and A. Jhala, “Reinforcement learning for spatial reasoning in strategy games,” in Proceedings of the 9th AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE '13), pp. 156–162, October 2013. View at: Google Scholar
 H. Sethy, A. Patel, and V. Padmanabhan, “Real time strategy games: a reinforcement learning approach,” Procedia Computer Science, vol. 54, pp. 257–264, 2015. View at: Publisher Site  Google Scholar
 2014, http://glest.org/en/index.php.
 2014, http://glest.org/en/techtree.php.
 K. Efthymiadis and D. Kudenko, “Using planbased reward shaping to learn strategies in starcraft: broodwar,” in Proceedings of the IEEE Conference on Computational Intelligence in Games (CIG '13), pp. 1–8, IEEE, August 2013. View at: Publisher Site  Google Scholar
 S. Wender and I. Watson, “Applying reinforcement learning to small scale combat in the realtime strategy game StarCraft: Broodwar,” in Proceedings of the IEEE International Conference on Computational Intelligence and Games (CIG '12), pp. 402–408, Granada, Spain, September 2012. View at: Publisher Site  Google Scholar
 J. Eriksson and D. Ø. Tornes, Learning to play starcraft with casebased reasoning: investigating issues in largescale casebased planning [Master of Science in Computer Science], Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway, 2012.
 S. Wender and I. Watson, “Applying reinforcement learning to small scale combat in the realtime strategy game StarCraft:Broodwar,” in Proceedings of the IEEE International Conference on Computational Intelligence and Games (CIG '12), pp. 402–408, IEEE, Granada, Spain, September 2012. View at: Publisher Site  Google Scholar
 C. Siebra and G. Neto, “Evolving the behavior of autonomous agents in strategic combat scenarios via SARSA reinforcement learning,” in Proceedings of the 13th Brazilian Symposium on Computer Games and Digital Entertainment (SBGAMES '14), pp. 115–122, Porto Alegre, Brazil, November 2014. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2016 Shahenda Sarhan et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.