Computational Intelligence and Neuroscience

Volume 2016, Article ID 7420984, 13 pages

http://dx.doi.org/10.1155/2016/7420984

## Active Player Modeling in the Iterated Prisoner’s Dilemma

Department of Computer Science and Engineering, Sejong University, 209 Neungdong-ro, Gwangjin-gu, Seoul 05006, Republic of Korea

Received 12 November 2015; Revised 14 January 2016; Accepted 20 January 2016

Academic Editor: Reinoud Maex

Copyright © 2016 Hyunsoo Park and Kyung-Joong Kim. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The iterated prisoner’s dilemma (IPD) is well known within the domain of game theory. Although it is relatively simple, it can also elucidate important problems related to cooperation and trust. Generally, players can predict their opponents’ actions when they are able to build a precise model of their behavior based on their game playing experience. However, it is difficult to make such predictions based on a limited number of games. The creation of a precise model requires the use of not only an appropriate learning algorithm and framework but also a good dataset. Active learning approaches have recently been introduced to machine learning communities. The approach can usually produce informative datasets with relatively little effort. Therefore, we have proposed an active modeling technique to predict the behavior of IPD players. The proposed method can model the opponent player’s behavior while taking advantage of interactive game environments. This experiment used twelve representative types of players as opponents, and an observer used an active modeling algorithm to model these opponents. This observer actively collected data and modeled the opponent’s behavior online. Most of our data showed that the observer was able to build, through direct actions, a more accurate model of an opponent’s behavior than when the data were collected through random actions.

#### 1. Introduction

Understanding one’s opponents is very useful when playing games. In many games, each player tries to figure out his or her opponents’ hidden beliefs, desires, and intentions to maximize his or her reward. However, this is difficult in many cases because this piece of information is often hidden by the opponents. Instead, each player can usually only infer other players’ internal states based on observable information such as behavior. This discussion raises questions about how players can understand one another and there are many possible explanations for the development of such understanding [1, 2]. However, in the present paper, we consider only player modeling, techniques/methods to build models that can predict/infer player’s future behaviors, as an approach to understanding opponents [3]. Usually, it uses the player’s past behavior data. First, this approach is simple and effective and can accommodate many techniques and styles of implementation based on the data. These methods can be easily used for player modeling, and the development of an accurate player model may enable us to infer a player’s current inner state, predict his/her future actions, and figure out the reason for current actions.

The iterated prisoner’s dilemma (IPD) [4] is among the games in which player modeling is important. This mathematical game is well known in the domains of economics, international politics, and artificial intelligence (AI). When playing the IPD, the ability to predict the future action of one’s opponent is the most important contributor to maximizing one’s own benefit. Generally, the creation of a precise model to predict the future action of an opponent is sufficient to win this game, and several studies have examined opponent modeling in the IPD [5].

Application of the modeling technique requires the prior collection of sufficient data, which is difficult because it entails that a player should play the game to provide data with ground truth. Modeling techniques based on data are usually known as data mining [6], an approach that has been studied for the last few decades and successfully applied in various domains. The success of many of these applications requires considerable data. Thus, insufficient data render the performance of the model uncertain. Indeed, more sophisticated models with more features require more data. This phenomenon is known as the curse of dimensions [7]. The same problems arise in player modeling. Although the labeling of a future action is easy in the IPD (labeling a future action involves only recording it), the problem of a limited dataset remains.

The development of an active learning algorithm [8] is one approach to dealing with the scarcity of data. This approach enables the collection of more informative data through online interactions/queries, and it usually increases the accuracy of the model with fewer data points compared with conventional (passive) learning algorithms. This approach can be applied in a straightforward manner to collect data related to the IPD, because interaction is an important element of this game. These interactions can form the basis for the active learning algorithm used to model the behavior of one’s opponent.

In this paper, we propose an active learning method to be used for player modeling in the IPD game. Our approach is based on a query-by-committee (QBC) [9] and estimation exploration algorithm (EEA) [10]. QBC algorithms are among the active learning algorithms using multiple hypotheses (ensemble of models). EEAs use similar approach but originate in evolutionary computation. We conducted simulations to evaluate the performance of our approach. These simulations involved two players, an observer (learner) and an opponent. The observer played the game to model his/her opponent’s behavior based on an active learning algorithm. The opponent was a typical player who was playing to obtain rewards. This approach involved 12 types of strategy. In this way, we comparatively evaluated the advantage offered by our method with regard to data collection. According to our results, our approach performed better in most cases. Over the course of a few games, it was able to build more accurate models than a random approach.

#### 2. Related Research

##### 2.1. Iterated Prisoner’s Dilemma

In the original prisoner’s dilemma, two players choose to cooperate () or to defect () and receive a reward or penalty based on their choice. Table 1 presents the prisoner’s dilemma’s payoff table. If two players cooperate, each gets an intermediate reward (: reward). However, if one defects when the other cooperates, the defector gets the maximum reward (: temptation) and the cooperator gets nothing (: sucker). If all players defect to avoid being penalized, the two players get the minimum reward (: penalty). The payoff has to satisfy the conditions in (1), which prevents a strategy involving alternating between and from yielding an incentive and also motivates each player to play noncooperatively:It is usually in the best interest of both players to trust and cooperate with the other, because this strategy yields the maximum reward. However, there is always the risk that one’s opponent will choose to defect. In this setting the cooperative action’s expected reward is 1.5 ((3 + 0)/2) and defect action’s expectation reward is 3.0 ((5 + 1)/2). Therefore, if each player is rational, defection is seen as the action that maximizes the expected reward. Furthermore, each player assumes the opponents are rational; in other words each opponent will defect to maximize an expected reward, and so each player should defect. This is the rational strategy in the original prisoner’s dilemma.