Journal of Control Science and Engineering

Volume 2015 (2015), Article ID 264953, 7 pages

http://dx.doi.org/10.1155/2015/264953

## Combining Multiple Strategies for Multiarmed Bandit Problems and Asymptotic Optimality

Department of Computer Science and Engineering, Sogang University, Seoul 121-742, Republic of Korea

Received 13 January 2015; Accepted 11 March 2015

Academic Editor: Shengwei Mei

Copyright © 2015 Hyeong Soo Chang and Sanghee Choe. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

This brief paper provides a simple algorithm that selects a strategy at each time in a given set of multiple strategies for stochastic multiarmed bandit problems, thereby playing the arm by the chosen strategy at each time. The algorithm follows the idea of the probabilistic -switching in the -greedy strategy and is asymptotically optimal in the sense that the selected strategy converges to the best in the set under some conditions on the strategies in the set and the sequence of .

#### 1. Introduction

This paper considers the problem of stochastic non-Bayesian multiarmed bandit (MAB) in which a player with a bandit has to decide which arm to play at each time among available arms to maximize the sum of rewards earned through a sequence of playing arms. When played, each arm provides a random reward from an unknown distribution specific to that arm. The problem models the well-known trade-off between “exploration” and “exploitation” in sequential learning. The player needs to obtain new knowledge (exploration) and at the same time optimize her decisions based on existing knowledge (exploitation). The player attempts to balance these competing tasks in order to achieve the goal. Many practical problems, for example, in networking [1, 2], in games [3], and in prediction [4], and problems such as clinical trials and ad placement on the Internet (see, e.g., [1, 5, 6] and the references therein) have been studied with a (properly extended) model of the MAB problems.

Specifically, we consider a stochastic -armed bandit problem where there is a finite set of arms , and one arm in needs to be sequentially played. When an arm is played at time , the player obtains a sample bounded reward drawn from an unknown distribution associated with , whose unknown expectation and variance are and , respectively. We define a* strategy * as a sequence of mappings such that maps from the set of past plays and rewards, , to the set of all possible distributions over where is an arbitrarily given nonempty subset of . We denote the set of all possible strategies as . Given a particular sequence of the past plays and rewards obtained by following over time steps, selects to be played by probability at time . We assume that is arbitrarily given.

Let random variable denote the arm selected by at time and let be the distribution over given by at time so that , which is equal to . We assume that and are independent of any and and ’s for are identically distributed for any fixed in .

Let and . For a given , if as , then we say that is an* asymptotically optimal* strategy. The notion of the asymptotic optimality was introduced by Robbins [5]. He presented a strategy which achieves the optimality for the case where in a single play, each produces a reward of 1 or 0 with unknown probabilities and , respectively. Bather [7] considered the same Bernoulli problem with and established an asymptotically optimal index-based strategy such that, at time , selects an arm in , where denotes the number of times arm has been played by during the first plays, and , denotes the average of the reward samples obtained by playing when is followed, and is a sequence of strictly positive constants such that as , and , are i.i.d. positive and unbounded random variables whose common distribution function satisfies that and . The idea is to ensure that each arm is played infinitely often by adding small perturbations to and to make the effect vanish as increases. The well-known asymptotically optimal -greedy strategy [8] with and follows exactly the same idea: at time , with probability , it selects with probability and, with probability , it selects an arm in where refers to the -greedy strategy.

This brief paper provides a randomized algorithm which follows the spirit of the -greedy strategy for combining multiple strategies in a given finite nonempty . At each time , we use the probabilistic -switching to select uniformly a strategy in or to select a strategy with the highest sample average of the rewards obtained so far by playing the bandit. Once a strategy is selected, the arm chosen by is played for the bandit. Analogous to the case of the -greedy strategy, it is asymptotically optimal in the sense that the selected strategy converges to the “best” in the set under some conditions on the strategies in and .

#### 2. Related Work

In the following, we briefly summarize the most relevant works in the literature with the results of the present paper. A seminal work by Gittins and Jones [9] provides an optimal policy (or allocation index rule) to maximize the discounted reward over an infinite horizon when the rewards are given by Markov chains whose statistics are perfectly known in advance. Note that our model does not consider discounting in the rewards and assumes that the relevant statistics are unknown.

Auer et al. [10] presented an algorithm, called Exp4, which combines multiple strategies in a* nonstochastic* bandit setting. In the nonstochastic MAB, it is assumed that each arm is* initially assigned* an arbitrary and unknown sequence of rewards, one for each time step. In other words, the rewards obtained by playing a specific arm are predetermined. In Exp4, the “uniform expert,” which always selects an action uniformly over , needs to be always included in . At time , Exp4 computeswhere is obtained from the predetermined sample reward sequence of length , , and is a weight for such that . is updated from the observed sample reward as follows: with , where with being Iverson bracket. An arm is then played according to the distribution . A finite-time upper bound (but no lower bound) of on the expected “regret” was analyzed against a best strategy that achieves the optimal total reward of with respect to the fixed sample reward sequences for the arms (see [11] for a probability version within a contextual bandit setting and [12] for a simplified derivation).

McMahan and Streeter [13] proposed a variant of Exp4, called NEXP, in the same nonstochastic setting. NEXP needs to solve a linear program (LP) at each time to obtain a distribution over that offers a “locally optimal” trade-off between exploration and exploitation. Even if some improvement over Exp4 was shown, this comes at the expense of solving an LP at every time step.

de Farias and Megiddo [14] presented another variant of Exp4, called EEE, within a “reactive” setting. In this setting, at each time a player chooses an arm and an environment chooses its state, which is unknown to the player. The reward obtained by the player depends on both the chosen arm and the current state but not necessarily determined by a distribution specific to the chosen arm and the current state. An example of this setting is playing a repeated game against another player. When an expert is selected by EEE for a phase, it is followed for multiple times during the phase, rather than picking a different expert at each time, and the average reward accumulated for that phase is kept track of. The current best strategy with respect to the estimate of the average reward or a random strategy is selected at each phase with a certain control rule of exploration and exploitation, which is similar to the -schedule we consider here. (See also a survey section in [15] for expert-combining algorithms in different scenarios).

Because these representative approaches combine multiple experts, we compare them with our algorithm after adapting into our setting (cf. Section 4). However, more importantly, the notion of the best strategy in nonstochastic or reactive settings does not directly apply to the stochastic setting. Establishing some kind of asymptotic optimality for Exp4 or its variants with respect to a properly defined best strategy (even after adapting Exp4 as a strategy in ) is an open problem. In fact, to the authors’ best knowledge, there seems to be no notable work yet which studies asymptotic optimality in combining multiple strategies for* stochastic* multiarmed bandit problems.

Finally, we stress that this paper focuses on asymptotic optimality as performance measure, also termed as “instantaneous regret” [8], but not on “expected regret” which is typically considered in the (recent) existing bandit-theory literature (see, e.g., [12] for a survey). It is worthwhile to note that the instantaneous regret is a stronger measure of convergence than expected regret [8].

#### 3. Algorithm and Convergence

Assume that a finite nonempty subset of is given. Once is selected by the algorithm at time , the bandit is played with an arm selected by and a sample reward of is obtained. Let denote the random variable denoting the strategy selected by at time , and (with abusing notations) let denote the number of times has been selected by during the first time steps, where if and 0 otherwise. Let , , and where is the time for the th selection of the strategy by . For finite , we then let We formally describe the algorithm below.

*The ** Algorithm*(1)*Initialization:* select , for . Set and .(2)*Loop: *(2.1)obtain (ties broken arbitrarily);(2.2)with probability , select and, with probability , select uniformly a strategy in . Set the selected strategy to be . Select an arm according to and obtain by playing it;(2.3) and .

Note that as above is involved with general schedule of . By setting the -schedule in properly, subsumes the schedules used in -greedy, -first, and -decreasing strategies [16, 17]. In particular, as a special case, if and is given such that for all , then degenerates to the -greedy strategy. As shown in the experimental results in [16, 17], the performance of* tuned *-greedy is no worse than (or very close to) those of -first, -decreasing strategies, and so forth, even with tuning the schedule (and the relevant parameters) each strategy used. However, because these schedules are usually heuristically tuned and -greedy uses a constant value of , it is not necessarily guaranteed that employing such schedules in achieves asymptotic optimality. Furthermore, it is very difficult to tune the value of in advance. The theorem below establishes general conditions for asymptotic optimality of with respect to a properly defined best strategy in . It states that if each is selected infinitely often by and for each , each is selected infinitely often by , and each ’s action-selection distribution converges to a stationary distribution, and the selection of goes greedy in the limit; then the selected strategy by converges to the best strategy in .

Theorem 1. *Given a finite nonempty , consider . Suppose that , that , that there exists such that for all , and , and that for all . Then *

*Proof. *Because from the assumption, there exists such that, for all , . Fix any . Then decomposing the sample average such thatwe see that as , the first term in the right-hand side of (4) goes to zero because from and . For the second term in the right-hand side of (4), we rewrite it as and we will establish the convergence of the right product term in (5): because goes to one.

Let . Then (6) can be rewritten such thatBecause for all , we have that, as , . Therefore the second product term inside of the summation in the right-hand side of (7),goes to by the law of large numbers. We now show that the first product term inside of the summation in the right-hand side of (7), , goes to as . Let Then by Poisson’s limit theorem [18, Chapter 11] in the law of large numbers, for any , Because for all , . By setting and using triangular inequality, we have that, for any , It follows that with probability 1, as , . Finally, because as , the result of the theorem follows.

*We remark that this algorithm can be used for solving a bandit problem in a decomposed manner. Suppose that we partition into nonempty subsets , such that and . Choose any asymptotically optimal strategy and associate with a strategy such that as , where and for all . Then by employing where , we have that as .*

*4. A Numerical Example*

*For a proof-of-concept implementation of the approach, we consider three simple numerical examples.*

*We have Bernoulli reward distributions with where the reward expectations of the arms 1 through 10 are given by , and 0.6, respectively. For any strategy involved with in this section, including , we used -schedule in -greedy strategy given by based on [8, Theorem 3], where and is a constant chosen for to control the degree of exploration and exploitation.*

*For the first case, is partitioned into , such that , , and and the -greedy strategy associated with , playing only the arms in , corresponds to (cf. the remark given at the end of Section 3). Thus we have and trivially the best strategy is . The second case considers combining two pursuit learning automata (PLA) algorithms [19] with different learning rates designed for solving stochastic optimization problems. Even though PLA was not designed specifically for solving multiarmed bandit problems, PLA guarantees “-optimality” and can be casted into a strategy for playing the bandit. (Roughly, the error probability of choosing nonoptimal solutions is bounded by .) The first PLA strategy uses the learning rate of which corresponds to the parameter setting in [19, Theorem 3.1] with for a theoretical performance guarantee and the second one uses the tuned learning rate of which achieves the best performance among the various rates we tested for the above distribution. These two PLAs are contained in and the PLA with 0.002 is taken as the best strategy.*

*From Figures 1–3, we show the percentage of selections of the best strategy and of plays of the optimal arm of (tuned) for each case along with those of (tuned) Exp4, NEXP, and EEE, respectively. (The percentage for the optimal arm for the second case is not shown due to the space constraint.) The tuned strategy corresponds to the best empirical parameter setting we obtained. The performances of all tested strategies were obtained through the average over 1000 different runs where each strategy is followed over 100,000 time steps in a single run. For the first case, that is, , we set for all , and for , and for EEE. (The value of 0.15 was chosen for a reasonable allocation of exploration and exploitation.) The tuned uses 0.075 for and the tuned EEE uses 0.07 for . Exp4 uses 0.0116 for which was obtained by the parameter setting in [10, Corollary 3.2] and the tuned Exp4 uses 0.03. For the second case, that is, , we set again . The tuned uses for , and the tuned EEE uses 0.04 for and (tuned) Exp4 uses the same value in the first case. As we can see from Figures 1–3 (tuned) successfully combines the strategies in achieving the asymptotic optimality with respect to the best strategy for each case. In particular, ’s convergence rate is much faster than that of (even tuned) Exp4, NEXP, and EEE for each case. Note that by tuning , achieves a better performance, which shows a similar sensitivity to -schedule as in the -greedy strategy case [8].*