Abstract

This brief paper provides a simple algorithm that selects a strategy at each time in a given set of multiple strategies for stochastic multiarmed bandit problems, thereby playing the arm by the chosen strategy at each time. The algorithm follows the idea of the probabilistic -switching in the -greedy strategy and is asymptotically optimal in the sense that the selected strategy converges to the best in the set under some conditions on the strategies in the set and the sequence of .

1. Introduction

This paper considers the problem of stochastic non-Bayesian multiarmed bandit (MAB) in which a player with a bandit has to decide which arm to play at each time among available arms to maximize the sum of rewards earned through a sequence of playing arms. When played, each arm provides a random reward from an unknown distribution specific to that arm. The problem models the well-known trade-off between “exploration” and “exploitation” in sequential learning. The player needs to obtain new knowledge (exploration) and at the same time optimize her decisions based on existing knowledge (exploitation). The player attempts to balance these competing tasks in order to achieve the goal. Many practical problems, for example, in networking [1, 2], in games [3], and in prediction [4], and problems such as clinical trials and ad placement on the Internet (see, e.g., [1, 5, 6] and the references therein) have been studied with a (properly extended) model of the MAB problems.

Specifically, we consider a stochastic -armed bandit problem where there is a finite set of arms , and one arm in needs to be sequentially played. When an arm is played at time , the player obtains a sample bounded reward drawn from an unknown distribution associated with , whose unknown expectation and variance are and , respectively. We define a strategy as a sequence of mappings such that maps from the set of past plays and rewards, , to the set of all possible distributions over where is an arbitrarily given nonempty subset of . We denote the set of all possible strategies as . Given a particular sequence of the past plays and rewards obtained by following over time steps, selects to be played by probability at time . We assume that is arbitrarily given.

Let random variable denote the arm selected by at time and let be the distribution over given by at time so that , which is equal to . We assume that and are independent of any and and ’s for are identically distributed for any fixed in .

Let and . For a given , if as , then we say that is an asymptotically optimal strategy. The notion of the asymptotic optimality was introduced by Robbins [5]. He presented a strategy which achieves the optimality for the case where in a single play, each produces a reward of 1 or 0 with unknown probabilities and , respectively. Bather [7] considered the same Bernoulli problem with and established an asymptotically optimal index-based strategy such that, at time , selects an arm in , where denotes the number of times arm has been played by during the first plays, and , denotes the average of the reward samples obtained by playing when is followed, and is a sequence of strictly positive constants such that as , and , are i.i.d. positive and unbounded random variables whose common distribution function satisfies that and . The idea is to ensure that each arm is played infinitely often by adding small perturbations to and to make the effect vanish as increases. The well-known asymptotically optimal -greedy strategy [8] with and follows exactly the same idea: at time , with probability , it selects with probability and, with probability , it selects an arm in where refers to the -greedy strategy.

This brief paper provides a randomized algorithm which follows the spirit of the -greedy strategy for combining multiple strategies in a given finite nonempty . At each time , we use the probabilistic -switching to select uniformly a strategy in or to select a strategy with the highest sample average of the rewards obtained so far by playing the bandit. Once a strategy is selected, the arm chosen by is played for the bandit. Analogous to the case of the -greedy strategy, it is asymptotically optimal in the sense that the selected strategy converges to the “best” in the set under some conditions on the strategies in and .

In the following, we briefly summarize the most relevant works in the literature with the results of the present paper. A seminal work by Gittins and Jones [9] provides an optimal policy (or allocation index rule) to maximize the discounted reward over an infinite horizon when the rewards are given by Markov chains whose statistics are perfectly known in advance. Note that our model does not consider discounting in the rewards and assumes that the relevant statistics are unknown.

Auer et al. [10] presented an algorithm, called Exp4, which combines multiple strategies in a nonstochastic bandit setting. In the nonstochastic MAB, it is assumed that each arm is initially assigned an arbitrary and unknown sequence of rewards, one for each time step. In other words, the rewards obtained by playing a specific arm are predetermined. In Exp4, the “uniform expert,” which always selects an action uniformly over , needs to be always included in . At time , Exp4 computeswhere is obtained from the predetermined sample reward sequence of length , , and is a weight for such that . is updated from the observed sample reward as follows: with , where with being Iverson bracket. An arm is then played according to the distribution . A finite-time upper bound (but no lower bound) of on the expected “regret” was analyzed against a best strategy that achieves the optimal total reward of with respect to the fixed sample reward sequences for the arms (see [11] for a probability version within a contextual bandit setting and [12] for a simplified derivation).

McMahan and Streeter [13] proposed a variant of Exp4, called NEXP, in the same nonstochastic setting. NEXP needs to solve a linear program (LP) at each time to obtain a distribution over that offers a “locally optimal” trade-off between exploration and exploitation. Even if some improvement over Exp4 was shown, this comes at the expense of solving an LP at every time step.

de Farias and Megiddo [14] presented another variant of Exp4, called EEE, within a “reactive” setting. In this setting, at each time a player chooses an arm and an environment chooses its state, which is unknown to the player. The reward obtained by the player depends on both the chosen arm and the current state but not necessarily determined by a distribution specific to the chosen arm and the current state. An example of this setting is playing a repeated game against another player. When an expert is selected by EEE for a phase, it is followed for multiple times during the phase, rather than picking a different expert at each time, and the average reward accumulated for that phase is kept track of. The current best strategy with respect to the estimate of the average reward or a random strategy is selected at each phase with a certain control rule of exploration and exploitation, which is similar to the -schedule we consider here. (See also a survey section in [15] for expert-combining algorithms in different scenarios).

Because these representative approaches combine multiple experts, we compare them with our algorithm after adapting into our setting (cf. Section 4). However, more importantly, the notion of the best strategy in nonstochastic or reactive settings does not directly apply to the stochastic setting. Establishing some kind of asymptotic optimality for Exp4 or its variants with respect to a properly defined best strategy (even after adapting Exp4 as a strategy in ) is an open problem. In fact, to the authors’ best knowledge, there seems to be no notable work yet which studies asymptotic optimality in combining multiple strategies for stochastic multiarmed bandit problems.

Finally, we stress that this paper focuses on asymptotic optimality as performance measure, also termed as “instantaneous regret” [8], but not on “expected regret” which is typically considered in the (recent) existing bandit-theory literature (see, e.g., [12] for a survey). It is worthwhile to note that the instantaneous regret is a stronger measure of convergence than expected regret [8].

3. Algorithm and Convergence

Assume that a finite nonempty subset of is given. Once is selected by the algorithm at time , the bandit is played with an arm selected by and a sample reward of is obtained. Let denote the random variable denoting the strategy selected by at time , and (with abusing notations) let denote the number of times has been selected by during the first time steps, where if and 0 otherwise. Let , , and where is the time for the th selection of the strategy by . For finite , we then let We formally describe the algorithm below.

The Algorithm(1)Initialization: select , for . Set and .(2)Loop: (2.1)obtain (ties broken arbitrarily);(2.2)with probability , select and, with probability , select uniformly a strategy in . Set the selected strategy to be . Select an arm according to and obtain by playing it;(2.3) and .

Note that as above is involved with general schedule of . By setting the -schedule in properly, subsumes the schedules used in -greedy, -first, and -decreasing strategies [16, 17]. In particular, as a special case, if and is given such that for all , then degenerates to the -greedy strategy. As shown in the experimental results in [16, 17], the performance of tuned -greedy is no worse than (or very close to) those of -first, -decreasing strategies, and so forth, even with tuning the schedule (and the relevant parameters) each strategy used. However, because these schedules are usually heuristically tuned and -greedy uses a constant value of , it is not necessarily guaranteed that employing such schedules in achieves asymptotic optimality. Furthermore, it is very difficult to tune the value of in advance. The theorem below establishes general conditions for asymptotic optimality of with respect to a properly defined best strategy in . It states that if each is selected infinitely often by and for each , each is selected infinitely often by , and each ’s action-selection distribution converges to a stationary distribution, and the selection of goes greedy in the limit; then the selected strategy by converges to the best strategy in .

Theorem 1. Given a finite nonempty , consider . Suppose that , that , that there exists such that for all , and , and that for all . Then

Proof. Because from the assumption, there exists such that, for all , . Fix any . Then decomposing the sample average such thatwe see that as , the first term in the right-hand side of (4) goes to zero because from and . For the second term in the right-hand side of (4), we rewrite it as and we will establish the convergence of the right product term in (5): because goes to one.
Let . Then (6) can be rewritten such thatBecause for all , we have that, as , . Therefore the second product term inside of the summation in the right-hand side of (7),goes to by the law of large numbers. We now show that the first product term inside of the summation in the right-hand side of (7), , goes to as . Let Then by Poisson’s limit theorem [18, Chapter 11] in the law of large numbers, for any , Because for all ,  . By setting and using triangular inequality, we have that, for any , It follows that with probability 1, as , . Finally, because as , the result of the theorem follows.

We remark that this algorithm can be used for solving a bandit problem in a decomposed manner. Suppose that we partition into nonempty subsets , such that and . Choose any asymptotically optimal strategy and associate with a strategy such that as , where and for all . Then by employing where , we have that as .

4. A Numerical Example

For a proof-of-concept implementation of the approach, we consider three simple numerical examples.

We have Bernoulli reward distributions with where the reward expectations of the arms 1 through 10 are given by , and 0.6, respectively. For any strategy involved with in this section, including , we used -schedule in -greedy strategy given by based on [8, Theorem 3], where and is a constant chosen for to control the degree of exploration and exploitation.

For the first case, is partitioned into , such that , , and and the -greedy strategy associated with , playing only the arms in , corresponds to (cf. the remark given at the end of Section 3). Thus we have and trivially the best strategy is . The second case considers combining two pursuit learning automata (PLA) algorithms [19] with different learning rates designed for solving stochastic optimization problems. Even though PLA was not designed specifically for solving multiarmed bandit problems, PLA guarantees “-optimality” and can be casted into a strategy for playing the bandit. (Roughly, the error probability of choosing nonoptimal solutions is bounded by .) The first PLA strategy uses the learning rate of which corresponds to the parameter setting in [19, Theorem 3.1] with for a theoretical performance guarantee and the second one uses the tuned learning rate of which achieves the best performance among the various rates we tested for the above distribution. These two PLAs are contained in and the PLA with 0.002 is taken as the best strategy.

From Figures 13, we show the percentage of selections of the best strategy and of plays of the optimal arm of (tuned) for each case along with those of (tuned) Exp4, NEXP, and EEE, respectively. (The percentage for the optimal arm for the second case is not shown due to the space constraint.) The tuned strategy corresponds to the best empirical parameter setting we obtained. The performances of all tested strategies were obtained through the average over 1000 different runs where each strategy is followed over 100,000 time steps in a single run. For the first case, that is, , we set for all , and for , and for EEE. (The value of 0.15 was chosen for a reasonable allocation of exploration and exploitation.) The tuned uses 0.075 for and the tuned EEE uses 0.07 for . Exp4 uses 0.0116 for which was obtained by the parameter setting in [10, Corollary 3.2] and the tuned Exp4 uses 0.03. For the second case, that is, , we set again . The tuned uses for , and the tuned EEE uses 0.04 for and (tuned) Exp4 uses the same value in the first case. As we can see from Figures 13 (tuned) successfully combines the strategies in achieving the asymptotic optimality with respect to the best strategy for each case. In particular, ’s convergence rate is much faster than that of (even tuned) Exp4, NEXP, and EEE for each case. Note that by tuning , achieves a better performance, which shows a similar sensitivity to -schedule as in the -greedy strategy case [8].

The third case considers the case of combining two different strategies, the -greedy strategy and UCB1-tuned [8], and was chosen to show some robustness of . Here the -greedy strategy uses the value of 0.3 for -schedule. (Tuning the -greedy strategy will make it more competitive with UCB1-tuned. This particular value was chosen for an illustration purpose only.) Figure 4 shows the average regret of each strategy tested over time steps for two different distributions, respectively, calculated by where denotes the number of times arm has been played by during the first plays in the th run. The first distribution is the same as the above one used for the first and the second cases. The second distribution is again Bernoulli reward distributions with but the reward expectations are given by for the optimal arm 1 and 0.7 for all the remaining arms. For the first distribution, UCB1-tuned’s regret is much smaller than that of the -greedy strategy but for the second distribution, the -greedy strategy is slightly better than UCB1-tuned. In sum, we have the case where the better performance of the two strategies is distribution-dependent in terms of the average regret even if both achieve asymptotic optimality empirically (not shown here). By combining two strategies by , we have a reasonable distribution-independent algorithm for playing the bandit. As we can see from Figure 4, the tuned with and , respectively, shows a robust performance across the two distributions. For both cases, the performances of the tuned Exp4, NEXP, and EEE are not good.

5. Concluding Remarks

In this paper, we provided a randomized algorithm for playing a given stochastic MAB when a finite nonempty strategy set is available. Following the spirit of the -greedy strategy, the algorithm combines the strategies in and finds the best strategy in the limit. Specifically, at each time , we use the probabilistic -switching to select uniformly a strategy in or to select a strategy with the highest sample average of the rewards obtained so far by playing the bandit. Once a strategy is selected, the arm chosen by is played for the bandit. We showed that it is asymptotically optimal in the sense that the selected strategy converges to the best in the set under some conditions on the strategies in and and illustrated the result by simulation studies on some example problems.

If each is stationary (as opposed to the general case studied in the present paper) in that, for a distribution over , for all , then each can be viewed as an arm and playing according to provides a sample reward of with unknown expectation of . Therefore, in this case any asymptotically optimal strategy in can be adapted for selecting a strategy at each time instead of to achieve asymptotic optimality with respect to a strategy that achieves .

Even if we showed the convergence to the best strategy in the set, the convergence rate has not been discussed. In the statement of Theorem 1, we assumed only that . In other words, the convergence rate of each strategy in to its stationary behavior is assumed to be unknown. Because the convergence rate of needs to be given with respect to , it is crucial to know the convergence rate of each strategy in . It would be a good future work to analyze the convergence rate of , for example, by extending the result of Theorem 3 in [8] for -greedy strategy into the case of , under the conditions on the rates of to .

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.