Efficient Optimal Learning for Contextual Bandits Miroslav Dudik [email protected] Daniel Hsu [email protected] John Langford [email protected] 1 Satyen Kale [email protected] Lev Reyzin [email protected] Nikos Karampatziakis [email protected] Tong Zhang [email protected] Abstract of credit assignment as in more general reinforcement learning settings. We address the problem of learning in an online setting where the learner repeatedly observes features, selects among a set of actions, and receives reward for the action taken. We provide the first efficient algorithm with an optimal regret. Our algorithm uses a cost sensitive classification learner as an oracle and has a running time polylog(N ), where N is the number of classification rules among which the oracle might choose. This is exponentially faster than all previous algorithms that achieve optimal regret in this setting. Our formulation also enables us to create an algorithm with regret that is additive rather than multiplicative in feedback delay as in all previous work. The contextual bandit setting is a half-way point between standard supervised learning and full-scale reinforcement learning where it appears possible to construct algorithms with convergence rate guarantees similar to supervised learning. Many natural settings satisfy this half-way point, motivating the investigation of contextual bandit learning. For example, the problem of choosing interesting news articles or ads for users by internet companies can be naturally modeled as a contextual bandit setting. In the medical domain where discrete treatments are tested before approval, the process of deciding which patients are eligible for a treatment takes contexts into account. More generally, we can imagine that in a future with personalized medicine, new treatments are essentially equivalent to new actions in a contextual bandit setting. INTRODUCTION The contextual bandit setting consists of the following loop repeated indefinitely: 1. The world presents context information as features x. 2. The learning algorithm chooses an action a from K possible actions. 3. The world presents a reward r for the action. The key difference between the contextual bandit setting and standard supervised learning is that only the reward of the chosen action is revealed. For example, after always choosing the same action several times in a row, the feedback given provides almost no basis to prefer the chosen action over another action. In essence, the contextual bandit setting captures the difficulty of exploration while avoiding the difficulty In the i.i.d. setting, the world draws a pair (x, ~r) consisting of a context and a reward vector from some unknown distribution D, revealing x in Step 1, but only the reward r(a) of the chosen action a in Step 3. Given a set of policies Π = {π : X → A}, the goal is to create an algorithm for Step 2 which competes with the set of policies. We measure our success by comparing the algorithm’s cumulative reward to the expected cumulative reward of the best policy in the set. The difference of the two is called regret. All existing algorithms for this setting either achieve a suboptimal regret (Langford and Zhang, 2007) or require computation linear in the number of policies (Auer et al., 2002b; Beygelzimer et al., 2011). In unstructured policy spaces, this computational complexity is the best one can hope for. On the other hand, in the case where the rewards of all actions are revealed, the problem is equivalent to cost-sensitive classification, and we know of algorithms to efficiently search the space of policies (classification rules) such as cost-sensitive logistic regression and support vector machines. In these cases, the space of classifica- tion rules is exponential in the number of features, but these problems can be efficiently solved using convex optimization. Our goal here is to efficiently solve the contextual bandit problems for similarly large policy spaces. We do this by reducing the contextual bandit problem to cost-sensitive classification. Given a supervised cost-sensitive learning algorithm as an oracle (Beygelzimer et al., 2009), our algorithm runs √ in time only polylog(N ) while achieving regret O( T K ln N ), where N is the number of possible policies (classification rules), K is the number of actions (classes), and T is the number of time steps. This efficiency is achieved in a modular way, so any future improvement in costsensitive learning immediately applies here. 1.1 2011), and also for VC classes when the adversary is constrained to i.i.d. sampling. There are two central benefits that we hope to realize by directly assuming i.i.d. contexts and reward vectors. 1. Computational Tractability. Even when the reward vectoris fully known, adversarial regrets √ ln N while computation scales scale as O as O(N ) in general. One attempt to get around this is the follow-the-perturbed-leader algorithm (Kalai and Vempala, 2005) which provides a computationally tractable solution in certain special-case structures. This algorithm has no mechanism for efficient application to arbitrary policy spaces, even given an efficient cost-sensitive classification oracle. An efficient cost-sensitive classification oracle has been shown effective in transductive settings (Kakade and Kalai, 2005). Aside from the drawback of requiring a transductive setting, the regret achieved there is substantially worse than for EXP4. PREVIOUS WORK AND MOTIVATION All previous regret-optimal approaches are measure based—they work by updating a measure over policies, an operation which is linear in the number of policies. In contrast, regret guarantees scale only logarithmically in the number of policies. If not for the computational bottleneck, these regret guarantees imply that we could dramatically increase performance in contextual bandit settings using more expressive policies. We overcome the computational bottleneck using an algorithm which works by creating cost-sensitive classification instances and calling an oracle to choose optimal policies. Actions are chosen based on the policies returned by the oracle rather than according to a measure over all policies. This is reminiscent of AdaBoost (Freund and Schapire, 1997), which creates weighted binary classification instances and calls a “weak learner” oracle to obtain classification rules. These classification rules are then combined into a final classifier with boosted accuracy. Similarly as AdaBoost converts a weak learner into a strong learner, our approach converts a cost-sensitive classification learner into an algorithm that solves the contextual bandit problem. In a more difficult version of contextual bandits, an adversary chooses (x, ~r) given knowledge of the learning algorithm (but not any random numbers). All known regret-optimal solutions in the adversarial setting are variants of the EXP4 algorithm (Auer et al., 2002b). EXP4 √ achieves the same regret rate as our algorithm: O KT ln N , where T is the number of time steps, K is the number of actions available in each time step, and N is the number of policies. Why not use EXP4 in the i.i.d. setting? For example, it is known that the algorithm can be modified to succeed with high probability (Beygelzimer et al., 2. Improved Rates. When the world is not completely adversarial, it is possible to achieve substantially lower regrets than are possible with algorithms optimized for the adversarial setting. For example, in supervised learning, it is possible to obtain regrets scaling as O(log(T )) with a problem dependent constant (Bartlett et al., 2007). When the feedback is delayed by τ rounds, lower bounds imply that the regret in the√adversarial setting increases by a multiplicative τ while in the i.i.d. setting, it is possible to achieve an additive regret of τ (Langford et al., 2009). In a direct i.i.d. setting, the previous-best approach using a cost-sensitive classification oracle was given by -greedy and epoch greedy algorithms (Langford and Zhang, 2007) which have a regret scaling as O(T 2/3 ) in the worst case. There have also been many special-case analyses. For example, theory of context-free setting is well understood (Lai and Robbins, 1985; Auer et al., 2002a; Even-Dar et al., 2006). Similarly, good algorithms exist when rewards are linear functions of features (Auer, 2002) or actions lie in a continuous space with the reward function sampled according to a Gaussian process (Srinivas et al., 2010). 1.2 WHAT WE PROVE In Section 3 we state the PolicyElimination algorithm, and prove the following regret bound for it. Theorem 4. For all distributions D over (x, ~r) with K actions, for all sets of N policies Π, with probabil- ity at least 1 − δ, the regret of PolicyElimination (Algorithm 1) over T rounds is at most r 4T 2 N 16 2T K ln . δ This result can be extended to deal with VC classes, as well as other special cases. It forms the simplest method we have of exhibiting the new analysis. The new key element of this algorithm is identification of a distribution over actions which simultaneously achieves small expected regret and allows estimating value of every policy with small variance. The existence of such a distribution is shown nonconstructively by a minimax argument. PolicyElimination is computationally intractable and also requires exact knowledge of the context distribution (but not the reward distribution!). We show how to address these issues in Section 4 using an algorithm we call RandomizedUCB. Namely, we prove the following theorem. Theorem 5. For all distributions D over (x, ~r) with K actions, for all sets of N policies Π, with probability at least 1 − δ, the regret of RandomizedUCB (Algorithm 2) over T rounds is at most p T K log (T N/δ) + K log(N K/δ) . O RandomizedUCB’s analysis is substantially more complex, with a key subroutine being an application of the ellipsoid algorithm with a costsensitive classification oracle (described in Section 5). RandomizedUCB does not assume knowledge of the context distribution, and instead works with the history of contexts it has observed. Modifying the proof for this empirical distribution requires a covering argument over the distributions over policies which uses the probabilistic method. The net result is an algorithm with a similar top-level analysis as PolicyElimination, but with the running time only poly-logarithmic in the number of policies given a costsensitive classification oracle. Theorem 11. In each time step t, RandomizedUCB makes at most O(poly(t, K, log(1/δ), log N )) calls to cost-sensitive classification oracle, and requires additional O(poly(t, K, log N )) processing time. Apart from a tractable algorithm, our analysis can be used to derive tighter regrets than would be possible in adversarial setting. For example, in Section 6, we consider a common setting where reward feedback is delayed by τ rounds. A straightforward modification of PolicyElimination yields a regret with an additive term proportional to τ compared with the delay-free setting. Namely, we prove the following. Theorem 12. For all distributions D over (x, ~r) with K actions, for all sets of N policies Π, and all delay intervals τ , with probability at least 1 − δ, the regret of DelayedPE (Algorithm 3) is at most r √ 4T 2 N τ+ T . 16 2K ln δ We start next with precise settings and definitions. 2 2.1 SETTING AND DEFINITIONS THE SETTING Let A be the set of K actions, let X be the domain of contexts x, and let D be an arbitrary joint distribution on (x, ~r). We denote the marginal distribution of D over X by DX . We denote Π to be a finite set of policies {π : X → A}, where each policy π, given a context xt in round t, chooses the action π(xt ). The cardinality of Π is denoted by N . Let ~rt ∈ [0, 1]K be the vector of rewards, where rt (a) is the reward of action a on round t. In the i.i.d. setting, on each round t = 1 . . . T , the world chooses (xt , ~rt ) i.i.d. according to D and reveals xt to the learner. The learner, having access to Π, chooses action at ∈ {1, . . . , K}. Then the world reveals reward rt (at ) (which we call rt for short) to the learner, and the interaction proceeds to the next round. We consider two modes of accessing the set of policies Π. The first option is through the enumeration of all policies. This is impractical in general, but suffices for the illustrative purpose of our first algorithm. The second option is an oracle access, through an argmax oracle, corresponding to a cost-sensitive learner: Definition 1. For a set of policies Π, an argmax oracle (AMO for short), is an algorithm, which for any sequence {(xt0 , ~rt0 )}t0 =1...t , xt0 ∈ X, ~rt0 ∈ RK , computes X arg max rt0 (π(xt0 )) . π∈Π t0 =1...t The reason why the above can be viewed as a costsensitive classification oracle is that vectors of rewards ~rt0 can be interpreted as negative costs and hence the policy returned by AMO is the optimal cost-sensitive classifier on the given data. 2.2 EXPECTED AND EMPIRICAL REWARDS Let the expected instantaneous reward of a policy π ∈ Π be denoted by . ηD (π) = E [r(π(x))] . (x,~ r )∼D The best policy πmax ∈ Π is that which maximizes ηD (π). More formally, . πmax = argmax ηD (π) . π∈Π We define ht to be the history at time t that the learner has seen. Specifically [ ht = (xt0 , at0 , rt0 , pt0 ) , t0 =1...t where pt0 is the probability of the algorithm choosing action at0 at time t0 . Note that at0 and pt0 are produced by the learner while xt0 , rt0 are produced by nature. We write x ∼ h to denote choosing x uniformly at random from the x’s in history h. Using the history of past actions and probabilities with which they were taken, we can form an unbiased estimate of the policy value for any π ∈ Π: . 1 ηt (π) = t X (x,a,r,p)∈ht actions. For any distribution P over policies Π, let WP (x, a) denote the induced conditional distribution over actions a given the context x: X . WP (x, a) = P (π) . (2.2) π∈Π:π(x)=a In general, we shall use W , W 0 and Z as conditional probability distributions over the actions A given contexts X, i.e., W : X × A → [0, 1] such that W (x, ·) is a probability distribution over A (and similarly for W 0 and Z). We shall think of W 0 as a smoothed version of W with a minimum action probability of µ (to be defined by the algorithm), such that W 0 (x, a) = (1 − Kµ)W (x, a) + µ . Conditional distributions such as W (and W 0 , Z, etc.) correspond to randomized policies. We define notions true and empirical value and regret for them as follows: . ηD (W ) = rI(π(x) = a) . p E [~r · W (x)] (x,~ r )∼D . 1 ηt (W ) = t rI(π(x)=a) Ea∼p p(a) The unbiasedness follows, because = P rI(π(x)=a) = r(π(x)). The empirically best a p(a) p(a) policy at time t is denoted X (x,a,r,p)∈ht rW (x, a) p . ∆D (W ) = ηD (πmax ) − ηD (W ) . ∆t (W ) = ηt (πt ) − ηt (W ) . . πt = argmax ηt (π). π∈Π 2.3 REGRET The goal of this work is to obtain a learner that has small regret relative to the expected performance of πmax over T rounds, which is X (ηD (πmax ) − rt ) . (2.1) t=1...T We say that the regret of the learner over T rounds is bounded by with probability at least 1 − δ, if " # X Pr (ηD (πmax ) − rt ) ≤ ≥ 1 − δ t=1...T where the probability is taken with respect to the random pairs (xt , ~rt ) ∼ D for t = 1 . . . T , as well as any internal randomness used by the learner. We can also define notions of regret and empirical regret for policies π. For all π ∈ Π, let ∆D (π) = ηD (πmax ) − ηD (π) , ∆t (π) = ηt (πt ) − ηt (π) . Our algorithms work by choosing distributions over policies, which in turn then induce distributions over 3 POLICY ELIMINATION The basic ideas behind our approach are demonstrated in our first algorithm: PolicyElimination (Algorithm 1). The key step is Step 1, which finds a distribution over policies which induces low variance in the estimate of the value of all policies. Below we use minimax theorem to show that such a distribution always exists. How to find this distribution is not specified here, but in Section 5 we develop a method based on the ellipsoid algorithm. Step 2 then projects this distribution onto a distribution over actions and applies smoothing. Finally, Step 5 eliminates the policies that have been determined to be suboptimal (with high probability). ALGORITHM ANALYSIS We analyze PolicyElimination in several steps. First, we prove the existence of Pt in Step 1, provided that Πt−1 is non-empty. We recast the feasibility problem in Step 1 as a game between two players: Prover, who is trying to produce Pt , and Falsifier, who is trying to find π violating the constraints. We give more power to Falsifier and allow him to choose a distribution over π (i.e., a randomized policy) which would violate the constraints. Algorithm 1 PolicyElimination(Π,δ,K,DX ) Let Π0 = Π and history h0 = ∅ . Define: δt = δ r / 4N t2 2K ln(1/δt ) . Define: bt = 2 ( t r ) 1 ln(1/δt ) . Define: µt = min , 2K 2Kt For each timestep t = 1 . . . T , observe xt and do: • convex in W : Note that 1/W 0 (x, a) is convex in W (x, a) by convexity of 1/(c1 w + c2 ) in w ≥ 0, for c1 ≥ 0, c2 > 0. Convexity of f (W, Z) in W then follows by taking expectations over x and a. Hence, by Theorem 14 (in Appendix B), min and max can be reversed without affecting the value: min max f (W, Z) = max min f (W, Z) . W ∈C Z∈C 1. Choose distribution Pt over Πt−1 s.t. ∀ π ∈ Πt−1 : 1 ≤ 2K E x∼DX (1 − Kµt )WPt (x, π(x)) + µt 2. Let Wt0 (a) = (1−Kµt )WPt (xt , a)+µt for all a ∈ A 3. Choose at ∼ Wt0 4. Observe reward rt n 5. Let Πt = π ∈ Πt−1 : o ηt (π) ≥ 0max ηt (π 0 ) − 2bt Z∈C W ∈C The right-hand side can be further upper-bounded by maxZ∈C f (Z, Z), which is upper-bounded by X Z(x, a) f (Z, Z) = E Z 0 (x, a) x∼DX a∈A X Z(x, a) K ≤ E = . (1 − Kµ)Z(x, a) 1 − Kµ x∼DX a∈A: Z(x,a)>0 Corollary 2. The set of distributions satisfying constraints of Step 1 is non-empty. π ∈Πt−1 6. Let ht = ht−1 ∪ (xt , at , rt , Wt0 (at )) Note that any policy π corresponds to a point in the space of randomized policies (viewed as functions . X × A → [0, 1]), with π(x, a) = I(π(x) = a). For any distribution P over policies in Πt−1 , the induced randomized policy WP then corresponds to a point in the convex hull of Πt−1 . Denoting the convex hull of Πt−1 by C, Prover’s choice by W and Falsifier’s choice by Z, the feasibility of Step 1 follows by the following lemma: Lemma 1. Let C be a compact and convex set of randomized policies. Let µ ∈ (0, 1/K] and for any W ∈ C, . W 0 (x, a) = (1 − Kµ)W (x, a) + µ. Then for all distributions D, 1 K min max E ≤ . E 0 W ∈C Z∈C x∼DX a∼Z(x,·) W (x, a) 1 − Kµ . Proof. Let f (W, Z) = Ex∼DX Ea∼Z(x,·) [1/W 0 (x, a)] denote the inner expression of the minimax problem. Note that f (W, Z) is: • everywhere defined : Since W 0 (x, a) ≥ µ, we obtain that 1/W 0 (x, a) ∈ [0, 1/µ], hence the expectations are defined for all W and Z. • linear in Z: f (W, Z) as Linearity follows from rewriting f (W, Z) = E x∼DX X Z(x, a) . W 0 (x, a) a∈A Given the existence of Pt , we will see below that the constraints in Step 1 ensure low variance of the policy value estimator ηt (π) for all π ∈ Πt−1 . The small variance is used to ensure accuracy of policy elimination in Step 5 as quantified in the following lemma: Lemma 3. With probability at least 1 − δ, for all t: 1. πmax ∈ Πt (i.e., Πt is non-empty) 2. ηD (πmax ) − ηD (π) ≤ 4bt for all π ∈ Πt Proof. We will show that for any policy π ∈ Πt−1 , the probability that ηt (π) deviates from ηD (π) by more that bt is at most 2δt . Taking the union bound over all policies and all time steps we find that with probability at least 1 − δ, |ηt (π) − ηD (π)| ≤ bt (3.1) for all t and all π ∈ Πt−1 . Then: 1. By the triangle inequality, in each time step, ηt (π) ≤ ηt (πmax ) + 2bt for all π ∈ Πt−1 , yielding the first part of the lemma. 2. Also by the triangle inequality, if ηD (π) < ηD (πmax ) − 4bt for π ∈ Πt−1 , then ηt (π) < ηt (πmax ) − 2bt . Hence the policy π is eliminated in Step 5, yielding the second part of the lemma. It remains to show Eq. (3.1). We fix the policy π ∈ Π and time t, and show that the deviation bound is violated with probability at most 2δt . Our argument rests on Freedman’s inequality (see Theorem 13 in Appendix A). Let yt = rt I(π(xt ) = at ) , Wt0 (at ) Pt i.e., ηt (π) = ( t0 =1 yt0 )/t. Let Et denote the conditional expectation E[ · | ht−1 ]. To use Freedman’s inequality, we need to bound the range of yt and its conditional second moment Et [yt2 ]. Algorithm 2 RandomizedUCB(Π,δ,K) . Let h0 = ∅ be the initial history. Define the following quantities: ) ( r Nt Ct 1 . . Ct = 2 log . and µt = min , δ 2K 2Kt For each timestep t = 1 . . . T , observe xt and do: 1. Let Pt be a distribution over Π that approximately solves the optimization problem X min P (π)∆t−1 (π) Since rt ∈ [0, 1] and Wt0 (at ) ≥ µt , we have the bound . 0 ≤ yt ≤ 1/µt = Rt . P Next, π∈Π for all distributions Q over Π : # " t−1 1 1 X E (1 − Kµt )WP (xi , π(xi )) + µt π∼Q t − 1 i=1 (t − 1)∆t−1 (WQ )2 ≤ max 4K, 180Ct−1 (4.1) so p that the objective value at Pt is within εopt,t = O( KCt /t) of the optimal value, and so that each constraint is satisfied with slack ≤ K. s.t. Et [yt2 ] = E E (xt ,~ rt )∼D at ∼Wt0 2 yt rt2 I(π(xt ) = at ) = E E 0 Wt0 (at )2 (xt ,~ rt )∼D at ∼Wt 0 Wt (π(xt )) ≤ E 0 2 (xt ,~ rt )∼D Wt (π(xt )) 1 = E ≤ 2K . 0 xt ∼D Wt (π(xt )) (3.2) (3.3) where Eq. (3.2) follows by boundedness of rt and Eq. (3.3) follows from the constraints in Step 1. Hence, X . Et0 [yt20 ] ≤ 2Kt = Vt . 2. Let Wt0 be the distribution over A given by . Wt0 (a) = (1 − Kµt )WPt (xt , a) + µt for all a ∈ A. t0 =1...t 3. Choose at ∼ Wt0 . Since (ln t)/t is decreasing for t ≥ 3, we obtain that µt is non-increasing (by separately analyzing t = 1, t = 2, t ≥ 3). Let t0 be the first t such that µt < 1/2K. Note that bt ≥ 4Kµt , so for t < t0 , we have bt ≥ 2 and Πt = Π. Hence, the deviation bound holds for t < t0 . Let t ≥ t0 . For t0 ≤ t, by the monotonicity of µt s s 2Kt Vt Rt0 = 1/µt0 ≤ 1/µt = = . ln(1/δt ) ln(1/δt ) Hence, the assumptions of Theorem 13 are satisfied, and Pr [|ηt (π) − ηD (π)| ≥ bt ] ≤ 2δt . 4. Observe reward rt . . 5. Let ht = ht−1 ∪ (xt , at , rt , Wt0 (at )). Theorem 4. For all distributions D over (x, ~r) with K actions, for all sets of N policies Π, with probability at least 1 − δ, the regret of PolicyElimination (Algorithm 1) over T rounds is at most r 16 The union bound over π and t yields Eq. (3.1). This immediately implies that the cumulative regret is bounded by r T X 4N T 2 X 1 √ (ηD (πmax ) − rt ) ≤ 8 2K ln δ t=1 t t=1...T r 4T 2 N ≤ 16 2T K ln (3.4) δ and gives us the following theorem. 4 2T K ln 4T 2 N . δ THE RANDOMIZED UCB ALGORITHM PolicyElimination is the simplest exhibition of the minimax argument, but it has some drawbacks: 1. The algorithm keeps explicit track of the space of good policies (like a version space), which is difficult to implement efficiently in general. 2. If the optimal policy is mistakenly eliminated by chance, the algorithm can never recover. 3. The algorithm requires perfect knowledge of the distribution DX over contexts. These difficulties are addressed by RandomizedUCB (or RUCB for short), an algorithm which we present and analyze in this section. Our approach is reminiscent of the UCB algorithm (Auer et al., 2002a), developed for context-free setting, which keeps an upperconfidence bound on the expected reward for each action. However, instead of choosing the highest upper confidence bound, we randomize over choices according to the value of their empirical performance. The algorithm has the following properties: 1. The optimization step required by the algorithm always considers the full set of policies (i.e., explicit tracking of the set of good policies is avoided), and thus it can be efficiently implemented using an argmax oracle. We discuss this further in Section 5. 2. Suboptimal policies are implicitly used with decreasing frequency by using a non-uniform variance constraint that depends on a policy’s estimated regret. A consequence of this is a bound on the value of the optimization, stated in Lemma 7 below. 3. Instead of DX , the algorithm uses the history of previously seen contexts. The effect of this approximation is quantified in Theorem 6 below. The regret of RandomizedUCB is the following: Theorem 5. For all distributions D over (x, ~r) with K actions, for all sets of N policies Π, with probability at least 1 − δ, the regret of RandomizedUCB (Algorithm 2) over T rounds is at most p T K log (T N/δ) + K log(N K/δ) . O The first quantity VP,π,t is (a bound on) the variance incurred by an importance-weighted estimate of reward in round t using the action distribution induced by P , and the second quantity VbP,π,t is an empirical estimate of VP,π,t using the finite sample {x1 , . . . , xt−1 } ⊆ X drawn from DX . We show that for all distributions P and all π ∈ Π, VbP,π,t is close to VP,π,t with high probability. Theorem 6. For any ∈ (0, 1), with probability at least 1 − δ, VP,π,t ≤ (1 + ) · VbP,π,t + 7500 ·K 3 for all distributions P over Π, all π ∈ Π, and all t ≥ 16K log(8KN/δ). The proof appears in Appendix C. 4.2 REGRET ANALYSIS Central to the analysis is the following lemma that bounds the value of the optimization in each round. It is a direct corollary of Lemma 24 in Appendix D.4. Lemma 7. If OPTt is the value of the optimization problem (4.1) in round t, then ! ! r r KCt−1 K log(N t/δ) OPTt ≤ O = O . t−1 t This lemma implies that the algorithm is always able to select a distribution over the policies that focuses mostly on the policies with low estimated regret. Moreover, the variance constraints ensure that good policies never appear too bad, and that only bad policies are allowed to incur high variance in their reward estimates. Hence, minimizing the objective in (4.1) is an effective surrogate for minimizing regret. The proof is given in Appendix D.4. Here, we present an overview of the analysis. The bulk of the analysis consists of analyzing the variance of the importance-weighted reward estimates ηt (π), and showing how they relate to their actual expected rewards ηD (π). The details are deferred to Appendix D. 4.1 5 EMPIRICAL VARIANCE ESTIMATES A key technical prerequisite for the regret analysis is the accuracy of the empirical variance estimates. For a distribution P over policies Π and a particular policy π ∈ Π, define 1 VP,π,t = E x∼DX (1 − Kµt )WP (x, π(x)) + µt t−1 VbP,π,t = 1 X 1 . t − 1 i=1 (1 − Kµt )WP (xi , π(xi )) + µt USING AN ARGMAX ORACLE In this section, we show how to solve the optimization problem (4.1) using the argmax oracle (AMO) for our set of policies. Namely, we describe an algorithm running in polynomial time independent1 of the number of policies, which makes queries to AMO to compute a distribution over policies suitable for the optimization step of Algorithm 2. 1 Or rather dependent only on log N , the representation size of a policy. This algorithm relies on the ellipsoid method. The ellipsoid method is a general technique for solving convex programs equipped with a separation oracle. A separation oracle is defined as follows: Definition 2. Let S be a convex set in Rn . A separation oracle for S is an algorithm that, given a point x ∈ Rn , either declares correctly that x ∈ S, or produces a hyperplane H such that x and S are on opposite sides of H. We do not describe the ellipsoid algorithm here (since it is standard), but only spell out its key properties in the following lemma. For a point x ∈ Rn and r ≥ 0, we use the notation B(x, r) to denote the `2 ball of radius r centered at x. Lemma 8. Suppose we are required to decide whether a convex set S ⊆ Rn is empty or not. We are given a separation oracle for S and two numbers R and r, such that S ∈ B(0, R) and if S is non-empty, then there is a point x? such that S ⊇ B(x? , r). The ellipsoid algorithm decides correctly if S is empty or not, by executing at most O(n2 log( Rr )) iterations, each involving one call to the separation oracle and additional O(n2 ) processing time. We now write a convex program whose solution is the required distribution, and show how to solve it using the ellipsoid method by giving a separation oracle for its feasible set using AMO. Fix a time period t. Let Xt−1 be the set of all contexts seen so far, i.e. Xt−1 = {x1 , x2 , . . . , xt−1 }. We embed all policies π ∈ Π in R(t−1)K , with coordinates identified with (x, a) ∈ Xt−1 × A. With abuse of notation, a policy π is represented by the vector π with coordinate π(x, a) = 1 if π(x) = a and 0 otherwise. Let C be the convex hull of all policy vectors π. Recall that a distribution P over policies corresponds to P a point inside C, i.e., WP (x, a) = π:π(x)=a P (π), and that W 0 (x, a) = (1 − µt K)W (x, a) + µt , where µt is as t−1 defined in Algorithm 2. Also define βt = 180C . In t−1 the following, we use the notation x ∼ ht−1 to denote a context drawn uniformly at random from Xt−1 . Consider the following convex program: min s s.t. ∆t−1 (W ) ≤ s (5.1) W ∈ C (5.2) ∀Z ∈ C : " # X Z(x, a) ≤ max{4K, βt ∆t−1 (Z)2 } (5.3) E 0 (x, a) W x∼ht−1 a We claim that this program is equivalent to the RUCB optimization problem (4.1), up to finding an explicit distribution over policies which corresponds to the optimal solution. This can be seen as follows. Since we require W ∈ C, it can be interpreted as being equal to WP for some distribution over policies P . The constraints (5.3) are equivalent to (4.1) by substitution Z = WQ . The above convex program can be solved by performing a binary search over s and testing feasibility of the constraints. For a fixed value of s, the feasibility problem defined by (5.1)–(5.3) is denoted by A. We now give a sketch of how we construct a separation oracle for the feasible region of A. The details of the algorithm are a bit complicated due to the fact that we need to ensure that the feasible region, when non-empty, has a non-negligible volume (recall the requirements of Lemma 8). This necessitates having a small error in satisfying the constraints of the program. We leave the details to Appendix E. Modulo these details, the construction of the separation oracle essentially implies that we can solve A. Before giving the construction of the separation oracle, we first show that AMO allows us to do linear optimization over C efficiently: Lemma 9. Given a vector w ∈ R(t−1)K , we can compute arg maxZ∈C w · Z using one invocation of AMO. Proof. The sequence for AMO consists of xt0 ∈ Xt−1 and ~rt0 (a) P = w(xt0 , a). The lemma now follows since w · π = x∈Xt−1 w(x, π(x)). We need another simple technical lemma which explains how to get a separating hyperplane for violations of convex constraints: Lemma 10. For x ∈ Rn , let f (x) be a convex function of x, and consider the convex set K defined by K = {x : f (x) ≤ 0}. Suppose we have a point y such that f (y) > 0. Let ∇f (y) be a subgradient of f at y. Then the hyperplane f (y) + ∇f (y) · (x − y) = 0 separates y from K. Proof. Let g(x) = f (y) + ∇f (y) · (x − y). By the convexity of f , we have f (x) ≥ g(x) for all x. Thus, for any x ∈ K, we have g(x) ≤ f (x) ≤ 0. Since g(y) = f (y) > 0, we conclude that g(x) = 0 separates y from K. Now given a candidate point W , a separation oracle can be constructed as follows. We check whether W satisfies the constraints of A. If any constraint is violated, then we find a hyperplane separating W from all points satisfying the constraint. 1. First, for constraint (5.1), note that ηt−1 (W ) is linear in W , and so we can compute maxπ ηt−1 (π) via AMO as in Lemma 9. We can then compute ηt−1 (W ) and check if the constraint is satisfied. If not, then the constraint, being linear, automatically yields a separating hyperplane. 2. Next, we consider constraint (5.2). To check if W ∈ C, we use the perceptron algorithm. We shift the origin to W , and run the perceptron algorithm with all points π ∈ Π being positive examples. The perceptron algorithm aims to find a hyperplane putting all policies π ∈ Π on one side. In each iteration of the perceptron algorithm, we have a candidate hyperplane (specified by its normal vector), and then if there is a policy π that is on the wrong side of the hyperplane, we can find it by running a linear optimization over C in the negative normal vector direction as in Lemma 9. If W ∈ / C, then in a bounded number of iterations (depending on the distance of W from C, and the maximum magnitude kπk2 ) we obtain a separating hyperplane. In passing we also note that if W ∈ C, the same technique allows us to explicitly compute an approximate convex combination of policies in Π that yields W . This is done by running the perceptron algorithm as before and stopping after the bound on the number of iterations has been reached. Then we collect all the policies we have found in the run of the perceptron algorithm, and we are guaranteed that W is close in distance to their convex hull. We can then find the closest point in the convex hull of these policies by solving a simple quadratic program. 3. Finally, we consider constraint (5.3). We rewrite ηt−1 (W ) as ηt−1 (W ) = w · W , where w(xt0 , a) = rt0 I(a = at0 )/Wt00 (at0 ). Thus, ∆t−1 (Z) = v −w ·Z, where v = maxπ0 ηt−1 (π 0 ) = maxπ0 w · π 0 , which can be computed by using AMO once. Next, using the candidate point W , compute the /t , where nx vector u defined as u(x, a) = Wn0x(x,a) is the number of times x appears in h t−1 , so that hP i Z(x,a) Ex∼ht−1 a W 0 (x,a) = u · Z. Now, the problem reduces to finding a policy Z ∈ C which violates the constraint To do this, we again apply the ellipsoid method. For this, we need a separation oracle for the program. A separation oracle for the constraints (5.5) can be constructed as in Step 2 above. For the constraints (5.4), if the candidate solution Z has f (Z) > 0, then we can construct a separating hyperplane as in Lemma 10. Suppose that after solving the program, we get a point Z ∈ C such that f (Z) ≤ 0, i.e. W violates the constraint (5.3) for Z. Then since constraint (5.3) is convex in W , we can construct a separating hyperplane as in Lemma 10. This completes the description of the separation oracle. Working out the details carefully yields the following theorem, proved in Appendix E: Theorem 11. There is an iterative algorithm with O(t5 K 4 log2 ( tK δ )) iterations, each involving one call to AMO and O(t2 K 2 ) processing time, that either declares correctly that A is infeasible or outputs a distribution P over policies in Π such that WP satisfies ∀Z ∈ C : " # X Z(x, a) 2 E 0 (x, a) ≤ max{4K, βt ∆t−1 (Z) } + 5 W x∼ht−1 P a ∆t−1 (W ) ≤ s + 2γ, where = 6 8δ µ2t and γ = δ µt . DELAYED FEEDBACK In a delayed feedback setting, we observe rewards with a τ step delay according to: 1. The world presents features xt . 2. The learning algorithm chooses an action at ∈ {1, ..., K}. 3. The world presents a reward rt−τ for the action at−τ given the features xt−τ . We deal with delay by suitably modifying Algorithm 1 to incorporate the delay τ , giving Algorithm 3. u · Z ≤ max{4K, βt (w · Z − v)2 }. Now we can prove the following theorem, which shows the delay has an additive effect on regret. Define f (Z) = max{4K, βt (w·Z−v)2 }−u·Z. Note that f is a convex function of Z. Finding a point Z that violates the above constraint is equivalent to solving the following (convex) program: Theorem 12. For all distributions D over (x, ~r) with K actions, for all sets of N policies Π, and all delay intervals τ , with probability at least 1 − δ, the regret of DelayedPE (Algorithm 3) is at most r √ 4T 2 N 16 2K ln τ+ T . δ f (Z) ≤ 0 (5.4) Z ∈ C (5.5) Algorithm 3 DelayedPE(Π,δ,K,DX ,τ ) Let Π0 = Π and history h0 = ∅ r 2K ln(1/δt ) . . Define: δt = δ / 4N t2 and bt = 2 ( ) t r 1 ln(1/δ ) . t Define: µt = min , 2K 2Kt For each timestep t = 1 . . . T , observe xt and do: 1. Let t0 = max(t − τ, 1). 2. Choose distribution Pt over Πt−1 s.t. ∀ π ∈ Πt−1 : 1 ≤ 2K E x∼DX (1 − Kµt0 )WPt (x, π(x)) + µt0 3. ∀ a ∈ A, Let Wt0 (a) = (1 − Kµt0 )WPt (xt , a) + µt0 4. Choose at ∼ Wt0 5. Observe reward rt . n 6. Let Πt = π ∈ Πt−1 : o ηh (π) ≥ 0max ηh (π 0 ) − 2bt0 π ∈Πt−1 7. Let ht = ht−1 ∪ (xt , at , rt , Wt0 (at )) Proof. Essentially as Theorem 4. The variance bound is unchanged because it depends only on context Pthe T distribution. Thus, it suffices to replace t−1 √1t with PT +τ PT 1 τ + t=τ +1 √t−τ = τ + t=1 √1t in Eq. (3.4). Acknowledgements We thank Alina Beygelzimer, who helped in several formative discussions. References Peter Auer. Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3:397–422, 2002. Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47(2–3):235–256, 2002a. Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal of Computing, 32(1):48–77, 2002b. P. L. Bartlett, E. Hazan, and A. Rakhlin. Adaptive online gradient descent. In NIPS, 2007. Alina Beygelzimer, John Langford, and Pradeep Ravikumar. Error correcting tournaments. In ALT, 2009. Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandit algorithms with supervised learning guarantees. In AISTATS, 2011. Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7:1079–1105, 2006. David A. Freedman. On tail probabilities for martingales. Annals of Probability, 3(1):100–118, 1975. Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1): 119–139, 1997. Sham M. Kakade and Adam Kalai. From batch to transductive online learning. In NIPS, 2005. Adam Tauman Kalai and Santosh Vempala. Efficient algorithms for online decision problems. J. Comput. Syst. Sci., 71(3):291–307, 2005. Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985. J. Langford, A. Smola, and M. Zinkevich. Slow learners are fast. In NIPS, 2009. John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In NIPS, 2007. Maurice Sion. On general minimax theorems. Pacific J. Math., 8(1):171–176, 1958. Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In ICML, 2010. A Concentration Inequality The following is an immediate corollary of Theorem 1 of (Beygelzimer et al., 2011). It can be viewed as a version of Freedman’s Inequality (Freedman, 1975). Let y1 , . . . , yT be a sequence of real-valued random variables. Let Et denote the conditional expectation E[ · | y1 , . . . , yt−1 ] and Vt conditional variance. Theorem 13 (Freedman-style Inequality). Let V, R ∈ PT R such that t=1 Vt [yt ] ≤ V , and for all t, yt − E [y ] ≤ R. Then for any δ > 0 such that R ≤ t t p V / ln(2/δ), with probability at least 1 − δ, T T X X p yt − Et [yt ] ≤ 2 V ln(2/δ) . t=1 B t=1 Minimax Theorem The following is a continuous version of Sion’s Minimax Theorem (Sion, 1958, Theorem 3.4). Theorem 14. Let W and Z be compact and convex sets, and f : W × Z → R a function which for all Z ∈ Z is convex and continuous in W and for all W ∈ W is concave and continuous in Z. Then min max f (W, Z) = max min f (W, Z) . W ∈W Z∈Z Z∈Z W ∈W C Empirical Variance Bounds and then applying the AM/GM inequality. In this section we prove Theorem 6. We first show uniform convergence for a certain class of policy distributions (Lemma 15), and argue that each distribution P is close to some distribution Pe from this class, in the sense that VP,π,t is close to VPe,π,t and VbP,π,t is close to VbPe,π,t (Lemma 16). Together, they imply the main uniform convergence result in Theorem 6. For each positive integer m, let Sparse[m] be the set of distributions Pe over Π that can be written as m 1 X I(π = πi ) Pe(π) = m i=1 (i.e., the average of m delta functions) for some π1 , . . . , πm ∈ Π. In our analysis, we approximate an arbitrary distribution P over Π by a distribution Pe ∈ Sparse[m] chosen randomly by independently drawing π1 , . . . , πm ∼ P ; we denote this process by Pe ∼ P m . Lemma 15. Fix positive integers (m1 , m2 , . . . ). With probability at least 1 − δ over the random samples (x1 , x2 , . . . ) from DX , VPe,π,t ≤ (1 + λ) · VbPe,π,t (mt + 1) log N + log 1 + 5+ · 2λ µt · (t − 1) 2t2 δ Lemma 16. Fix any γ ∈ [0, 1], and any x ∈ X. For any distribution P over Π and any π ∈ Π, if . m= 6 , γ 2 µt then 1 E (1 − Kµ )W m e t e P ∼P P (x, π(x)) + µt 1 − (1 − Kµt )WP (x, π(x)) + µt γ . ≤ (1 − Kµt )WP (x, π(x)) + µt This implies that for all distributions P over Π and any π ∈ Π, there exists Pe ∈ Sparse[m] such that for any λ > 0, for all λ > 0, all t ≥ 1, all π ∈ Π, and all distributions Pe ∈ Sparse[mt ]. Proof. Let . ZPe,π,t (x) = VP,π,t − VPe,π,t + (1 + λ) VbPe,π,t − VbP,π,t ≤ γ(VP,π,t + (1 + λ)VbP,π,t ). 1 (1 − Kµt )WPe (x, π(x)) + µt so VPe,π,t = Ex∼DX [ZPe,π,t (x)] and VbPe,π,t = (t − Pt−1 1)−1 i=1 ZPe,π,t (xi ). Also let . Proof. randomly draw Pe ∼ P m , with Pe(π 0 ) = PWe m −1 0 m i=1 I(π = πi ), and then define . log(|Sparse[mt ]|N 2t2 /δ) εt = µt · (t − 1) ((mt + 1) log N + log = µt · (t − 1) 2t2 δ ) . We apply Bernstein’s inequality and union bounds over Pe ∈ Sparse[mt ], π ∈ Π, and t ≥ 1 so that with probability at least 1 − δ, q VPe,π,t ≤ VbPe,π,t + 2VPe,π,t εt + (2/3)εt all t ≥ 1, all π ∈ Π, and all distributions P ∈ Sparse[mt ]. The conclusion follows by solving the quadratic inequality for VPe,π,t to get q VPe,π,t ≤ VbPe,π,t + 2VbPe,π,t εt + 5εt . X z= P (π 0 ) · I(π 0 (x) = π(x)) and π 0 ∈Π . X e 0 ẑ = P (π ) · I(π 0 (x) = π(x)). π 0 ∈Π 0 We have Pm z = Eπ0 ∼P [I(π (x) = π(x)] and ẑ = −1 m i=1 I(πi (x) = π(x)). In other words, ẑ is the average of m independent Bernoulli random variables, each with mean z. Thus, EPe∼P m [(ẑ−z)2 ] = z(1−z)/m and PrPe∼P m [ẑ ≤ z/2] ≤ exp(−mz/8) by a Chernoff bound. We have 1 1 − E (1 − Kµ )ẑ + µ (1 − Kµ )z + µ e∼P m t t t t P ≤ E (1 − Kµt )|ẑ − z| [(1 − Kµt )ẑ + µt ][(1 − Kµt )z + µt ] E (1 − Kµt )|ẑ − z|I(ẑ ≥ 0.5z) 0.5[(1 − Kµt )z + µt ]2 e∼P m P ≤ e∼P m P (1 − Kµt )|ẑ − z|I(ẑ ≤ 0.5z) µt [(1 − Kµt )z + µt ] e∼P m P p (1 − Kµt ) EPe∼P m |ẑ − z|2 ≤ 0.5[(1 − Kµt )z + µt ]2 (1 − Kµt )z PrPe∼P m (ẑ ≤ 0.5z) + µt [(1 − Kµt )z + µt ] p (1 − Kµt ) z/m p ≤ 0.5[2 (1 − Kµt )zµt ][(1 − Kµt )z + µt ] (1 − Kµt )z exp(−mz/8) + µt [(1 − Kµt )z + µt ] p √ γ 1 − Kµt z/m ≤p z(6/m)[(1 − Kµt )z + µt ] + E (1 − Kµt )γ 2 mz exp(−mz/8) , + 6[(1 − Kµt )z + µt ] where the third inequality follows from Jensen’s inequality, and the fourth inequality uses the AM/GM inequality in the denominator of the first term and the previous observations in the numerators. The final expression simplifies to the first desired displayed inequality by observing that mz exp(−mz/8) ≤ 3 for all mz ≥ 0 (the maximum is achieved at mz = 8). The second displayed inequality follows from the following facts: E e∼P m P |VP,π,t − VPe,π,t | ≤ γVP,π,t , E (1 + λ)|VbP,π,t − VbPe,π,t | ≤ γ(1 + λ)VbP,π,t . e∼P m P Both inequalities follow from the first displayed bound of the lemma, by taking expectation with respect to the true (and empirical) distributions over x. The desired bound follows by adding the above two inequalities, which implies that the bound holds in expectation, and hence the existence of Pe for which the bound holds. Now, we can prove Theorem 6. Proof of Theorem 6. Let . mt = 6 1 · λ2 µt (for some λ ∈ (0, 1/5) to be determined) and condition on the ≥ 1 − δ probability event from Lemma 15 that VPe,π,t − (1 + λ)VbPe,π,t (mt + 1) log(N ) + log(2t2 /δ) 1 ≤K · 5+ · 2λ Kµt · (t − 1) 1 (mt + 1) log(N ) + log(2t2 /δ) · ≤K ·5 1+ λ Kµt · t for all t ≥ 2, all Pe ∈ Sparse[mt ], and all π ∈ Π. Using the definitions of mt and µt , the second term is at most (40/λ2 ) · (1 + 1/λ) · K for all t ≥ 16K log(8KN/δ): the key here p is that for t ≥ 16K log(8KN/δ), we have µt = log(N t/δ)/(Kt) ≤ 1/(2K) and therefore mt log(N ) 6 ≤ 2 Kµt t λ and log(N ) + log(2t2 /δ) ≤ 2. Kµt t Now fix t ≥ 16K log(8KN/δ), π ∈ Π, and a distribution P over Π. Let Pe ∈ Sparse[mt ] be the distribution guaranteed by Lemma 16 with γ = λ satisfying VP,π,t ≤ VPe,π,t − (1 + λ)VbPe,π,t + (1 + λ)2 VbP,π,t 1−λ . Substituting the previous bound for VPe,π,t − (1 + λ)Vb e gives P ,π,t VP,π,t ≤ 1 1−λ 40 2b (1 + 1/λ)K + (1 + λ) V P,π,t . λ2 This can be bounded as (1 + ) · VbP,π,t + (7500/3 ) · K by setting λ = /5. D D.1 Analysis of RandomizedUCB Preliminaries First, we define the following constants. • ∈ (0, 1) is a fixed constant, and . • ρ = 7500 3 is the factor that appears in the bound from Theorem 6. . 2 1 + 7500 ≥5 • θ = (ρ + 1)/(1 − (1 + )/2) = 1− 3 is a constant central to Lemma 21, which bounds the variance of the optimal policy’s estimated rewards. Recall the algorithm-specific quantities Nt . Ct = 2 log δ ( ) r 1 Ct . µt = min , . 2K 2Kt It can be checked that µt is non-increasing. We define the following time indices: p • t0 is the first round t in which µt = Ct /(2Kt). Note that 8K ≤ t0 ≤ 8K log(N K/δ). D.2 Deviation Bound for ηt (π) For any policy π ∈ Π, define, for 1 ≤ t ≤ t0 , . V̄t (π) = K, and for t > t0 , • t1 := d16K log(8KN/δ)e is the round given by Theorem 6 such that, with probability at least 1 − δ, . V̄t (π) = K + E xt ∼DX 1 . Wt0 (π(xt )) The V̄t (π) bounds the variances of the terms in ηt (π). E xt ∼DX ≤ (1 + ) 1 0 Wt (π(xt )) E x∼ht−1 Lemma 18. Assume the bound in (D.1) holds for all π ∈ Π and t ≥ t1 . For all π ∈ Π: 1 + ρK WPt ,µt (x, π(x)) (D.1) for all π ∈ Π and all t ≥ t1 , where WP,µ (x, ·) is the distribution over A given by 1. If t ≤ t1 , then K ≤ V̄t (π) ≤ 4K. 2. If t > t1 , then . WP,µ (x, a) = (1 − Kµ)WP (x, a) + µ, V̄t (π) ≤ (1 + ) and the notation Ex∼ht−1 denotes expectation with respect to the empirical (uniform) distribution over x1 , . . . , xt−1 . The following lemma shows the effect of allowing slack in the optimization constraints. Lemma 17. If P satisfies the constraints of the optimization problem (4.1) with slack K for each distribution Q over Π, i.e., E x∼ht−1 1 (1 − Kµt )WPt (x, π(x)) + µt + (ρ + 1)K. Proof. For the first claim, note that if t < t0 , then V̄t (π) = K, and if t0 ≤ t < t1 , then s r log(N t/δ) log(N t0 /δ) 1 ≥ ≥ ; µt = Kt 16K 2 log(8KN/δ) 4K so Wt0 (a) ≥ µt ≥ 1/(4K). 1 π∼Q x∼ht−1 (1 − Kµt )WP (x, π(x)) + µt (t − 1)∆t−1 (WQ )2 +K ≤ max 4K, 180Ct−1 E E for all Q, then P satisfies E E π∼Q x∼ht−1 1 (1 − Kµt )WP (x, π(x)) + µt (t − 1)∆t−1 (WQ )2 ≤ max 5K, 144Ct−1 n o 2 . t−1 (π) Proof. Let b = max 4K, (t−1)∆ . Note that 180Ct−1 ≥ K. Hence b + K ≤ bound. The stated bound on V̄t (π) now follows from its definition. Let . V̄max,t (π) = max{V̄τ (π), τ = 1, 2, . . . , t} for all Q. b 4 For the second claim, pick any t > t1 , and note that by definition of t1 , for any π ∈ Π we have 1 E 0 xt ∼DX Wt (π(xt )) 1 ≤ (1 + ) E + ρK. x∼ht−1 (1 − Kµt )WPt (x, π(x)) + µt 5b 4 which gives the stated Note that the allowance of slack K is somewhat arbitrary; any O(K) slack is tolerable provided that other constants are adjusted appropriately. The following lemma gives a deviation bound for ηt (π) in terms of these quantities. Lemma 19. Pick any δ ∈ (0, 1). With probability at least 1 − δ, for all pairs π, π 0 ∈ Π and t ≥ t0 , we have (ηt (π) − ηt (π 0 )) − (ηD (π) − ηD (π 0 )) r (V̄max,t (π) + V̄max,t (π 0 )) · Ct ≤2 . (D.2) t Proof. Fix any t ≥ t0 and π, π 0 ∈ Π. exp(−Ct ). Pick any τ ≤ t. Let Let δt := . rτ (aτ )I(π(xτ ) = aτ ) Zτ (π) = Wτ0 (aτ ) Pt so ηt (π) = t−1 τ =1 Zτ (π). It is easy to see that E [Zτ (π) − Zτ (π 0 )] = ηD (π) − ηD (π 0 ) (xτ ,~ rτ )∼D, aτ ∼Wτ0 The next two lemmas relate the V̄t (π) to the ∆t (π). Lemma 20. Assume Condition 1. For any t ≥ t1 and π ∈ Π, if V̄t (π) > θK, then s ∆t−1 (π) ≥ Proof. By Lemma 18, the fact V̄t (π) > θK implies that and E t X τ =1 ≤ x∼ht−1 (Zτ (π) − Zτ (π 0 ))2 E (xτ ,~ r (τ ))∼D, aτ ∼Wτ0 t X τ =1 E xτ ∼DX 1 1 + 0 0 0 Wτ (π(xτ )) Wτ (π (xτ )) 72V̄t (π)Ct−1 . t−1 1 (1 − Kµt )WPt (x, π(x)) + µt ρ+1 1 1 1− V̄t (π) ≥ V̄t (π). > 1+ θ 2 ≤ t · (V̄max,t (π) + V̄max,t (π 0 )). Since V̄t (π) > θK ≥ 5K, Lemma 17 implies that in order for Pt to satisfy the optimization constraint in (4.1) corresponding to π (with slack ≤ K), it must be the case that Moreover, with probability 1, 1 . µτ q Ct Now, note that since t ≥ t0 , µt = 2Kt , so that |Zτ (π) − Zτ (π 0 )| ≤ Ct 0 t = 2Kµ 2 . Further, both V̄max,t (π) and V̄max,t (π ) are t at least K. Using these bounds we get s 1 · t · (V̄max,t (π) + V̄max,t (π 0 )) log(1/δt ) s 1 1 Ct 1 · 2K = ≥ · ≥ , Ct 2Kµ2t µt µτ for all τ ≤ t, since the µτ ’s are non-increasing. Therefore, by Freedman’s inequality (Theorem 13), we have " Pr (ηt (π) − ηt (π 0 )) − (ηD (π) − ηD (π 0 )) r >2 # (V̄max,t (π) + V̄max,t (π 0 )) · log(1/δt ) ≤ 2δt . t The conclusion follows by taking a union bound over t0 < t ≤ T and all pairs π, π 0 ∈ Π. ∆t−1 (π) s 144Ct−1 1 ≥ · E . t−1 x∼ht−1 (1 − Kµt )WPt (x, π(x)) + µt Combining with the above, we obtain s 72V̄t (π)Ct−1 . ∆t−1 (π) ≥ t−1 Lemma 21. Assume Condition 1. For all t ≥ 1, V̄max,t (πmax ) ≤ θK and V̄max,t (πt ) ≤ θK. Proof. By induction on t. The claim for all t ≤ t1 follows from Lemma 18. So take t > t1 , and assume as the (strong) inductive hypothesis that V̄max,τ (πmax ) ≤ θK and V̄max,τ (πτ ) ≤ θK for τ ∈ {1, . . . , t − 1}. Suppose for sake of contradiction that V̄t (πmax ) > θK. By Lemma 20, s 72V̄t (πmax )Ct−1 ∆t−1 (πmax ) ≥ . t−1 However, by the deviation bounds, we have D.3 Variance Analysis We define the following condition, which will be assumed by most of the subsequent lemmas in this section. Condition 1. The deviation bound (D.1) holds for all π ∈ Π and t ≥ t1 , and the deviation bound (D.2) holds for all pairs π, π 0 ∈ Π and t ≥ t0 . ∆t−1 (πmax ) + ∆D (πt−1 ) s (V̄max,t−1 (πt−1 ) + V̄max,t−1 (πmax ))Ct−1 ≤2 t−1 s s 2V̄t (πmax )Ct−1 72V̄t (πmax )Ct−1 ≤2 < . t−1 t−1 The second inequality follows from our assumption and the induction hypothesis: This contradicts the inequality in (D.3), so it must be that V̄max,t (πt ) ≤ θK. V̄t (πmax ) > θK ≥ V̄max,t−1 (πt−1 ), V̄max,t−1 (πmax ). Corollary 22. Under the assumptions of Lemma 21, Since ∆D (πt−1 ) ≥ 0, we have a contradiction, so it must be that V̄t (πmax ) ≤ θK. This proves that V̄max,t (πmax ) ≤ θK. It remains to show that V̄max,t (πt ) ≤ θK. So suppose for sake of contradiction that the inequality fails, and let t1 < τ ≤ t be any round for which V̄τ (πt ) = V̄max,t (πt ) > θK. By Lemma 20, s ∆τ −1 (πt ) ≥ 72V̄τ (πt )Cτ −1 . τ −1 (D.3) On the other hand, ∆τ −1 (πt ) ≤ ∆D (πτ −1 ) + ∆τ −1 (πt ) + ∆t (πmax ) = ∆D (πτ −1 ) + ∆τ −1 (πmax ) + ητ −1 (πmax ) − ητ −1 (πt ) − ∆D (πt ) + ∆D (πt ) + ∆t (πmax ) . The parenthesized terms can be bounded using the deviation bounds, so we have ∆τ −1 (πt ) s (V̄max,τ −1 (πτ −1 ) + V̄max,τ −1 (πmax ))Cτ −1 ≤2 τ −1 s (V̄max,τ −1 (πt ) + V̄max,τ −1 (πmax ))Cτ −1 +2 τ −1 r (V̄max,t (πt ) + V̄max,t (πmax ))Ct +2 s st 2V̄τ (πt )Cτ −1 +2 τ −1 r 2V̄τ (πt )Ct +2 t s ≤2 < 2V̄τ (πt )Cτ −1 τ −1 72V̄τ (πt )Cτ −1 τ −1 where the second inequality follows from the following facts: 1. By induction hypothesis, we have V̄max,τ −1 (πτ −1 ), V̄max,τ −1 (πmax ), V̄max,t (πmax ) ≤ θK, and V̄τ (πt ) > θK, 2. V̄τ (πt ) ≥ V̄max,t (πt ), and 3. since τ is a round that achieves V̄max,t (πt ), we have V̄τ (πt ) ≥ V̄τ −1 (πt ). r ∆D (πt ) + ∆t (πmax ) ≤ 2 2θKCt t for all t ≥ t0 . Proof. Immediate from Lemma 21 and the deviation bounds from (D.2). The following lemma shows that if a policy π has large ∆τ (π) in some round τ , then ∆t (π) remains large in later rounds t > τ . Lemma 23. Assume Condition 1. Pick any π ∈ Π and t ≥ t1 . If V̄max,t (π) > θK, then r ∆t (π) > 2 2V̄max,t (π)Ct . t Proof. Let τ ≤ t be any round in which V̄τ (π) = V̄max,t (π) > θK. We have ∆t (π) ≥ ∆t (π) − ∆t (πmax ) − ∆D (πτ −1 ) = ∆τ −1 (π) + ηt (πmax ) − ηt (π) − ∆D (π) + ηD (πτ −1 ) − ηD (π) − ∆τ −1 (π) s 72V̄τ (π)Cτ −1 ≥ τ −1 r (V̄max,t (π) + V̄max,t (πmax ))Ct −2 t s (V̄max,τ −1 (π) + V̄max,τ −1 (πτ −1 ))Cτ −1 τ −1 s r 72V̄max,t (π)Cτ −1 2V̄max,t (π)Ct > −2 τ −1 t s 2V̄max,t (π)Cτ −1 −2 τ −1 s r 2V̄max,t (π)Cτ −1 2V̄max,t (π)Ct ≥2 ≥2 τ −1 t −2 where the second inequality follows from Lemma 20 and the deviation bounds, and the third inequality follows from Lemma 21 and the facts that V̄τ (π) = V̄max,t (π) > θK ≥ V̄max,t (πmax ), V̄max,τ −1 (πτ −1 ), and V̄max,t (π) ≥ V̄max,τ −1 (π). D.4 Regret Analysis We now bound the value of the optimization problem (4.1), which then leads to our regret bound. The next lemma shows the existence of a feasible solution with a certain structure based on the non-uniform constraints. Recall from Section 5, that solving the optimization problem A, i.e. constraints (5.1, 5.2, 5.3), for the smallest feasible value of s is equivalent to solving the RUCB optimization problem (4.1). Recall that t−1 . βt = 180C t−1 Lemma 24. There is a point W ∈ R(t−1)K such that s K ∆t−1 (W ) ≤ 4 βt W ∈ C # " X Z(x, a) ≤ max{4K, βt ∆t−1 (Z)2 } ∀Z ∈ C : E 0 (x, a) W x∼ht−1 a In particular, the value of the qoptimization q probt−1 lem (4.1), OPTt , is bounded by 8 βKt ≤ 110 KC t−1 . Proof. Define the sets {Ci : i = 1, 2, . . .} such that Ci := {Z ∈ C : 2i+1 κ ≤ ∆t−1 (Z) ≤ 2i+2 κ}, q where κ = βKt . Note that since ∆t−1 (Z) is a linear function of Z, each Ci is a closed, convex, compact set. Also, define C0 = {Z ∈ C : ∆t−1 (Z) ≤ 4κ}. ThisSis also a closed, convex, compact set. Note that ∞ C = i=0 Ci . Let I = {i : Ci 6=P ∅}.For i ∈ I \ {0}, define wi = 4−i , and let w0 = 1 − i∈I\{0} wi . Note that w0 ≥ 2/3. By Lemma 1, for each i ∈ I, there is a point Wi ∈ Ci such that for all Z ∈ Ci , we have " # X Z(x, a) ≤ 2K. E Wi0 (x, a) x∼ht−1 a Here we use the fact that Kµt ≤ 1/2 to upper K by 2K. Now consider the point W = bound 1−Kµ t P w W . Since C is convex, W ∈ C. i i i∈I Now fix any i ∈ I. For any (x, a), we have W 0 (x, a) ≥ wi Wi0 (x, a), so that for all Z ∈ Ci , we have " # X Z(x, a) 1 ≤ 2K E 0 (x, a) W w x∼ht−1 i a ≤ 4i+1 K ≤ max{4K, βt ∆t−1 (Z)2 }, so the constraint for Z is satisfied. Finally, since for all i ∈ I, we have wi ≤ 4−i and ∆t−1 (Wi ) ≤ 2i+2 κ, we get ∆t−1 (W ) = X wi ∆t−1 (Wi ) ≤ ∞ X 4−i · 2i+2 κ ≤ 8κ. i=0 i∈I The value of the optimization problem (4.1) can be related to the expected instantaneous regret of policy drawn randomly from the distribution Pt . Lemma 25. Assume Condition 1. Then r X √ KCt−1 Pt (π)∆D (π) ≤ 220 + 4 2θ · + 2εopt,t t−1 π∈Π for all t > t1 . Proof. Fix any π ∈ Π and t > t1 . By the deviation bounds, we have ηD (πt−1 ) − ηD (π) s (V̄max,t−1 (π) + V̄max,t−1 (πt−1 ))Ct−1 ≤ ∆t−1 (π) + 2 t−1 s V̄max,t−1 (π) + θK Ct−1 ≤ ∆t−1 (π) + 2 , t−1 by Lemma 21. By Corollary 22 we have r 2θKCt−1 ∆D (πt−1 ) ≤ 2 t−1 Thus, we get ∆D (π) ≤ ηD (πt−1 ) − ηD (π) + ∆D (πt−1 ) s V̄max,t−1 (π) + θK Ct−1 ≤ ∆t−1 (π) + 2 t−1 r 2θKCt−1 +2 . t−1 If V̄max,t−1 (π) ≤ θK, then we have r 2θKCt−1 ∆D (π) ≤ ∆t−1 (π) + 4 . t−1 Otherwise, Lemma 23 implies that V̄max,t−1 (π) ≤ (t − 1) · ∆t−1 (π)2 , 8Ct−1 so r ∆t−1 (π)2 θKCt−1 ∆D (π) ≤ ∆t−1 (π) + 2 + 8 t−1 r 2θKCt−1 +2 t−1 r 2θKCt−1 ≤ 2∆t−1 (π) + 4 . t−1 Therefore X E Pt (π)∆D (π) π∈Π ≤2 X r Pt (π)∆t−1 (π) + 4 π∈Π r ≤ 2 (OPTt +εopt,t ) + 4 2θKCt−1 t−1 2θKCt−1 t−1 where OPTt is the value of the optimization problem (4.1). The conclusion follows from Lemma 24. We can now finally prove the main regret bound for RUCB. Proof of Theorem 5. The regret through the first t1 rounds is trivially bounded by t1 . In the event that Condition 1 holds, we have for all t ≥ t1 , X X Wt (a)rt (a) ≥ (1 − Kµt )WPt (xt , a)rt (a) a∈A a∈A ≥ X Details of Oracle-based Algorithm We show how to (approximately) solve A using the ellipsoid algorithm with AMO. Fix a time period t. To avoid clutter, (only) in this section we drop the subscript t − 1 from ηt−1 (·), ∆t−1 (·), and ht−1 so that they becomes η(·), ∆(·), and h respectively. In order to use the ellipsoid algorithm, we need to relax the program a little bit in order to ensure that the feasible region has a non-negligible volume. To do this, we need to obtain some perturbation bounds for the constraints of A. The following lemma gives such bounds. For any δ > 0, we define Cδ to be the set of all points within a distance of δ from C. Lemma 26. Let δ ≤ b/4 be a parameter. Let U, W ∈ C2δ be points such that kU − W k ≤ δ. Then we have |∆(U ) − ∆(W )| ≤ γ (E.1) ∀Z ∈ C1 : # " # " X Z(x, a) X Z(x, a) − E ≤ E x∼h U 0 (x, a) W 0 (x, a) x∼h a a WPt (xt , a)rt (a) − Kµt (E.2) a∈A = X Pt (π)rt (π(xt )) − Kµt , where = 8δ µ2t and γ = δ µt . π∈Π Proof. First, we have and therefore |η(U ) − η(W )| ≤ [rt (at )] E " = E (xt ,~ r (t))∼D ≥ X Wt0 (a)rt (a) which implies (E.1). a∈A Pt (π)ηD (π) − Kµt π∈Π r KCt−1 + εopt,t t−1 ! where the last inequality follows from Lemma 25. Summing the bound from t = t1 + 1, . . . , T gives Next, for any Z ∈ C1 , we have X Z(x, a) X Z(x, a) − U 0 (x, a) W 0 (x, a) a a X |U 0 (x, a) − W 0 (x, a)| ≤ |Z(x, a)| U 0 (x, a)W 0 (x, a) a ≤ T X t=1 E [ηD (πmax ) − rt (at )] (xt ,~ r (t))∼D at ∼Wt0 ≤ t1 + O r |U (x, a) − W (x, a)| p δ ≤ = γ, µt # ≥ ηD (πmax ) − O X (x,a,r,q)∈h (xt ,~ r (t))∼D at ∼Wt0 X 1 t−1 p T K log (N T /δ) . By PT Azuma’s inequality, the probability that t=1 p rt (at ) deviates from its mean by more than O( T log(1/δ)) is at most δ. Finally, the probability that Condition 1 does not hold is at most 2δ by Lemma 19, Theorem 6, and a union bound. The conclusion follows by a final union bound. 8δ = . µ2t In the last inequality, we use the Cauchy-Schwarz inequality, and use the following facts (here, Z(x, ·) denotes the vector hZ(x, a)ia , etc.): 1. kZ(x, ·)k ≤ 2 since Z ∈ C1 , 2. kU 0 (x, ·) − W 0 (x, ·)k ≤ kU (x, ·) − W (x, ·)k ≤ δ, and 3. U 0 (x, a) ≥ (1 − bK) · (−2δ) + b ≥ b/2, for δ ≤ b/4, and similarly W 0 (x, a) ≥ b/2. This implies (E.2). We now consider the following relaxed form of A. Here, δ ∈ (0, b/4) is a parameter. We want to find a point W ∈ R(t−1)K such that ∆(W ) ≤ s + γ (E.3) W ∈ Cδ (E.4) ∀Z ∈ C2δ : # " X Z(x, a) E W 0 (x, a) x∼h a ≤ max{4K, βt ∆(Z)2 } + , (E.5) linear optimization over C efficiently. This immediately gives us the following useful corollary: Corollary 28. Given a vector w ∈ R(t−1)K and δ > 0, we can compute arg maxZ∈Cδ w · Z using one invocation of AMO. Proof. This follows directly from the following fact: arg max w · Z = Z∈Cδ δ w + arg max w · Z. Z∈C kwk where and γ are as defined in Lemma 26. Call this relaxed program A0 . We apply the ellipsoid method to A0 rather than A. Recall the requirements of Lemma 8: we need an enclosing ball of bounded radius for the feasible region, and the radius of an enclosed ball in the feasible region. The following lemma gives this. 0 Lemma √ 27. The feasible region for A is contained in B(0, t + δ), and if A is feasible, then it contains a ball of radius δ. Proof. Note that for any W ∈ Cδ , we have √ √ kW k ≤ t + δ, so the feasible region lies in B(0, t + δ). Next, if A is feasible, let W ? ∈ C be any feasible solution to A. Consider the ball B(W ? , δ). Let U be any point in B(W ? , δ). Clearly U ∈ Cδ . By Lemma 26, assuming δ ≤ 1/2, we have for all Z ∈ C2δ , " " # # X Z(x, a) X Z(x, a) ≤ E + E U 0 (x, a) U 0 (x, a) x∼h x∼h a a ≤ max{4K, βt ∆(Z)2 } + . Also ∆(U ) ≤ ∆(W ? ) + γ ≤ s + γ. Thus, U is feasible for A0 , and hence the entire ball B(W ? , δ) is feasible for A0 . We now give the construction of a separation oracle for the feasible region of A0 by checking for violations of the constraints. In the following, we use the word “iteration” to indicate one step of either the ellipsoid algorithm or the perceptron algorithm. Each such iteration involves one call to AMO, and additional O(t2 K 2 ) processing time. Let W ∈ R(t−1)K be a candidate point that we want to check for feasibility for A0 . We can check for violation of the constraint (E.3) easily, and since it is a linear constraint in W , it automatically yields a separating hyperplane if it is violated. The harder constraints are (E.4) and (E.5). Recall that Lemma 9 shows that that AMO allows us to do Now we show how to use AMO to check for constraint (E.4): Lemma 29. Suppose we are given a point W . Then / C2δ , we can construct a in O( δt2 ) iterations, if W ∈ hyperplane separating W from Cδ . Otherwise, we declare correctly that W ∈ C2δ . In the latter case, we can find an explicit distribution P over policies in Π such that WP satisfies kWP − W k ≤ 2δ. Proof. We run the perceptron algorithm with the origin at W and all points in Cδ being positive examples. The goal of the perceptron algorithm then is to find a hyperplane going through W that puts all of Cδ (strictly) on one side. In each iteration of the perceptron algorithm, we have a weight vector w that is the normal to a candidate hyperplane, and we need to find a point Z ∈ Cδ such that w · (Z − W ) ≤ 0 (note that we have shifted the origin to W ). To do this, we use AMO as in Lemma 9 to find Z ? = arg maxZ∈Cδ −w·Z. If w · (Z ? − W ) ≤ 0, we use Z ? to update w using the perceptron update rule, w ← w + (Z ? − W ). Otherwise, we have w · (Z − W ) > 0 for all W ∈ Cδ , and hence we have found our separating hyperplane. Now suppose that W ∈ / C2δ , i.e. the distance of √W from Cδ √ is more than δ. Since kZ − W k ≤ 2 √t + 3δ = O( t) for all W ∈ Cδ (assuming δ = O( t)), the perceptron convergence guarantee implies that in O( δt2 ) iterations we find a separating hyperplane. If in k = O( δt2 ) iterations we haven’t found a separating hyperplane, then W ∈ C2δ . In fact the perceptron algorithm gives a stronger guarantee: if the k policies found in the run of the perceptron algorithm are π1 , π2 , . . . , πk ∈ Π, then W is within a distance of 2δ from their convex hull, C 0 = conv(π1 , π2 , . . . , πk ). This 0 is because a run of the perceptron algorithm on C2δ would be identical to that on C2δ for k steps. We can then compute the explicit distribution over policies P by computing the Euclidean projection of W on C 0 in poly(k) time using a convex quadratic program: To do this, we again apply the ellipsoid method, but on the relaxed program Pk min kW − i=1 Pi πi k2 X Pi = 1 f (Z) ≤ (E.8) Z ∈ Cδ i ∀i : Pi ≥ 0 Solving this quadratic program, we get a distribution P over the policies {π1 , π2 , . . . , πk } such that kWP − W k ≤ 2δ. Finally, we show how to check constraint (E.5): Lemma 30. Suppose we are given a point W . In 3 2 O( t δK2 · log( δt )) iterations, we can either find a point Z ∈ C2δ such that " # X Z(x, a) ≥ max{4K, βt ∆(Z)2 } + 2, E 0 (x, a) W x∼h a or else we conclude correctly that for all Z ∈ C, we have " # X Z(x, a) ≤ max{4K, βt ∆(Z)2 } + 3. E 0 (x, a) W x∼h a (E.9) To run the ellipsoid algorithm, we need a separation oracle for the program. Given a candidate solution Z, we run the algorithm of Lemma 29, and if Z ∈ / C2δ , we construct a hyperplane separating Z from Cδ . Now suppose we conclude that Z ∈ C2δ . Then we construct a separation oracle for (E.6) as follows. If f (Z) > , then since f is a convex function of Z, we can construct a separating hyperplane as in Lemma 10. Now we can run the ellipsoid √ algorithm with the starting ellipsoid being B(0, t). If there is a point Z ? ∈ C such that f (Z ? ) ≤ 0, then consider the ball ). For any Y ∈ B(Z ? , 5√4δ ), we have B(Z ? , 5√4δ tKβ tKβ t t |(u · Z ? ) − (u · Y )| ≤ kukkZ ? − Y k ≤ 2 √ since kuk ≤ K µt . Also, βt |(w · Z ? − v)2 − (w · Y − v)2 | = βt |(w · Z ? − w · Y )(w · Z ? + w · Y − 2v)| Proof. We first rewrite η(W ) as η(W ) = w · π, where w is a vector defined as w(x, a) = 1 t−1 X (x0 ,a0 ,r,p)∈h: x0 =x,a0 =a r . p since kwk ≤ µ1t , kZ ? k ≤ √ √ |v| ≤ kwk · t ≤ µtt . Thus, ∆(Z) = v − w · Z, where v = maxπ0 η(π 0 ) = maxπ0 w · π 0 which can be computed by using AMO once. Next, using the candidate point W , compute the /t vector u defined as u(x, a) = Wn0x(x,a) , where nx is theh number ofi times x appears in h, so that P Z(x,a) Ex∼h a W 0 (x,a) = u · Z. Now, the problem reduces to finding a point R ∈ C which violates the constraint u · Z ≤ max{4K, βt (w · Z − v)2 } + 3. Define f (Z) = max{4K, βt (w · Z − v)2 } + 3 − u · Z. Note that f is convex function of Z. Checking for violation of the above constraint is equivalent to solving the following (convex) program: f (Z) ≤ 0 Z ∈ C ≤ βt kwkkZ ? − Y k(kwk(kZ ? k + kY k) + 2|v|) ≤ (E.6) (E.7) √ t, kY k ≤ √ , 2 √ t + δ ≤ 2 t, and Thus, f (Y ) ≤ f (Z ? ) + ≤ , so the entire ball B(Z ? , 5√4δ ) is feasible for the relaxed program. tKβ t By Lemma 8, in O(t2 K 2 · log( tK δ )) iterations of the ellipsoid algorithm, we obtain one of the following: 1. we either find a point Z ∈ C2δ such that f (Z) ≤ , i.e. " # X Z(x, a) ≥ max{4K, βt ∆(Z)2 } + 2, E 0 (x, a) W x∼h a 2. or else we conclude that the original convex program (E.6,E.7) is infeasible, i.e. for all Z ∈ C, we have " # X Z(x, a) ≤ max{4K, βt ∆(Z)2 } + 3. E 0 (x, a) W x∼h a The total number of invocations of iterations is t t3 K 2 bounded by O(t2 K 2 · log( tK δ )) · O( δ 2 ) = O( δ 2 · log( tK δ )). Lemma 31. Suppose we are given a point Z ∈ C2δ such that # " X Z(x, a) ≥ max{4K, βt ∆(Z)2 } + 2. E 0 (x, a) W x∼h a Then we can construct a hyperplane separating W from all feasible points for A0 . Proof. For notational convenience, define the function # " X Z(x, a) −max{4K, βt ∆(Z)2 }−2. fZ (W ) := E 0 (x, a) W x∼h a Note that it is a convex function of W . Note that for any point U that is feasible for A0 , we have fZ (U ) ≤ −, whereas fZ (W ) ≥ 0. Thus, by Lemma 10, we can construct the desired separating hyperplane. We can finally prove Theorem 11: Proof. [Theorem 11.] We run √ the ellipsoid algorithm starting with the ball B(0, t + δ). At each point, we are given a candidate solution W for program A0 . We check for violation of constraint (E.3) first. If it is violated, the constraint, being linear, gives us a separating hyperplane. Else, we use Lemma 29 to check for violation of constraint (E.4). If W ∈ / C2δ , then we can construct a separating hyperplane. Else, we use Lemmas 30 and 31 to check for violation of constraint hP (E.5). i If there is a Z ∈ C such that Z(x,a) ≥ max{4K, βt ∆(Z)2 } + 3, then Ex∼h a W 0 (x,a) we can find a separating hyperplane. Else, we conclude that the current point W satisfies the following constraints: " ∀Z ∈ C : E x∼h ∆(W ) ≤ s + γ # X Z(x, a) ≤ max{4K, βt ∆(Z)2 } + 3 0 (x, a) W a W ∈ C2δ We can then use the perceptron-based algorithm of Lemma 29 to “round” W to an explicit distribution P over policies in Π such that WP satisfies kWP − W k ≤ 2δ. Then Lemma 26 implies the stated bounds for WP . By Lemma 8, in O(t2 K 2 log( δt )) iterations of the ellipsoid algorithm, we find the point W satisfying the constraints given above, or declare correctly that A is infeasible. In the worst case, we might have to run the algorithm of Lemma 30 in every iteration, leading to an 3 2 upper bound of O(t2 K 2 log( δt )) × O( t δK2 · log( tK δ )) = 2 tK 5 4 O(t K log ( δ )) on the number of iterations.
© Copyright 2026 Paperzz