Efficient Optimal Learning for Contextual Bandits

Efficient Optimal Learning for Contextual Bandits
Miroslav Dudik
[email protected]
Daniel Hsu
[email protected]
John Langford
[email protected]
1
Satyen Kale
[email protected]
Lev Reyzin
[email protected]
Nikos Karampatziakis
[email protected]
Tong Zhang
[email protected]
Abstract
of credit assignment as in more general reinforcement
learning settings.
We address the problem of learning in an online setting where the learner repeatedly observes features, selects among a set of actions,
and receives reward for the action taken. We
provide the first efficient algorithm with an
optimal regret. Our algorithm uses a cost
sensitive classification learner as an oracle
and has a running time polylog(N ), where N
is the number of classification rules among
which the oracle might choose. This is exponentially faster than all previous algorithms
that achieve optimal regret in this setting.
Our formulation also enables us to create an
algorithm with regret that is additive rather
than multiplicative in feedback delay as in all
previous work.
The contextual bandit setting is a half-way point between standard supervised learning and full-scale reinforcement learning where it appears possible to construct algorithms with convergence rate guarantees
similar to supervised learning. Many natural settings
satisfy this half-way point, motivating the investigation of contextual bandit learning. For example, the
problem of choosing interesting news articles or ads for
users by internet companies can be naturally modeled
as a contextual bandit setting. In the medical domain
where discrete treatments are tested before approval,
the process of deciding which patients are eligible for
a treatment takes contexts into account. More generally, we can imagine that in a future with personalized
medicine, new treatments are essentially equivalent to
new actions in a contextual bandit setting.
INTRODUCTION
The contextual bandit setting consists of the following
loop repeated indefinitely:
1. The world presents context information as features x.
2. The learning algorithm chooses an action a from
K possible actions.
3. The world presents a reward r for the action.
The key difference between the contextual bandit setting and standard supervised learning is that only the
reward of the chosen action is revealed. For example,
after always choosing the same action several times
in a row, the feedback given provides almost no basis to prefer the chosen action over another action.
In essence, the contextual bandit setting captures the
difficulty of exploration while avoiding the difficulty
In the i.i.d. setting, the world draws a pair (x, ~r) consisting of a context and a reward vector from some
unknown distribution D, revealing x in Step 1, but
only the reward r(a) of the chosen action a in Step 3.
Given a set of policies Π = {π : X → A}, the goal
is to create an algorithm for Step 2 which competes
with the set of policies. We measure our success by
comparing the algorithm’s cumulative reward to the
expected cumulative reward of the best policy in the
set. The difference of the two is called regret.
All existing algorithms for this setting either achieve
a suboptimal regret (Langford and Zhang, 2007) or
require computation linear in the number of policies (Auer et al., 2002b; Beygelzimer et al., 2011). In
unstructured policy spaces, this computational complexity is the best one can hope for. On the other
hand, in the case where the rewards of all actions are
revealed, the problem is equivalent to cost-sensitive
classification, and we know of algorithms to efficiently
search the space of policies (classification rules) such
as cost-sensitive logistic regression and support vector machines. In these cases, the space of classifica-
tion rules is exponential in the number of features, but
these problems can be efficiently solved using convex
optimization.
Our goal here is to efficiently solve the contextual bandit problems for similarly large policy spaces. We
do this by reducing the contextual bandit problem
to cost-sensitive classification. Given a supervised
cost-sensitive learning algorithm as an oracle (Beygelzimer et al., 2009), our algorithm runs
√ in time
only polylog(N ) while achieving regret O( T K ln N ),
where N is the number of possible policies (classification rules), K is the number of actions (classes), and T
is the number of time steps. This efficiency is achieved
in a modular way, so any future improvement in costsensitive learning immediately applies here.
1.1
2011), and also for VC classes when the adversary is
constrained to i.i.d. sampling. There are two central
benefits that we hope to realize by directly assuming
i.i.d. contexts and reward vectors.
1. Computational Tractability. Even when the reward vectoris fully known, adversarial regrets
√
ln N while computation scales
scale as O
as O(N ) in general.
One attempt to get
around this is the follow-the-perturbed-leader algorithm (Kalai and Vempala, 2005) which provides a computationally tractable solution in certain special-case structures. This algorithm has
no mechanism for efficient application to arbitrary
policy spaces, even given an efficient cost-sensitive
classification oracle. An efficient cost-sensitive
classification oracle has been shown effective in
transductive settings (Kakade and Kalai, 2005).
Aside from the drawback of requiring a transductive setting, the regret achieved there is substantially worse than for EXP4.
PREVIOUS WORK AND
MOTIVATION
All previous regret-optimal approaches are measure
based—they work by updating a measure over policies, an operation which is linear in the number of
policies. In contrast, regret guarantees scale only logarithmically in the number of policies. If not for the
computational bottleneck, these regret guarantees imply that we could dramatically increase performance in
contextual bandit settings using more expressive policies. We overcome the computational bottleneck using
an algorithm which works by creating cost-sensitive
classification instances and calling an oracle to choose
optimal policies. Actions are chosen based on the
policies returned by the oracle rather than according to a measure over all policies. This is reminiscent
of AdaBoost (Freund and Schapire, 1997), which creates weighted binary classification instances and calls
a “weak learner” oracle to obtain classification rules.
These classification rules are then combined into a final classifier with boosted accuracy. Similarly as AdaBoost converts a weak learner into a strong learner,
our approach converts a cost-sensitive classification
learner into an algorithm that solves the contextual
bandit problem.
In a more difficult version of contextual bandits, an adversary chooses (x, ~r) given knowledge of the learning
algorithm (but not any random numbers). All known
regret-optimal solutions in the adversarial setting are
variants of the EXP4 algorithm (Auer et al., 2002b).
EXP4
√ achieves the same regret rate as our algorithm:
O
KT ln N , where T is the number of time steps,
K is the number of actions available in each time step,
and N is the number of policies.
Why not use EXP4 in the i.i.d. setting? For example, it is known that the algorithm can be modified
to succeed with high probability (Beygelzimer et al.,
2. Improved Rates. When the world is not completely adversarial, it is possible to achieve substantially lower regrets than are possible with algorithms optimized for the adversarial setting.
For example, in supervised learning, it is possible
to obtain regrets scaling as O(log(T )) with a problem dependent constant (Bartlett et al., 2007).
When the feedback is delayed by τ rounds, lower
bounds imply that the regret in the√adversarial
setting increases by a multiplicative τ while in
the i.i.d. setting, it is possible to achieve an additive regret of τ (Langford et al., 2009).
In a direct i.i.d. setting, the previous-best approach
using a cost-sensitive classification oracle was given by
-greedy and epoch greedy algorithms (Langford and
Zhang, 2007) which have a regret scaling as O(T 2/3 )
in the worst case.
There have also been many special-case analyses. For
example, theory of context-free setting is well understood (Lai and Robbins, 1985; Auer et al., 2002a;
Even-Dar et al., 2006). Similarly, good algorithms exist when rewards are linear functions of features (Auer,
2002) or actions lie in a continuous space with the reward function sampled according to a Gaussian process (Srinivas et al., 2010).
1.2
WHAT WE PROVE
In Section 3 we state the PolicyElimination algorithm, and prove the following regret bound for it.
Theorem 4. For all distributions D over (x, ~r) with
K actions, for all sets of N policies Π, with probabil-
ity at least 1 − δ, the regret of PolicyElimination
(Algorithm 1) over T rounds is at most
r
4T 2 N
16 2T K ln
.
δ
This result can be extended to deal with VC classes,
as well as other special cases. It forms the simplest
method we have of exhibiting the new analysis.
The new key element of this algorithm is identification of a distribution over actions which simultaneously achieves small expected regret and allows estimating value of every policy with small variance. The
existence of such a distribution is shown nonconstructively by a minimax argument.
PolicyElimination is computationally intractable
and also requires exact knowledge of the context distribution (but not the reward distribution!). We show
how to address these issues in Section 4 using an algorithm we call RandomizedUCB. Namely, we prove
the following theorem.
Theorem 5. For all distributions D over (x, ~r) with
K actions, for all sets of N policies Π, with probability at least 1 − δ, the regret of RandomizedUCB
(Algorithm 2) over T rounds is at most
p
T K log (T N/δ) + K log(N K/δ) .
O
RandomizedUCB’s analysis is substantially more
complex, with a key subroutine being an application of the ellipsoid algorithm with a costsensitive classification oracle (described in Section 5).
RandomizedUCB does not assume knowledge of the
context distribution, and instead works with the history of contexts it has observed. Modifying the
proof for this empirical distribution requires a covering argument over the distributions over policies
which uses the probabilistic method. The net result
is an algorithm with a similar top-level analysis as
PolicyElimination, but with the running time only
poly-logarithmic in the number of policies given a costsensitive classification oracle.
Theorem 11. In each time step t, RandomizedUCB
makes at most O(poly(t, K, log(1/δ), log N )) calls to
cost-sensitive classification oracle, and requires additional O(poly(t, K, log N )) processing time.
Apart from a tractable algorithm, our analysis can be
used to derive tighter regrets than would be possible in
adversarial setting. For example, in Section 6, we consider a common setting where reward feedback is delayed by τ rounds. A straightforward modification of
PolicyElimination yields a regret with an additive
term proportional to τ compared with the delay-free
setting. Namely, we prove the following.
Theorem 12. For all distributions D over (x, ~r) with
K actions, for all sets of N policies Π, and all delay
intervals τ , with probability at least 1 − δ, the regret
of DelayedPE (Algorithm 3) is at most
r
√ 4T 2 N τ+ T .
16 2K ln
δ
We start next with precise settings and definitions.
2
2.1
SETTING AND DEFINITIONS
THE SETTING
Let A be the set of K actions, let X be the domain of
contexts x, and let D be an arbitrary joint distribution
on (x, ~r). We denote the marginal distribution of D
over X by DX .
We denote Π to be a finite set of policies {π : X → A},
where each policy π, given a context xt in round t,
chooses the action π(xt ). The cardinality of Π is denoted by N . Let ~rt ∈ [0, 1]K be the vector of rewards,
where rt (a) is the reward of action a on round t.
In the i.i.d. setting, on each round t = 1 . . . T , the
world chooses (xt , ~rt ) i.i.d. according to D and reveals
xt to the learner. The learner, having access to Π,
chooses action at ∈ {1, . . . , K}. Then the world reveals
reward rt (at ) (which we call rt for short) to the learner,
and the interaction proceeds to the next round.
We consider two modes of accessing the set of policies
Π. The first option is through the enumeration of all
policies. This is impractical in general, but suffices
for the illustrative purpose of our first algorithm. The
second option is an oracle access, through an argmax
oracle, corresponding to a cost-sensitive learner:
Definition 1. For a set of policies Π, an argmax oracle (AMO for short), is an algorithm, which for any
sequence {(xt0 , ~rt0 )}t0 =1...t , xt0 ∈ X, ~rt0 ∈ RK , computes
X
arg max
rt0 (π(xt0 )) .
π∈Π
t0 =1...t
The reason why the above can be viewed as a costsensitive classification oracle is that vectors of rewards
~rt0 can be interpreted as negative costs and hence the
policy returned by AMO is the optimal cost-sensitive
classifier on the given data.
2.2
EXPECTED AND EMPIRICAL
REWARDS
Let the expected instantaneous reward of a policy π ∈
Π be denoted by
.
ηD (π) = E [r(π(x))] .
(x,~
r )∼D
The best policy πmax ∈ Π is that which maximizes
ηD (π). More formally,
.
πmax = argmax ηD (π) .
π∈Π
We define ht to be the history at time t that the learner
has seen. Specifically
[
ht =
(xt0 , at0 , rt0 , pt0 ) ,
t0 =1...t
where pt0 is the probability of the algorithm choosing
action at0 at time t0 . Note that at0 and pt0 are produced
by the learner while xt0 , rt0 are produced by nature.
We write x ∼ h to denote choosing x uniformly at
random from the x’s in history h.
Using the history of past actions and probabilities with
which they were taken, we can form an unbiased estimate of the policy value for any π ∈ Π:
. 1
ηt (π) =
t
X
(x,a,r,p)∈ht
actions. For any distribution P over policies Π, let
WP (x, a) denote the induced conditional distribution
over actions a given the context x:
X
.
WP (x, a) =
P (π) .
(2.2)
π∈Π:π(x)=a
In general, we shall use W , W 0 and Z as conditional
probability distributions over the actions A given contexts X, i.e., W : X × A → [0, 1] such that W (x, ·) is a
probability distribution over A (and similarly for W 0
and Z). We shall think of W 0 as a smoothed version
of W with a minimum action probability of µ (to be
defined by the algorithm), such that
W 0 (x, a) = (1 − Kµ)W (x, a) + µ .
Conditional distributions such as W (and W 0 , Z, etc.)
correspond to randomized policies. We define notions
true and empirical value and regret for them as follows:
.
ηD (W ) =
rI(π(x) = a)
.
p
E
[~r · W (x)]
(x,~
r )∼D
. 1
ηt (W ) =
t
rI(π(x)=a)
Ea∼p
p(a)
The unbiasedness follows, because
=
P
rI(π(x)=a)
= r(π(x)). The empirically best
a p(a)
p(a)
policy at time t is denoted
X
(x,a,r,p)∈ht
rW (x, a)
p
.
∆D (W ) = ηD (πmax ) − ηD (W )
.
∆t (W ) = ηt (πt ) − ηt (W ) .
.
πt = argmax ηt (π).
π∈Π
2.3
REGRET
The goal of this work is to obtain a learner that has
small regret relative to the expected performance of
πmax over T rounds, which is
X
(ηD (πmax ) − rt ) .
(2.1)
t=1...T
We say that the regret of the learner over T rounds is
bounded by with probability at least 1 − δ, if
"
#
X
Pr
(ηD (πmax ) − rt ) ≤ ≥ 1 − δ
t=1...T
where the probability is taken with respect to the random pairs (xt , ~rt ) ∼ D for t = 1 . . . T , as well as any
internal randomness used by the learner.
We can also define notions of regret and empirical regret for policies π. For all π ∈ Π, let
∆D (π) = ηD (πmax ) − ηD (π) ,
∆t (π) = ηt (πt ) − ηt (π) .
Our algorithms work by choosing distributions over
policies, which in turn then induce distributions over
3
POLICY ELIMINATION
The basic ideas behind our approach are demonstrated
in our first algorithm: PolicyElimination (Algorithm 1).
The key step is Step 1, which finds a distribution over
policies which induces low variance in the estimate of
the value of all policies. Below we use minimax theorem to show that such a distribution always exists.
How to find this distribution is not specified here, but
in Section 5 we develop a method based on the ellipsoid algorithm. Step 2 then projects this distribution
onto a distribution over actions and applies smoothing.
Finally, Step 5 eliminates the policies that have been
determined to be suboptimal (with high probability).
ALGORITHM ANALYSIS
We analyze PolicyElimination in several steps.
First, we prove the existence of Pt in Step 1, provided
that Πt−1 is non-empty. We recast the feasibility problem in Step 1 as a game between two players: Prover,
who is trying to produce Pt , and Falsifier, who is trying to find π violating the constraints. We give more
power to Falsifier and allow him to choose a distribution over π (i.e., a randomized policy) which would
violate the constraints.
Algorithm 1 PolicyElimination(Π,δ,K,DX )
Let Π0 = Π and history h0 = ∅
.
Define: δt = δ r
/ 4N t2
2K ln(1/δt )
.
Define: bt = 2
( t r
)
1
ln(1/δt )
.
Define: µt = min
,
2K
2Kt
For each timestep t = 1 . . . T , observe xt and do:
• convex in W : Note that 1/W 0 (x, a) is convex in
W (x, a) by convexity of 1/(c1 w + c2 ) in w ≥ 0, for
c1 ≥ 0, c2 > 0. Convexity of f (W, Z) in W then
follows by taking expectations over x and a.
Hence, by Theorem 14 (in Appendix B), min and max
can be reversed without affecting the value:
min max f (W, Z) = max min f (W, Z) .
W ∈C Z∈C
1. Choose distribution Pt over Πt−1 s.t. ∀ π ∈ Πt−1 :
1
≤ 2K
E
x∼DX (1 − Kµt )WPt (x, π(x)) + µt
2. Let Wt0 (a) = (1−Kµt )WPt (xt , a)+µt for all a ∈ A
3. Choose at ∼ Wt0
4. Observe reward rt
n
5. Let Πt = π ∈ Πt−1 :
o
ηt (π) ≥ 0max ηt (π 0 ) − 2bt
Z∈C W ∈C
The right-hand side can be further upper-bounded by
maxZ∈C f (Z, Z), which is upper-bounded by
X Z(x, a) f (Z, Z) = E
Z 0 (x, a)
x∼DX
a∈A
X
Z(x, a)
K
≤ E
=
.
(1
−
Kµ)Z(x,
a)
1
−
Kµ
x∼DX
a∈A:
Z(x,a)>0
Corollary 2. The set of distributions satisfying constraints of Step 1 is non-empty.
π ∈Πt−1
6. Let ht = ht−1 ∪ (xt , at , rt , Wt0 (at ))
Note that any policy π corresponds to a point in
the space of randomized policies (viewed as functions
.
X × A → [0, 1]), with π(x, a) = I(π(x) = a). For
any distribution P over policies in Πt−1 , the induced
randomized policy WP then corresponds to a point in
the convex hull of Πt−1 . Denoting the convex hull of
Πt−1 by C, Prover’s choice by W and Falsifier’s choice
by Z, the feasibility of Step 1 follows by the following
lemma:
Lemma 1. Let C be a compact and convex set of randomized policies. Let µ ∈ (0, 1/K] and for any W ∈ C,
.
W 0 (x, a) = (1 − Kµ)W (x, a) + µ. Then for all distributions D,
1
K
min max E
≤
.
E
0
W ∈C Z∈C x∼DX a∼Z(x,·) W (x, a)
1 − Kµ
.
Proof. Let f (W, Z) = Ex∼DX Ea∼Z(x,·) [1/W 0 (x, a)]
denote the inner expression of the minimax problem.
Note that f (W, Z) is:
• everywhere defined : Since W 0 (x, a) ≥ µ, we obtain that 1/W 0 (x, a) ∈ [0, 1/µ], hence the expectations are defined for all W and Z.
• linear in Z:
f (W, Z) as
Linearity follows from rewriting
f (W, Z) =
E
x∼DX
X Z(x, a) .
W 0 (x, a)
a∈A
Given the existence of Pt , we will see below that the
constraints in Step 1 ensure low variance of the policy
value estimator ηt (π) for all π ∈ Πt−1 . The small variance is used to ensure accuracy of policy elimination
in Step 5 as quantified in the following lemma:
Lemma 3. With probability at least 1 − δ, for all t:
1. πmax ∈ Πt (i.e., Πt is non-empty)
2. ηD (πmax ) − ηD (π) ≤ 4bt for all π ∈ Πt
Proof. We will show that for any policy π ∈ Πt−1 , the
probability that ηt (π) deviates from ηD (π) by more
that bt is at most 2δt . Taking the union bound over all
policies and all time steps we find that with probability
at least 1 − δ,
|ηt (π) − ηD (π)| ≤ bt
(3.1)
for all t and all π ∈ Πt−1 . Then:
1. By the triangle inequality, in each time step,
ηt (π) ≤ ηt (πmax ) + 2bt for all π ∈ Πt−1 , yielding the first part of the lemma.
2. Also by the triangle inequality, if ηD (π) <
ηD (πmax ) − 4bt for π ∈ Πt−1 , then ηt (π) <
ηt (πmax ) − 2bt . Hence the policy π is eliminated
in Step 5, yielding the second part of the lemma.
It remains to show Eq. (3.1). We fix the policy π ∈ Π
and time t, and show that the deviation bound is violated with probability at most 2δt . Our argument
rests on Freedman’s inequality (see Theorem 13 in Appendix A). Let
yt =
rt I(π(xt ) = at )
,
Wt0 (at )
Pt
i.e., ηt (π) = ( t0 =1 yt0 )/t. Let Et denote the conditional expectation E[ · | ht−1 ]. To use Freedman’s
inequality, we need to bound the range of yt and its
conditional second moment Et [yt2 ].
Algorithm 2 RandomizedUCB(Π,δ,K)
.
Let h0 = ∅ be the initial history.
Define the following quantities:
)
(
r
Nt
Ct
1
.
.
Ct = 2 log
.
and µt = min
,
δ
2K
2Kt
For each timestep t = 1 . . . T , observe xt and do:
1. Let Pt be a distribution over Π that approximately solves the optimization problem
X
min
P (π)∆t−1 (π)
Since rt ∈ [0, 1] and Wt0 (at ) ≥ µt , we have the bound
.
0 ≤ yt ≤ 1/µt = Rt .
P
Next,
π∈Π
for all distributions Q over Π :
#
"
t−1
1
1 X
E
(1 − Kµt )WP (xi , π(xi )) + µt
π∼Q t − 1
i=1
(t − 1)∆t−1 (WQ )2
≤ max 4K,
180Ct−1
(4.1)
so p
that the objective value at Pt is within εopt,t =
O( KCt /t) of the optimal value, and so that
each constraint is satisfied with slack ≤ K.
s.t.
Et [yt2 ] =
E
E
(xt ,~
rt )∼D at ∼Wt0
2
yt
rt2 I(π(xt ) = at )
=
E
E 0
Wt0 (at )2
(xt ,~
rt )∼D at ∼Wt
0
Wt (π(xt ))
≤
E
0
2
(xt ,~
rt )∼D Wt (π(xt ))
1
= E
≤ 2K .
0
xt ∼D Wt (π(xt ))
(3.2)
(3.3)
where Eq. (3.2) follows by boundedness of rt and
Eq. (3.3) follows from the constraints in Step 1. Hence,
X
.
Et0 [yt20 ] ≤ 2Kt = Vt .
2. Let Wt0 be the distribution over A given by
.
Wt0 (a) = (1 − Kµt )WPt (xt , a) + µt
for all a ∈ A.
t0 =1...t
3. Choose at ∼ Wt0 .
Since (ln t)/t is decreasing for t ≥ 3, we obtain that µt
is non-increasing (by separately analyzing t = 1, t = 2,
t ≥ 3). Let t0 be the first t such that µt < 1/2K.
Note that bt ≥ 4Kµt , so for t < t0 , we have bt ≥ 2 and
Πt = Π. Hence, the deviation bound holds for t < t0 .
Let t ≥ t0 . For t0 ≤ t, by the monotonicity of µt
s
s
2Kt
Vt
Rt0 = 1/µt0 ≤ 1/µt =
=
.
ln(1/δt )
ln(1/δt )
Hence, the assumptions of Theorem 13 are satisfied,
and
Pr [|ηt (π) − ηD (π)| ≥ bt ] ≤ 2δt .
4. Observe reward rt .
.
5. Let ht = ht−1 ∪ (xt , at , rt , Wt0 (at )).
Theorem 4. For all distributions D over (x, ~r) with
K actions, for all sets of N policies Π, with probability at least 1 − δ, the regret of PolicyElimination
(Algorithm 1) over T rounds is at most
r
16
The union bound over π and t yields Eq. (3.1).
This immediately implies that the cumulative regret is
bounded by
r
T
X
4N T 2 X 1
√
(ηD (πmax ) − rt ) ≤ 8 2K ln
δ t=1 t
t=1...T
r
4T 2 N
≤ 16 2T K ln
(3.4)
δ
and gives us the following theorem.
4
2T K ln
4T 2 N
.
δ
THE RANDOMIZED UCB
ALGORITHM
PolicyElimination is the simplest exhibition of the
minimax argument, but it has some drawbacks:
1. The algorithm keeps explicit track of the space
of good policies (like a version space), which is
difficult to implement efficiently in general.
2. If the optimal policy is mistakenly eliminated by
chance, the algorithm can never recover.
3. The algorithm requires perfect knowledge of the
distribution DX over contexts.
These difficulties are addressed by RandomizedUCB
(or RUCB for short), an algorithm which we present
and analyze in this section. Our approach is reminiscent of the UCB algorithm (Auer et al., 2002a), developed for context-free setting, which keeps an upperconfidence bound on the expected reward for each action. However, instead of choosing the highest upper
confidence bound, we randomize over choices according to the value of their empirical performance. The
algorithm has the following properties:
1. The optimization step required by the algorithm
always considers the full set of policies (i.e.,
explicit tracking of the set of good policies is
avoided), and thus it can be efficiently implemented using an argmax oracle. We discuss this
further in Section 5.
2. Suboptimal policies are implicitly used with decreasing frequency by using a non-uniform variance constraint that depends on a policy’s estimated regret. A consequence of this is a bound on
the value of the optimization, stated in Lemma 7
below.
3. Instead of DX , the algorithm uses the history of
previously seen contexts. The effect of this approximation is quantified in Theorem 6 below.
The regret of RandomizedUCB is the following:
Theorem 5. For all distributions D over (x, ~r) with
K actions, for all sets of N policies Π, with probability at least 1 − δ, the regret of RandomizedUCB
(Algorithm 2) over T rounds is at most
p
T K log (T N/δ) + K log(N K/δ) .
O
The first quantity VP,π,t is (a bound on) the variance incurred by an importance-weighted estimate of
reward in round t using the action distribution induced by P , and the second quantity VbP,π,t is an
empirical estimate of VP,π,t using the finite sample
{x1 , . . . , xt−1 } ⊆ X drawn from DX . We show that
for all distributions P and all π ∈ Π, VbP,π,t is close to
VP,π,t with high probability.
Theorem 6. For any ∈ (0, 1), with probability at
least 1 − δ,
VP,π,t ≤ (1 + ) · VbP,π,t +
7500
·K
3
for all distributions P over Π, all π ∈ Π, and all t ≥
16K log(8KN/δ).
The proof appears in Appendix C.
4.2
REGRET ANALYSIS
Central to the analysis is the following lemma that
bounds the value of the optimization in each round. It
is a direct corollary of Lemma 24 in Appendix D.4.
Lemma 7. If OPTt is the value of the optimization
problem (4.1) in round t, then
!
!
r
r
KCt−1
K log(N t/δ)
OPTt ≤ O
= O
.
t−1
t
This lemma implies that the algorithm is always able
to select a distribution over the policies that focuses
mostly on the policies with low estimated regret.
Moreover, the variance constraints ensure that good
policies never appear too bad, and that only bad policies are allowed to incur high variance in their reward
estimates. Hence, minimizing the objective in (4.1) is
an effective surrogate for minimizing regret.
The proof is given in Appendix D.4. Here, we present
an overview of the analysis.
The bulk of the analysis consists of analyzing the
variance of the importance-weighted reward estimates
ηt (π), and showing how they relate to their actual expected rewards ηD (π). The details are deferred to Appendix D.
4.1
5
EMPIRICAL VARIANCE ESTIMATES
A key technical prerequisite for the regret analysis is
the accuracy of the empirical variance estimates. For
a distribution P over policies Π and a particular policy
π ∈ Π, define
1
VP,π,t = E
x∼DX (1 − Kµt )WP (x, π(x)) + µt
t−1
VbP,π,t =
1 X
1
.
t − 1 i=1 (1 − Kµt )WP (xi , π(xi )) + µt
USING AN ARGMAX ORACLE
In this section, we show how to solve the optimization
problem (4.1) using the argmax oracle (AMO) for our
set of policies. Namely, we describe an algorithm running in polynomial time independent1 of the number of
policies, which makes queries to AMO to compute a
distribution over policies suitable for the optimization
step of Algorithm 2.
1
Or rather dependent only on log N , the representation
size of a policy.
This algorithm relies on the ellipsoid method. The ellipsoid method is a general technique for solving convex programs equipped with a separation oracle. A
separation oracle is defined as follows:
Definition 2. Let S be a convex set in Rn . A separation oracle for S is an algorithm that, given a point
x ∈ Rn , either declares correctly that x ∈ S, or produces a hyperplane H such that x and S are on opposite sides of H.
We do not describe the ellipsoid algorithm here (since
it is standard), but only spell out its key properties in
the following lemma. For a point x ∈ Rn and r ≥ 0,
we use the notation B(x, r) to denote the `2 ball of
radius r centered at x.
Lemma 8. Suppose we are required to decide whether
a convex set S ⊆ Rn is empty or not. We are given
a separation oracle for S and two numbers R and r,
such that S ∈ B(0, R) and if S is non-empty, then
there is a point x? such that S ⊇ B(x? , r). The ellipsoid algorithm decides correctly if S is empty or not,
by executing at most O(n2 log( Rr )) iterations, each involving one call to the separation oracle and additional
O(n2 ) processing time.
We now write a convex program whose solution is the
required distribution, and show how to solve it using
the ellipsoid method by giving a separation oracle for
its feasible set using AMO.
Fix a time period t. Let Xt−1 be the set of all contexts seen so far, i.e. Xt−1 = {x1 , x2 , . . . , xt−1 }. We
embed all policies π ∈ Π in R(t−1)K , with coordinates
identified with (x, a) ∈ Xt−1 × A. With abuse of notation, a policy π is represented by the vector π with
coordinate π(x, a) = 1 if π(x) = a and 0 otherwise.
Let C be the convex hull of all policy vectors π. Recall that a distribution P over policies
corresponds to
P
a point inside C, i.e., WP (x, a) = π:π(x)=a P (π), and
that W 0 (x, a) = (1 − µt K)W (x, a) + µt , where µt is as
t−1
defined in Algorithm 2. Also define βt = 180C
. In
t−1
the following, we use the notation x ∼ ht−1 to denote
a context drawn uniformly at random from Xt−1 .
Consider the following convex program:
min s s.t.
∆t−1 (W ) ≤ s
(5.1)
W ∈ C
(5.2)
∀Z ∈ C :
"
#
X Z(x, a)
≤ max{4K, βt ∆t−1 (Z)2 } (5.3)
E
0 (x, a)
W
x∼ht−1
a
We claim that this program is equivalent to the RUCB
optimization problem (4.1), up to finding an explicit
distribution over policies which corresponds to the optimal solution. This can be seen as follows. Since we
require W ∈ C, it can be interpreted as being equal
to WP for some distribution over policies P . The constraints (5.3) are equivalent to (4.1) by substitution
Z = WQ .
The above convex program can be solved by performing a binary search over s and testing feasibility of
the constraints. For a fixed value of s, the feasibility
problem defined by (5.1)–(5.3) is denoted by A.
We now give a sketch of how we construct a separation oracle for the feasible region of A. The details
of the algorithm are a bit complicated due to the fact
that we need to ensure that the feasible region, when
non-empty, has a non-negligible volume (recall the requirements of Lemma 8). This necessitates having a
small error in satisfying the constraints of the program.
We leave the details to Appendix E. Modulo these details, the construction of the separation oracle essentially implies that we can solve A.
Before giving the construction of the separation oracle, we first show that AMO allows us to do linear
optimization over C efficiently:
Lemma 9. Given a vector w ∈ R(t−1)K , we can compute arg maxZ∈C w · Z using one invocation of AMO.
Proof. The sequence for AMO consists of xt0 ∈ Xt−1
and ~rt0 (a)
P = w(xt0 , a). The lemma now follows since
w · π = x∈Xt−1 w(x, π(x)).
We need another simple technical lemma which explains how to get a separating hyperplane for violations of convex constraints:
Lemma 10. For x ∈ Rn , let f (x) be a convex function
of x, and consider the convex set K defined by K =
{x : f (x) ≤ 0}. Suppose we have a point y such that
f (y) > 0. Let ∇f (y) be a subgradient of f at y. Then
the hyperplane f (y) + ∇f (y) · (x − y) = 0 separates y
from K.
Proof. Let g(x) = f (y) + ∇f (y) · (x − y). By the
convexity of f , we have f (x) ≥ g(x) for all x. Thus,
for any x ∈ K, we have g(x) ≤ f (x) ≤ 0. Since
g(y) = f (y) > 0, we conclude that g(x) = 0 separates
y from K.
Now given a candidate point W , a separation oracle
can be constructed as follows. We check whether W
satisfies the constraints of A. If any constraint is violated, then we find a hyperplane separating W from
all points satisfying the constraint.
1. First, for constraint (5.1), note that ηt−1 (W ) is
linear in W , and so we can compute maxπ ηt−1 (π)
via AMO as in Lemma 9. We can then compute
ηt−1 (W ) and check if the constraint is satisfied. If
not, then the constraint, being linear, automatically yields a separating hyperplane.
2. Next, we consider constraint (5.2). To check if
W ∈ C, we use the perceptron algorithm. We
shift the origin to W , and run the perceptron algorithm with all points π ∈ Π being positive examples. The perceptron algorithm aims to find a
hyperplane putting all policies π ∈ Π on one side.
In each iteration of the perceptron algorithm, we
have a candidate hyperplane (specified by its normal vector), and then if there is a policy π that is
on the wrong side of the hyperplane, we can find
it by running a linear optimization over C in the
negative normal vector direction as in Lemma 9.
If W ∈
/ C, then in a bounded number of iterations
(depending on the distance of W from C, and the
maximum magnitude kπk2 ) we obtain a separating hyperplane. In passing we also note that if
W ∈ C, the same technique allows us to explicitly compute an approximate convex combination
of policies in Π that yields W . This is done by
running the perceptron algorithm as before and
stopping after the bound on the number of iterations has been reached. Then we collect all the
policies we have found in the run of the perceptron algorithm, and we are guaranteed that W is
close in distance to their convex hull. We can then
find the closest point in the convex hull of these
policies by solving a simple quadratic program.
3. Finally, we consider constraint (5.3). We rewrite
ηt−1 (W ) as ηt−1 (W ) = w · W , where w(xt0 , a) =
rt0 I(a = at0 )/Wt00 (at0 ). Thus, ∆t−1 (Z) = v −w ·Z,
where v = maxπ0 ηt−1 (π 0 ) = maxπ0 w · π 0 , which
can be computed by using AMO once.
Next, using the candidate point W , compute the
/t
, where nx
vector u defined as u(x, a) = Wn0x(x,a)
is the number
of
times
x
appears
in
h
t−1 , so that
hP
i
Z(x,a)
Ex∼ht−1
a W 0 (x,a) = u · Z. Now, the problem
reduces to finding a policy Z ∈ C which violates
the constraint
To do this, we again apply the ellipsoid method.
For this, we need a separation oracle for the program. A separation oracle for the constraints (5.5)
can be constructed as in Step 2 above. For the
constraints (5.4), if the candidate solution Z has
f (Z) > 0, then we can construct a separating hyperplane as in Lemma 10.
Suppose that after solving the program, we get
a point Z ∈ C such that f (Z) ≤ 0, i.e. W violates the constraint (5.3) for Z. Then since constraint (5.3) is convex in W , we can construct a
separating hyperplane as in Lemma 10. This completes the description of the separation oracle.
Working out the details carefully yields the following
theorem, proved in Appendix E:
Theorem 11. There is an iterative algorithm with
O(t5 K 4 log2 ( tK
δ )) iterations, each involving one call to
AMO and O(t2 K 2 ) processing time, that either declares correctly that A is infeasible or outputs a distribution P over policies in Π such that WP satisfies
∀Z ∈ C :
"
#
X Z(x, a)
2
E
0 (x, a) ≤ max{4K, βt ∆t−1 (Z) } + 5
W
x∼ht−1
P
a
∆t−1 (W ) ≤ s + 2γ,
where =
6
8δ
µ2t
and γ =
δ
µt .
DELAYED FEEDBACK
In a delayed feedback setting, we observe rewards with
a τ step delay according to:
1. The world presents features xt .
2. The learning algorithm chooses an action at ∈
{1, ..., K}.
3. The world presents a reward rt−τ for the action
at−τ given the features xt−τ .
We deal with delay by suitably modifying Algorithm 1
to incorporate the delay τ , giving Algorithm 3.
u · Z ≤ max{4K, βt (w · Z − v)2 }.
Now we can prove the following theorem, which shows
the delay has an additive effect on regret.
Define f (Z) = max{4K, βt (w·Z−v)2 }−u·Z. Note
that f is a convex function of Z. Finding a point
Z that violates the above constraint is equivalent
to solving the following (convex) program:
Theorem 12. For all distributions D over (x, ~r) with
K actions, for all sets of N policies Π, and all delay
intervals τ , with probability at least 1 − δ, the regret of
DelayedPE (Algorithm 3) is at most
r
√ 4T 2 N 16 2K ln
τ+ T .
δ
f (Z) ≤ 0
(5.4)
Z ∈ C
(5.5)
Algorithm 3 DelayedPE(Π,δ,K,DX ,τ )
Let Π0 = Π and history h0 = ∅ r
2K ln(1/δt )
.
.
Define: δt = δ / 4N t2 and bt = 2
(
) t
r
1
ln(1/δ
)
.
t
Define: µt = min
,
2K
2Kt
For each timestep t = 1 . . . T , observe xt and do:
1. Let t0 = max(t − τ, 1).
2. Choose distribution Pt over Πt−1 s.t. ∀ π ∈ Πt−1 :
1
≤ 2K
E
x∼DX (1 − Kµt0 )WPt (x, π(x)) + µt0
3. ∀ a ∈ A, Let Wt0 (a) = (1 − Kµt0 )WPt (xt , a) + µt0
4. Choose at ∼ Wt0
5. Observe reward rt .
n
6. Let Πt = π ∈ Πt−1 :
o
ηh (π) ≥ 0max ηh (π 0 ) − 2bt0
π ∈Πt−1
7. Let ht = ht−1 ∪ (xt , at , rt , Wt0 (at ))
Proof. Essentially as Theorem 4. The variance bound
is unchanged because it depends only on
context
Pthe
T
distribution. Thus, it suffices to replace t−1 √1t with
PT +τ
PT
1
τ + t=τ +1 √t−τ
= τ + t=1 √1t in Eq. (3.4).
Acknowledgements
We thank Alina Beygelzimer, who helped in several
formative discussions.
References
Peter Auer. Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3:397–422, 2002.
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47(2–3):235–256, 2002a.
Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and
Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal of Computing, 32(1):48–77,
2002b.
P. L. Bartlett, E. Hazan, and A. Rakhlin. Adaptive online
gradient descent. In NIPS, 2007.
Alina Beygelzimer, John Langford, and Pradeep Ravikumar. Error correcting tournaments. In ALT, 2009.
Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin,
and Robert E. Schapire. Contextual bandit algorithms
with supervised learning guarantees. In AISTATS, 2011.
Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action
elimination and stopping conditions for the multi-armed
bandit and reinforcement learning problems. Journal of
Machine Learning Research, 7:1079–1105, 2006.
David A. Freedman. On tail probabilities for martingales.
Annals of Probability, 3(1):100–118, 1975.
Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):
119–139, 1997.
Sham M. Kakade and Adam Kalai. From batch to transductive online learning. In NIPS, 2005.
Adam Tauman Kalai and Santosh Vempala. Efficient algorithms for online decision problems. J. Comput. Syst.
Sci., 71(3):291–307, 2005.
Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied
Mathematics, 6:4–22, 1985.
J. Langford, A. Smola, and M. Zinkevich. Slow learners
are fast. In NIPS, 2009.
John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In NIPS,
2007.
Maurice Sion. On general minimax theorems. Pacific J.
Math., 8(1):171–176, 1958.
Niranjan Srinivas, Andreas Krause, Sham Kakade, and
Matthias Seeger. Gaussian process optimization in the
bandit setting: No regret and experimental design. In
ICML, 2010.
A
Concentration Inequality
The following is an immediate corollary of Theorem
1 of (Beygelzimer et al., 2011). It can be viewed as
a version of Freedman’s Inequality (Freedman, 1975).
Let y1 , . . . , yT be a sequence of real-valued random
variables. Let Et denote the conditional expectation
E[ · | y1 , . . . , yt−1 ] and Vt conditional variance.
Theorem 13 (Freedman-style Inequality). Let V, R ∈
PT
R such that
t=1 Vt [yt ] ≤ V , and for all t, yt −
E
[y
]
≤
R.
Then
for any δ > 0 such that R ≤
t
t
p
V / ln(2/δ), with probability at least 1 − δ,
T
T
X
X
p
yt −
Et [yt ] ≤ 2 V ln(2/δ) .
t=1
B
t=1
Minimax Theorem
The following is a continuous version of Sion’s Minimax Theorem (Sion, 1958, Theorem 3.4).
Theorem 14. Let W and Z be compact and convex
sets, and f : W × Z → R a function which for all
Z ∈ Z is convex and continuous in W and for all
W ∈ W is concave and continuous in Z. Then
min max f (W, Z) = max min f (W, Z) .
W ∈W Z∈Z
Z∈Z W ∈W
C
Empirical Variance Bounds
and then applying the AM/GM inequality.
In this section we prove Theorem 6. We first show uniform convergence for a certain class of policy distributions (Lemma 15), and argue that each distribution P
is close to some distribution Pe from this class, in the
sense that VP,π,t is close to VPe,π,t and VbP,π,t is close
to VbPe,π,t (Lemma 16). Together, they imply the main
uniform convergence result in Theorem 6.
For each positive integer m, let Sparse[m] be the set of
distributions Pe over Π that can be written as
m
1 X
I(π = πi )
Pe(π) =
m i=1
(i.e., the average of m delta functions) for some
π1 , . . . , πm ∈ Π. In our analysis, we approximate
an arbitrary distribution P over Π by a distribution
Pe ∈ Sparse[m] chosen randomly by independently
drawing π1 , . . . , πm ∼ P ; we denote this process by
Pe ∼ P m .
Lemma 15. Fix positive integers (m1 , m2 , . . . ). With
probability at least 1 − δ over the random samples
(x1 , x2 , . . . ) from DX ,
VPe,π,t ≤ (1 + λ) · VbPe,π,t
(mt + 1) log N + log
1
+ 5+
·
2λ
µt · (t − 1)
2t2
δ
Lemma 16. Fix any γ ∈ [0, 1], and any x ∈ X. For
any distribution P over Π and any π ∈ Π, if
.
m=
6
,
γ 2 µt
then
1
E (1
−
Kµ
)W
m
e
t
e
P ∼P
P (x, π(x)) + µt
1
−
(1 − Kµt )WP (x, π(x)) + µt γ
.
≤
(1 − Kµt )WP (x, π(x)) + µt
This implies that for all distributions P over Π and
any π ∈ Π, there exists Pe ∈ Sparse[m] such that for
any λ > 0,
for all λ > 0, all t ≥ 1, all π ∈ Π, and all distributions
Pe ∈ Sparse[mt ].
Proof. Let
.
ZPe,π,t (x) =
VP,π,t − VPe,π,t + (1 + λ) VbPe,π,t − VbP,π,t
≤ γ(VP,π,t + (1 + λ)VbP,π,t ).
1
(1 − Kµt )WPe (x, π(x)) + µt
so VPe,π,t = Ex∼DX [ZPe,π,t (x)] and VbPe,π,t = (t −
Pt−1
1)−1 i=1 ZPe,π,t (xi ). Also let
.
Proof.
randomly draw Pe ∼ P m , with Pe(π 0 ) =
PWe
m
−1
0
m
i=1 I(π = πi ), and then define
. log(|Sparse[mt ]|N 2t2 /δ)
εt =
µt · (t − 1)
((mt + 1) log N + log
=
µt · (t − 1)
2t2
δ )
.
We apply Bernstein’s inequality and union bounds
over Pe ∈ Sparse[mt ], π ∈ Π, and t ≥ 1 so that with
probability at least 1 − δ,
q
VPe,π,t ≤ VbPe,π,t + 2VPe,π,t εt + (2/3)εt
all t ≥ 1, all π ∈ Π, and all distributions P ∈
Sparse[mt ]. The conclusion follows by solving the
quadratic inequality for VPe,π,t to get
q
VPe,π,t ≤ VbPe,π,t + 2VbPe,π,t εt + 5εt
. X
z=
P (π 0 ) · I(π 0 (x) = π(x)) and
π 0 ∈Π
. X e 0
ẑ =
P (π ) · I(π 0 (x) = π(x)).
π 0 ∈Π
0
We have
Pm z = Eπ0 ∼P [I(π (x) = π(x)] and ẑ =
−1
m
i=1 I(πi (x) = π(x)). In other words, ẑ is the
average of m independent Bernoulli random variables,
each with mean z. Thus, EPe∼P m [(ẑ−z)2 ] = z(1−z)/m
and PrPe∼P m [ẑ ≤ z/2] ≤ exp(−mz/8) by a Chernoff
bound. We have
1
1
−
E (1
−
Kµ
)ẑ
+
µ
(1
−
Kµ
)z
+
µ
e∼P m
t
t
t
t
P
≤
E
(1 − Kµt )|ẑ − z|
[(1 − Kµt )ẑ + µt ][(1 − Kµt )z + µt ]
E
(1 − Kµt )|ẑ − z|I(ẑ ≥ 0.5z)
0.5[(1 − Kµt )z + µt ]2
e∼P m
P
≤
e∼P m
P
(1 − Kµt )|ẑ − z|I(ẑ ≤ 0.5z)
µt [(1 − Kµt )z + µt ]
e∼P m
P
p
(1 − Kµt ) EPe∼P m |ẑ − z|2
≤
0.5[(1 − Kµt )z + µt ]2
(1 − Kµt )z PrPe∼P m (ẑ ≤ 0.5z)
+
µt [(1 − Kµt )z + µt ]
p
(1 − Kµt ) z/m
p
≤
0.5[2 (1 − Kµt )zµt ][(1 − Kµt )z + µt ]
(1 − Kµt )z exp(−mz/8)
+
µt [(1 − Kµt )z + µt ]
p
√
γ 1 − Kµt z/m
≤p
z(6/m)[(1 − Kµt )z + µt ]
+
E
(1 − Kµt )γ 2 mz exp(−mz/8)
,
+
6[(1 − Kµt )z + µt ]
where the third inequality follows from Jensen’s inequality, and the fourth inequality uses the AM/GM
inequality in the denominator of the first term and
the previous observations in the numerators. The final expression simplifies to the first desired displayed
inequality by observing that mz exp(−mz/8) ≤ 3 for
all mz ≥ 0 (the maximum is achieved at mz = 8). The
second displayed inequality follows from the following
facts:
E
e∼P m
P
|VP,π,t − VPe,π,t | ≤ γVP,π,t ,
E (1 + λ)|VbP,π,t − VbPe,π,t | ≤ γ(1 + λ)VbP,π,t .
e∼P m
P
Both inequalities follow from the first displayed bound
of the lemma, by taking expectation with respect to
the true (and empirical) distributions over x. The desired bound follows by adding the above two inequalities, which implies that the bound holds in expectation, and hence the existence of Pe for which the bound
holds.
Now, we can prove Theorem 6.
Proof of Theorem 6. Let
.
mt =
6 1
·
λ2 µt
(for some λ ∈ (0, 1/5) to be determined) and condition
on the ≥ 1 − δ probability event from Lemma 15 that
VPe,π,t − (1 + λ)VbPe,π,t
(mt + 1) log(N ) + log(2t2 /δ)
1
≤K · 5+
·
2λ
Kµt · (t − 1)
1
(mt + 1) log(N ) + log(2t2 /δ)
·
≤K ·5 1+
λ
Kµt · t
for all t ≥ 2, all Pe ∈ Sparse[mt ], and all π ∈ Π. Using
the definitions of mt and µt , the second term is at most
(40/λ2 ) · (1 + 1/λ) · K for all t ≥ 16K log(8KN/δ): the
key here
p is that for t ≥ 16K log(8KN/δ), we have
µt = log(N t/δ)/(Kt) ≤ 1/(2K) and therefore
mt log(N )
6
≤ 2
Kµt t
λ
and
log(N ) + log(2t2 /δ)
≤ 2.
Kµt t
Now fix t ≥ 16K log(8KN/δ), π ∈ Π, and a distribution P over Π. Let Pe ∈ Sparse[mt ] be the distribution
guaranteed by Lemma 16 with γ = λ satisfying
VP,π,t ≤
VPe,π,t − (1 + λ)VbPe,π,t + (1 + λ)2 VbP,π,t
1−λ
.
Substituting the previous bound for VPe,π,t − (1 +
λ)Vb e
gives
P ,π,t
VP,π,t ≤
1
1−λ
40
2b
(1
+
1/λ)K
+
(1
+
λ)
V
P,π,t .
λ2
This can be bounded as (1 + ) · VbP,π,t + (7500/3 ) · K
by setting λ = /5.
D
D.1
Analysis of RandomizedUCB
Preliminaries
First, we define the following constants.
• ∈ (0, 1) is a fixed constant, and
.
• ρ = 7500
3 is the factor that appears in the bound
from Theorem 6.
.
2
1 + 7500
≥5
• θ = (ρ + 1)/(1 − (1 + )/2) = 1−
3
is a constant central to Lemma 21, which bounds
the variance of the optimal policy’s estimated rewards.
Recall the algorithm-specific quantities
Nt
.
Ct = 2 log
δ
(
)
r
1
Ct
.
µt = min
,
.
2K
2Kt
It can be checked that µt is non-increasing. We define
the following time indices:
p
• t0 is the first round t in which µt = Ct /(2Kt).
Note that 8K ≤ t0 ≤ 8K log(N K/δ).
D.2
Deviation Bound for ηt (π)
For any policy π ∈ Π, define, for 1 ≤ t ≤ t0 ,
.
V̄t (π) = K,
and for t > t0 ,
• t1 := d16K log(8KN/δ)e is the round given by
Theorem 6 such that, with probability at least
1 − δ,
.
V̄t (π) = K +
E
xt ∼DX
1
.
Wt0 (π(xt ))
The V̄t (π) bounds the variances of the terms in ηt (π).
E
xt ∼DX
≤ (1 + )
1
0
Wt (π(xt ))
E
x∼ht−1
Lemma 18. Assume the bound in (D.1) holds for all
π ∈ Π and t ≥ t1 . For all π ∈ Π:
1
+ ρK
WPt ,µt (x, π(x))
(D.1)
for all π ∈ Π and all t ≥ t1 , where WP,µ (x, ·) is
the distribution over A given by
1. If t ≤ t1 , then
K ≤ V̄t (π) ≤ 4K.
2. If t > t1 , then
.
WP,µ (x, a) = (1 − Kµ)WP (x, a) + µ,
V̄t (π)
≤ (1 + )
and the notation Ex∼ht−1 denotes expectation
with respect to the empirical (uniform) distribution over x1 , . . . , xt−1 .
The following lemma shows the effect of allowing slack
in the optimization constraints.
Lemma 17. If P satisfies the constraints of the optimization problem (4.1) with slack K for each distribution Q over Π, i.e.,
E
x∼ht−1
1
(1 − Kµt )WPt (x, π(x)) + µt
+ (ρ + 1)K.
Proof. For the first claim, note that if t < t0 , then
V̄t (π) = K, and if t0 ≤ t < t1 , then
s
r
log(N t/δ)
log(N t0 /δ)
1
≥
≥
;
µt =
Kt
16K 2 log(8KN/δ)
4K
so Wt0 (a) ≥ µt ≥ 1/(4K).
1
π∼Q x∼ht−1 (1 − Kµt )WP (x, π(x)) + µt
(t − 1)∆t−1 (WQ )2
+K
≤ max 4K,
180Ct−1
E
E
for all Q, then P satisfies
E
E
π∼Q x∼ht−1
1
(1 − Kµt )WP (x, π(x)) + µt
(t − 1)∆t−1 (WQ )2
≤ max 5K,
144Ct−1
n
o
2
.
t−1 (π)
Proof. Let b = max 4K, (t−1)∆
. Note that
180Ct−1
≥ K. Hence b + K ≤
bound.
The stated bound on V̄t (π) now follows from its definition.
Let
.
V̄max,t (π) = max{V̄τ (π), τ = 1, 2, . . . , t}
for all Q.
b
4
For the second claim, pick any t > t1 , and note that
by definition of t1 , for any π ∈ Π we have
1
E
0
xt ∼DX Wt (π(xt ))
1
≤ (1 + ) E
+ ρK.
x∼ht−1 (1 − Kµt )WPt (x, π(x)) + µt
5b
4
which gives the stated
Note that the allowance of slack K is somewhat arbitrary; any O(K) slack is tolerable provided that other
constants are adjusted appropriately.
The following lemma gives a deviation bound for ηt (π)
in terms of these quantities.
Lemma 19. Pick any δ ∈ (0, 1). With probability at
least 1 − δ, for all pairs π, π 0 ∈ Π and t ≥ t0 , we have
(ηt (π) − ηt (π 0 )) − (ηD (π) − ηD (π 0 ))
r
(V̄max,t (π) + V̄max,t (π 0 )) · Ct
≤2
. (D.2)
t
Proof. Fix any t ≥ t0 and π, π 0 ∈ Π.
exp(−Ct ). Pick any τ ≤ t. Let
Let δt :=
. rτ (aτ )I(π(xτ ) = aτ )
Zτ (π) =
Wτ0 (aτ )
Pt
so ηt (π) = t−1 τ =1 Zτ (π). It is easy to see that
E
[Zτ (π) − Zτ (π 0 )] = ηD (π) − ηD (π 0 )
(xτ ,~
rτ )∼D,
aτ ∼Wτ0
The next two lemmas relate the V̄t (π) to the ∆t (π).
Lemma 20. Assume Condition 1. For any t ≥ t1 and
π ∈ Π, if V̄t (π) > θK, then
s
∆t−1 (π) ≥
Proof. By Lemma 18, the fact V̄t (π) > θK implies
that
and
E
t
X
τ =1
≤
x∼ht−1
(Zτ (π) − Zτ (π 0 ))2
E
(xτ ,~
r (τ ))∼D,
aτ ∼Wτ0
t
X
τ =1
E
xτ ∼DX
1
1
+ 0 0
0
Wτ (π(xτ )) Wτ (π (xτ ))
72V̄t (π)Ct−1
.
t−1
1
(1 − Kµt )WPt (x, π(x)) + µt
ρ+1
1
1
1−
V̄t (π) ≥ V̄t (π).
>
1+
θ
2
≤ t · (V̄max,t (π) + V̄max,t (π 0 )).
Since V̄t (π) > θK ≥ 5K, Lemma 17 implies that in order for Pt to satisfy the optimization constraint in (4.1)
corresponding to π (with slack ≤ K), it must be the
case that
Moreover, with probability 1,
1
.
µτ
q
Ct
Now, note that since t ≥ t0 , µt =
2Kt , so that
|Zτ (π) − Zτ (π 0 )| ≤
Ct
0
t = 2Kµ
2 . Further, both V̄max,t (π) and V̄max,t (π ) are
t
at least K. Using these bounds we get
s
1
· t · (V̄max,t (π) + V̄max,t (π 0 ))
log(1/δt )
s
1
1
Ct
1
· 2K =
≥
·
≥
,
Ct 2Kµ2t
µt
µτ
for all τ ≤ t, since the µτ ’s are non-increasing. Therefore, by Freedman’s inequality (Theorem 13), we have
"
Pr (ηt (π) − ηt (π 0 )) − (ηD (π) − ηD (π 0 ))
r
>2
#
(V̄max,t (π) + V̄max,t (π 0 )) · log(1/δt )
≤ 2δt .
t
The conclusion follows by taking a union bound over
t0 < t ≤ T and all pairs π, π 0 ∈ Π.
∆t−1 (π)
s
144Ct−1
1
≥
· E
.
t−1
x∼ht−1 (1 − Kµt )WPt (x, π(x)) + µt
Combining with the above, we obtain
s
72V̄t (π)Ct−1
.
∆t−1 (π) ≥
t−1
Lemma 21. Assume Condition 1. For all t ≥ 1,
V̄max,t (πmax ) ≤ θK and V̄max,t (πt ) ≤ θK.
Proof. By induction on t. The claim for all t ≤ t1 follows from Lemma 18. So take t > t1 , and assume as
the (strong) inductive hypothesis that V̄max,τ (πmax ) ≤
θK and V̄max,τ (πτ ) ≤ θK for τ ∈ {1, . . . , t − 1}. Suppose for sake of contradiction that V̄t (πmax ) > θK. By
Lemma 20,
s
72V̄t (πmax )Ct−1
∆t−1 (πmax ) ≥
.
t−1
However, by the deviation bounds, we have
D.3
Variance Analysis
We define the following condition, which will be assumed by most of the subsequent lemmas in this section.
Condition 1. The deviation bound (D.1) holds for
all π ∈ Π and t ≥ t1 , and the deviation bound (D.2)
holds for all pairs π, π 0 ∈ Π and t ≥ t0 .
∆t−1 (πmax ) + ∆D (πt−1 )
s
(V̄max,t−1 (πt−1 ) + V̄max,t−1 (πmax ))Ct−1
≤2
t−1
s
s
2V̄t (πmax )Ct−1
72V̄t (πmax )Ct−1
≤2
<
.
t−1
t−1
The second inequality follows from our assumption and
the induction hypothesis:
This contradicts the inequality in (D.3), so it must be
that V̄max,t (πt ) ≤ θK.
V̄t (πmax ) > θK ≥ V̄max,t−1 (πt−1 ), V̄max,t−1 (πmax ).
Corollary 22. Under the assumptions of Lemma 21,
Since ∆D (πt−1 ) ≥ 0, we have a contradiction, so
it must be that V̄t (πmax ) ≤ θK. This proves that
V̄max,t (πmax ) ≤ θK.
It remains to show that V̄max,t (πt ) ≤ θK. So suppose for sake of contradiction that the inequality fails,
and let t1 < τ ≤ t be any round for which V̄τ (πt ) =
V̄max,t (πt ) > θK. By Lemma 20,
s
∆τ −1 (πt ) ≥
72V̄τ (πt )Cτ −1
.
τ −1
(D.3)
On the other hand,
∆τ −1 (πt ) ≤ ∆D (πτ −1 ) + ∆τ −1 (πt ) + ∆t (πmax )
= ∆D (πτ −1 ) + ∆τ −1 (πmax )
+ ητ −1 (πmax ) − ητ −1 (πt ) − ∆D (πt )
+ ∆D (πt ) + ∆t (πmax ) .
The parenthesized terms can be bounded using the
deviation bounds, so we have
∆τ −1 (πt )
s
(V̄max,τ −1 (πτ −1 ) + V̄max,τ −1 (πmax ))Cτ −1
≤2
τ −1
s
(V̄max,τ −1 (πt ) + V̄max,τ −1 (πmax ))Cτ −1
+2
τ −1
r
(V̄max,t (πt ) + V̄max,t (πmax ))Ct
+2
s
st
2V̄τ (πt )Cτ −1
+2
τ −1
r
2V̄τ (πt )Ct
+2
t
s
≤2
<
2V̄τ (πt )Cτ −1
τ −1
72V̄τ (πt )Cτ −1
τ −1
where the second inequality follows from the following
facts:
1. By
induction
hypothesis,
we
have
V̄max,τ −1 (πτ −1 ), V̄max,τ −1 (πmax ), V̄max,t (πmax ) ≤
θK, and V̄τ (πt ) > θK,
2. V̄τ (πt ) ≥ V̄max,t (πt ), and
3. since τ is a round that achieves V̄max,t (πt ), we
have V̄τ (πt ) ≥ V̄τ −1 (πt ).
r
∆D (πt ) + ∆t (πmax ) ≤ 2
2θKCt
t
for all t ≥ t0 .
Proof. Immediate from Lemma 21 and the deviation
bounds from (D.2).
The following lemma shows that if a policy π has large
∆τ (π) in some round τ , then ∆t (π) remains large in
later rounds t > τ .
Lemma 23. Assume Condition 1. Pick any π ∈ Π
and t ≥ t1 . If V̄max,t (π) > θK, then
r
∆t (π) > 2
2V̄max,t (π)Ct
.
t
Proof. Let τ ≤ t be any round in which V̄τ (π) =
V̄max,t (π) > θK. We have
∆t (π) ≥ ∆t (π) − ∆t (πmax ) − ∆D (πτ −1 )
= ∆τ −1 (π) + ηt (πmax ) − ηt (π) − ∆D (π)
+ ηD (πτ −1 ) − ηD (π) − ∆τ −1 (π)
s
72V̄τ (π)Cτ −1
≥
τ −1
r
(V̄max,t (π) + V̄max,t (πmax ))Ct
−2
t
s
(V̄max,τ −1 (π) + V̄max,τ −1 (πτ −1 ))Cτ −1
τ −1
s
r
72V̄max,t (π)Cτ −1
2V̄max,t (π)Ct
>
−2
τ −1
t
s
2V̄max,t (π)Cτ −1
−2
τ −1
s
r
2V̄max,t (π)Cτ −1
2V̄max,t (π)Ct
≥2
≥2
τ −1
t
−2
where the second inequality follows from Lemma 20
and the deviation bounds, and the third inequality
follows from Lemma 21 and the facts that V̄τ (π) =
V̄max,t (π) > θK ≥ V̄max,t (πmax ), V̄max,τ −1 (πτ −1 ), and
V̄max,t (π) ≥ V̄max,τ −1 (π).
D.4
Regret Analysis
We now bound the value of the optimization problem (4.1), which then leads to our regret bound. The
next lemma shows the existence of a feasible solution
with a certain structure based on the non-uniform constraints. Recall from Section 5, that solving the optimization problem A, i.e. constraints (5.1, 5.2, 5.3), for
the smallest feasible value of s is equivalent to solving
the RUCB optimization problem (4.1). Recall that
t−1
.
βt = 180C
t−1
Lemma 24. There is a point W ∈ R(t−1)K such that
s
K
∆t−1 (W ) ≤ 4
βt
W ∈ C
#
"
X Z(x, a)
≤ max{4K, βt ∆t−1 (Z)2 }
∀Z ∈ C : E
0 (x, a)
W
x∼ht−1
a
In particular, the value of the qoptimization
q probt−1
lem (4.1), OPTt , is bounded by 8 βKt ≤ 110 KC
t−1 .
Proof. Define the sets {Ci : i = 1, 2, . . .} such that
Ci := {Z ∈ C : 2i+1 κ ≤ ∆t−1 (Z) ≤ 2i+2 κ},
q
where κ = βKt . Note that since ∆t−1 (Z) is a linear
function of Z, each Ci is a closed, convex, compact
set. Also, define C0 = {Z ∈ C : ∆t−1 (Z) ≤ 4κ}.
ThisSis also a closed, convex, compact set. Note that
∞
C = i=0 Ci .
Let I = {i : Ci 6=P
∅}.For i ∈ I \ {0}, define wi = 4−i ,
and let w0 = 1 − i∈I\{0} wi . Note that w0 ≥ 2/3.
By Lemma 1, for each i ∈ I, there is a point Wi ∈ Ci
such that for all Z ∈ Ci , we have
"
#
X Z(x, a)
≤ 2K.
E
Wi0 (x, a)
x∼ht−1
a
Here we use the fact that Kµt ≤ 1/2 to upper
K
by 2K. Now consider the point W =
bound 1−Kµ
t
P
w
W
.
Since
C is convex, W ∈ C.
i
i
i∈I
Now fix any i ∈ I. For any (x, a), we have W 0 (x, a) ≥
wi Wi0 (x, a), so that for all Z ∈ Ci , we have
"
#
X Z(x, a)
1
≤
2K
E
0 (x, a)
W
w
x∼ht−1
i
a
≤ 4i+1 K
≤ max{4K, βt ∆t−1 (Z)2 },
so the constraint for Z is satisfied.
Finally, since for all i ∈ I, we have wi ≤ 4−i and
∆t−1 (Wi ) ≤ 2i+2 κ, we get
∆t−1 (W ) =
X
wi ∆t−1 (Wi ) ≤
∞
X
4−i · 2i+2 κ ≤ 8κ.
i=0
i∈I
The value of the optimization problem (4.1) can be
related to the expected instantaneous regret of policy
drawn randomly from the distribution Pt .
Lemma 25. Assume Condition 1. Then
r
X
√ KCt−1
Pt (π)∆D (π) ≤ 220 + 4 2θ ·
+ 2εopt,t
t−1
π∈Π
for all t > t1 .
Proof. Fix any π ∈ Π and t > t1 . By the deviation
bounds, we have
ηD (πt−1 ) − ηD (π)
s
(V̄max,t−1 (π) + V̄max,t−1 (πt−1 ))Ct−1
≤ ∆t−1 (π) + 2
t−1
s
V̄max,t−1 (π) + θK Ct−1
≤ ∆t−1 (π) + 2
,
t−1
by Lemma 21. By Corollary 22 we have
r
2θKCt−1
∆D (πt−1 ) ≤ 2
t−1
Thus, we get
∆D (π) ≤ ηD (πt−1 ) − ηD (π) + ∆D (πt−1 )
s
V̄max,t−1 (π) + θK Ct−1
≤ ∆t−1 (π) + 2
t−1
r
2θKCt−1
+2
.
t−1
If V̄max,t−1 (π) ≤ θK, then we have
r
2θKCt−1
∆D (π) ≤ ∆t−1 (π) + 4
.
t−1
Otherwise, Lemma 23 implies that
V̄max,t−1 (π) ≤
(t − 1) · ∆t−1 (π)2
,
8Ct−1
so
r
∆t−1 (π)2
θKCt−1
∆D (π) ≤ ∆t−1 (π) + 2
+
8
t−1
r
2θKCt−1
+2
t−1
r
2θKCt−1
≤ 2∆t−1 (π) + 4
.
t−1
Therefore
X
E
Pt (π)∆D (π)
π∈Π
≤2
X
r
Pt (π)∆t−1 (π) + 4
π∈Π
r
≤ 2 (OPTt +εopt,t ) + 4
2θKCt−1
t−1
2θKCt−1
t−1
where OPTt is the value of the optimization problem (4.1). The conclusion follows from Lemma 24.
We can now finally prove the main regret bound for
RUCB.
Proof of Theorem 5. The regret through the first t1
rounds is trivially bounded by t1 . In the event that
Condition 1 holds, we have for all t ≥ t1 ,
X
X
Wt (a)rt (a) ≥
(1 − Kµt )WPt (xt , a)rt (a)
a∈A
a∈A
≥
X
Details of Oracle-based Algorithm
We show how to (approximately) solve A using the
ellipsoid algorithm with AMO. Fix a time period t.
To avoid clutter, (only) in this section we drop the
subscript t − 1 from ηt−1 (·), ∆t−1 (·), and ht−1 so that
they becomes η(·), ∆(·), and h respectively.
In order to use the ellipsoid algorithm, we need to
relax the program a little bit in order to ensure that
the feasible region has a non-negligible volume. To do
this, we need to obtain some perturbation bounds for
the constraints of A. The following lemma gives such
bounds. For any δ > 0, we define Cδ to be the set of
all points within a distance of δ from C.
Lemma 26. Let δ ≤ b/4 be a parameter. Let U, W ∈
C2δ be points such that kU − W k ≤ δ. Then we have
|∆(U ) − ∆(W )| ≤ γ
(E.1)
∀Z ∈ C1 :
#
"
#
"
X Z(x, a) X Z(x, a)
− E
≤
E
x∼h
U 0 (x, a)
W 0 (x, a) x∼h
a
a
WPt (xt , a)rt (a) − Kµt
(E.2)
a∈A
=
X
Pt (π)rt (π(xt )) − Kµt ,
where =
8δ
µ2t
and γ =
δ
µt .
π∈Π
Proof. First, we have
and therefore
|η(U ) − η(W )| ≤
[rt (at )]
E
"
=
E
(xt ,~
r (t))∼D
≥
X
Wt0 (a)rt (a)
which implies (E.1).
a∈A
Pt (π)ηD (π) − Kµt
π∈Π
r
KCt−1
+ εopt,t
t−1
!
where the last inequality follows from Lemma 25.
Summing the bound from t = t1 + 1, . . . , T gives
Next, for any Z ∈ C1 , we have
X Z(x, a)
X Z(x, a) −
U 0 (x, a)
W 0 (x, a) a
a
X
|U 0 (x, a) − W 0 (x, a)|
≤
|Z(x, a)|
U 0 (x, a)W 0 (x, a)
a
≤
T
X
t=1
E
[ηD (πmax ) − rt (at )]
(xt ,~
r (t))∼D
at ∼Wt0
≤ t1 + O
r
|U (x, a) − W (x, a)|
p
δ
≤
= γ,
µt
#
≥ ηD (πmax ) − O
X
(x,a,r,q)∈h
(xt ,~
r (t))∼D
at ∼Wt0
X
1
t−1
p
T K log (N T /δ) .
By
PT Azuma’s inequality, the probability that
t=1
p rt (at ) deviates from its mean by more than
O( T log(1/δ)) is at most δ. Finally, the probability
that Condition 1 does not hold is at most 2δ by
Lemma 19, Theorem 6, and a union bound. The
conclusion follows by a final union bound.
8δ
= .
µ2t
In the last inequality, we use the Cauchy-Schwarz inequality, and use the following facts (here, Z(x, ·) denotes the vector hZ(x, a)ia , etc.):
1. kZ(x, ·)k ≤ 2 since Z ∈ C1 ,
2. kU 0 (x, ·) − W 0 (x, ·)k ≤ kU (x, ·) − W (x, ·)k ≤ δ,
and
3. U 0 (x, a) ≥ (1 − bK) · (−2δ) + b ≥ b/2, for δ ≤ b/4,
and similarly W 0 (x, a) ≥ b/2.
This implies (E.2).
We now consider the following relaxed form of A.
Here, δ ∈ (0, b/4) is a parameter. We want to find
a point W ∈ R(t−1)K such that
∆(W ) ≤ s + γ
(E.3)
W ∈ Cδ
(E.4)
∀Z ∈ C2δ :
#
"
X Z(x, a)
E
W 0 (x, a)
x∼h
a
≤ max{4K, βt ∆(Z)2 } + ,
(E.5)
linear optimization over C efficiently. This immediately gives us the following useful corollary:
Corollary 28. Given a vector w ∈ R(t−1)K and δ > 0,
we can compute arg maxZ∈Cδ w · Z using one invocation of AMO.
Proof. This follows directly from the following fact:
arg max w · Z =
Z∈Cδ
δ
w + arg max w · Z.
Z∈C
kwk
where and γ are as defined in Lemma 26. Call this
relaxed program A0 .
We apply the ellipsoid method to A0 rather than A.
Recall the requirements of Lemma 8: we need an enclosing ball of bounded radius for the feasible region,
and the radius of an enclosed ball in the feasible region.
The following lemma gives this.
0
Lemma
√ 27. The feasible region for A is contained in
B(0, t + δ), and if A is feasible, then it contains a
ball of radius δ.
Proof. Note that for any W ∈ Cδ , we have
√
√ kW k ≤
t + δ, so the feasible region lies in B(0, t + δ).
Next, if A is feasible, let W ? ∈ C be any feasible solution to A. Consider the ball B(W ? , δ). Let U be any
point in B(W ? , δ). Clearly U ∈ Cδ . By Lemma 26,
assuming δ ≤ 1/2, we have for all Z ∈ C2δ ,
"
"
#
#
X Z(x, a)
X Z(x, a)
≤ E
+
E
U 0 (x, a)
U 0 (x, a)
x∼h
x∼h
a
a
≤ max{4K, βt ∆(Z)2 } + .
Also
∆(U ) ≤ ∆(W ? ) + γ ≤ s + γ.
Thus, U is feasible for A0 , and hence the entire ball
B(W ? , δ) is feasible for A0 .
We now give the construction of a separation oracle for
the feasible region of A0 by checking for violations of
the constraints. In the following, we use the word “iteration” to indicate one step of either the ellipsoid algorithm or the perceptron algorithm. Each such iteration
involves one call to AMO, and additional O(t2 K 2 )
processing time.
Let W ∈ R(t−1)K be a candidate point that we want to
check for feasibility for A0 . We can check for violation
of the constraint (E.3) easily, and since it is a linear
constraint in W , it automatically yields a separating
hyperplane if it is violated.
The harder constraints are (E.4) and (E.5). Recall
that Lemma 9 shows that that AMO allows us to do
Now we show how to use AMO to check for constraint
(E.4):
Lemma 29. Suppose we are given a point W . Then
/ C2δ , we can construct a
in O( δt2 ) iterations, if W ∈
hyperplane separating W from Cδ . Otherwise, we declare correctly that W ∈ C2δ . In the latter case, we can
find an explicit distribution P over policies in Π such
that WP satisfies kWP − W k ≤ 2δ.
Proof. We run the perceptron algorithm with the origin at W and all points in Cδ being positive examples. The goal of the perceptron algorithm then is to
find a hyperplane going through W that puts all of Cδ
(strictly) on one side. In each iteration of the perceptron algorithm, we have a weight vector w that is the
normal to a candidate hyperplane, and we need to find
a point Z ∈ Cδ such that w · (Z − W ) ≤ 0 (note that
we have shifted the origin to W ). To do this, we use
AMO as in Lemma 9 to find Z ? = arg maxZ∈Cδ −w·Z.
If w · (Z ? − W ) ≤ 0, we use Z ? to update w using the
perceptron update rule, w ← w + (Z ? − W ). Otherwise, we have w · (Z − W ) > 0 for all W ∈ Cδ , and
hence we have found our separating hyperplane.
Now suppose that W ∈
/ C2δ , i.e. the distance of
√W
from Cδ √
is more than δ. Since kZ − W k ≤ 2 √t +
3δ = O( t) for all W ∈ Cδ (assuming δ = O( t)),
the perceptron convergence guarantee implies that in
O( δt2 ) iterations we find a separating hyperplane.
If in k = O( δt2 ) iterations we haven’t found a separating hyperplane, then W ∈ C2δ . In fact the perceptron
algorithm gives a stronger guarantee: if the k policies found in the run of the perceptron algorithm are
π1 , π2 , . . . , πk ∈ Π, then W is within a distance of 2δ
from their convex hull, C 0 = conv(π1 , π2 , . . . , πk ). This
0
is because a run of the perceptron algorithm on C2δ
would be identical to that on C2δ for k steps. We can
then compute the explicit distribution over policies P
by computing the Euclidean projection of W on C 0 in
poly(k) time using a convex quadratic program:
To do this, we again apply the ellipsoid method, but
on the relaxed program
Pk
min kW − i=1 Pi πi k2
X
Pi = 1
f (Z) ≤ (E.8)
Z ∈ Cδ
i
∀i : Pi ≥ 0
Solving this quadratic program, we get a distribution
P over the policies {π1 , π2 , . . . , πk } such that kWP −
W k ≤ 2δ.
Finally, we show how to check constraint (E.5):
Lemma 30. Suppose we are given a point W . In
3
2
O( t δK2 · log( δt )) iterations, we can either find a point
Z ∈ C2δ such that
"
#
X Z(x, a)
≥ max{4K, βt ∆(Z)2 } + 2,
E
0 (x, a)
W
x∼h
a
or else we conclude correctly that for all Z ∈ C, we
have
"
#
X Z(x, a)
≤ max{4K, βt ∆(Z)2 } + 3.
E
0 (x, a)
W
x∼h
a
(E.9)
To run the ellipsoid algorithm, we need a separation
oracle for the program. Given a candidate solution Z,
we run the algorithm of Lemma 29, and if Z ∈
/ C2δ , we
construct a hyperplane separating Z from Cδ .
Now suppose we conclude that Z ∈ C2δ . Then we
construct a separation oracle for (E.6) as follows. If
f (Z) > , then since f is a convex function of Z, we
can construct a separating hyperplane as in Lemma 10.
Now we can run the ellipsoid
√ algorithm with the
starting ellipsoid being B(0, t). If there is a point
Z ? ∈ C such that f (Z ? ) ≤ 0, then consider the ball
). For any Y ∈ B(Z ? , 5√4δ
), we have
B(Z ? , 5√4δ
tKβ
tKβ
t
t
|(u · Z ? ) − (u · Y )| ≤ kukkZ ? − Y k ≤
2
√
since kuk ≤
K
µt .
Also,
βt |(w · Z ? − v)2 − (w · Y − v)2 |
= βt |(w · Z ? − w · Y )(w · Z ? + w · Y − 2v)|
Proof. We first rewrite η(W ) as η(W ) = w · π, where
w is a vector defined as
w(x, a) =
1
t−1
X
(x0 ,a0 ,r,p)∈h: x0 =x,a0 =a
r
.
p
since kwk ≤ µ1t , kZ ? k ≤
√
√
|v| ≤ kwk · t ≤ µtt .
Thus, ∆(Z) = v − w · Z, where v = maxπ0 η(π 0 ) =
maxπ0 w · π 0 which can be computed by using AMO
once.
Next, using the candidate point W , compute the
/t
vector u defined as u(x, a) = Wn0x(x,a)
, where nx
is theh number ofi times x appears in h, so that
P Z(x,a)
Ex∼h
a W 0 (x,a) = u · Z. Now, the problem reduces
to finding a point R ∈ C which violates the constraint
u · Z ≤ max{4K, βt (w · Z − v)2 } + 3.
Define
f (Z) = max{4K, βt (w · Z − v)2 } + 3 − u · Z.
Note that f is convex function of Z. Checking for violation of the above constraint is equivalent to solving
the following (convex) program:
f (Z) ≤ 0
Z ∈ C
≤ βt kwkkZ ? − Y k(kwk(kZ ? k + kY k) + 2|v|) ≤
(E.6)
(E.7)
√
t, kY k ≤
√
,
2
√
t + δ ≤ 2 t, and
Thus, f (Y ) ≤ f (Z ? ) + ≤ , so the entire ball
B(Z ? , 5√4δ
) is feasible for the relaxed program.
tKβ
t
By Lemma 8, in O(t2 K 2 · log( tK
δ )) iterations of the
ellipsoid algorithm, we obtain one of the following:
1. we either find a point Z ∈ C2δ such that f (Z) ≤ ,
i.e.
"
#
X Z(x, a)
≥ max{4K, βt ∆(Z)2 } + 2,
E
0 (x, a)
W
x∼h
a
2. or else we conclude that the original convex program (E.6,E.7) is infeasible, i.e. for all Z ∈ C, we
have
"
#
X Z(x, a)
≤ max{4K, βt ∆(Z)2 } + 3.
E
0 (x, a)
W
x∼h
a
The total number of invocations of iterations is
t
t3 K 2
bounded by O(t2 K 2 · log( tK
δ )) · O( δ 2 ) = O( δ 2 ·
log( tK
δ )).
Lemma 31. Suppose we are given a point Z ∈ C2δ
such that
#
"
X Z(x, a)
≥ max{4K, βt ∆(Z)2 } + 2.
E
0 (x, a)
W
x∼h
a
Then we can construct a hyperplane separating W
from all feasible points for A0 .
Proof. For notational convenience, define the function
#
"
X Z(x, a)
−max{4K, βt ∆(Z)2 }−2.
fZ (W ) := E
0 (x, a)
W
x∼h
a
Note that it is a convex function of W . Note that for
any point U that is feasible for A0 , we have fZ (U ) ≤
−, whereas fZ (W ) ≥ 0. Thus, by Lemma 10, we can
construct the desired separating hyperplane.
We can finally prove Theorem 11:
Proof. [Theorem 11.] We run
√ the ellipsoid algorithm
starting with the ball B(0, t + δ). At each point,
we are given a candidate solution W for program A0 .
We check for violation of constraint (E.3) first. If
it is violated, the constraint, being linear, gives us
a separating hyperplane. Else, we use Lemma 29 to
check for violation of constraint (E.4). If W ∈
/ C2δ ,
then we can construct a separating hyperplane. Else,
we use Lemmas 30 and 31 to check for violation of
constraint
hP (E.5). i If there is a Z ∈ C such that
Z(x,a)
≥ max{4K, βt ∆(Z)2 } + 3, then
Ex∼h
a W 0 (x,a)
we can find a separating hyperplane. Else, we conclude that the current point W satisfies the following
constraints:
"
∀Z ∈ C : E
x∼h
∆(W ) ≤ s + γ
#
X Z(x, a)
≤ max{4K, βt ∆(Z)2 } + 3
0 (x, a)
W
a
W ∈ C2δ
We can then use the perceptron-based algorithm of
Lemma 29 to “round” W to an explicit distribution P
over policies in Π such that WP satisfies kWP − W k ≤
2δ. Then Lemma 26 implies the stated bounds for WP .
By Lemma 8, in O(t2 K 2 log( δt )) iterations of the ellipsoid algorithm, we find the point W satisfying the
constraints given above, or declare correctly that A is
infeasible. In the worst case, we might have to run the
algorithm of Lemma 30 in every iteration, leading to an
3
2
upper bound of O(t2 K 2 log( δt )) × O( t δK2 · log( tK
δ )) =
2 tK
5 4
O(t K log ( δ )) on the number of iterations.