Bandits : optimality in exponential families
Odalric-Ambrym Maillard
IHES, January 2016
Odalric-Ambrym Maillard
Bandits
1 / 40
Introduction
.
1
2
3
Stochastic multi-armed bandits
Boundary crossing probabilities
Analysis for exponential families
Odalric-Ambrym Maillard
Bandits
2 / 40
Stochastic multi-armed bandits
Application
examples
.
Clinical trials
I
K possible treatments (unknown effects)
I
What treatment allocate to each patient, depending on
observed effects on previous patients?
Online advertising
I
K ads that can be displayed
I
What ad show to each user, depending on observed clicks
from past users?
Odalric-Ambrym Maillard
Bandits
4 / 40
The
stochastic Multi-Armed Bandit setting
.
Setting
Set of choices A. Each a ∈ A is associated with an unknown probability distribution νa ∈ D with mean µa .
Game
The game is sequential: At each round t > 1,
Goal
I
the player first picks an arm At ∈ A based on
previous observations.
I
then receives (and sees) a stochastic payoff Yt ∼ νAt .
"T
X
Maximize the cumulative payoff E
#
Yt or equivalently
t=1
minimize the regret at round T :
RT
"T
X
#
def
=
Tµ − E
=
max{ µa ; a ∈ A },
?
Yt
where
t=1
µ?
Odalric-Ambrym Maillard
Bandits
a? ∈ argmax{ µa ; a ∈ A } .
5 / 40
Applications
of Bandits
.
What book to
display?
(+Few
interactions)
What news to
display?
(+Short lifespan)
Odalric-Ambrym Maillard
What drug is the most
efficient?
(+Economic/Human cost)
What action to choose?
(+Risk Aversion)
Bandits
What power to
inject in the
electrical network?
(+Fast decisions)
How to organize
future cities?
(+Long-term
decisions)
6 / 40
Core
problem
.
Decision Making under
Uncertainty
Exploration versus Exploitation
I
Sample more information or trust current estimates ?
Optimism in face of (modeled) uncertainty
I
Build empirical confidence bounds on value. Choose point
with highest value.
Odalric-Ambrym Maillard
Bandits
7 / 40
Some
historical facts
.
I
1933: Introduction by Thompson ("Bayesian" way)
I
1952: Frequentist formalization by Robbins.
I
1985: Lai and Robbins’ first lower-performance bounds.
I
1996: Burnetas and Katehakis get general lower-performance
bounds.
.
I
2002: Upper-Confidence Bound (UCB) algorithm gets first
non-asymptotic guarantee.
I
2010+: Provably optimal strategies KL-UCB, D-MED,
Thompson Sampling for specific distributions.
I
This talk: Extension to general exponential families.
Odalric-Ambrym Maillard
Bandits
8 / 40
Typical
performance guarantees
.
Lower bound (Burnetas and Katehakis 96)
For all "consistent" algorithm, for arbitrary D ⊂ P([0, 1])
lim inf
T →∞
X µ? − µa
RT
>
,
log(T ) a∈A Kinf (νa , µ? )
where Kinf (νa , µ? ) = inf{KL(νa , ν) : ν ∈ D, Eν [Y ] > µ? }.
Upper bounds (Cappe, Garivier, M., Munos, Stoltz 2013)
KL-UCB achieves for specific D (e.g. 1-dim exponential families):
RT 6
Odalric-Ambrym Maillard
µ? − µa
log(T ) + 9 log(T )2/3 .
K
(ν
,
µ
)
a
?
inf
a∈A
X
Bandits
9 / 40
The
KL-UCB algorithm
.
Pull arm At+1 that maximizes the upper bound
sup Eν [Y ] : ν ∈ D, KL ΠD νba,Na (t) ,ν 6
f (t)
Na (t)
where
1
n
Pn
m=1 δXa,m ,
Na (t) =
Pt
I
νba,n =
I
ΠD : M1 (R) → D projection on D.
I
f : N → R non-decreasing. .
s=1 I{As
= a}.
Build empirical distribution for arm a.
Odalric-Ambrym Maillard
Bandits
10 / 40
The
KL-UCB algorithm
.
Pull arm At+1 that maximizes the upper bound
sup Eν [Y ] : ν ∈ D, KL ΠD νba,Na (t) ,ν 6
f (t)
Na (t)
where
1
n
Pn
m=1 δXa,m ,
Na (t) =
Pt
I
νba,n =
I
ΠD : M1 (R) → D projection on D.
I
f : N → R non-decreasing. .
s=1 I{As
= a}.
Project it on D.
Odalric-Ambrym Maillard
Bandits
10 / 40
The
KL-UCB algorithm
.
Pull arm At+1 that maximizes the upper bound
sup Eν [Y ] : ν ∈ D, KL ΠD νba,Na (t) ,ν 6
f (t)
Na (t)
where
1
n
Pn
m=1 δXa,m ,
Na (t) =
Pt
I
νba,n =
I
ΠD : M1 (R) → D projection on D.
I
f : N → R non-decreasing. Threshold.
s=1 I{As
= a}.
Consider all ν ∈ D close to this distribution.
Odalric-Ambrym Maillard
Bandits
10 / 40
The
KL-UCB algorithm
.
Pull arm At+1 that maximizes the upper bound
sup Eν [Y ] : ν ∈ D, KL ΠD νba,Na (t) ,ν 6
f (t)
Na (t)
where
1
n
Pn
m=1 δXa,m ,
Na (t) =
Pt
I
νba,n =
I
ΠD : M1 (R) → D projection on D.
I
f : N → R non-decreasing. .
s=1 I{As
= a}.
Compute the maximal mean of these candidate distributions.
Odalric-Ambrym Maillard
Bandits
10 / 40
The
KL-UCB+ algorithm
.
Pull arm At+1 that maximizes the upper bound
sup Eν [Y ] : ν ∈ D, KL ΠD νba,Na (t) ,ν 6
f t/Na (t) Na (t)
where
1
n
Pn
m=1 δXa,m ,
Na (t) =
Pt
I
νba,n =
I
ΠD : M1 (R) → D projection on D.
I
f : N → R non-decreasing.
Odalric-Ambrym Maillard
Bandits
s=1 I{As
= a}.
11 / 40
From
Regret to Boundary Crossing Probabilities
.
(a)
The regret is given by
z
}|
(b)
{
log(T )
RT =
(µ? −µa )E[Na (T )] 6
(µ? −µa )
+ 9 log(T )2/3 .
K
(ν
,
µ
)
?
inf a
a∈A
a∈A
X
z
X
}|
{
Theorem
Let 0 < ε < min{µ? − µa , a ∈ A}. Then, the number of pulls of a
sub-optimal arm a ∈ A by Algorithm KL-UCB satisfies
E NT (a) 6 2 +
T
n
X
P Kinf (ΠD (νba,n ), µ? − ε) 6 f (T )/n
o
n=1
|
+
T
−1
X
{z
(a) Easy term, via Sanov
o
Kinf ΠD (νba? ,Na? (t) ), µ? − ε > f (t)/Na? (t)
n
P
}
.
t=|A|
|
Odalric-Ambrym Maillard
{z
(b) Difficult: Boundary Crossing Probability
Bandits
}
12 / 40
From
Regret to Boundary Crossing Probabilities
.
(a)
The regret is given by
z
}|
(b)
{
log(T )
RT =
(µ? −µa )E[Na (T )] 6
(µ? −µa )
+ 9 log(T )2/3 .
K
(ν
,
µ
)
?
inf a
a∈A
a∈A
X
X
z
}|
{
Theorem
Let 0 < ε < min{µ? − µa , a ∈ A}. Likewise, the number of pulls
of a sub-optimal arm a ∈ A by Algorithm KL-UCB+ satisfies
E NT (a) 6 2 +
T
n
X
P Kinf (ΠD (νba,n ), µ? − ε) 6 f (T /n)/n
o
n=1
|
+
T
−1
X
{z
}
(a) Easy term, via Sanov
o
Kinf ΠD (νba? ,Na? (t) ), µ? − ε > f (t/Na? (t))/Na? (t)
n
P
.
t=|A|
|
Odalric-Ambrym Maillard
{z
(b) Difficult: Boundary Crossing Probability
Bandits
}
12 / 40
Exponential
families
.
E parametric family of distributions of the form
where
νθ (x ) = exp hθ, F (x )i − ψ(θ) ν0 (x ) ,
I
F : X → RK , ν0 ∈ P(X ): reference function/measure.
I
θ ∈ ΘE =
I
ψ(θ) = log
def
n
o
θ ∈ RK ; ψ(θ) < ∞ natural parameter.
Z
def
X
exp hθ, F (x )i ν0 (dx ): log-partition function.
Example
I
Bernoulli: X = {0, 1}, F : x → x (K = 1), ΘD = R,
ψ(θ) = log(1 + e θ ). B(µ): parameter θ = log(µ/(1 − µ)).
I
Gaussian: X = R, F : x → (x , x 2 )(K = 2), ΘD = R × R−
?,
θ2
ψ(θ) = − 4θ12 + 12 log
Odalric-Ambrym Maillard
−π
θ2
−1
. N (µ, σ 2 ): parameter θ = ( σµ2 , 2σ
2 ).
Bandits
13 / 40
Basic
properties of regular exponential families
.
I
Eνθ (F (X )) = ∇ψ(θ)
Odalric-Ambrym Maillard
Bandits
14 / 40
Basic
properties of regular exponential families
.
I
Eνθ (F (X )) = ∇ψ(θ)
I
KL(νθ , νθ0 ) = B ψ (θ, θ0 ) = hθ| −
θ0 , ∇ψ(θ)i − ψ(θ) + ψ(θ0 ).
{z } | {z }
| {z }
Bregman div.
Odalric-Ambrym Maillard
natural
Bandits
dual
14 / 40
Basic
properties of regular exponential families
.
I
Eνθ (F (X )) = ∇ψ(θ)
I
KL(νθ , νθ0 ) = B ψ (θ, θ0 ) = hθ| −
θ0 , ∇ψ(θ)i − ψ(θ) + ψ(θ0 ).
{z } | {z }
| {z }
Bregman div.
I
natural
dual
Estimation for regular family:
Odalric-Ambrym Maillard
Bandits
14 / 40
Basic
properties of regular exponential families
.
I
Eνθ (F (X )) = ∇ψ(θ)
I
KL(νθ , νθ0 ) = B ψ (θ, θ0 ) = hθ| −
θ0 , ∇ψ(θ)i − ψ(θ) + ψ(θ0 ).
{z } | {z }
| {z }
natural
Bregman div.
I
dual
Estimation for regular family:
I
Odalric-Ambrym Maillard
Let {Xi }i6n ∼ νθ? , νbn =
1
n
Pn
Bandits
i=1 δXi .
14 / 40
Basic
properties of regular exponential families
.
I
Eνθ (F (X )) = ∇ψ(θ)
I
KL(νθ , νθ0 ) = B ψ (θ, θ0 ) = hθ| −
θ0 , ∇ψ(θ)i − ψ(θ) + ψ(θ0 ).
{z } | {z }
| {z }
natural
Bregman div.
I
dual
Estimation for regular family:
I
I
Pn
Let {Xi }i6n ∼ νθ? , νbn = n1 i=1 δXi .
Pn
def
(F (X )) ∈ ∇ψ(ΘD )
Fbn = n1 i=1 F (Xi ) = Eb
νn
In the sequel, we would like to have
ΠE (νba? ,n ) = νθb for some θbn ∈ ΘE .
n
Odalric-Ambrym Maillard
Bandits
14 / 40
Technical
assumption: Well-defined empirical parameter θbn
.
We assume that θ? ∈ Θ and Fbn ∈ ∇ψ(Θρ ) where
I
I
Θ is convex
Neighborhood Θ2ρ ⊂ Θ̊I where
I
I
Θ2ρ = n{θ : inf θ0 ∈Θ ||θ − θ0 || < 2ρ}
o
def
ΘI = θ ∈ ΘE ; 0 < λMIN (∇2 ψ(θ)) 6 λMAX (∇2 ψ(θ)) < ∞ ,
with λMIN/MAX (M): min/max eigenvalues of a sdp matrix M.
I
2ρ < ρ? = d(θ? , ΘcI )
θ? ∈ Θ ⊂ Θ2ρ ⊂ Θ̊I ⊂ ΘE
Then we can form ΠE (νba? ,n ) = νθb with θbn ∈ Θρ and
n
Kinf ΠD (νba? ,n ), µ? − ε = inf{B ψ (θbn , θ) : Eνθ [X ] > µ? − ε}
Odalric-Ambrym Maillard
Bandits
15 / 40
Boundary crossing probabilities
for general exponential families
What
it is about
.
n
o
Pθ? Kinf ΠE (νba? ,Na? (t) ), µ? − ε > f (t/Na? (t))/Na? (t) 6
P
θ?
t
n[
inf{B ψ (θbn , θ) : Eνθ [X ] > µ? − ε}> f (t/n)/n
o
n=1
E.g. for canonical exp. family of dimension 1:
Unique θ = θε? ∈ R such that Eνθ [X ] = µ? − ε, thus:
Pθ ?
t
n[
B ψ (θbn , θε? ) > f (t/n)/n
o
n=1
We want this to be o(1/t) for well-chosen f .
Odalric-Ambrym Maillard
Bandits
17 / 40
Known
results
.
Theorem (Cappé, Garivier, M., Munos, Stoltz 2013)
For canonical (F (x ) = x ∈ X ) exp. families of dimension K = 1:
Pθ ?
t
n[
o
B ψ (θbn , θ? ) > f t /n 6 edf (t) log(t)ee −f (t) .
n=1
For f (x ) = log(x ) + ξ log log(x ), the bound is O( log
A similar result is shown for discrete distributions.
2−ξ
t
(t)
).
Limitations
I
Requires ξ > 2, while experiments shows ξ 6 0 still works
(and even better).
I
Only for threshold f (t)/n, does not work for f (t/n)/n.
I
Consider θ? : does not take advantage of ε > 0.
Odalric-Ambrym Maillard
Bandits
18 / 40
Forgetted
result
.
Theorem (Lai, 1988; exponential family of dimension K )
For γ ∈ (0, 1) let Cγ (θ) = {θ0 ∈ RK : hθ0 , θi > γ|θ||θ0 |}. Then, for
f (x ) = log(x ) + ξ log log(x ) it holds for all θ† ∈ Θρ such that
|θ† − θ? |2 > δt , where δt → 0, tδt → ∞ as t → ∞,
P
t
n[
θ?
o
θbn ∈ Θρ ∩B ψ (θbn , θ† ) > f t/n /n ∩∇ψ(θbn )−∇ψ(θ† ) ∈ Cγ (θ† −θ? )
n=1
t→∞
=
O
†
|θ − θ? |−2
t
Pros
log
−ξ−1+K /2
†
? 2
(t|θ − θ | ) .
Cons
I
ξ > K /2 − 1 is enough.
I
Blue term
I
For f (t/n)
I
Asymptotic
"Boundary crossing problems for sample means", 1988.
Odalric-Ambrym Maillard
Bandits
19 / 40
Illustrative
experiment on Bernoulli distributions
.
Regret as function of time for KL-UCB and KL-UCB+
(mean and quantiles).
Odalric-Ambrym Maillard
Bandits
20 / 40
This
talk
.
I
Explain the proof technique for this result:
1
2
3
4
5
I
Peeling (standard) and covering
Change of measure argument
Localization + change of measure
Concentration (standard)
Fine tuning (gain log2 factor)
Show that Lai’s theorem can be made non-asymptotic, and
used to control the general case.
Odalric-Ambrym Maillard
Bandits
21 / 40
Analysis and proof technique.
(up to some details for presentation purpose)
Pθ?
nS
16n6t
θbn ∈ Θρ ∩ Kinf (ΠE (νba? ,n ), µ? − ε) > f (t/n)/n
|
{z
En
}
o
I.
. Peeling
Let n0 = 1 < n1 < · · · < nIt = t + 1 for some It ∈ N? .
Pθ?
n [
16n6t
o
En 6
IX
t −1
n
Pθ ?
i=0
[
o
En .
ni 6n<ni+1
For some b > 1 and β ∈ (0, 1), use:
b i if i < I def
t = dlogb (βt + β)e
ni =
t + 1 if i = It ,
(valid sequence since nIt −1 6 b logb (βt+β) < t + 1 = nIt )
Odalric-Ambrym Maillard
Bandits
23 / 40
II.
. Cone covering
1
First:
Kinf (ΠE (νba? ,n ), µ? − ε)
= inf{B ψ (θbn , θ) : θ ∈ ΘE , µθ > µ? − ε}
= inf{B ψ (θbn , θ? − ∆) : θ? − ∆ ∈ ΘE , µθ? −∆ > µ? − ε}
2
Then
B ψ (θbn , θ? − ∆)
b θbn )i .
= B ψ (θbn , θ? ) −B ψ (θ? − ∆, θ? ) − h∆, ∇ψ(θ? − ∆) − ∇ψ(
|
Odalric-Ambrym Maillard
{z
shift
Bandits
}
24 / 40
II.
. Cone covering
Consider half-cone Cγ (∆) = θ0 ∈ RK : hθ0 , ∆i > γ||θ0 ||||∆|| ,
Cγ,K
then set {∆c }16c6Cγ,K of directions s.t. RK =
[
Cγ (∆c ).
c=1
I
Cγ (∆) invariant by rescaling of ∆: We can choose ∆c small
enough so that θ? − ∆c ∈ Θρ .
I
Cγ,1 = 2, Cγ,K → ∞ when γ → 1.
Odalric-Ambrym Maillard
Bandits
25 / 40
II.
. Cone covering
•
Kinf (ΠE (νba? ,n ), µ? − ε)
= inf{B ψ (θbn , θ? − ∆) : θ? − ∆ ∈ ΘE , µθ? −∆ > µ? − ε}
6 B ψ (θbn , θ? − ∆c ) ∀θ? − ∆c ∈ Θρ , µθ? −∆c > µ? − ε, c ∈ [Cγ,K ]
Cγ,K
•
Thus Pθ?
n
o
[
En 6 min
c
ni 6n<ni+1
where
En,c,c 0
=
X
c 0 =1
n
Pθ ?
[
En,c,c 0
o
ni 6n<ni+1
θbn ∈ Θρ ∩ ∇ψ(θ? − ∆c ) − Fbn ∈ Cγ (∆0c )
f (t/n)
∩ B (θbn , θ? − ∆c ) >
ψ
n
En,c,c 0 is almost the event appearing in Lai’s theorem.
Odalric-Ambrym Maillard
Bandits
26 / 40
Recap
of steps I. II.
.
Pθ?
n [
16n6t
Odalric-Ambrym Maillard
o
En 6
IX
t −1
i=0
Cγ,K
min
c
X
n
Pθ ?
c 0 =1
Bandits
[
o
En,c,c 0 .
ni 6n<ni+1
27 / 40
III.
. Change of measure
Lemma (Change of measure)
If n → nf (t/n) is non-decreasing and θ? − ∆c ∈ Θρ , then
n
Pθ ?
[
ni 6n<ni+1
En,c,c 0
o
6 C exp
ni
− vρ δc2 − γδc vρ
2
×Pθ? −∆c
n
[
s
2ni f (t/ni )
Vρ
o
En,c,c 0 ,
ni 6n<ni+1
where vρ = inf θ∈Θρ λMIN (∇2 ψ(θ)), Vρ = supθ∈Θρ λMAX (∇2 ψ(θ)),
δc = ||∆c ||.
Concentration of B ψ (θbn , θ? − ∆c ) for distribution νθ? vs
concentration of B ψ (θbn , θ? − ∆c ) for distribution νθ? −∆c .
Odalric-Ambrym Maillard
Bandits
28 / 40
III.
. Change of measure: proof details
I
dPθ?
dPθ? −∆c
for n i.i.d. variables look at the ratio
=
Πnk=1 νθ? (Xk )
Πnk=1 νθ? −∆c (Xk )
=
?
?
b
exp nh∆c , Fa? ,n i − n ψ(θ ) − ψ(θ − ∆c )
= exp
?
− n h∆c , ∇ψ(θ − ∆c ) − Fba? ,n i −n B ψ (θ? − ∆c , θ? )
|
{z
}
|
{z
}
(a)
Odalric-Ambrym Maillard
Bandits
.
(b)
29 / 40
III.
. Change of measure: proof details
I
dPθ?
dPθ? −∆c
for n i.i.d. variables look at the ratio
=
exp
?
− n h∆c , ∇ψ(θ − ∆c ) − Fba? ,n i −n B ψ (θ? − ∆c , θ? )
|
{z
}
{z
}
|
(a)
I
.
(b)
For (b):
1
B ψ (θ? − ∆c , θ? ) > vρ δc2 .
2
Odalric-Ambrym Maillard
Bandits
29 / 40
III.
. Change of measure: proof details
I for n i.i.d. variables look at the ratio
dPθ?
dPθ? −∆c
I
I
=
exp
− n h∆c , ∇ψ(θ? − ∆c ) − Fba? ,n i −n B ψ (θ? − ∆c , θ? )
|
{z
}
(a)
|
{z
(b)
.
}
For (b):
1
B ψ (θ? − ∆c , θ? ) > vρ δc2 .
2
For (a):
(up to going from ∆c to ∆0c )
h∆c 0 , ∇ψ(θ? − ∆c ) − Fbn i > γ||∆c 0 ||||∇ψ(θ? − ∆c ) − Fbn ||
> γδc 0 vρ ||θbn? − θ? + ∆c ||
s
2 ψ b ?
B ( θn , θ − ∆ c )
Vρ
s
2f (t/n)
.
Vρ n
> γδc 0 vρ
> γδc 0 vρ
Odalric-Ambrym Maillard
Bandits
29 / 40
IV.
. Localization
I
By construction:
∇ψ(θ? − ∆c ) − Fbn ∈ Cp (∆c 0 ) ⇔
h∆c 0 , ∇ψ(θ? − ∆c ) − Fbn i > γ||∆c 0 ||||∇ψ(θ? − ∆c ) − Fbn || ,
Thus, we look at ||∇ψ(θ? − ∆c ) − Fbn ||.
I
Decomposition, for some constant εt,i,c > 0
n
Pθ? −∆c
[
En,c
o
ni 6n<ni+1
6 Pθ? −∆c
n
[
En,c ∩ ||∇ψ(θ? − ∆c ) − Fbn || < εt,i,c
o
ni 6n<ni+1
|
{z
(a) Requires some care
+ Pθ? −∆c
n
[
}
En,c ∩ ||∇ψ(θ? − ∆c ) − Fbn || > εt,i,c
o
.
ni 6n<ni+1
|
Odalric-Ambrym Maillard
{z
(b) "Standard" concentration
Bandits
}
30 / 40
V.
. Localized change of measure and concentration
Lemma (Concentration (admitted))
Under some technical assumption,
(b) 6 2K exp
r
For εt,i,c =
2KV2ρ ni+1 f (t/ni+1 )
,
ni2
−
ni2 ε2t,i,c
2KV2ρ ni+1
then (b) 6 2K exp(−f (t/ni+1 )).
Lemma (Localized change of measure)
For the εt,i,c considered, we have
(a) 6 2Cb,ρ exp
− f (t/ni+1 ) f (t/ni+1 )K /2 ,
for an explicit constant Cb,ρ = max
Odalric-Ambrym Maillard
n
Bandits
2
8K 2 b 2 V2ρ
2KbV2ρ
,
4,
2
2
ρ vρ
(K +2)vρ2
oK /2
.
31 / 40
Recap
.
Pθ ?
n [
θbn ∈ Θρ ∩ Kinf (ΠE (νba? ,n ), µ? − ε) > f (t/n)/n
o
16n6t
Cγ,K It −1
X X
6
|Ce
c=1 i=0
−
ni
2
vρ δc2 −γδc vρ
q
2ni f (t/ni )
Vρ
{z
Change of measure
}
| {z }
Peeling+Covering
×e
|
−f (t/ni+1 )
2CCb,ρ f (t/ni+1 )
K /2
{z
+ 2K
Localized change of measure + Concentration
}
End of the proof:
I
"Change of measure" term kills ni & δc−2 / log(tδc2 ) terms
I
Remains e −f (tδc ) f (tδc2 )K /2 / log(tδc2 ) '
Odalric-Ambrym Maillard
2
Bandits
1
tδc2
log(tδc2 )−ξ+K /2−1
32 / 40
Illustrative
experiment on Bernoulli distributions
.
Regret as function of time for KL-UCB and KL-UCB+
(mean and quantiles).
Odalric-Ambrym Maillard
Bandits
33 / 40
Back
to V.(a) Localized change of measure
.
n
Pθ? −∆c
[
En,c ∩ ||∇ψ(θ? − ∆c ) − Fbn || < εt,i,c
o
ni 6n<ni+1
def
I
Let θc? = θ? − ∆c ∈ Θρ
I
En,c ⊂ {θbn ∈ Θρ ∩ B ψ (θbn , θc? ) > f (t/n)/n}.
Odalric-Ambrym Maillard
Bandits
34 / 40
Back
to V.(a) Localized change of measure: idea
.
Change of measure from Pθc? to the mixture QB (·) =
where
2εt,i,c
B = Θ2ρ ∩ Ball(θc? ,
).
vρ
εt,i,c
vρ , ρ})
I
We have Ball(θbn , min{
I
Study for n iid points the ratio
dPθ
dQB
"Z
=
θ0 ∈B
=
θ0 ∈B
exp
θ0 ∈B
Pθ0 (·)
⊂ B ⊂ Θ2ρ .
Πnk=1 νθ0 (Xk ) 0
dθ
Πnk=1 νθ (Xk )
"Z
R
#−1
#−1
0
b
nhθ − θ, Fa? ,n i − n ψ(θ ) − ψ(θ) dθ0
|
{z
}
0
h(θ0 )
Odalric-Ambrym Maillard
Bandits
35 / 40
Back
to V.(a) Localized change of measure: idea
.
Studying h, it comes for θ0 ∈ Θ2ρ
n
h(θ0 ) > nB ψ (θbn , θ) − (θ0 − θbn )0 ∇2 ψ(θ̃)(θ0 − θbn )
2
n 0 b 2
> f (t/n) − ||θ − θn || V2ρ (convexity of Θ2ρ )
2
Thus using that Ball(θbn , min{
dPθ
dQB
"Z
6
6 e
Odalric-Ambrym Maillard
εt,i,c
vρ , ρ})
⊂B
n
exp f (t/n) − ||θ0 − θbn ||2 V2ρ dθ0
2
θ0 ∈B
−f (t/ni+1 )
"Z
x ∈Ball(0,min{ρ,
Bandits
εt,i,c
vρ
exp
})
#−1
ni+1
−
||x ||2 V2ρ dx
2
#−1
,
36 / 40
Back
to V.(a) Localized change of measure: idea
.
Thus for our event of interest Ω,
n o
Pθc? Ω 6 e
−f (t/ni+1 )
"Z
Ball(0,min{ρ,
|
εt,i,c
vρ
e
−
ni+1
|x |2 V2ρ
2
#−1
Z
dx
})
{z
Volume of a Gaussian ball
}
|B
n o
Pθ0 Ω dθ0 ,
{z
6|B|
}
And after some easy computations
n o
Pθc? Ω 6 2 exp
r
We conclude using εt,i,c =
Odalric-Ambrym Maillard
n
− f (t/ni+1 ) min ρ2 vρ2 , ε2t,i,c ,
2KV2ρ ni+1 f (t/ni+1 )
.
ni2
Bandits
(K + 2)vρ2 o−K /2 K K
2 εt,i,c .
Kni+1 V2ρ
37 / 40
Conclusion
.
For a general exponential family E, known, we can*
I Exhibit a concentration of measure for the Bregman
divergence induced by the family:
I
I
I
I
Get fine understanding of the threshold
f (x ) = log(x ) + ξ log log(x )
I
I
I
in finite time.
for all dimension K .
powerful proof technique.
improve over existing work: ξ > K /2 − 1 is enough + hold for
all K + KL-UCB+ as well.
closely follows what is observed in experiments
. . . show this was known 30 years ago by Lai.
. . . but we still need to know the family E.
* we skipped some regularity restrictions on E.
Odalric-Ambrym Maillard
Bandits
38 / 40
Teaser...
.
Regret as function of time for KL-UCB, KL-UCB+ and BESA
(mean and quantiles)
BESA does not know the family E.
"Sub-sampling for the multi-armed bandit", A. Baransi et al. 2014
Odalric-Ambrym Maillard
Bandits
39 / 40
I
Looking for Interns, PhD candidates, collaborations...
Thank you
© Copyright 2026 Paperzz