Introduction to Bandits: Algorithms and Theory - imagine

Introduction to Bandits:
Algorithms and Theory
Jean-Yves Audibert1,2 & Rémi Munos3
1. Université Paris-Est, LIGM, Imagine,
2. CNRS/École Normale Supérieure/INRIA, LIENS, Sierra
3. INRIA Sequential Learning team, France
ICML 2011, Bellevue (WA), USA
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
1/1
Outline
I
Bandit problems and applications
I
Bandits with small set of actions
I
I
Stochastic setting
I
Adversarial setting
Bandits with large set of actions
I
unstructured set
I
structured set
I
I
I
I
Jean-Yves Audibert,
linear bandits
Lipschitz bandits
tree bandits
Extensions
Introduction to Bandits: Algorithms and Theory
2/1
Bandit game
Parameters available to the forecaster:
the number of arms (or actions) K and the number of rounds n
Unknown to the forecaster: the way the gain vectors
gt = (g1,t , . . . , gK ,t ) ∈ [0, 1]K are generated
For each round t = 1, 2, . . . , n
1. the forecaster chooses an arm It ∈ {1, . . . , K }
2. the forecaster receives the gain gIt ,t
3. only gIt ,t is revealed to the forecaster
Cumulative regret goal: maximize the cumulative gains obtained.
More precisely, minimize
n
n
X
X
Rn =
max E
gi,t − E
gIt ,t
i=1,...,K
t=1
t=1
where E comes from both a possible stochastic generation of the
gain vector and a possible randomization in the choice of It
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
3/1
Stochastic and adversial environments
I
Stochastic environment: the gain vector gt is sampled from
an unknown product distribution ν1 ⊗ . . . ⊗ νK on [0, 1]K , that
is gi,t ∼ νi .
I
Adversarial environment: the gain vector gt is chosen by an
adversary (which, at time t, knows all the past, but not It )
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
4/1
Numerous variants
I
different environments: adversarial, “stochastic”, non-stationary
I
different targets: cumulative regret, simple regret, tracking the
best expert
I
Continuous or discrete set of actions
I
extension with additional rules: varying set of arms, pay-perobservation, . . .
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
5/1
Various applications
I
Clinical trials (Thompson, 1933)
I
Ads placement on webpages
I
Nash equilibria (traffic or communication networks, agent simulation, tic-tac-toe phantom, . . . )
I
Game-playing computers (Go, urban rivals, . . . )
I
Packet routing, itinerary selection
I
...
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
6/1
Outline
I
Bandit problems and applications
I
Bandits with small set of actions
I
I
Stochastic setting
I
Adversarial setting
Bandits with large set of actions
I
unstructured set
I
structured set
I
I
I
I
Jean-Yves Audibert,
linear bandits
Lipschitz bandits
tree bandits
Extensions
Introduction to Bandits: Algorithms and Theory
7/1
Stochastic bandit game
(Robbins, 1952)
Parameters available to the forecaster: K and n
Parameters unknown to the forecaster: the reward distributions
ν1 , . . . , νK of the arms (with respective means µ1 , . . . , µK )
For each round t = 1, 2, . . . , n
1. the forecaster chooses an arm It ∈ {1, . . . , K }
2. the environment draws the gain vector gt = (g1,t , . . . , gK ,t )
according to ν1 ⊗ · · · ⊗ νK
3. the forecaster receives the gain gIt ,t
∗ = max
Notation: i ∗ = arg maxi=1,...,K µi
µP
i=1,...,K µi
∗
∆i = µ − µi ,
Ti (n) = nt=1 1It =i
P
P
Cumulative regret: R̂n = nt=1 gi ∗ ,t − nt=1 gIt ,t
Goal: minimize the expected cumulative regret
Rn = ER̂n = nµ∗ − E
Jean-Yves Audibert,
n
X
t=1
gIt ,t = nµ∗ − E
K
X
Ti (n)µi =
i=1
Introduction to Bandits: Algorithms and Theory
K
X
∆i ETi (n)
i=1
8/1
A simple policy: ε-greedy
I
I
Playing the arm with highest empirical mean does not work
ε-greedy: at time t,
I
I
I
Theoretical guarantee: (Auer, Cesa-Bianchi, Fischer, 2002)
I
I
I
I
with probability 1−εt , play the arm with highest empirical mean
with probability εt , play a random arm
Let ∆ = mini:∆i >0 ∆i and consider εt = min( ∆6K2 t , 1)
When t ≥ 6K
∆2 , the probability of choosing a suboptimal arm i
is bounded by ∆C2 t for some constant C > 0
P
As a consequence, E[Ti (n)] ≤ ∆C2 log n and ERn ≤ i:∆i >0 C∆∆2 i log n
−→ logarithmic regret
drawbacks:
I
I
Jean-Yves Audibert,
requires knowledge of ∆
outperforms by UCB policy in practice
Introduction to Bandits: Algorithms and Theory
9/1
Optimism in face of uncertainty
I
At time t, from past observations and some probabilistic argument, you have an upper confidence bound (UCB) on the
expected rewards.
I
Simple implementation:
play the arm having the largest UCB !
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
10/1
Why does it make sense?
I
Could we stay a long time drawing a wrong arm?
No, since:
I The more we draw a wrong arm i the closer the UCB gets to
the expected reward µi ,
I µ < µ∗ ≤ UCB on µ∗
i
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
11/1
Illustration of UCB policy
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
12/1
Confidence intervals vs sampling times
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
13/1
Hoeffding-based UCB
I
(Auer, Cesa-Bianchi, Fischer, 2002)
Hoeffding’s inequality: Let X , X1 , . . . , Xm be i.i.d. r.v. taking
their values in [0, 1]. For any ε > 0, with probability at least
1 − ε, we have
r
m
log(ε−1 )
1 X
Xs +
EX ≤
m
2m
s=1
I
UCB1 policy: at time t, play
(
It ∈ arg max
s
µ̂i,t−1 +
i∈{1,...,K }
I
where µ̂i,t−1 =
Regret bound:
1
Ti (t−1)
Rn ≤
Jean-Yves Audibert,
PTi (t−1)
X
i6=i ∗
s=1
min
2 log t
Ti (t − 1)
)
,
Xi,s
10
log n, n∆i
∆i
Introduction to Bandits: Algorithms and Theory
14/1
Hoeffding-based UCB
I
(Auer, Cesa-Bianchi, Fischer, 2002)
Hoeffding’s inequality: Let X1 , . . . , Xm be i.i.d. r.v. taking their
values in [0, 1]. For any ε > 0, with probability at least 1 − ε,
we have
r
m
log(ε−1 )
1 X
Xs +
EX ≤
m
2m
s=1
I
UCB1 policy: At time t, play
(
It ∈ arg max
s
µ̂i,t−1 +
i∈{1,...,K }
where µ̂i,t−1 =
I
1
Ti (t−1)
PTi (t−1)
s=1
2 log t
Ti (t − 1)
)
,
Xi,s
UCB1 is an anytime policy (it does not need to know n to be
implemented)
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
15/1
Hoeffding-based UCB
I
(Auer, Cesa-Bianchi, Fischer, 2002)
Hoeffding’s inequality: Let X1 , . . . , Xm be i.i.d. r.v. taking their
values in [0, 1]. For any ε > 0, with probability at least 1 − ε,
we have
r
m
log(ε−1 )
1 X
EX ≤
Xs +
m
2m
s=1
I
UCB1 policy: At time t, play
(
It ∈ arg max
s
µ̂i,t−1 +
i∈{1,...,K }
where µ̂i,t−1 =
1
Ti (t−1)
PTi (t−1)
s=1
2 log t
Ti (t − 1)
)
,
Xi,s
−1
I
I
UCB1 corresponds to 2 log t = log(ε2 ) , hence ε = 1/t 4
Critical confidence level ε = 1/t (Lai & Robbins, 1985; Agrawal,
1995; Burnetas & Katehakis, 1996; Audibert, Munos, Szepesvári, 2009;
Honda & Takemura, 2010)
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
16/1
Better confidence bounds imply smaller regret
I Hoeffding’s inequality
1
-confidence
t
EX ≤
I Bernstein’s inequality
1
m
bound
r
m
X
s=1
1
-confidence
t
Xs +
log(t)
2m
bound
r
2 log(t) ar X
log(t)
1
EX ≤
Xs +
+
m s=1
m
3m
m
X
I
V
Empirical Bernstein’s inequality 1t -confidence bound
s
m
X
\
8 log(t)
1
2 log(t)V
ar X
Xs +
+
EX ≤
m
m
3m
s=1
(Audibert, Munos, Szepesvári, 2009; Maurer, 2009; Audibert, 2010)
I Asymptotic confidence bound leads to catastrophy:
s
Z +∞ −u2 /2
m
\
1 X
ar X
e
1
√
EX ≤
Xs +
x with x s.t.
du =
m s=1
m
t
2π
x
V
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
17/1
Better confidence bounds imply smaller regret
Hoeffding-based UCB
r
1 Pm X +
EX ≤ m
s=1 s
Rn ≤
P
i6=i ∗
Jean-Yves Audibert,
min
c
∆i
empirical Bernstein-based UCB
r
log(ε−1 )
2m
log n, n∆i
1 Pm X +
EX ≤ m
s=1 s
Rn ≤
P
i6=i ∗
min c
V
2 log(ε−1 )\
ar X
m
Var ν
∆i
Introduction to Bandits: Algorithms and Theory
i
+
8 log(ε−1 )
3m
+ 1 log n, n∆i
18/1
Tuning the exploration: simple vs difficult bandit problems
I
UCB1(ρ) policy: At time t, play
(
It ∈ arg max
s
µ̂i,t−1 +
i∈{1,...,K }
Regret of UCB1(ρ) for n = 1000 and K = 2 arms:
Dirac(0.6) and Ber(0.5)
35
ρ log t
Ti (t − 1)
)
,
Regret of UCB1(ρ) for n = 1000 and K = 2 arms:
Ber(0.6) and Ber(0.5)
25
30
Expected regret
Expected regret
20
15
10
25
20
15
10
5
5
0
0.0
0.2
0.4
Jean-Yves Audibert,
0.6 0.8 1.0 1.2 1.4 1.6
Exploration parameter ρ
1.8
2.0
0
0.0
0.2
0.4
0.6 0.8 1.0 1.2 1.4 1.6
Exploration parameter ρ
Introduction to Bandits: Algorithms and Theory
1.8
2.0
19/1
Tuning the exploration parameter: from theory to practice
I
Theory:
I
I
I
Practice: ρ = 0.2 seems to be the best default value for n < 108
Regret of UCB1(ρ) for n = 1000 and K = 3 arms:
Ber(0.6), Ber(0.5) and Ber(0.5)
Regret of UCB1(ρ) for n = 1000 and K = 5 arms:
Ber(0.7), Ber(0.6), Ber(0.5), Ber(0.4) and Ber(0.3)
45
80
40
70
Expected regret
Expected regret
50
for ρ < 0.5, UCB1(ρ) has polynomial regret
for ρ > 0.5, UCB1(ρ) has logarithmic regret
35
30
25
20
15
40
30
10
5
0.0
50
20
10
0
60
0.2
0.4
Jean-Yves Audibert,
0.6 0.8 1.0 1.2 1.4 1.6
Exploration parameter ρ
1.8
2.0
0
0.0
0.2
0.4
0.6 0.8 1.0 1.2 1.4 1.6
Exploration parameter ρ
Introduction to Bandits: Algorithms and Theory
1.8
2.0
20/1
Deviations of UCB1 regret
I UCB1 policy: At time t, play
s
(
It ∈ arg max
i∈{1,...,K }
I
I
µ̂i,t−1 +
2 log t
Ti (t − 1)
)
,
Inequality of the form P(R̂n > ER̂n +γ) ≤ ce −cγ does not hold!
If the smallest reward observable from the optimal arm is smaller
than the mean reward of the second optimal arm, then the
regret of UCB1 satisfies: for any C > 0, there exists C 0 > 0
such that for any n ≥ 2
P(R̂n > ER̂n + C log n) >
1
C 0 (log n)C 0
(Audibert, Munos, Szepesvári, 2009)
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
21/1
Anytime UCB policies has a heavy-tailed regret
I For some difficult bandit problems, the regret of UCB1 satisfies: for any
C > 0, there exists C 0 > 0 such that for any n ≥ 2
P(R̂n > ER̂n + C log n) >
I
1
C 0 (log n)C 0
UCB-Horizon policy: At time t, play
s
(
It ∈ arg max
i∈{1,...,K }
µ̂i,t−1 +
(A)
2 log n
Ti (t − 1)
)
,
(Audibert, Munos, Szepesvári, 2009)
I
I
UCB-H satisfies P(R̂n > ER̂n + C log n) ≤
C
n
for some C
(A) = unavoidable for anytime policies (Salomon, Audibert, 2011)
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
22/1
Comparison of UCB1
Jean-Yves Audibert,
(solid lines)
and UCB-H
Introduction to Bandits: Algorithms and Theory
(dotted lines)
23/1
Comparison of UCB1(ρ) and UCB-H(ρ) in expectation
Comparison of policies for n = 1000 and K = 2 arms:
Dirac(0.6) and Ber(0.5)
Comparison of policies for n = 1000 and K = 2 arms:
Ber(0.6) and Dirac(0.5)
70
25
60
20
Expected regret
Expected regret
50
15
10
40
30
20
GCL ∗
5
GCL ∗
10
UCB 1(ρ)
UCB 1(ρ)
UCB - H (ρ)
0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Exploration parameter ρ
UCB - H (ρ)
1.4
1.6
1.8
Left: Dirac(0.6) vs Bernoulli(0.5)
Jean-Yves Audibert,
2.0
0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Exploration parameter ρ
1.4
1.6
1.8
2.0
Right: Bernoulli(0.6) vs Dirac(0.5)
Introduction to Bandits: Algorithms and Theory
24/1
Comparison of UCB1(ρ) and UCB-H(ρ) in deviations
I
For n = 1000 and K = 2 arms: Bernoulli(0.6) and Dirac(0.5)
0.40
0.08
GCL ∗
IP(r − 2 ≤ R̂n ≤ r + 2)
0.35
0.30
0.07
UCB 1(0.2)
UCB - H (0.2)
0.06
UCB - H (0.2)
IP(R̂n > r)
UCB 1(0.5)
0.25
UCB - H (0.5)
0.20
0.15
0.03
0.02
0.01
20
40
60
80
100
Regret level r
120
140
160
UCB - H (0.5)
0.04
0.05
0
UCB 1(0.5)
0.05
0.10
0.00
−20
GCL ∗
UCB 1(0.2)
0.00
40
60
80
100
Regret level r
120
140
160
Left: smoothed probability mass function. Right: tail distribution of the regret.
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
25/1
Knowing the horizon: theory and practice
I
Theory: use UCB-H to avoid heavy tails of the regret
I
Practice: Theory is right. Besides, thanks to this robustness,
the expected regret of UCB-H(ρ) consistently outperforms the
expected regret of UCB1(ρ), but the gain is small.
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
26/1
Knowing µ∗
I
Hoeffding-based GCL∗ policy: play each arm once, then play
It ∈ argmin Ti (t − 1) (µ∗ − µ̂i,t−1 )2+
i∈{1,...,K }
I
Underlying ideas:
I
I
compare p-values of the K tests: H0 = {µi = µ∗ }, i ∈ {1, . . . , K }
the p-values are estimated using Hoeffding’s inequality
2 (obs)
(obs) PH0 µ̂i,t−1 ≤ µ̂i,t−1 / exp − 2Ti (t − 1) µ∗ − µ̂i,t−1
+
I
I
play the arm for which we have the Greatest Confidence Level
that it is the optimal arm.
Advantages:
I
I
I
I
I
Jean-Yves Audibert,
logarithmic expected regret
anytime policy
regret with a subexponential right-tail
parameter-free policy !
outperforms any other Hoeffding-based algorithm !
Introduction to Bandits: Algorithms and Theory
27/1
From Chernoff’s inequality to KL-based algorithms
I
I
Let K(p, q) be the Kullback-Leibler divergence between Bernoulli
distributions of respective parameter p and q
Let X1 , . . . , XT be i.i.d.
P r.v. of mean µ, and taking their values
in [0, 1]. Let X̄ = T1 T
i=1 Xi . For any γ > 0
P X̄ ≤ µ − γ ≤ exp − T K(µ − γ, µ) .
In particular, we have
P X̄ ≤ X̄ (obs) ≤ exp − T K min(X̄ (obs) , µ), µ .
I
If µ∗ is known, using the same idea of comparing the p-values
of the tests H0 = {µi = µ∗ }, i ∈ {1, . . . , K }, we get the
Chernoff-based GCL∗ policy: play each arm once, then play
It ∈ argmin Ti (t − 1) K min(µ̂i,t−1 , µ∗ ), µ∗
i∈{1,...,K }
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
28/1
Back to unknown µ∗
I
When µ∗ is unknown, the principle
playing the arm for which we have the greatest confidence level
that it is the optimal arm
is replaced by
being optimistic in face of uncertainty:
I
I
Jean-Yves Audibert,
an arm i is represented by the highest mean of a distribution ν
for which the hypothesis H0 = {νi = ν} has a p-value greater
than t1β (critical β = 1, as usual)
the arm with the highest index (=UCB) is played
Introduction to Bandits: Algorithms and Theory
29/1
KL-based algorithms when µ∗ is unknown
I
Approximating the p-value using Sanov’s theorem is tightly
linked to the DMED policy, which satisfies
lim sup
n→+∞
Rn
∆i
≤
log n
inf ν:EX ∼ν X ≥µ∗ K(νi , ν)
for distributions with finite support (included in [0, 1])
(Burnetas & Katehakis, 1996; Honda & Takemura, 2010)
It matches the lower bound
lim inf
n→+∞
Rn
∆i
≥
log n
inf ν:EX ∼ν X ≥µ∗ K(νi , ν)
(Lai & Robbins, 1985; Burnetas & Katehakis, 1996)
I
Approximating the p-value using non-asymptotic version of Sanov’s
theorem leads to the KL-UCB (Cappé & Garivier, COLT 2011) and
the K-strategy (Maillard, Munos, Stoltz, COLT 2011)
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
30/1
Outline
I
Bandit problems and applications
I
Bandits with small set of actions
I
I
Stochastic setting
I
Adversarial setting
Bandits with large set of actions
I
unstructured set
I
structured set
I
I
I
I
Jean-Yves Audibert,
linear bandits
Lipschitz bandits
tree bandits
Extensions
Introduction to Bandits: Algorithms and Theory
31/1
Adversarial bandit
Parameters: the number of arms K and the number of rounds n
For each round t = 1, 2, . . . , n
1. the forecaster chooses an arm It ∈ {1, . . . , K }, possibly with the help of
an external randomization
2. the adversary chooses a gain vector gt = (g1,t , . . . , gK ,t ) ∈ [0, 1]K
3. the forecaster receives and observes only the gain gIt ,t
Goal: Maximize the cumulative gains obtained. We consider the regret:
n
n
X
X
Rn =
max E
gi,t − E
gIt ,t ,
i=1,...,K
t=1
t=1
I
In full information, step 3. is replaced by the forecaster receives
gIt ,t and observes the full gain vector gt
I
In both settings, the forecaster should use an external randomization to have o(n) regret.
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
32/1
Adversarial setting in full information: an optimal policy
I
I
I
Cumulative reward on [1, t − 1]: Gi,t−1 =
Pt−1
s=1 gi,s
Follow-the-leader: It ∈ arg maxi∈{1,...,K } Gi,t−1 is a bad policy
An “optimal” policy is obtained by considering
e ηGi,t−1
pi,t = P(It = i) = PK
ηGk,t−1
k=1 e
I
For this policy, Rn ≤
I
in particular, for η =
nη
8
+
log K
η
q
8 log K
n ,
we have Rn ≤
q
n log K
2
(Littlestone, Warmuth, 1994; Long, 1996; Bylanger, 1997; Cesa-Bianchi, 1999)
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
33/1
Proof of the regret bound
e ηGi,t−1
pi,t = P(It = i) = PK
ηGk,t−1
k=1 e
E
X
gIt ,t
t
=E
XX
t
pi,t gi,t
i
P
X
X
1
1
η(gi,t − j pj,t gj,t )
ηgi,t
pi,t e
+ log
pi,t e
=E
− log
η
η
t
i
i
P ηGi,t X 1
e
1
=E
− log Ee η(Vt −EVt ) + log P i ηG
P(Vt = gi,t ) = pi,t
i,t−1
η
η
e
i
t
P ηGj,n
X η 1
j e
+ E log P ηG
≥E −
j,0
8
η
j e
t
X
≥−
Jean-Yves Audibert,
nη
1
e η maxj Gj,n
nη
log K
+ E log
=−
−
+ E max Gj,n
j
8
η
K
8
η
Introduction to Bandits: Algorithms and Theory
34/1
Adapting the exponentially weighted forecaster
I
In bandit setting, Gi,t−1 , i = 1, . . . , K are not observed
Trick = estimate them
I
Precisely, Gi,t−1 is estimated by G̃i,t−1 =
g̃i,s = 1 −
Note that EIs ∼ps g̃i,s = 1 −
Pt−1
s=1 g̃i,s
with
1 − gi,s
1Is =i .
pi,s
K
X
pk,s
k=1
1 − gi,s
1k=i = gi,s
pi,s
e ηG̃i,t−1
pi,t = P(It = i) = PK
η G̃k,t−1
k=1 e
I
For this policy, Rn ≤
I
nK η
2
In particular, for η =
+ logη K
q
2 log K
nK ,
we have Rn ≤
√
2nK log K
(Auer, Cesa-Bianchi, Freund, Schapire, 1995)
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
35/1
Implicitly Normalized Forecaster
(Audibert, Bubeck, 2010)
Let ψ : R∗− → R∗+ increasing, convex, twice continuously differentiable, and s.t. [ K1 , 1] ⊂ ψ(R∗− )
Let p1 be the uniform distribution over {1, . . . , K }
For each round t = 1, 2, . . . ,
I
I
It ∼ pt
Compute pt+1 = (p1,t+1 , . . . , pK ,t+1 ) where
pi,t+1 = ψ(G̃i,t − Ct )
where Ct is the unique real number s.t.
Jean-Yves Audibert,
PK
i=1 pi,t+1
Introduction to Bandits: Algorithms and Theory
=1
36/1
Minimax policy
I
ψ(x) = exp(ηx) with η > 0; this corresponds exactly to the
exponentially weighted forecaster
I
ψ(x) = (−ηx)−q with q > 1 and η > 0; this√is a new policy
which is minimax optimal: for q = 2 and η = 2n, we have
√
Rn ≤ 2 2nK
(Audibert, Bubeck, 2010; Audibert, Bubeck, Lugosi, 2011)
while for any strategy, we have
sup Rn ≥
1√
nK
20
(Auer, Cesa-Bianchi, Freund, Schapire, 1995)
Jean-Yves Audibert,
Introduction to Bandits: Algorithms and Theory
37/1
Outline
I
Bandit problems and applications
I
Bandits with small set of actions
I
I
Stochastic setting
I
Adversarial setting
Bandits with large set of actions
I
unstructured set
I
structured set
I
I
I
I
Jean-Yves Audibert,
linear bandits
Lipschitz bandits
tree bandits
Extensions
Introduction to Bandits: Algorithms and Theory
38/1