Multi-armed Bandit Problems with Dependent Arms

Multi-armed Bandit Problems
with Dependent Arms
Sandeep Pandey
([email protected])
Deepayan Chakrabarti ([email protected])
Deepak Agarwal
([email protected])
1
Background: Bandits
Bandit “arms”
μ1
μ2
μ3
(unknown reward
probabilities)
Pull arms sequentially so as to maximize the total
expected reward
• Show ads on a webpage to maximize clicks
• Product recommendation to maximize sales
2
Dependent Arms


Reward probabilities μi are generally
assumed to be independent of each other
What if they are dependent?

E.g., ads on similar topics, using similar
text/phrases, should have similar rewards
“Skiing,
snowboarding”
“Skiing,
snowshoes”
“Snowshoe
rental”
“Get Vonage!”
μ1=0.3
μ2=0.28
μ2=0.31
μ3=10-6
3
Dependent Arms


Reward probabilities μi are generally
assumed to be independent of each other
What if they are dependent?



E.g., ads on similar topics, using similar
text/phrases, should have similar rewards
A click on one ad  other “similar” ads may
generate clicks as well
Can we increase total reward using this
dependency?
4
Cluster Model of Dependence
Arm
1
Arm
2
Cluster 1
Arm
3
Arm
4
Cluster 2
Successes si ~ Bin(ni, μi)
μi ~ f(π[i])
Some
distribution
(known)
# pulls of
arm i
Cluster-specific
parameter
(unknown)
5
Cluster Model of Dependence
Arm
1
Arm
2
μi ~ f(π1)
Arm
3
Arm
4
μi ~ f(π2)
Total reward:

∞
Discounted: ∑ αt.E[R(t)],
α = discounting factor
t=0
T

Undiscounted: ∑ E[R(t)]
t=0
6
Discounted Reward
x’1 x’2
MDP for
cluster 1
x1 x2
Pull
Arm 1
x”1 x”2
x’3 x’4
MDP for
cluster 2
x3 x4
Pull
Arm 3
x”3 x”4
The optimal policy can
be computed using percluster MDPs only.
Optimal Policy:
• Compute an (“index”,
arm) pair for each cluster
• Pick the cluster with the
largest index, and pull
the corresponding arm
7
Discounted Reward
x’1 x’2
MDP for
Pull
x1 x2
• Reduces
the
problem
Arm
1
clusterto1smaller state spaces
x”1 x”2
• Reduces to Gittins’
Theorem [1979] for
independent bandits
x’3 x’4
• Approximation bounds
MDP for x3 x4 Pull
on the index
for
k-step
Arm
3
cluster 2
x”3 x”4
lookahead
The optimal policy can
be computed using percluster MDPs only.
Optimal Policy:
• Compute an (“index”,
arm) pair for each cluster
• Pick the cluster with the
largest index, and pull
the corresponding arm
8
Cluster Model of Dependence
Arm
1
Arm
2
μi ~ f(π1)
Arm
3
Arm
4
μi ~ f(π2)
Total reward:

∞
Discounted: ∑ αt.E[R(t)],
α = discounting factor
t=0
T

Undiscounted: ∑ E[R(t)]
t=0
9
Undiscounted Reward
“Cluster
arm” 1
Arm
1
Arm
2
“Cluster
arm” 2
Arm
3
Arm
4
All arms in a cluster are similar
 They can be grouped into one
hypothetical “cluster arm”
10
Undiscounted Reward
Two-Level Policy
“Cluster
arm” 1
“Cluster
arm” 2
Each “cluster arm” must
have some estimated
reward probability
Arm
1
Arm
2
Arm
3
Arm
4
In each iteration:
Pick “cluster arm”
using a traditional bandit
policy

Pick an arm within that
cluster using a traditional
bandit policy

11
Issues


What is the reward probability of a “cluster
arm”?
How do cluster characteristics affect
performance?
12
Reward probability of a “cluster arm”


What is the reward probability r of a “cluster
arm”?
MEAN: r = ∑si / ∑ni,
i.e., average success rate, summing over all
arms in the cluster [Kocsis+/2006, Pandey+/2007]



Initially, r = μavg = average μ of arms in cluster
Finally, r = μmax = max μ among arms in cluster
“Drift” in the reward probability of the “cluster arm”
13
Reward probability drift causes problems
Best (optimal) arm,
with reward
probability μopt
Arm
1
Arm
2
Cluster 1
(opt cluster)

Arm
3
Arm
4
Cluster 2
Drift
 Non-optimal clusters might temporarily look
better
 optimal arm is explored only O(log T) times
14
Reward probability of a “cluster arm”





What is the reward probability r of a “cluster
arm”?
MEAN: r = ∑si / ∑ni
for all arms i
MAX: r = max( E[μi] )
in cluster
PMAX: r = E[ max(μi) ]
Both MAX and PMAX aim to estimate μmax
and thus reduce drift
15
Reward probability of a “cluster arm”
Bias in
estimation
of μmax




MEAN: r = ∑si / ∑ni
MAX: r = max( E[μi] )
PMAX: r = E[ max(μi) ]
Variance of
estimator
High
Low
Unbiased
High
Both MAX and PMAX aim to estimate μmax
and thus reduce drift
16
Comparison of schemes
10 clusters, 11.3
arms/cluster
MAX performs best
17
Issues


What is the reward probability of a “cluster
arm”?
How do cluster characteristics affect
performance?
18
Effects of cluster characteristics

We analytically study the effects of cluster
characteristics on the “crossover-time”

Crossover-time: Time when the expected reward
probability of the optimal cluster becomes highest
among all “cluster arms”
19
Effects of cluster characteristics




Crossover-time Tc for MEAN depends on:
Cluster separation Δ = μopt – μmax outside opt cluster
Δ increases  Tc decreases
Cluster size Aopt
Aopt increases  Tc increases
Cohesiveness in opt cluster 1-avg(μopt – μi)
Cohesiveness increases  Tc decreases
20
Experiments (effect of separation)
Δ increases  Tc decreases  higher reward
21
Experiments (effect of size)
Aopt increases  Tc increases  lower reward
22
Experiments (effect of cohesiveness)
Cohesiveness increases  Tc decreases  higher reward
23
Related Work

Typical multi-armed bandit problems



Bandits with side information


Do not consider dependencies
Very few arms
Cannot handle dependencies among arms
Active learning

Emphasis on #examples required to achieve a
given prediction accuracy
24
Conclusions



We analyze bandits where dependencies are
encapsulated within clusters
Discounted Reward  the optimal policy is
an index scheme on the clusters
Undiscounted Reward


Two-level Policy with MEAN, MAX, and PMAX
Analysis of the effect of cluster characteristics on
performance, for MEAN
25
Discounted Reward
1
2
3
4
x’1 x’2
x1 x2
Estimated
reward
probabilities
x3 x4
x3 x4
Pull
Arm 1
x”1 x”2
Change of belief
for both arms 1
and 2
x3 x4
• Create a belief-state MDP
• Each state contains the estimated reward
probabilities for all arms
• Solve for optimal
26
Background: Bandits
Bandit “arms”
p1
p2
p3
(unknown payoff
probabilities)
Regret = optimal payoff – actual payoff
27
Reward probability of a “cluster arm”

What is the reward probability of a “cluster
arm”?



Eventually, every “cluster arm” must converge to
the most rewarding arm μmax within that cluster
since a bandit policy is used within each cluster
However, “drift” causes problems
28
Experiments


Simulation based on one week’s worth of
data from a large-scale ad-matching
application
10 clusters, with 11.3 arms/cluster on
average
29
Comparison of schemes
10 clusters, 11.3 arms/cluster

Cluster separation Δ = 0.08

Cluster size Aopt
= 31

Cohesiveness
= 0.75
MAX performs best
30
Reward probability drift causes problems
Best (optimal) arm,
with reward
probability μopt
Arm
1
Arm
2
Cluster 1
(opt cluster)
Arm
3
Arm
4
Cluster 2
Intuitively, to reduce regret, we must:
 Quickly converge to the optimal “cluster arm”
 and then to the best arm within that cluster
31