The K-armed Dueling Bandits Problem
COLT 2009
Yisong
Yue
Josef
Broder
Robert
Kleinberg
Cornell University
Thorsten
Joachims
Multi-armed Bandits
• K bandits (arms / actions / strategies)
• Each time step, algorithm chooses bandit
based on prior actions and feedback.
• Only observe feedback from actions taken.
• Partial Feedback Online Learning Problem
Optimizing Retrieval Functions
• Interactively learn the best retrieval function
• Search users provide implicit feedback
– E.g., what they click on
• Seems like a natural bandit problem
–
–
–
–
Each retrieval function is a bandit
Feedback is clicks
Find the best w.r.t. clickthrough rate
Assumes clicks → explicit absolute feedback!
180
# times result selected
160
time spent in abstract
1
0.9
0.8
140
0.7
120
0.6
100
0.5
80
0.4
60
0.3
40
0.2
20
0.1
0
0
1
2
3
4
5
6
7
8
9
10
11
Rank of result
[Joachims et al., TOIS 2007]
mean time (s)
# times rank selected
What
Results
dobyUsers
Time spent
in each result
frequency View/Click?
of doc selected
Team-Game Interleaving
(u=thorsten, q=“svm”)
f1(u,q) r1
1.
2.
3.
4.
5.
f2(u,q) r2
Kernel Machines
http://svm.first.gmd.de/
Support Vector Machine
http://jbolivar.freeservers.com/
An Introduction to Support Vector Machines
http://www.support-vector.net/
Archives of SUPPORT-VECTOR-MACHINES ...
http://www.jiscmail.ac.uk/lists/SUPPORT...
SVM-Light Support Vector Machine
http://ais.gmd.de/~thorsten/svm light/
1.
2.
3.
4.
5.
6.
7.
1.
2.
3.
4.
5.
Interleaving(r1,r2)
Kernel Machines
http://svm.first.gmd.de/
Support Vector Machine
http://jbolivar.freeservers.com/
SVM-Light Support Vector Machine
http://ais.gmd.de/~thorsten/svm light/
An Introduction to Support Vector Machines
http://www.support-vector.net/
Support Vector Machine and Kernel ... References
http://svm.research.bell-labs.com/SVMrefs.html
Archives of SUPPORT-VECTOR-MACHINES ...
http://www.jiscmail.ac.uk/lists/SUPPORT...
Lucent Technologies: SVM demo applet
http://svm.research.bell-labs.com/SVT/SVMsvt.html
Kernel Machines
http://svm.first.gmd.de/
SVM-Light Support Vector Machine
http://ais.gmd.de/~thorsten/svm light/
Support Vector Machine and Kernel ... References
http://svm.research.bell-labs.com/SVMrefs.html
Lucent Technologies: SVM demo applet
http://svm.research.bell-labs.com/SVT/SVMsvt.html
Royal Holloway Support Vector Machine
http://svm.dcs.rhbnc.ac.uk
T2
T1
T2
T1
T2
T1
T2
•Mix results of f1 and f2
•Relative feedback
•More reliable
Interpretation: (r1 > r2) ↔ clicks(T1) > clicks(T2)
[Radlinski, Kurup, Joachims; CIKM 2008]
Dueling Bandits Problem
• Given K bandits b1, …, bK
• Each iteration: compare (duel) two bandits
– E.g., interleaving two retrieval functions
• Comparison is noisy
– Each comparison result independent
– Comparison probabilities initially unknown
– Comparison probabilities fixed over time
• Total preference ordering, initially unknown
Dueling Bandits Problem
• Want to find best (or good) bandit
– Similar to finding the max w/ noisy comparisons
– Ours is a regret minimization setting
• Choose pair (bt, bt’) to minimize regret:
T
RT P(b* bt ) P(b* bt ' ) 1
t 1
• (% users who prefer best bandit over chosen ones)
Related Work
• Existing MAB settings
– [Lai & Robbins, ’85] [Auer et al., ’02]
– PAC setting [Even-Dar et al., ’06]
• Partial monitoring problem
– [Cesa-Bianchi et al., ’06]
• Computing with noisy comparisons
– Finding max [Feige et al., ’97]
– Binary search [Karp & Kleinberg, ’07] [Ben-Or & Hassidim, ’08]
• Continuous-armed Dueling Bandits Problem
– [Yue & Joachims, ’09]
Assumptions
• P(bi > bj) = ½ + εij (distinguishability)
• Strong Stochastic Transitivity
ik max ij , jk
– For three bandits bi > bj > bk :
– Monotonicity property
• Stochastic Triangle Inequality
– For three bandits bi > bj > bk :
– Diminishing returns property
• Satisfied by many standard models
– E.g., Logistic/Bradley-Terry
ik ij jk
Naïve Approach
• In deterministic case, O(K) comparisons to find max
• Extend to noisy case:
– Repeatedly compare until confident one is better
• Problem: comparing two awful (but similar) bandits
– Waste comparisons to see which awful bandit is better
– Incur high regret for each comparison
– Also applies to elimination tournaments
K
RT O 2 log T
Interleaved Filter
• Choose candidate bandit at random
►
Interleaved Filter
• Choose candidate bandit at random
• Make noisy comparisons (Bernoulli trial)
against all other bandits simultaneously ►
– Maintain mean and confidence interval
for each pair of bandits being compared
Interleaved Filter
• Choose candidate bandit at random
• Make noisy comparisons (Bernoulli trial)
against all other bandits simultaneously
– Maintain mean and confidence interval
for each pair of bandits being compared
• …until another bandit is better
– With confidence 1 – δ
►
Interleaved Filter
• Choose candidate bandit at random
• Make noisy comparisons (Bernoulli trial)
against all other bandits simultaneously
– Maintain mean and confidence interval
for each pair of bandits being compared
• …until another bandit is better
– With confidence 1 – δ
• Repeat process with new candidate
– Remove all empirically worse bandits
►
Interleaved Filter
• Choose candidate bandit at random
• Make noisy comparisons (Bernoulli trial)
against all other bandits simultaneously
– Maintain mean and confidence interval
for each pair of bandits being compared
• …until another bandit is better
– With confidence 1 – δ
• Repeat process with new candidate
– Remove all empirically worse bandits
• Continue until 1 candidate left
►
Regret Analysis
• Define a round to be all the time steps for
a particular candidate bandit
– Halts round when 1 - δ confidence met
– Treat candidate bandits as random walk
– Takes log K rounds to reach best bandit
• Define a match to be all the comparisons
between two bandits in a round
– O(K) total matches in expectation
– Constant fraction of bandits removed
each round
Regret Analysis
• O(K) total matches
1
O
log
T
• Each match incurs regret
– Depends on δ = K-2T-1
• Finds best bandit w.p. 1-1/T
• Expected regret:
1 K
1
E RT 1 O log T O(T )
T
T
K
E RT O log T
Lower Bound Example
• Order bandits b1 > b2 > … > bK
– P(b > b’) = ½ + ε
1
• Each match takes 2 log T comparisons
• Pay Θ(ε) regret for each comparison
• Accumulated regret over all matches is at least
K
log T
Moving Forward
• Extensions
– Drifting user interests
• Changing user population / document collection
– Contextualization / side information
• Cost-sensitive active learning w/ relative feedback
• Live user studies
– Interactive experiments on search service
– Sheds insight; guides future design
Extra Slides
Per-Match Regret
1
• Number of comparisons in match bi vs bj : O
log T
2
2
max ,
1
i
ij
– ε1i > εij : round ends before concluding bi > bj
– ε1i < εij : conclude bi > bj before round ends, remove bj
• Pay ε1i + ε1j regret for each comparison
– By triangle inequality ε1i + ε1j ≤ 2*max{ε1i , εij}
– Thus by stochastic transitivity accumulated regret is
1
1
O
log T O log T
max ,
1i
ij
Number of Rounds
• Assume all superior bandits have
equal prob of defeating candidate
– Worst case scenario under transitivity
• Model this as a random walk
– rj transitions to each ri (i < j) with equal probability
– Compute total number of steps before reaching r1 (i.e., r*)
• Can show O(log K) w.h.p. using Chernoff bound
Total Matches Played
• O(K) matches played in each round
– Naïve analysis yields O(K log K) total
• However, all empirically worse bandits
are also removed at the end of each round
– Will not participate in future rounds
• Assume worst case that inferior bandits have ½ chance of
being empirically worse
– Can show w.h.p. that O(K) total matches are played
over O(log N) rounds
Removing Inferior Bandits
• At conclusion of each round
– Remove any empirically worse bandits
• Intuition:
– High confidence that winner is better
than incumbent candidate
– Empirically worse bandits cannot be “much better” than
incumbent candidate
– Can show via Hoeffding bound that winner is also better than
empirically worse bandits with high confidence
– Preserves 1-1/T confidence overall that we’ll find the best bandit
© Copyright 2026 Paperzz