Beat the Mean Bandit
ICML 2011
Yisong Yue
Carnegie Mellon University
Joint work with Thorsten Joachims (Cornell University)
Optimizing Information Retrieval Systems
• Increasingly reliant on user feedback
– E.g., clicks on search results
• Online learning is a popular modeling tool
– Especially partial-information (bandit) settings
• Our focus: learning from relative preferences
– Motivated by recent work on interleaved retrieval
evaluation (example following)
Team Draft Interleaving
(Comparison Oracle for Search)
Ranking A
Ranking B
1.Napa Valley – The authority for lodging...
1.
Napa Country, California –
www.napavalley.com
Wikipedia
2.Napa Valley Wineries - Plan your wine...
en.wikipedia.org/wiki/Napa_Valle
www.napavalley.com/wineries
y
3.Napa Valley College
2.
Napa Valley – The authority for
www.napavalley.edu/homex.asp
lodging...
4.
Been There | Tips | Napa Valley
www.napavalley.com
www.ivebeenthere.co.uk/tips/166
3.
Napa: The Story of an American
Presented Ranking
81
1.Napa Valley – The authorityEden...
for lodging...
5.
Napa Valley Wineries and Wine
books.google.co.uk/books?isbn=..
www.napavalley.com
www.napavintners.com
.
2.
Napa Country, California
–
6.
Napa Country, California
–
4.
Napa Valley Hotels – Bed and
Wikipedia
Wikipedia
Breakfast...
en.wikipedia.org/wiki/Napa_Valley
en.wikipedia.org/wiki/Napa_Valley
www.napalinks.com
3.
Napa: The Story of an American
5.
NapaValley.org
Eden...
www.napavalley.org
books.google.co.uk/books?isbn=...
6.your wine...
The Napa Valley Marathon
4.Napa Valley Wineries – Plan
www.napavalleymarathon.org
A
B
www.napavalley.com/wineries
5.
Napa Valley Hotels – Bed and
Breakfast...
[Radlinski et al. 2008]
www.napalinks.com
6.Napa Balley College
Team Draft Interleaving
(Comparison Oracle for Search)
Ranking A
Ranking B
1.Napa Valley – The authority for lodging...
1.
Napa Country, California –
www.napavalley.com
Wikipedia
2.Napa Valley Wineries - Plan your wine...
en.wikipedia.org/wiki/Napa_Valle
www.napavalley.com/wineries
y
3.Napa Valley College
2.
Napa Valley – The authority for
www.napavalley.edu/homex.asp
lodging...
4.
Been There | Tips | Napa Valley
www.napavalley.com
www.ivebeenthere.co.uk/tips/166
3.
Napa: The Story of an American
Presented Ranking
81
1.Napa Valley – The authorityEden...
for lodging...
5.
Napa Valley Wineries and Wine
books.google.co.uk/books?isbn=..
www.napavalley.com
www.napavintners.com
.
2.
Napa Country, California
–
6.
Napa Country, California
–
4.
Napa Valley Hotels – Bed and
Wikipedia
Wikipedia
Breakfast...
en.wikipedia.org/wiki/Napa_Valley
en.wikipedia.org/wiki/Napa_Valley
www.napalinks.com
3.
Napa: The Story of an American
B wins!
5.
NapaValley.org
Eden...
www.napavalley.org
books.google.co.uk/books?isbn=...
6.your wine...
The Napa Valley Marathon
4.Napa Valley Wineries – Plan
www.napavalleymarathon.org
www.napavalley.com/wineries
5.
Napa Valley Hotels – Bed and
Breakfast...
[Radlinski et al. 2008]
www.napalinks.com
6.Napa Balley College
Interleave A vs B
…
A
B
C
Total wins
Total losses
A wins vs…
0
1
0
1
0
B wins vs…
0
0
0
0
1
C wins vs…
0
0
0
0
0
Interleave A vs C
…
A
B
C
Total wins
Total losses
A wins vs…
0
1
0
1
1
B wins vs…
0
0
0
0
1
C wins vs…
1
0
0
1
0
Interleave B vs C
…
A
B
C
Total wins
Total losses
A wins vs…
0
1
0
1
1
B wins vs…
0
1
0
1
1
C wins vs…
1
0
0
1
1
Interleave A vs B
…
A
B
C
Total wins
Total losses
A wins vs…
0
1
0
1
2
B wins vs…
0
2
0
2
1
C wins vs…
1
0
0
1
1
Outline
• Learning Formulation
– Dueling Bandits Problem [Yue et al. 2009]
• Modeling transitivity violation
– E.g., (A >> B) AND (B >> C) IMPLIES (A >> C) ??
– Not done in previous work
Outline
• Learning Formulation
– Dueling Bandits Problem [Yue et al. 2009]
• Modeling transitivity violation
– E.g., (A >> B) AND (B >> C) IMPLIES (A >> C) ??
– Not done in previous work
• Algorithm: Beat-the-Mean
• Empirical Validation
Dueling Bandits Problem
• Given K bandits b1, …, bK
• Each iteration: compare (duel) two bandits
– E.g., interleaving two retrieval functions
[Yue et al. 2009]
Dueling Bandits Problem
• Given K bandits b1, …, bK
• Each iteration: compare (duel) two bandits
– E.g., interleaving two retrieval functions
• Cost function (regret):
T
RT P(b* bt ) P(b* bt ' ) 1
t 1
• (bt, bt’) are the two bandits chosen
• b* is the overall best one
• (% users who prefer best bandit over chosen ones)
[Yue et al. 2009]
Example Pairwise Preferences
A
B
C
D
E
F
A 0
0.05
0.05
0.04
0.11
0.11
B -0.05
0
0.05
0.06
0.08
0.10
C -0.05
-0.05 0
0.04
0.01
0.06
D -0.04
-0.04 -0.04 0
0.04
0.00
E -0.11
-0.08 -0.01 -0.04 0
F -0.11
-0.10 -0.06 -0.00 -0.01 0
0.01
•Values are Pr(row > col) – 0.5
•Derived from interleaving experiments on http://arXiv.org
Example Pairwise Preferences
A
T
B
C
D
E
F
A 0
0.05
0.05
0.04
0.11
0.11
B -0.05
0
0.05
0.06
0.08
0.10
C -0.05
-0.05 0
0.04
0.01
0.06
D -0.04
-0.04 -0.04 0
0.04
0.00
E -0.11
-0.08 -0.01 -0.04 0
F -0.11
-0.10 -0.06 -0.00 -0.01 0
RT P(b* bt ) P(b* bt ' ) 1
t 1
0.01
Compare E & F:
•P(A > E) = 0.61
•P(A > F) = 0.61
•Incurred Regret = 0.22
•Values are Pr(row > col) – 0.5
•Derived from interleaving experiments on http://arXiv.org
Example Pairwise Preferences
A
T
B
C
D
E
F
A 0
0.05
0.05
0.04
0.11
0.11
B -0.05
0
0.05
0.06
0.08
0.10
C -0.05
-0.05 0
0.04
0.01
0.06
D -0.04
-0.04 -0.04 0
0.04
0.00
E -0.11
-0.08 -0.01 -0.04 0
F -0.11
-0.10 -0.06 -0.00 -0.01 0
RT P(b* bt ) P(b* bt ' ) 1
t 1
0.01
Compare B & C:
•P(A > B) = 0.55
•P(A > C) = 0.55
•Incurred Regret = 0.10
•Values are Pr(row > col) – 0.5
•Derived from interleaving experiments on http://arXiv.org
Example Pairwise Preferences
A
T
B
C
D
E
F
A 0
0.05
0.05
0.04
0.11
0.11
B -0.05
0
0.05
0.06
0.08
0.10
C -0.05
-0.05 0
0.04
0.01
0.06
D -0.04
-0.04 -0.04 0
0.04
0.00
E -0.11
-0.08 -0.01 -0.04 0
F -0.11
-0.10 -0.06 -0.00 -0.01 0
RT P(b* bt ) P(b* bt ' ) 1
t 1
0.01
Interleaving
shows ranking
produced by A.
Compare A & A:
•P(A > A) = 0.50
•P(A > A) = 0.50
•Incurred Regret = 0.00
•Values are Pr(row > col) – 0.5
•Derived from interleaving experiments on http://arXiv.org
Example Pairwise Preferences
A
B
C
D
E
F
A 0
0.05
0.05
0.04
0.11
0.11
B -0.05
0
0.05
0.06
0.08
0.10
C -0.05
-0.05 0
0.04
0.01
0.06
D -0.04
-0.04 -0.04 0
0.04
0.00
E -0.11
-0.08 -0.01 -0.04 0
F -0.11
-0.10 -0.06 -0.00 -0.01 0
0.01
Violation in internal consistency!
For strong stochastic transitivity:
• A > D should be at least 0.06
•Values are Pr(row > col) – 0.5
•Derived from interleaving experiments on http://arXiv.org
Example Pairwise Preferences
A
B
C
D
E
F
A 0
0.05
0.05
0.04
0.11
0.11
B -0.05
0
0.05
0.06
0.08
0.10
C -0.05
-0.05 0
0.04
0.01
0.06
D -0.04
-0.04 -0.04 0
0.04
0.00
E -0.11
-0.08 -0.01 -0.04 0
F -0.11
-0.10 -0.06 -0.00 -0.01 0
0.01
Violation in internal consistency!
For strong stochastic transitivity:
• C > E should be at least 0.04
•Values are Pr(row > col) – 0.5
•Derived from interleaving experiments on http://arXiv.org
Example Pairwise Preferences
A
B
C
D
E
F
A 0
0.05
0.05
0.04
0.11
0.11
B -0.05
0
0.05
0.06
0.08
0.10
C -0.05
-0.05 0
0.04
0.01
0.06
D -0.04
-0.04 -0.04 0
0.04
0.00
E -0.11
-0.08 -0.01 -0.04 0
F -0.11
-0.10 -0.06 -0.00 -0.01 0
0.01
Violation in internal consistency!
For strong stochastic transitivity:
• D > F should be at least 0.04
•Values are Pr(row > col) – 0.5
•Derived from interleaving experiments on http://arXiv.org
Modeling Assumptions
• P(bi > bj) = ½ + εij
• Let b1 be the best overall bandit
• Relaxed Stochastic Transitivity
– For three bandits b1 > bj > bk :
– γ ≥ 1 (γ = 1 for strong transitivity **)
– Relaxed internal consistency property
• Stochastic Triangle Inequality
– For three bandits b1 > bj > bk :
– Diminishing returns property
ge1k ³ max {e1 j , e jk }
1k 1 j jk
(** γ = 1 required in previous work, and required to apply for all bandit triplets)
Example Pairwise Preferences
A
B
C
D
E
F
A 0
0.05
0.05
0.04
0.11
0.11
B -0.05
0
0.05
0.06
0.08
0.10
C -0.05
-0.05 0
0.04
0.01
0.06
D -0.04
-0.04 -0.04 0
0.04
0.00
E -0.11
-0.08 -0.01 -0.04 0
F -0.11
-0.10 -0.06 -0.00 -0.01 0
1k max 1 j , jk
γ = 1.5
0.01
•Values are Pr(row > col) – 0.5
•Derived from interleaving experiments on http://arXiv.org
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
B wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
C wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
D wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
E wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
F wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
B wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
C wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
D wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
E wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
F wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
Comparison Results
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
B wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
C wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
D wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
0
E wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
F wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
Mean Score &
-- Confidence
0.00 Interval
1.00
Beat-the-Mean
A
B
C
D
E
F
0
0
0
0
0
A’s
vs0 rest 0
0 performance
0
0
Mean
Lower
Bound
Upper
Bound
-0
0.00
1.00
A wins
Total
0
0
B wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
C wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
D wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
E wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
F wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0.00
1.00
0A’s mean performance
B wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
C wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
D wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
E wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
F wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
0
0
1
1
0
0
0
0
0
0
0
0
1.00
1
0.00
1.00
B wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
C wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
D wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
E wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
F wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
0
0
1
1
0
0
0
0
0
0
0
0
1.00
1
0.00
1.00
B wins
Total
0
0
0
0
0
0
0
0
0
1
0
0
0.00
1
0.00
1.00
C wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
D wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
E wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
F wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
0
0
1
1
0
0
0
0
0
0
0
0
1.00
1
0.00
1.00
B wins
Total
0
0
0
0
0
0
0
0
0
1
0
0
0.00
1
0.00
1.00
C wins
Total
0
0
0
0
0
0
0
0
0
0
1
1
1.00
1
0.00
1.00
D wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
E wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
F wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
0
0
1
1
0
0
0
0
0
0
0
0
1.00
1
0.00
1.00
B wins
Total
0
0
0
0
0
0
0
0
0
1
0
0
0.00
1
0.00
1.00
C wins
Total
0
0
0
0
0
0
0
0
0
0
1
1
1.00
1
0.00
1.00
D wins
Total
0
0
0
0
0
1
0
0
0
0
0
0
0.00
1
0.00
1.00
E wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
F wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
0
0
1
1
0
0
0
0
0
0
0
0
1.00
1
0.00
1.00
B wins
Total
0
0
0
0
0
0
0
0
0
1
0
0
0.00
1
0.00
1.00
C wins
Total
0
0
0
0
0
0
0
0
0
0
1
1
1.00
1
0.00
1.00
D wins
Total
0
0
0
0
0
1
0
0
0
0
0
0
0.00
1
0.00
1.00
E wins
Total
0
1
0
0
0
0
0
0
0
0
0
0
0.00
1
0.00
1.00
F wins
Total
0
0
0
0
0
0
0
0
0
0
0
0
-0
0.00
1.00
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
0
0
1
1
0
0
0
0
0
0
0
0
1.00
1
0.00
1.00
B wins
Total
0
0
0
0
0
0
0
0
0
1
0
0
0.00
1
0.00
1.00
C wins
Total
0
0
0
0
0
0
0
0
0
0
1
1
1.00
1
0.00
1.00
D wins
Total
0
0
0
0
0
1
0
0
0
0
0
0
0.00
1
0.00
1.00
E wins
Total
0
1
0
0
0
0
0
0
0
0
0
0
0.00
1
0.00
1.00
F wins
Total
0
0
0
0
0
1
0
0
0
0
0
0
0.00
1
0.00
1.00
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
13
25
16
24
11
22
16
28
20
30
13
21
0.59
150
0.49
0.69
B wins
Total
14
30
15
30
13
19
15
20
17
26
20
25
0.63
150
0.53
0.73
C wins
Total
12
28
10
22
13
23
15
28
20
24
13
25
0.55
150
0.45
0.65
D wins
Total
9
20
15
28
10
21
11
23
15
28
15
30
0.50
150
0.40
0.60
E wins
Total
8
24
11
25
6
22
14
29
14
31
10
19
0.42
150
0.32
0.52
F wins
Total
11
29
4
25
10
18
12
25
14
30
13
23
0.43
150
0.33
0.53
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
13
25
16
24
11
22
16
28
20
30
13
21
0.59
150
0.49
0.69
B wins
Total
14
30
15
30
13
19
15
20
17
26
20
25
0.63
150
0.53
0.73
C wins
Total
12
28
10 13 15 20 13
B dominates E!
22 23 28 24 25
0.55
150
0.45
0.65
D wins
Total
9
20
0.40
0.60
E wins
Total
F wins
Total
(B’s lower bound greater
15 than
10 E’s
11upper
15 bound)
15 0.50
28
21
23
28
30
150
8
24
11
25
6
22
14
29
14
31
10
19
0.42
150
0.32
0.52
11
29
4
25
10
18
12
25
14
30
13
23
0.43
150
0.33
0.53
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
13
25
16
24
11
22
16
28
20
30
13
21
0.58
120
0.49
0.67
B wins
Total
14
30
15
30
13
19
15
20
15
26
20
25
0.62
124
0.51
0.73
C wins
Total
12
28
10
22
13
23
15
28
20
24
13
25
0.50
126
0.39
0.61
D wins
Total
9
20
15
28
10
21
11
23
15
28
15
30
0.49
122
0.38
0.60
E wins
Total
8
24
11
25
6
22
14
29
14
31
10
19
0.42
150
0.32
0.52
F wins
Total
11
29
4
25
10
18
12
25
14
30
13
23
0.42
120
0.31
0.53
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
13
25
17
25
11
22
16
28
20
30
13
21
0.58
121
0.49
0.67
B wins
Total
14
30
15
30
13
19
15
20
15
26
20
25
0.62
124
0.51
0.73
C wins
Total
12
28
10
22
13
23
15
28
20
24
13
25
0.50
126
0.39
0.61
D wins
Total
9
20
15
28
10
21
11
23
15
28
15
30
0.49
122
0.38
0.60
E wins
Total
8
24
11
25
6
22
14
29
14
31
10
19
0.42
150
0.32
0.52
F wins
Total
11
29
4
25
10
18
12
25
14
30
13
23
0.42
120
0.31
0.53
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
15
30
19
29
14
28
18
33
23
30
15
25
0.56
145
0.46
0.66
B wins
Total
15
33
17
34
15
24
20
27
15
26
23
27
0.62
145
0.52
0.72
C wins
Total
13
31
11
28
14
29
15
30
20
24
16
27
0.48
145
0.38
0.68
D wins
Total
11
26
17
31
12
26
14
29
15
28
17
33
0.49
145
0.39
0.59
E wins
Total
8
24
11
25
6
22
14
29
14
31
10
19
0.42
150
0.32
0.52
F wins
Total
12
32
7
30
13
26
13
28
14
30
15
29
0.41
145
0.31
0.51
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
15
30
19
29
14
28
18
33
23
30
15
25
0.56
145
0.46
0.66
B wins
Total
15
33
17
34
15
24
20
27
15
26
23
27
0.62
145
0.52
0.72
C wins
Total
13
31
0.38
0.68
D wins
Total
11
26
11 14 15 20 16 0.48
28 B 29
30 24F! 27 145
dominates
17 (B’s
12 lower
14 bound
15 17greater
0.49
31 than
26 F’s
29upper
28 bound)
33 145
0.39
0.59
E wins
Total
8
24
11
25
6
22
14
29
14
31
10
19
0.42
150
0.32
0.52
F wins
Total
12
32
7
30
13
26
13
28
14
30
15
29
0.41
145
0.31
0.51
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
15
30
19
29
14
28
18
33
23
30
15
25
0.55
120
0.43
0.67
B wins
Total
15
33
17
34
15
24
20
27
15
26
23
27
0.56
118
0.44
0.68
C wins
Total
13
31
11
28
14
29
15
30
20
24
16
27
0.45
118
0.33
0.57
D wins
Total
11
26
17
31
12
26
14
29
15
28
17
33
0.48
112
0.36
0.60
E wins
Total
8
24
11
25
6
22
14
29
14
31
10
19
0.42
150
0.32
0.52
F wins
Total
12
32
7
30
13
26
13
28
14
30
15
29
0.41
145
0.31
0.51
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
41
80
44
75
38
70
42
75
23
30
15
25
0.55
300
0.48
0.62
B wins
Total
31
69
38
78
47
78
51
75
15
26
23
27
0.56
300
0.49
0.63
C wins
Total
33
77
31
77
35
70
39
76
20
24
16
27
0.46
300
0.49
0.53
D wins
Total
30
76
27
77
35
74
35
73
15
28
17
33
0.42
300
0.35
0.49
E wins
Total
8
24
11 6
14 14 10
dominates
25 B 22
29 31D! 19
0.42
150
0.32
0.52
F wins
Total
12
32
7
13 13 14 15 0.41
30 than
26 D’s
28upper
30 bound)
29 145
0.31
0.51
(B’s lower bound greater
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
41
80
44
75
38
70
42
75
23
30
15
25
0.55
225
0.46
0.64
B wins
Total
31
69
38
78
47
78
51
75
15
26
23
27
0.52
225
0.43
0.61
C wins
Total
33
77
31
77
35
70
39
76
20
24
16
27
0.33
225
0.24
0.42
D wins
Total
30
76
27
77
35
74
35
73
15
28
17
33
0.42
300
0.35
0.49
E wins
Total
8
24
11 6
14 14 10
dominates
25 A 22
29 31C! 19
0.42
150
0.32
0.52
F wins
Total
12
32
7
13 13 14 15 0.41
30 than
26 C’s
28upper
30 bound)
29 145
0.31
0.51
(A’s lower bound greater
Beat-the-Mean
A
B
C
D
E
F
Mean
Lower
Bound
Upper
Bound
A wins
Total
41
80
44
75
38
70
42
75
23
30
15
25
0.51
80
0.38
0.64
B wins
Total
31
69
38
78
47
78
51
75
15
26
23
27
0.52
147
0.45
0.49
C wins
Total
33
77
31
77
35
70
39
76
20
24
16
27
0.33
225
0.24
0.42
D wins
Total
30
76
27
77
35
74
35
73
15
28
17
33
0.42
300
0.35
0.49
E wins
Total
8
24
11 6
14 14 10 0.42
25 A is
22 last29bandit
31 remaining.
19 150
0.32
0.52
F wins
Total
12
32
7
30
0.31
0.51
Eventually…
A is declared best bandit!
13
26
13
28
14
30
15
29
0.41
145
Regret Guarantee
• Playing against mean bandit calibrates preference scores
– Estimates of (active) bandits directly comparable
– One estimate per active bandit = linear number of estimates
Regret Guarantee
• Playing against mean bandit calibrates preference scores
– Estimates of (active) bandits directly comparable
– One estimate per active bandit = linear number of estimates
• We can bound comparisons needed to remove worst bandit
– Varies smoothly with transitivity parameter γ
– High probability bound
• We can bound the regret incurred by each comparison
– Varies smoothly with transitivity parameter γ
Regret Guarantee
• Playing against mean bandit calibrates preference scores
– Estimates of (active) bandits directly comparable
– One estimate per active bandit = linear number of estimates
• We can bound comparisons needed to remove worst bandit
– Varies smoothly with transitivity parameter γ
– High probability bound
• We can bound the regret incurred by each comparison
– Varies smoothly with transitivity parameter γ
• Thus, we can bound the total regret with high probability:
– γ is typically close to 1
7K
RT O
log T
We also have a similar PAC guarantee.
Regret Guarantee
• Playing against mean bandit calibrates preference scores
– Estimates of (active) bandits directly comparable
– One estimate per active bandit = linear number of estimates
• We can bound comparisons needed to remove worst bandit
– Varies smoothly with transitivity parameter γ
– High probability bound
Not possible with
previous approaches!
• We can bound the regret incurred by each comparison
– Varies smoothly with transitivity parameter γ
• Thus, we can bound the total regret with high probability:
– γ is typically close to 1
7K
RT O
log T
We also have a similar PAC guarantee.
•Simulation experiment where γ = 1.3
•Light = Beat-the-Mean
•Dark = Interleaved Filter [Yue et al. 2009]
•Beat-the-Mean maintains linear regret guarantee
•Interleaved Filter suffers quadratic regret in the worst case
•Simulation experiment where γ = 1 (original DB setting)
•Light = Beat-the-Mean
•Dark = Interleaved Filter [Yue et al. 2009]
•Beat-the-Mean has high probability bound
•Beat-the-Mean exhibits significantly lower variance
Conclusions
• Online learning approach using pairwise feedback
– Well-suited for optimizing information retrieval systems
from user feedback
– Models violations in preference transitivity
• Algorithm: Beat-the-Mean
–
–
–
–
Regret linear in #bandits and logarithmic in #iterations
Degrades smoothly with transitivity violation
Stronger guarantees than previous work
Empirically supported
© Copyright 2026 Paperzz