Regret to the Best
vs.
Regret to the Average
Eyal Even-Dar
Michael Kearns
Yishay Mansour
Upenn + Tel Aviv Univ.
Jennifer Wortman
Slides: Csaba
Motivation
Expert algorithms attempt to control regret to
the return of the best expert
Regret to the average return?
Same bound!
Weak???
EW: wi1=1, wit=wi,t-1e git , pit=wit/Wt, Wt = i wit
E1: 1 0 1 0 1 0 1 0 1 0 …
E2: 0 1 0 1 0 1 0 1 0 1 …
GA,T=T/2-cT1/2
GT+ = GT- = GT0 = T/2
RT+ · cT1/2, RT0· c T1/2
Notation - gains
git2 [0,1]
g=(git)
GiT(g)= t=1T git
G0T(g)=(i GiT(g))/N
G-T(g)=mini GiT(g)
G+T(g)=maxi GiT(g)
GDT(g)=i Di GiT(g)
- gains
- sequence of gains
- cumulated gains
- average gain
- worst gain
- best gain
- weighted avg. gain
Notation - algorithms
wit
pit=wit/Wt,
Wt = i wit
gA,t=i pit git
GAT(g)= t gA,t
– unnormalized weights
– normalized weights
– gain of A
– cumulated gain of A
Notation - regret
regret to the..
R+T(g) = (G+T(g) – GA,T(g)) Ç 1
– best
R-T(g) = (G-T(g) – GA,T(g)) Ç 1
– worst
R0T(g) = (G0T(g) – GA,T(g)) Ç 1
– avg
RDT(g) = (GDT(g) – GA,T(g)) Ç 1
– dist.
Goal
Algorithm A is “nice” if ..
R+A,T · O(T1/2)
R0A,T · 1
Program:
Examine existing algorithms (“difference
algorithms”) – lower bound
Show “nice” algorithms
Show that no substantial further improvement is
possible
“Difference” algorithms
Def:
A is a difference algorithm if for N=2,
git2 {0,1}, p1t = f(dt), p2t = 1-f(dt), dt = G1t-G2t
Examples:
EW: wit = e Git
FPL: Choose argmaxi ( Git+Zit )
Prod: wit = s (1+ gis) = (1+)Git
A lower bound for difference
algorithms
Theorem:
If A is a difference algorithm then there exist
some series, g, g’ (tuned to A), such that
R+AT (g) R0AT (g’) ¸ R+AT (g) R-AT (g’) = (T)
For R+AT = maxg R+AT(g), R-AT = maxg R-AT(g),
R0AT = maxg R0AT(g),
R+AT R0AT ¸ R+AT R-AT = (T)
Proof
Assume T is even, p11 · ½
g: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 …
0 0 0 0 0 0 0 0 0 0 0 0 0 0…
: first time t when p1t¸ 2/3 ) R+AT(g) ¸ /3
9 2 {2,3,..,} s.t. p1-p1-1 ¸ 1/(6)
Proof/2
g’:
p1-p1-1 ¸ 1/(6)
1 1 1 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1
T-2
p1,t=p1,
p1,t+1=p1,-1
Gain: · 1-1/(6)
p1t=p1,T-t
Gain: p1t+1-p1t=1
G+T=G-T=G0T=T/2
GAT(g’)· + (T-2)/2 (1-1/(6))
R-AT(g’) ¸ (T-2)/(12) )
R+AT(g)R-AT(g’)¸ (T-2)/36
Tightness
We know that for difference algorithms
R+AT R0AT ¸ R+AT R-AT = (T)
Can a (difference) algorithm achieve this?
Theorem: EW=EW(), with appropriately
tuned =(), 0· · 1/2 has
R+EW,T· T1/2+ (1+ln N)
R0EW,T· T1/2-
Breaking the frontier
What’s wrong with the difference algorithms?
They are designed to find the best expert with
low regret (fast)
..they don’t pay attention to the average gain
and how it compares with the best gain
BestWorst(A)
G+T-G-T: the spread of cumulated gain
Idea: Stay with the average, until the spread
becomes large.
Then switch to learning (using algorithm A).
When the spread is large enough,
G0T=GBW(A),T À G-T
) “Nothing” to loose
Spread threshold: NR; where R=RT,N is a
bound on the regret of A.
BestWorst(A)
Theorem: R+BW(A),T = O(NR), GBW(A),T¸ G-{T}
Proof:
At the time of switch,
GBW(A) ¸ (G++ (N-1)+G-)/N.
Since G+¸ G-+NR,
GBW(A)¸ G- + R.
PhasedAgression(A,R,D)
for k=1:log2(R) do
:=2k-1/R
A.reset(); s:=0 // local time, new phase
while (G+s-GDs<2R) do
qs := A.getNormedWeights( gs-1 )
ps := qs + (1-) D
end
end
A.reset()
run A until time T
PA(A,R,D) – Theorem
Theorem:
Let A be any algorithm with regret R = RT,N to
the best expert, D any distribution.
Then for PA=PA(A,R,D),
R+PA,T· 2R(log R+1)
RDPA,T· 1
Proof
Consider local time s during phase k.
D and A share the gains & the regret
G+s-GPA,s < 2k-1/R£ R + (1-2k-1/R) £ 2R < 2R
GDs-GPA,s· 2k-1/R £ R =2k-1
What happens at the end of the phase?
k1
GPA,s-GD,s ¸ 2 - /R £ (G+s-R-GDs)
¸ 2k-1/R £ (G+s-GDs-R+GDsGDs)
¸ 2k-1/R £ R = 2k-1.
What if PA ends in phase k at time T:
G+T-GPA,T · 2R k · 2R (log R + 1)
GDT-GPA,T· 2k-1 - j=1k-1 2j-1= 2k-1(2k-1-1)=1
General lower bounds
Theorem:
R+A,T=O(T1/2) ) R0A,T=(T1/2)
R+A,T· (Tlog(T))1/2/10 ) R0A,T=(T),
where ¸ 0.02
Compare this with
R+PA,T· 2R(log R+1), RDPA,T· 1,
where R=(T log N)1/2
Conclusions
Achieving constant regret to the average is a
reasonable goal.
“Classical” algorithms do not have this
property, but satisfy R+AT R0AT ¸ (T).
Modification: Learn only when it makes
sense; ie. when the best is much better than
the average
PhasedAgression: Optimal tradeoff
Can we remove dependence on T?
© Copyright 2026 Paperzz