regret

Regret to the Best
vs.
Regret to the Average
Eyal Even-Dar
Michael Kearns
Yishay Mansour
Upenn + Tel Aviv Univ.
Jennifer Wortman
Slides: Csaba
Motivation


Expert algorithms attempt to control regret to
the return of the best expert
Regret to the average return?



Same bound!
Weak???
EW: wi1=1, wit=wi,t-1e git , pit=wit/Wt, Wt = i wit
E1: 1 0 1 0 1 0 1 0 1 0 …
E2: 0 1 0 1 0 1 0 1 0 1 …
GA,T=T/2-cT1/2
GT+ = GT- = GT0 = T/2
RT+ · cT1/2, RT0· c T1/2
Notation - gains







git2 [0,1]
g=(git)
GiT(g)= t=1T git
G0T(g)=(i GiT(g))/N
G-T(g)=mini GiT(g)
G+T(g)=maxi GiT(g)
GDT(g)=i Di GiT(g)
- gains
- sequence of gains
- cumulated gains
- average gain
- worst gain
- best gain
- weighted avg. gain
Notation - algorithms




wit
pit=wit/Wt,
Wt = i wit
gA,t=i pit git
GAT(g)= t gA,t
– unnormalized weights
– normalized weights
– gain of A
– cumulated gain of A
Notation - regret




regret to the..
R+T(g) = (G+T(g) – GA,T(g)) Ç 1
– best
R-T(g) = (G-T(g) – GA,T(g)) Ç 1
– worst
R0T(g) = (G0T(g) – GA,T(g)) Ç 1
– avg
RDT(g) = (GDT(g) – GA,T(g)) Ç 1
– dist.
Goal

Algorithm A is “nice” if ..



R+A,T · O(T1/2)
R0A,T · 1
Program:



Examine existing algorithms (“difference
algorithms”) – lower bound
Show “nice” algorithms
Show that no substantial further improvement is
possible
“Difference” algorithms

Def:
A is a difference algorithm if for N=2,
git2 {0,1}, p1t = f(dt), p2t = 1-f(dt), dt = G1t-G2t

Examples:



EW: wit = e Git
FPL: Choose argmaxi ( Git+Zit )
Prod: wit = s (1+ gis) = (1+)Git
A lower bound for difference
algorithms

Theorem:
If A is a difference algorithm then there exist
some series, g, g’ (tuned to A), such that
R+AT (g) R0AT (g’) ¸ R+AT (g) R-AT (g’) = (T)

For R+AT = maxg R+AT(g), R-AT = maxg R-AT(g),
R0AT = maxg R0AT(g),
R+AT R0AT ¸ R+AT R-AT = (T)
Proof

Assume T is even, p11 · ½
g: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 …
0 0 0 0 0 0 0 0 0 0 0 0 0 0…




: first time t when p1t¸ 2/3 ) R+AT(g) ¸ /3
9  2 {2,3,..,} s.t. p1-p1-1 ¸ 1/(6)
Proof/2

g’:
p1-p1-1 ¸ 1/(6)
1 1 1 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1

T-2

p1,t=p1,
p1,t+1=p1,-1
Gain: · 1-1/(6)
p1t=p1,T-t
Gain: p1t+1-p1t=1



G+T=G-T=G0T=T/2
GAT(g’)·  + (T-2)/2 (1-1/(6))
R-AT(g’) ¸ (T-2)/(12) )
R+AT(g)R-AT(g’)¸ (T-2)/36
Tightness



We know that for difference algorithms
R+AT R0AT ¸ R+AT R-AT = (T)
Can a (difference) algorithm achieve this?
Theorem: EW=EW(), with appropriately
tuned =(), 0· · 1/2 has
R+EW,T· T1/2+ (1+ln N)
R0EW,T· T1/2-
Breaking the frontier



What’s wrong with the difference algorithms?
They are designed to find the best expert with
low regret (fast)
..they don’t pay attention to the average gain
and how it compares with the best gain
BestWorst(A)




G+T-G-T: the spread of cumulated gain
Idea: Stay with the average, until the spread
becomes large.
Then switch to learning (using algorithm A).
When the spread is large enough,
G0T=GBW(A),T À G-T
) “Nothing” to loose
Spread threshold: NR; where R=RT,N is a
bound on the regret of A.
BestWorst(A)

Theorem: R+BW(A),T = O(NR), GBW(A),T¸ G-{T}

Proof:
At the time of switch,
GBW(A) ¸ (G++ (N-1)+G-)/N.
Since G+¸ G-+NR,
GBW(A)¸ G- + R.
PhasedAgression(A,R,D)
for k=1:log2(R) do
:=2k-1/R
A.reset(); s:=0 // local time, new phase
while (G+s-GDs<2R) do
qs := A.getNormedWeights( gs-1 )
ps :=  qs + (1-) D
end
end
A.reset()
run A until time T
PA(A,R,D) – Theorem

Theorem:
Let A be any algorithm with regret R = RT,N to
the best expert, D any distribution.
Then for PA=PA(A,R,D),
R+PA,T· 2R(log R+1)
RDPA,T· 1
Proof
Consider local time s during phase k.
 D and A share the gains & the regret
G+s-GPA,s < 2k-1/R£ R + (1-2k-1/R) £ 2R < 2R
GDs-GPA,s· 2k-1/R £ R =2k-1
 What happens at the end of the phase?
k1
GPA,s-GD,s ¸ 2 - /R £ (G+s-R-GDs)
¸ 2k-1/R £ (G+s-GDs-R+GDsGDs)
¸ 2k-1/R £ R = 2k-1.
 What if PA ends in phase k at time T:
G+T-GPA,T · 2R k · 2R (log R + 1)
GDT-GPA,T· 2k-1 - j=1k-1 2j-1= 2k-1(2k-1-1)=1

General lower bounds

Theorem:
R+A,T=O(T1/2) ) R0A,T=(T1/2)
R+A,T· (Tlog(T))1/2/10 ) R0A,T=(T),
where ¸ 0.02
Compare this with
R+PA,T· 2R(log R+1), RDPA,T· 1,
where R=(T log N)1/2
Conclusions





Achieving constant regret to the average is a
reasonable goal.
“Classical” algorithms do not have this
property, but satisfy R+AT R0AT ¸ (T).
Modification: Learn only when it makes
sense; ie. when the best is much better than
the average
PhasedAgression: Optimal tradeoff
Can we remove dependence on T?