A Proof for Theorem. 3.3

A
Proof for Theorem. 3.3
Proof Let us define rt = (ŷt rt
ŷt+1 )rf ft (xt ), namely, rt is short for r`dt (ft ) with respect to predictor ft . Take the
gradient of Eq. 9 with respect to f and set it to zero, we have for ft+1 and ft :
µ
t
X
i=0
µ
t 1
X
i=0
Hence, we have:
ri + rR(ft+1 ) = 0,
(19)
ri + rR(ft ) = 0
µrt + rR(ft+1 )
(20)
rR(ft ) = 0.
Since µ1 R(f ) is 1/µ-strongly convex, from Eq. 9, for ft and ft+1 we have:
hft ,
t
X
i=0
hft+1 ,
ri i + (1/µ)R(ft )
t 1
X
i=0
hft+1 ,
ri i + (1/µ)R(ft+1 )
t
X
i=0
hft ,
From the above two inequalities, we have:
Now let us consider the progress DR (f ⇤ , ft )
ri i + (1/µ)R(ft+1 ) +
t 1
X
i=0
k ft k = kft+1
ri i + (1/µ)R(ft ) +
1
kft+1
2µ
ft k2 ,
1
kft+1
2µ
ft k2 .
(21)
(22)
ft k  µkrt k.
DR (f ⇤ , ft+1 ) in every time step t. Similar to the prove for T D⇤ (0), we have:
DR (f ⇤ , ft ) DR (f ⇤ , ft+1 )
= R(f ⇤ ) R(ft ) rR(ft )(f ⇤ ft ) R(f ⇤ ) + R(ft+1 ) + rR(ft+1 )(f ⇤ ft+1 )
= R(ft+1 ) R(ft ) rR(ft )(f ⇤ ft ) + rR(ft+1 )(f ⇤ ft+1 )
= R(ft+1 ) R(ft ) + (rR(ft+1 ) rR(ft ))f ⇤ + rR(ft )ft rR(ft+1 )ft+1
= R(ft+1 ) R(ft ) + (rR(ft+1 ) rR(ft ))f ⇤ + rR(ft )ft rR(ft+1 )ft
+ rR(ft+1 )ft rR(ft+1 )ft+1
= R(ft+1 ) R(ft ) + (rR(ft+1 ) rR(ft ))(f ⇤ ft ) + rR(ft+1 )(ft ft+1 )
= DR (ft , ft+1 ) + µrt (ft f ⇤ )
(23)
Since we assume that R(f ) is also ↵-smooth, we must have:
↵
kft
2
DR (f ⇤ , ft ):
DR (ft , ft+1 ) 
Now let us upper bound the progress DR (f ⇤ , ft+1 )
DR (f ⇤ , ft+1 ) DR (f ⇤ , ft )
= µrt (ft f ⇤ ) + DR (ft , ft+1 )
↵
 µrt (ft f ⇤ ) + kft ft+1 k2
2
2
↵µ
 µrt (ft f ⇤ ) +
krt k2
2
↵µ2 X 2
 µ(et
et+1 )(e⇤t et ) +
(et
2
= 2µet e⇤t

2µe2t
(24)
ft+1 k.
et+1 )2
2µ et+1 e⇤t + 2µ et+1 et +
↵µ2 X 2 2
et
2
↵µ2 X 2 et et+1 +
2 2 2 b ⇤2
2
b
↵ 2 2 2
2
2
µ et + et
2µe2t + µ2 2 e2t+1 + e⇤2
µ X et
t + µ et+1 + µ et +
b
2
b
2
2
↵
↵
↵
+ µ2 X 2 e2t+1 + µ2 X 2 e2t+1 + µ2 X 2 2 e2t+1
2
2
2
↵µ2 X 2
2
2 2
et+1
(25)
PT
Now we are going to sum over the above inequality from t = 1 to T . Note that for any t=1 e2t+1 can be rewritten as
PT
PT
2
2
2
2
t=1 et + ( e1 + eT +1 ). When we summing over the above inequality, we will keep use this trick to get ride of
t=1 et+1
PT
by replacing it by t=1 e2t + ( e21 + e2T +1 ).
T
X
DR (f ⇤ , f1 ) 
2 2
 µ2 ( +
b
b
2
(DR (f ⇤ , ft+1 )
DR (f ⇤ , ft ))
t=1
+
↵X 2
(1 + )2 )
2
2µ(1
)
X
e2t + b
X
e⇤2
t + C,
(26)
where C is a constant that only depends on eT1 and e2T +1 , which under our assumption are finitely bounded real numbers.
1
Now, let us set µ = 2 + 2 2 +(↵X
2 /2)(1+ )2 , after rearrange terms, we have:
b
b
X
e2t 
2
b
+
2 2
b
X
+ (↵X 2 /2)(1 + )2
⇤
(D
(f
,
f
)
+
b
e⇤2
R
1
t + C).
(1
)2
(27)
We can keep optimizing the
of the above inequality by optimizing with respect to b. Under our assumption that |e⇤t | is
PRHS
⇤2
always bounded, we have
et  T E 2 , where E = supt |e⇤t |. We also assume that DR (f1 , f ⇤ )  R, R 2 R+ . We can
further improve the bound by optimizing it with respect to b. The RHS of the above inequality can be upper bounded as:
2 + 2 2 X ⇤2
2+2 2
(↵X 2 /2)(1 + )2 X ⇤2 (↵X 2 /2)(1 + )2
et +
(R + C) +
b
et +
(R + C)
2
2
(1
)
b(1
)
(1
)2
(1
)2
2 + 2 2 X ⇤2 (↵X 2 /2)(1 + )2
2(1 + )2 XE p

et +
(R + C) +
↵(R + C)T ,
(28)
2
2
(1
)
(1
)
(1
)2
q
p
P ⇤2
4(R+C)
where the last inequality comes from the fact that
et  T E 2 , and by setting b = ↵X
T ). Substitute b
2 E 2 T = O(1/
back to µ, we will have:
With this µ, for
P
e2t , we have:
X
µ= q
1
↵X 2 E 2 T
R+C
1
= O( p ).
T
(1 + 2 ) + (↵X 2 /2)(1 + )2
(29)
2 + 2 2 X ⇤2 (↵X 2 /2)(1 + )2
2(1 + )2 XE p
e
+
(R
+
C)
+
↵(R + C)T
t
(1
)2
(1
)2
(1
)2
p
2 + 2 2 X ⇤2
=
et + O( T ).
2
(1
)
e2t 
Now for average prediction error, we divide both sides of the above inequality by T , and take T ! 1, we have:
P 2
P ⇤2
et
2+2 2
et
lim

.
2
T !1 T
(1
)
T
(30)
(31)
Hence we prove the theorem.
B
Proof for Theorem. 3.4
Proof The proof is similar to the one for OMD-T D⇤ (0). Again, we start by quantifying the progress made from step t to t + 1:
kft+1
f ⇤ k2
= 2 ft (ft
kft
⇤
f ⇤ k2
(32)
2
f ) + k ft k
2µ
=
(et
1 + µK(xt , xt )
2µ
(et

1 + µK(xt , xt )
µ2 X 2
(et
(1 + µK(xt , xt ))2
et+1 )(e⇤t
et ) +
et+1 )(e⇤t
et ) + µ2 X 2 (et
et+1 )2 .
et+1 )2
(33)
Now for notation simplicity, let us define ⇠t =
inequality, we have:
f ⇤ k2
kft+1
 2⇠t (et
2⇠t et e⇤t
kft
µ
1+µK(xt ,xt ) .
f ⇤ k2
We have µ/(1 + µX 2 )  ⇠t  µ. Substitute ⇠t to the above
(34)
et+1 )(e⇤t et ) + µ2 X 2 (et
2⇠t e2t 2⇠t et+1 e⇤t + 2⇠t
et+1 )
2
et+1 et + µ2 X 2 e2t 2µ2 X 2 et et+1 + µ2 X 2 2 e2t+1
2
b
2
b
 ⇠t2 e2t + e⇤t 2 2⇠t e2t + ⇠t2 2 e2t+1 + e⇤2
+ ⇠t e2t+1 + ⇠t e2t + µ2 X 2 e2t
b
2
b
2 t
+ µ2 X 2 e2t+1 + µ2 X 2 e2t+1 + µ2 X 2 2 e2t+1
2
b
2
b
 µ2 e2t + e⇤t 2 2µ/(1 + µX 2 )e2t + µ2 2 e2t+1 + e⇤2
+ µ e2t+1 + µ e2t + µ2 X 2 e2t
b
2
b
2 t
+ µ2 X 2 e2t+1 + µ2 X 2 e2t+1 + µ2 X 2 2 e2t+1
=
Now let us sum over the above inequality from t = 1 to T , use the same tricks that we did in the proof for theorem. 3.3, we
have:
X
X
2 2 2
kf1 f ⇤ k2  µ2 ( +
+ X 2 (1 + )2 ) 2µ(1/(1 + µX 2 )
)
e2t + b
e⇤2
(35)
t + C,
b
b
where C again is a bounded non-negative constant that only depends
on e1 and eT . We first need 1/(1 + µX 2 ) > , hence
p
2
we need µ < (1/
1)/X . Eventually we will set µ = O(1/ T ) as we will show later (and we assume T is big enough).
2
Hence we here simply assume that µ  12 (1/
1)/X 2 . Hence, we have 1/(1 + µX 2 )
= 11+ .
1+
P 2
Now let us rearrange the terms in Eq. 35. We move the term that has et to the LHS and the term kf1 f ⇤ k2 to the RHS
and set µ = (1+ )( 2 + 21 2 +X 2 (1+ )2 ) , we have:
b
b
X
X
(1 + )2 ( 2b + 2b 2 + X 2 (1 + )2 )
(kf1 f ⇤ k2 + b
e⇤2
(36)
t + C).
2
(1
)
P ⇤2
Based on our assumption, kf1 f ⇤ k2  R,
et  E 2 T . We can further tight the upper bound by optimizing b. The RHS
of the above inequality can be upper bounded as:
h2+2 2 X
2+2 2
X 2 (1 + )2 X ⇤2
(1 + )2
e⇤2
(R + C) +
b
et
t +
2
2
(1
)
b(1
)
(1
)2
i
X 2 (1 + )2
+
(R + C)
(37)
2
(1
)
q
p
Similar to the proof for OMD-T D⇤ (0), we set b = 2(R+C)
X 2 E 2 T = O(1/ T ). Substitute b back to the expression of µ, we have:
e2t 
µ=
(1 + )(
the above expression, we have:
X
q
1
2X 2 E 2 T
R+C
1
= O( p ).
T
(1 + 2 ) + X 2 (1 + )2 )
2 + 2 2 X ⇤2 X 2 (1 + )4
2(1 + )4 XE p
e
+
(R
+
C)
+
2(R + C)T
t
(1
)2
(1
)2
(1
)2
p
(2 + 2 2 )(1 + )2 X ⇤2
=
et + O( T ).
2
(1
)
(38)
e2t  (1 + )2
For average prediction error, we divide both sides of the above inequality by T and take T ! 1, we have:
P 2
P ⇤2
et
(2 + 2 2 )(1 + )2
et
lim

.
2
T !1 T
(1
)
T
Hence we prove the theorem.
(39)
(40)