Linear prediction with side information

L INEAR PREDICTION WITH SIDE INFORMATION
Csaba Szepesvári
University of Alberta
CMPUT 654
E-mail: [email protected]
UofA, November 23, 2006
O UTLINE
1
L INEAR PREDICTION WITH SIDE INFORMATION
2
C ONVEX ANALYSIS
3
G RADIENT BASED
4
F ORECASTERS BASED
LINEAR FORECASTER
ON THE POLYNOMIAL POTENTIAL
Adaptive learning rates
Tracking
5
E XPONENTIATED G RADIENT (EG) A LGORITHM
6
B IBLIOGRAPHY
L INEAR PREDICTION WITH SIDE INFORMATION
P
Note: x, y ∈ Rd ; x · y = x T y = i xi yi .
Side information (features):
x1 , x2 , . . . , xt , . . . ∈ Rd
Targets
y1 , y2 , . . . , yt , . . . ∈ R
Linear expert: u ∈ Rd :
fu,t = σ(u · xt ).
Forecaster: wt−1 = A(x1 , y1 , . . . , xt−1 , yt−1 );
p̂t = σ(wt−1 · xt ),
where σ : R → R is some “transfer function”.
Loss function: ` : R × R → [0, ∞).
Regret:
Rn (u) = L̂n − Ln (u) =
n
X
t=1
`(p̂t , yt ) − `(fut , yt ).
B REGMAN DIVERGENCES
I DEA
Define “distance” from an arbitrary convex function
D EFINITION
Legendre function F : A → R is Legendre if
1
A ⊂ Rd , A 6= ∅, A◦ is convex
2
F is strictly convex, ∃F 0 and continuous on A◦
3
If xn ∈ A is such that xn → ∂A then k∇F (xn )k → ∞ as
n→∞
D EFINITION
Bregman divergence of F Let F : A → R be Legendre, F ≥ 0.
Define
DF (u, v ) = F (u) − F (v ) − (u − v )T ∇F (v ).
DF (u, v ): “Divergence from u to v ”
B REGMAN DIVERGENCE : E XAMPLES
P ROPOSITION
DF (u, v ) ≥ 0 and DF (u, u) = 0. However, DF (u, v ) 6= DF (v , u)
most of the cases.
E XAMPLE
Euclidean distance F (x) = 1/2kxk2 (k · k = k · k2 ). Then
∇F (v ) = v ,
1
DF (u, v ) = ku − v k2 .
2
E XAMPLE
Unnormalized
Kullback-Leibler
divergence
P
P
F (p) = i pi ln pi − i pi (unnormalized neg-entropy); pi > 0
(A = (0, ∞)d ). Then
DF (p, q) =
X
i
pi ln
pi X
+
(qi − pi ),
qi
i
B REGMAN DIVERGENCE /2
L EMMA (“3-D IVS L EMMA”)
F : A → R Legendre, u ∈ A, v , w ∈ A◦ . Then
DF (u, v ) + DF (v , w) = DF (u, w) + (u − v ) · (∇F (w) − ∇F (v )).
P ROOF .
Homework
B REGMAN PROJECTION
D EFINITION
Bregman projection S ⊂ Rd convex closed, S ∩ A 6= ∅, w ∈ A◦ .
Then
ProjF (w; S) = argminx∈S∩A DF (x, w).
Homework: Well-defined!
L EMMA (G ENERALIZED P YTHAGOREAN I NEQUALITY )
S ⊂ Rd convex, closed, S ∩ A 6= ∅, u ∈ S. Then for all w ∈ A◦ ,
w 0 = ProjF (w; S),
DF (u, w) ≥ DF (u, w 0 ) + DF (w 0 , w).
G ENERALIZED P YTHAGOREAN INEQUALITY
For S convex closed, u ∈ S, S ∩ A 6= ∅, w ∈ A◦ ,
w 0 = ProjF (w; S),
DF (u, w) ≥ DF (u, w 0 ) + DF (w 0 , w).
E XAMPLE (E UCLIDEAN NORM )
Let F (u) = 12 kuk2 (Euclidean norm), w 0 = argminx∈S kx − wk.
Then
ku − wk2 ≥ ku − w 0 k2 + kw 0 − wk2 .
N OTE
Equality if S is a hyperplane. Insight: “Law of cosines”
T HE P YTHAGOREAN INEQUALITY
w
w
w0
θw
w0
u
ku − wk2 = ku − w0k2 + kw0 − wk2
− 2ku − w0k · kw0 − wkcos(θw )
S
θw
u
ku − wk2 = ku − w0k2 + kw0 − wk2
−2ku − w0k · kw0 − wkcos(θw )
L EGENDRE DUAL
D EFINITION
Let F : A → R be a Legendre function. It’s Legendre dual is
F ∗ : A∗ → R,
A∗ = {∇F (u) | u ∈ A◦ },
and is defined by
F ∗ (u) = sup (u · v − F (v )) ,
v ∈A
P ROPOSITION
If F is Legendre then so is F ∗ . Moreover, F ∗∗ = F .
L EGENDRE DUAL PROPERTIES
L EMMA (D UAL BY GRADIENT )
F (u) + F ∗ (u 0 ) = u · u 0
⇔
u 0 = ∇F (u).
P ROOF .
Let G(v ) = u 0 · v − F (v ). G(v ) is concave;
∇G(v ) = u 0 − ∇F (v ).
⇒: F ∗ (u 0 ) = supv G(v ) = G(u). Hence ∇G(u) = 0.
⇐: Since u 0 = ∇F (u), ∇G(u) = 0; hence u is a maximizer
of G.
L EGENDRE DUAL PROPERTIES /2
L EMMA (I NVERSION LEMMA )
∇F ∗ = (∇F )−1 .
P ROOF .
u 0 = ∇F (u) ⇔ F (u) + F ∗ (u 0 ) = u · u 0
u = ∇F ∗ (u 0 ) ⇔ F ∗ (u 0 ) + F ∗∗ (u) + u · u 0 and F ∗∗ (u) = F (u).
L EGENDRE DUAL PROPERTIES /3
L EMMA (D UAL DIVERGENCES )
Let u, v ∈ A◦ , u 0 = ∇F (u), v 0 = ∇F (v ).
Then
DF (u, v ) = DF ∗ (u 0 , v 0 ).
P ROOF .
Use previous two lemmas
L EGENDRE DUAL : E XAMPLES /1
E XAMPLE (P OLYNOMIAL POTENTIALS )
Fp (u) = 12 kuk2p , p ≥ 2. Then Fp∗ (u) = 12 kuk2q = Fq (u), where
1/p + 1/q = 1 ((p, q) are conjugate pairs).
(∇Fp (u))i =
∇Fp−1 (u) = ∇Fq (u)
The only self-dual norm is
1
2
sgn(ui )|ui |p−1
kukp−2
p
,
(by the Inversion Lemma)
kuk2 (p = 2 above).
L EGENDRE DUAL : E XAMPLES /2
E XAMPLE (E XPONENTIAL POTENTIAL )
F (u) =
Pd
i=1 e
ui .
This is Legendre.
∇F (u) : u 7→ (eu1 , . . . , eud ),
∇F −1 : v 7→ (ln v1 , . . . , ln vd ), vi > 0.
Hence (∇F ∗P
)(v ) = (ln v1 , . . . , ln vd ).
⇒ F ∗ (v ) = i vi (ln vi − 1).
∗
Note: v ∈ ∆+
1 ⇒ F (v ) = −(H(v ) + 1).
L EGENDRE DUAL : E XAMPLES /3
E XAMPLE (H YPERBOLIC COSINE )
P
F (u) = 21 i (eui + e−ui ).
(∇F (u))i = sinh(ui ), ⇒
q
(∇F ∗ (v ))i = arc sinh(vi ) = ln( vi2 + 1 + vi )
q
P
⇒ F ∗ (v ) = di=1 (vi arc sinh(vi ) − vi2 + 1).
G RADIENT BASED LINEAR FORECASTER (GBLF)
D EFINITION (R EGULAR LOSS )
` : R × R → R regular loss: ` ≥ 0, convex and for x ∈ Rd , y ∈ R,
def
`x,y (w) = `(w · x, y) is differentiable.
G RADIENT BASED LINEAR FORECASTER
[WARMUTH AND JAGOTA , 1997, G ROVE ET AL ., 2001]
def
Let Φ be Legendre; `t (w) = `(w · xt , yt ), λ > 0. Let
wt = ∇Φ(∇Φ∗ (wt−1 ) − λ∇`t (wt−1 )),
where w0 = ∇Φ(0).
“D UAL GRADIENT UPDATES ”
GBLF:
wt = ∇Φ(∇Φ∗ (wt−1 ) − λ∇`t (wt−1 )),
Let θt = ∇Φ∗ (wt ). Then
∇Φ−1 (wt ) = ∇Φ∗ (wt−1 ) − λ`t (wt−1 ),
hence using (∇Φ)−1 = ∇Φ∗ ,
∇Φ∗ (wt ) = ∇Φ∗ (wt−1 ) − λ∇`t (wt−1 ),
or
θt = θt−1 − λ∇`t (wt−1 ).
This is called “dual gradient update”.
R EGRET OF GBLF
T HEOREM ([WARMUTH AND JAGOTA , 1997])
Let Rn (u) = L̂n − Ln (u), ` regular, Φ Legendre. Then
n
1X
1
DΦ∗ (wt−1 , wt ).
Rn (u) ≤ DΦ∗ (u, w0 ) +
λ
λ
t=1
P ROOF .
By the convexity of `t :
`t (wt−1 ) − `t (u) ≤ (wt−1 − u) · ∇`t (wt−1 )
= λ1 (u − wt−1 ) · (∇Φ∗ (wt ) − ∇Φ∗ (wt−1 ))
1
λ
(DΦ∗ (u, wt−1 ) − DΦ∗ (u, wt ) + DΦ∗ (wt−1 , wt )) .
P
The last line follows by the “3-Divs Lemma”. Now, nt=1 · and
drop −DΦ∗ (u, wn ) to get the result.
=
E XAMPLES
An alternative form:
wt = argminu∈Rd [DΦ∗ (u, wt−1 ) + λ (`t (wt−1 ) + (u − wt−1 ) · ∇`t (wt−1 ))]
Note that the argument is convex!
An approximate form (convex optimization aka
“proximal-point algorithm”, [Rockafellar, 1970]):
wt = argminu∈Rd [DΦ∗ (u, wt−1 ) + λ`t (u)] .
[Widrow and Hoff, 1960]:
wt = wt−1 − λ(wt−1 · xt − yt )
with Φ(u) = 21 kuk22 , quadratic loss.
Exponentiated Gradient ([Kivinen and Warmuth, 1997]):
wit ∝ wi,t−1 e−λ∇`t (wt−1 )
with Φ(u) =
P
i
eui and use “projection”
T RANSFER FUNCTIONS [H ELMBOLD ET AL ., 1999]
D EFINITION (N ICE PAIR )
(σ, `) is a nice pair, if ` is regular and for all y ∈ R, `(σ(·), y) is
convex.
D EFINITION (α- SUBQUADRATIC PAIR [C ESA -B IANCHI , 1999] )
Let α > 0. (σ, `) is α-subquadratic if
2
d
`(σ(v ), y) ≤ α `(σ(v ), y )
dv
holds for all v , y ∈ R.
Abbreviations: `σ (v , y) = `(σ(v ), y ); L̂σn , Lσn (u).
GBLF
WITH TRANSFER FUNCTION
σ
wt = ∇Φ(∇Φ∗ (wt−1 ) − λ∇`σt (wt−1 )),
where `σt (u) = `σ (u · xt , yt ) = `(σ(u · xt ), yt ).
E XAMPLES
E XAMPLE
Simple case: ` regular, σ(x) = x
Identity transfer function, quadratic loss:
`(p, y) = 12 (p − y)2 , σ(x) = x
1-subq.
Sigmoid transfer function, entropic loss:
ex −e−x
∈ [−1, 1],
ex +e−x
1+y
1−y
1−y
1+p + 2 ln 1−p
σ(x) = tanh(x) =
`(p, y) =
1+y
2
ln
2-subq.
Logistic transfer function with Hellinger distance:
1
1+e−x
∈ [0, 1]
p
p
√
`(p, y) = ( p − y )2 + ( 1 − p − 1 − y)2
σ(x) =
√
1/4-subq.
R EGRET FOR GBLF WITH A POLYNOMIAL POTENTIAL
T HEOREM
Let (σ, `) be an α-subq. nice pair, p ≥ 2, Φp (x) = 1/2kxk2p .
Consider GBLF with Φp and σ. Let Xp > 0 be such that
maxt=1,...,n kxt kp ≤ Xp , (p, q) be conjugate pairs. Then
Rnσ (u) ≤
Φq (u)
αλ 2 σ
+ (p − 1)
X L̂ ,
λ
2 p n
and with
λ=
L̂σn ≤
2
,
(p − 1)αXp2
kuk2q
(p − 1)αXp2
Lσn (u)
+
×
.
1−
(1 − )
4
A REPRESENTATION OF B REGMAN DIVERGENCES
L EMMA (R EPRESENTATION L EMMA )
Let Φ be Legendre. Then there exists ξ ∈ Rd such that
DΦ (u, v ) =
1
(u − v )T (∇2 Φ)|ξ (u − v ).
2
P ROOF .
DΦ (u, v ) = Φ(u) − Φ(v ) − (u − v ) · ∇Φ(v ).
Second order Taylor expansion of Φ around v gives:
1
Φ(u) = Φ(v ) + (u − v ) · ∇Φ(v ) + (u − v )T (∇2 Φ)|ξ (u − v ),
2
where ξ ∈ Rd is appropriate. Combining the two equations
gives the result.
A BOUND ON THE QUADRATIC FORMS
L EMMA (A
BOUND FOR POTENTIALS )
P
Assume that Φ(u) = ψ( i φ(ui )), where ∃ψ 00 ≤ 0, ∃φ0 . Then
!
X
X
x T ∇2 Φ|ξ x ≤ ψ 0
φ(ξk )
φ00 (ξi )xi2 .
k
i
Proof: Just expand the l.h.s. and observe that ψ 00 ≤ 0.
C OROLLARY (A
If Φ(u) =
1
2
BOUND FOR THE POLYNOMIAL POTENTIAL )
2
P
( i |ui |p ) p then
x T ∇2 Φ|ξ x ≤ (p − 1)kxk2p .
Proof: Use previous result with ψ(u) = 12 u 2/p , φ(u) = |u|p and
Hölder’s inequality to bound the second sum.
P ROOF OF THE THEOREM
As in the general regret theorem, since (σ, `) is a nice pair:
n
Rnσ (u) ≤
1
1X
DΦq (u, w0 ) +
DΦq (wt−1 , wt ).
λ
λ
t=1
Since w0 = DΦp (0) = 0, DΦq (u, w0 ) = Φq (u).
Let θt = DΦq (wt ); by the “Dual Divergences Lemma”,
“Representation Lemma”, bound on prev. slide:
DΦq (wt−1 , wt ) =
1
DΦ (θt−1 , θt ) ≤
2 p
p−1
2 kθt−1
− θt k2p .
By the definition of the algorithm: θt − θt−1 = −λ∇`σt (wt−1 ),
2
σ
2
DΦq (wt−1 , wt ) = p−1
2 λ k∇`t (wt−1 )kp
2
d`(σ(v ),yt ) p−1 2
= 2 λ
kxt k2p ≤
dv
wt−1 ·xt
since (σ, `) is α-subquadratic.
p−1 2
σ
2
2 λ α`t (wt−1 )kxt kp ,
S ELF - CONFIDENT FORECASTER [AUER ET AL ., 2002]
S ELF - CONFIDENT FORECASTER (Uq )
Let Uq > 0, S = {u ∈ Rd | Φq (u) ≤ Uq },
wt0 = ∇Φp (∇Φq (wt−1 ) − λt ∇`σt (wt−1 ))
wt
λt
βt
L̂σt
= ProjΦq (wt0 ; S),
βt
, Xpt = max kxs kp ,
s=1,...,t
(p − 1)αXpt2
s
kt
=
, kt = (p − 1)αXpt2 Uq ,
kt + L̂σt
=
=
t
X
s=1
`σs (ws−1 ).
R EGRET FOR THE SELF - CONFIDENT FORECASTER
T HEOREM
Consider the self-confident forecaster with parameter Uq > 0.
Let u ∈ Rd be such that Φq (u) ≤ Uq . Then
q
2 U Lσ (u) + 30(p − 1)αX 2 U .
Rnσ (u) ≤ 5 (p − 1) α Xp,n
p n
p,n q
P ROJECTED LINEAR FORECASTER : T RACKING
P ROJECTED LINEAR FORECASTER (S, λ)
S ⊂ Rd convex, closed, λ > 0.
wt0 = ∇Φp (∇Φq (wt−1 ) − λ∇`σt (wt−1 ))
wt
= ProjΦq (wt0 ; S).
T RACKING REGRET
T HEOREM ([H ERBSTER
AND
WARMUTH , 2001])
Let (σ, `) be nice, (p, q) conjugate pair, Uq > 0, ∈ (0, 1),
S = {w | Φq (w) ≤ Uq }, λ = 2/((p − 1)αXp2 ), kxt kp ≤ Xp . Then
for any sequence {ut }t ⊂ S,
Lσn (hut i)
+
1− P
(p−1)αXp2 p
2Uq nt=1 kut−1 − ut kq +
2(1−)
L̂σn ≤
D ISCUSSION
def
Lσn (hut i) =
Pn
t=1 `(σ(ut
Role of Uq
What if ut−1 ≡ ut ≡ u?
· xt ), yt ).
kun k2q
2
.
A L EMMA FOR THE POLYNOMIAL POTENTIAL
L EMMA
Let Φp be the Legendre polynomial potential. For all θ ∈ Rd ,
Φp (θ) = Φq (∇Φp (θ)).
P ROOF .
Let w = ∇Φp (θ). By the “dual by gradient” lemma:
Φp (θ) + Φq (w) = θ · w.
p
By Hölder’s inequality, θ · w ≤ 2 Φq (θ)Φp (w). Hence
q
Φp (θ) −
and so Φp (θ) = Φq (w).
q
2
Φq (w) ≤ 0,
P ROOF /1
Since (σ, `) is nice:
`σt (wt−1 ) − `σt (ut−1 )
≤
1
λ
DΦq (ut−1 , wt−1 ) − DΦq (ut−1 , wt0 ) + DΦq (wt−1 , wt0 ) .
By the Generalized Pythagorean theorem (since ut−1 ∈ S):
DΦq (ut−1 , wt0 ) ≥ DΦq (ut−1 , wt ) + DΦq (wt0 , wt ) ≥ DΦq (ut−1 , wt ).
Hence,
`σt (wt−1 ) − `σt (ut−1 )
1
DΦq (ut−1 , wt−1 ) − DΦq (ut−1 , wt ) + DΦq (wt−1 , wt0 )
λ
1
=
DΦq (ut−1 , wt−1 ) − DΦq (ut , wt )
λ
1
+ −DΦq (ut−1 , wt ) + DΦq (ut , wt )
λ
1
1
+ DΦq (wt−1 , wt0 ) =: (at + bt + ct ).
λ
λ
≤
P ROOF /2
at = DΦq (ut−1 , wt−1 ) − DΦq (ut , wt )
n
X
at ≤ Φq (u0 ).
t=1
bt
= −DΦq (ut−1 , wt ) + DΦq (ut , wt )
= Φq (ut ) − Φq (ut−1 ) + (ut−1 − ut ) · ∇Φq (wt )
= Φq (ut ) − Φq (ut−1 ) + (ut−1 − ut ) · θt .
First two terms telescope; estimate last term:
q
(ut−1 − ut ) · θt ≤ 2 Φq (ut−1 − ut )Φp (θt ) (Hölder)
q
= 2 Φq (ut−1 − ut )Φq (wt ) (prev. lemma)
q
≤ kut−1 − ut kq 2Uq (wt ∈ S).
P ROOF /3
ct = DΦq (wt−1 , wt0 ); as before:
n
X
t=1
DΦq (wt−1 , wt0 ) ≤ (p − 1)
αλ2 2 σ
Xp L̂n .
2
The result follows by collecting the inequalities obtained.
E XPONENTIATED GRADIENT (EG) ALGORITHM KW97
Φ(u) =
P
i
eui , (∇Φ∗ (w))i = ln wi . Projected GBLF:
wt0 = ∇Φp (∇Φq (wt−1 ) − λt ∇`σt (wt−1 ))
wt
= ProjΦq (wt0 ; S),
where S = ∆1 .
wit0
= [∇Φp (∇Φq (wt−1 ) − λ∇`σt (wt−1 ))]i
= exp (∇Φq (wt−1 ) − λ∇`σt (wt−1 ))i
= exp (∇Φq (wt−1 ) − λ∇`σt (wt−1 ))i
= exp ln(wt−1,i ) − λ∇`σt (wt−1 )i
σ (w
t−1 )i
= wt−1,i e−λ∇`t
.
A NALYSIS
Assume (σ, `) is nice. Then:
`σt (wt−1 ) − `σt (u) ≤ −(u − wt−1 ) · ∇`σt (wt−1 ).
Let z = λ∇`σt (wt−1 ), v be defined by vi = wt−1 · z − zi . Then
−(u − wt−1 ) · z =
X
X
= −u · z + wt−1 · z − ln(
wt−1,i evi ) + ln(
wt−1,i evi )
X
X
= −u · z + − ln(
wt−1,i e−zi ) + ln(
wt−1,i evi )
X
X
X
=
uj ln e−zj − ln
wt−1,i e−zi + ln(
wt−1,i evi )
X
X
X
wt−1,j e−zj
1
P
+
ln(
wt−1,i evi )
=
uj ln
wt−1,j i wt−1,i e−zi
X
X
wtj
=
uj ln
+ ln(
wt−1,i evi )
wt−1,j
X
wt−1,i evi ).
= D(ukwt−1 ) − D(ukwt ) + ln(
A NALYSIS /2
X
1
D(ukwt−1 ) − D(ukwt ) + ln(
wt−1,i evi ) .
λ
`σt (wt−1 ) − `σt (u) ≤
The term D(ukwt−1 ) − D(ukwt ) telescopes. Let’s bound the
second term! Can use Hoeffding’s inequality, if we establish the
range if vi :
vi
= wt−1 · z − zi ,
z = λ∇`σt (wt−1 ) = λ
Let
Ct =
d`(σ(v ), yt ) xt
dv
wt−1 ·xt
d`(σ(v ), yt ) kxt k∞ .
dv
wt−1 ·xt
Then zi ∈ [−λCt , λCt ].
A NALYSIS /3
P
The goal is to bound ln( wt−1,i evi ), where zi ∈ [−λCt , λCt ].
Hoeffding’s inequality:
X
λ2 Ct2
ln(
wt−1,i evi ) ≤
.
2
2 .
If (σ, `) is α-subquadratic, Ct2 ≤ α`σt (wt−1 )X∞
Collecting the inequalities:
Rnσ (u) ≤
ln d
αλ 2 σ
+
X L̂ .
λ
2 ∞ n
Choosing λ “cleverly” gives the following result:
R EGRET FOR EG
T HEOREM
Assume (σ, `) is nice, α-subquadratic. Let u ∈ Σ+
1,
X∞ = maxt=1,...,n kxt k∞ . Then
L̂σn ≤
2 ln d
Lσn (u) αX∞
+
.
1−
2(1 − )
N EGATIVE WEIGHTS ? [G ROVE ET AL ., 1997]
Use EG with
xt0 = (−xt1 , . . . , −xtd , xt1 , . . . , xtd ).
≡ projected GBLF with hyperbolic cosine potential.
C HOOSING THE POTENTIAL
Fix u; scale simplex to contain u ⇒ can use EG:
L̂σn ≤
α
Lσn (u)
2
+
× (ln d) kuk21 X∞
.
1−
2(1 − )
Regret for GBLF using the polynomial potential with p ≥ 2:
L̂σn ≤
Lσn (u)
p−1
α
+
×
kuk2q Xp2 .
1−
2(1 − )
2
p = ln d gives a close match.
EG is at advantage: Dense xt (xti ∈ −1, 1), sparse u (e.g.,
u = (1, 0, . . . , 0)).
Poly is at advantage: Sparse side info, dense expert.
R EFERENCES
Auer, P., Cesa-Bianchi, N., and Gentile, C.
(2002).
Adaptive and self-confident on-line learning
algorithms.
Journal of Computer and System Sciences,
64(1).
Cesa-Bianchi, N. (1999).
Analysis of two gradient-based algorithms for
on-line regression.
Journal of Computer and System Sciences,
59(3):392–411.
Grove, A., Littlestone, N., and Schuurmans, D.
(1997).
General convergence results for linear
discriminant updates.
In Proceedings of the 10th Annual Conference
on Computational Learning Theory, pages
171–183. ACM Press.
Grove, A., Littlestone, N., and Schuurmans, D.
(2001).
General convergence results for linear
discriminant updates.
Machine Learning, 43(3):173–210.
Helmbold, D., Kivinen, J., and Warmuth, M.
(1999).
Relative loss bounds for single neurons.
IEEE Transactions on Neural Networks,
10(6):1291–1304.
Herbster, M. and Warmuth, M. (2001).
Tracking the best linear predictor.
Journal of Machine Learning Research,
1:281–309.
Kivinen, J. and Warmuth, M. (1997).
Exponentiated gradient versus gradient descent
for linear predictors.
Information and Computation, 132(1):1–63.
Rockafellar, R. (1970).
Convex Analysis.
Princeton University Press.
Warmuth, M. and Jagota, A. (1997).
Continuous and discrete-time nonlinear gradient
descent: Relative loss bounds and convergence.
In Electronic proceedings of the 5th International
Symposium on Artificial Intelligence and
Mathematics.
Widrow, B. and Hoff, M. (1960).
Adaptive switching circuits.
In 1960 IRE WESCON Conv. Record, pages
96–104.