L INEAR PREDICTION WITH SIDE INFORMATION Csaba Szepesvári University of Alberta CMPUT 654 E-mail: [email protected] UofA, November 23, 2006 O UTLINE 1 L INEAR PREDICTION WITH SIDE INFORMATION 2 C ONVEX ANALYSIS 3 G RADIENT BASED 4 F ORECASTERS BASED LINEAR FORECASTER ON THE POLYNOMIAL POTENTIAL Adaptive learning rates Tracking 5 E XPONENTIATED G RADIENT (EG) A LGORITHM 6 B IBLIOGRAPHY L INEAR PREDICTION WITH SIDE INFORMATION P Note: x, y ∈ Rd ; x · y = x T y = i xi yi . Side information (features): x1 , x2 , . . . , xt , . . . ∈ Rd Targets y1 , y2 , . . . , yt , . . . ∈ R Linear expert: u ∈ Rd : fu,t = σ(u · xt ). Forecaster: wt−1 = A(x1 , y1 , . . . , xt−1 , yt−1 ); p̂t = σ(wt−1 · xt ), where σ : R → R is some “transfer function”. Loss function: ` : R × R → [0, ∞). Regret: Rn (u) = L̂n − Ln (u) = n X t=1 `(p̂t , yt ) − `(fut , yt ). B REGMAN DIVERGENCES I DEA Define “distance” from an arbitrary convex function D EFINITION Legendre function F : A → R is Legendre if 1 A ⊂ Rd , A 6= ∅, A◦ is convex 2 F is strictly convex, ∃F 0 and continuous on A◦ 3 If xn ∈ A is such that xn → ∂A then k∇F (xn )k → ∞ as n→∞ D EFINITION Bregman divergence of F Let F : A → R be Legendre, F ≥ 0. Define DF (u, v ) = F (u) − F (v ) − (u − v )T ∇F (v ). DF (u, v ): “Divergence from u to v ” B REGMAN DIVERGENCE : E XAMPLES P ROPOSITION DF (u, v ) ≥ 0 and DF (u, u) = 0. However, DF (u, v ) 6= DF (v , u) most of the cases. E XAMPLE Euclidean distance F (x) = 1/2kxk2 (k · k = k · k2 ). Then ∇F (v ) = v , 1 DF (u, v ) = ku − v k2 . 2 E XAMPLE Unnormalized Kullback-Leibler divergence P P F (p) = i pi ln pi − i pi (unnormalized neg-entropy); pi > 0 (A = (0, ∞)d ). Then DF (p, q) = X i pi ln pi X + (qi − pi ), qi i B REGMAN DIVERGENCE /2 L EMMA (“3-D IVS L EMMA”) F : A → R Legendre, u ∈ A, v , w ∈ A◦ . Then DF (u, v ) + DF (v , w) = DF (u, w) + (u − v ) · (∇F (w) − ∇F (v )). P ROOF . Homework B REGMAN PROJECTION D EFINITION Bregman projection S ⊂ Rd convex closed, S ∩ A 6= ∅, w ∈ A◦ . Then ProjF (w; S) = argminx∈S∩A DF (x, w). Homework: Well-defined! L EMMA (G ENERALIZED P YTHAGOREAN I NEQUALITY ) S ⊂ Rd convex, closed, S ∩ A 6= ∅, u ∈ S. Then for all w ∈ A◦ , w 0 = ProjF (w; S), DF (u, w) ≥ DF (u, w 0 ) + DF (w 0 , w). G ENERALIZED P YTHAGOREAN INEQUALITY For S convex closed, u ∈ S, S ∩ A 6= ∅, w ∈ A◦ , w 0 = ProjF (w; S), DF (u, w) ≥ DF (u, w 0 ) + DF (w 0 , w). E XAMPLE (E UCLIDEAN NORM ) Let F (u) = 12 kuk2 (Euclidean norm), w 0 = argminx∈S kx − wk. Then ku − wk2 ≥ ku − w 0 k2 + kw 0 − wk2 . N OTE Equality if S is a hyperplane. Insight: “Law of cosines” T HE P YTHAGOREAN INEQUALITY w w w0 θw w0 u ku − wk2 = ku − w0k2 + kw0 − wk2 − 2ku − w0k · kw0 − wkcos(θw ) S θw u ku − wk2 = ku − w0k2 + kw0 − wk2 −2ku − w0k · kw0 − wkcos(θw ) L EGENDRE DUAL D EFINITION Let F : A → R be a Legendre function. It’s Legendre dual is F ∗ : A∗ → R, A∗ = {∇F (u) | u ∈ A◦ }, and is defined by F ∗ (u) = sup (u · v − F (v )) , v ∈A P ROPOSITION If F is Legendre then so is F ∗ . Moreover, F ∗∗ = F . L EGENDRE DUAL PROPERTIES L EMMA (D UAL BY GRADIENT ) F (u) + F ∗ (u 0 ) = u · u 0 ⇔ u 0 = ∇F (u). P ROOF . Let G(v ) = u 0 · v − F (v ). G(v ) is concave; ∇G(v ) = u 0 − ∇F (v ). ⇒: F ∗ (u 0 ) = supv G(v ) = G(u). Hence ∇G(u) = 0. ⇐: Since u 0 = ∇F (u), ∇G(u) = 0; hence u is a maximizer of G. L EGENDRE DUAL PROPERTIES /2 L EMMA (I NVERSION LEMMA ) ∇F ∗ = (∇F )−1 . P ROOF . u 0 = ∇F (u) ⇔ F (u) + F ∗ (u 0 ) = u · u 0 u = ∇F ∗ (u 0 ) ⇔ F ∗ (u 0 ) + F ∗∗ (u) + u · u 0 and F ∗∗ (u) = F (u). L EGENDRE DUAL PROPERTIES /3 L EMMA (D UAL DIVERGENCES ) Let u, v ∈ A◦ , u 0 = ∇F (u), v 0 = ∇F (v ). Then DF (u, v ) = DF ∗ (u 0 , v 0 ). P ROOF . Use previous two lemmas L EGENDRE DUAL : E XAMPLES /1 E XAMPLE (P OLYNOMIAL POTENTIALS ) Fp (u) = 12 kuk2p , p ≥ 2. Then Fp∗ (u) = 12 kuk2q = Fq (u), where 1/p + 1/q = 1 ((p, q) are conjugate pairs). (∇Fp (u))i = ∇Fp−1 (u) = ∇Fq (u) The only self-dual norm is 1 2 sgn(ui )|ui |p−1 kukp−2 p , (by the Inversion Lemma) kuk2 (p = 2 above). L EGENDRE DUAL : E XAMPLES /2 E XAMPLE (E XPONENTIAL POTENTIAL ) F (u) = Pd i=1 e ui . This is Legendre. ∇F (u) : u 7→ (eu1 , . . . , eud ), ∇F −1 : v 7→ (ln v1 , . . . , ln vd ), vi > 0. Hence (∇F ∗P )(v ) = (ln v1 , . . . , ln vd ). ⇒ F ∗ (v ) = i vi (ln vi − 1). ∗ Note: v ∈ ∆+ 1 ⇒ F (v ) = −(H(v ) + 1). L EGENDRE DUAL : E XAMPLES /3 E XAMPLE (H YPERBOLIC COSINE ) P F (u) = 21 i (eui + e−ui ). (∇F (u))i = sinh(ui ), ⇒ q (∇F ∗ (v ))i = arc sinh(vi ) = ln( vi2 + 1 + vi ) q P ⇒ F ∗ (v ) = di=1 (vi arc sinh(vi ) − vi2 + 1). G RADIENT BASED LINEAR FORECASTER (GBLF) D EFINITION (R EGULAR LOSS ) ` : R × R → R regular loss: ` ≥ 0, convex and for x ∈ Rd , y ∈ R, def `x,y (w) = `(w · x, y) is differentiable. G RADIENT BASED LINEAR FORECASTER [WARMUTH AND JAGOTA , 1997, G ROVE ET AL ., 2001] def Let Φ be Legendre; `t (w) = `(w · xt , yt ), λ > 0. Let wt = ∇Φ(∇Φ∗ (wt−1 ) − λ∇`t (wt−1 )), where w0 = ∇Φ(0). “D UAL GRADIENT UPDATES ” GBLF: wt = ∇Φ(∇Φ∗ (wt−1 ) − λ∇`t (wt−1 )), Let θt = ∇Φ∗ (wt ). Then ∇Φ−1 (wt ) = ∇Φ∗ (wt−1 ) − λ`t (wt−1 ), hence using (∇Φ)−1 = ∇Φ∗ , ∇Φ∗ (wt ) = ∇Φ∗ (wt−1 ) − λ∇`t (wt−1 ), or θt = θt−1 − λ∇`t (wt−1 ). This is called “dual gradient update”. R EGRET OF GBLF T HEOREM ([WARMUTH AND JAGOTA , 1997]) Let Rn (u) = L̂n − Ln (u), ` regular, Φ Legendre. Then n 1X 1 DΦ∗ (wt−1 , wt ). Rn (u) ≤ DΦ∗ (u, w0 ) + λ λ t=1 P ROOF . By the convexity of `t : `t (wt−1 ) − `t (u) ≤ (wt−1 − u) · ∇`t (wt−1 ) = λ1 (u − wt−1 ) · (∇Φ∗ (wt ) − ∇Φ∗ (wt−1 )) 1 λ (DΦ∗ (u, wt−1 ) − DΦ∗ (u, wt ) + DΦ∗ (wt−1 , wt )) . P The last line follows by the “3-Divs Lemma”. Now, nt=1 · and drop −DΦ∗ (u, wn ) to get the result. = E XAMPLES An alternative form: wt = argminu∈Rd [DΦ∗ (u, wt−1 ) + λ (`t (wt−1 ) + (u − wt−1 ) · ∇`t (wt−1 ))] Note that the argument is convex! An approximate form (convex optimization aka “proximal-point algorithm”, [Rockafellar, 1970]): wt = argminu∈Rd [DΦ∗ (u, wt−1 ) + λ`t (u)] . [Widrow and Hoff, 1960]: wt = wt−1 − λ(wt−1 · xt − yt ) with Φ(u) = 21 kuk22 , quadratic loss. Exponentiated Gradient ([Kivinen and Warmuth, 1997]): wit ∝ wi,t−1 e−λ∇`t (wt−1 ) with Φ(u) = P i eui and use “projection” T RANSFER FUNCTIONS [H ELMBOLD ET AL ., 1999] D EFINITION (N ICE PAIR ) (σ, `) is a nice pair, if ` is regular and for all y ∈ R, `(σ(·), y) is convex. D EFINITION (α- SUBQUADRATIC PAIR [C ESA -B IANCHI , 1999] ) Let α > 0. (σ, `) is α-subquadratic if 2 d `(σ(v ), y) ≤ α `(σ(v ), y ) dv holds for all v , y ∈ R. Abbreviations: `σ (v , y) = `(σ(v ), y ); L̂σn , Lσn (u). GBLF WITH TRANSFER FUNCTION σ wt = ∇Φ(∇Φ∗ (wt−1 ) − λ∇`σt (wt−1 )), where `σt (u) = `σ (u · xt , yt ) = `(σ(u · xt ), yt ). E XAMPLES E XAMPLE Simple case: ` regular, σ(x) = x Identity transfer function, quadratic loss: `(p, y) = 12 (p − y)2 , σ(x) = x 1-subq. Sigmoid transfer function, entropic loss: ex −e−x ∈ [−1, 1], ex +e−x 1+y 1−y 1−y 1+p + 2 ln 1−p σ(x) = tanh(x) = `(p, y) = 1+y 2 ln 2-subq. Logistic transfer function with Hellinger distance: 1 1+e−x ∈ [0, 1] p p √ `(p, y) = ( p − y )2 + ( 1 − p − 1 − y)2 σ(x) = √ 1/4-subq. R EGRET FOR GBLF WITH A POLYNOMIAL POTENTIAL T HEOREM Let (σ, `) be an α-subq. nice pair, p ≥ 2, Φp (x) = 1/2kxk2p . Consider GBLF with Φp and σ. Let Xp > 0 be such that maxt=1,...,n kxt kp ≤ Xp , (p, q) be conjugate pairs. Then Rnσ (u) ≤ Φq (u) αλ 2 σ + (p − 1) X L̂ , λ 2 p n and with λ= L̂σn ≤ 2 , (p − 1)αXp2 kuk2q (p − 1)αXp2 Lσn (u) + × . 1− (1 − ) 4 A REPRESENTATION OF B REGMAN DIVERGENCES L EMMA (R EPRESENTATION L EMMA ) Let Φ be Legendre. Then there exists ξ ∈ Rd such that DΦ (u, v ) = 1 (u − v )T (∇2 Φ)|ξ (u − v ). 2 P ROOF . DΦ (u, v ) = Φ(u) − Φ(v ) − (u − v ) · ∇Φ(v ). Second order Taylor expansion of Φ around v gives: 1 Φ(u) = Φ(v ) + (u − v ) · ∇Φ(v ) + (u − v )T (∇2 Φ)|ξ (u − v ), 2 where ξ ∈ Rd is appropriate. Combining the two equations gives the result. A BOUND ON THE QUADRATIC FORMS L EMMA (A BOUND FOR POTENTIALS ) P Assume that Φ(u) = ψ( i φ(ui )), where ∃ψ 00 ≤ 0, ∃φ0 . Then ! X X x T ∇2 Φ|ξ x ≤ ψ 0 φ(ξk ) φ00 (ξi )xi2 . k i Proof: Just expand the l.h.s. and observe that ψ 00 ≤ 0. C OROLLARY (A If Φ(u) = 1 2 BOUND FOR THE POLYNOMIAL POTENTIAL ) 2 P ( i |ui |p ) p then x T ∇2 Φ|ξ x ≤ (p − 1)kxk2p . Proof: Use previous result with ψ(u) = 12 u 2/p , φ(u) = |u|p and Hölder’s inequality to bound the second sum. P ROOF OF THE THEOREM As in the general regret theorem, since (σ, `) is a nice pair: n Rnσ (u) ≤ 1 1X DΦq (u, w0 ) + DΦq (wt−1 , wt ). λ λ t=1 Since w0 = DΦp (0) = 0, DΦq (u, w0 ) = Φq (u). Let θt = DΦq (wt ); by the “Dual Divergences Lemma”, “Representation Lemma”, bound on prev. slide: DΦq (wt−1 , wt ) = 1 DΦ (θt−1 , θt ) ≤ 2 p p−1 2 kθt−1 − θt k2p . By the definition of the algorithm: θt − θt−1 = −λ∇`σt (wt−1 ), 2 σ 2 DΦq (wt−1 , wt ) = p−1 2 λ k∇`t (wt−1 )kp 2 d`(σ(v ),yt ) p−1 2 = 2 λ kxt k2p ≤ dv wt−1 ·xt since (σ, `) is α-subquadratic. p−1 2 σ 2 2 λ α`t (wt−1 )kxt kp , S ELF - CONFIDENT FORECASTER [AUER ET AL ., 2002] S ELF - CONFIDENT FORECASTER (Uq ) Let Uq > 0, S = {u ∈ Rd | Φq (u) ≤ Uq }, wt0 = ∇Φp (∇Φq (wt−1 ) − λt ∇`σt (wt−1 )) wt λt βt L̂σt = ProjΦq (wt0 ; S), βt , Xpt = max kxs kp , s=1,...,t (p − 1)αXpt2 s kt = , kt = (p − 1)αXpt2 Uq , kt + L̂σt = = t X s=1 `σs (ws−1 ). R EGRET FOR THE SELF - CONFIDENT FORECASTER T HEOREM Consider the self-confident forecaster with parameter Uq > 0. Let u ∈ Rd be such that Φq (u) ≤ Uq . Then q 2 U Lσ (u) + 30(p − 1)αX 2 U . Rnσ (u) ≤ 5 (p − 1) α Xp,n p n p,n q P ROJECTED LINEAR FORECASTER : T RACKING P ROJECTED LINEAR FORECASTER (S, λ) S ⊂ Rd convex, closed, λ > 0. wt0 = ∇Φp (∇Φq (wt−1 ) − λ∇`σt (wt−1 )) wt = ProjΦq (wt0 ; S). T RACKING REGRET T HEOREM ([H ERBSTER AND WARMUTH , 2001]) Let (σ, `) be nice, (p, q) conjugate pair, Uq > 0, ∈ (0, 1), S = {w | Φq (w) ≤ Uq }, λ = 2/((p − 1)αXp2 ), kxt kp ≤ Xp . Then for any sequence {ut }t ⊂ S, Lσn (hut i) + 1− P (p−1)αXp2 p 2Uq nt=1 kut−1 − ut kq + 2(1−) L̂σn ≤ D ISCUSSION def Lσn (hut i) = Pn t=1 `(σ(ut Role of Uq What if ut−1 ≡ ut ≡ u? · xt ), yt ). kun k2q 2 . A L EMMA FOR THE POLYNOMIAL POTENTIAL L EMMA Let Φp be the Legendre polynomial potential. For all θ ∈ Rd , Φp (θ) = Φq (∇Φp (θ)). P ROOF . Let w = ∇Φp (θ). By the “dual by gradient” lemma: Φp (θ) + Φq (w) = θ · w. p By Hölder’s inequality, θ · w ≤ 2 Φq (θ)Φp (w). Hence q Φp (θ) − and so Φp (θ) = Φq (w). q 2 Φq (w) ≤ 0, P ROOF /1 Since (σ, `) is nice: `σt (wt−1 ) − `σt (ut−1 ) ≤ 1 λ DΦq (ut−1 , wt−1 ) − DΦq (ut−1 , wt0 ) + DΦq (wt−1 , wt0 ) . By the Generalized Pythagorean theorem (since ut−1 ∈ S): DΦq (ut−1 , wt0 ) ≥ DΦq (ut−1 , wt ) + DΦq (wt0 , wt ) ≥ DΦq (ut−1 , wt ). Hence, `σt (wt−1 ) − `σt (ut−1 ) 1 DΦq (ut−1 , wt−1 ) − DΦq (ut−1 , wt ) + DΦq (wt−1 , wt0 ) λ 1 = DΦq (ut−1 , wt−1 ) − DΦq (ut , wt ) λ 1 + −DΦq (ut−1 , wt ) + DΦq (ut , wt ) λ 1 1 + DΦq (wt−1 , wt0 ) =: (at + bt + ct ). λ λ ≤ P ROOF /2 at = DΦq (ut−1 , wt−1 ) − DΦq (ut , wt ) n X at ≤ Φq (u0 ). t=1 bt = −DΦq (ut−1 , wt ) + DΦq (ut , wt ) = Φq (ut ) − Φq (ut−1 ) + (ut−1 − ut ) · ∇Φq (wt ) = Φq (ut ) − Φq (ut−1 ) + (ut−1 − ut ) · θt . First two terms telescope; estimate last term: q (ut−1 − ut ) · θt ≤ 2 Φq (ut−1 − ut )Φp (θt ) (Hölder) q = 2 Φq (ut−1 − ut )Φq (wt ) (prev. lemma) q ≤ kut−1 − ut kq 2Uq (wt ∈ S). P ROOF /3 ct = DΦq (wt−1 , wt0 ); as before: n X t=1 DΦq (wt−1 , wt0 ) ≤ (p − 1) αλ2 2 σ Xp L̂n . 2 The result follows by collecting the inequalities obtained. E XPONENTIATED GRADIENT (EG) ALGORITHM KW97 Φ(u) = P i eui , (∇Φ∗ (w))i = ln wi . Projected GBLF: wt0 = ∇Φp (∇Φq (wt−1 ) − λt ∇`σt (wt−1 )) wt = ProjΦq (wt0 ; S), where S = ∆1 . wit0 = [∇Φp (∇Φq (wt−1 ) − λ∇`σt (wt−1 ))]i = exp (∇Φq (wt−1 ) − λ∇`σt (wt−1 ))i = exp (∇Φq (wt−1 ) − λ∇`σt (wt−1 ))i = exp ln(wt−1,i ) − λ∇`σt (wt−1 )i σ (w t−1 )i = wt−1,i e−λ∇`t . A NALYSIS Assume (σ, `) is nice. Then: `σt (wt−1 ) − `σt (u) ≤ −(u − wt−1 ) · ∇`σt (wt−1 ). Let z = λ∇`σt (wt−1 ), v be defined by vi = wt−1 · z − zi . Then −(u − wt−1 ) · z = X X = −u · z + wt−1 · z − ln( wt−1,i evi ) + ln( wt−1,i evi ) X X = −u · z + − ln( wt−1,i e−zi ) + ln( wt−1,i evi ) X X X = uj ln e−zj − ln wt−1,i e−zi + ln( wt−1,i evi ) X X X wt−1,j e−zj 1 P + ln( wt−1,i evi ) = uj ln wt−1,j i wt−1,i e−zi X X wtj = uj ln + ln( wt−1,i evi ) wt−1,j X wt−1,i evi ). = D(ukwt−1 ) − D(ukwt ) + ln( A NALYSIS /2 X 1 D(ukwt−1 ) − D(ukwt ) + ln( wt−1,i evi ) . λ `σt (wt−1 ) − `σt (u) ≤ The term D(ukwt−1 ) − D(ukwt ) telescopes. Let’s bound the second term! Can use Hoeffding’s inequality, if we establish the range if vi : vi = wt−1 · z − zi , z = λ∇`σt (wt−1 ) = λ Let Ct = d`(σ(v ), yt ) xt dv wt−1 ·xt d`(σ(v ), yt ) kxt k∞ . dv wt−1 ·xt Then zi ∈ [−λCt , λCt ]. A NALYSIS /3 P The goal is to bound ln( wt−1,i evi ), where zi ∈ [−λCt , λCt ]. Hoeffding’s inequality: X λ2 Ct2 ln( wt−1,i evi ) ≤ . 2 2 . If (σ, `) is α-subquadratic, Ct2 ≤ α`σt (wt−1 )X∞ Collecting the inequalities: Rnσ (u) ≤ ln d αλ 2 σ + X L̂ . λ 2 ∞ n Choosing λ “cleverly” gives the following result: R EGRET FOR EG T HEOREM Assume (σ, `) is nice, α-subquadratic. Let u ∈ Σ+ 1, X∞ = maxt=1,...,n kxt k∞ . Then L̂σn ≤ 2 ln d Lσn (u) αX∞ + . 1− 2(1 − ) N EGATIVE WEIGHTS ? [G ROVE ET AL ., 1997] Use EG with xt0 = (−xt1 , . . . , −xtd , xt1 , . . . , xtd ). ≡ projected GBLF with hyperbolic cosine potential. C HOOSING THE POTENTIAL Fix u; scale simplex to contain u ⇒ can use EG: L̂σn ≤ α Lσn (u) 2 + × (ln d) kuk21 X∞ . 1− 2(1 − ) Regret for GBLF using the polynomial potential with p ≥ 2: L̂σn ≤ Lσn (u) p−1 α + × kuk2q Xp2 . 1− 2(1 − ) 2 p = ln d gives a close match. EG is at advantage: Dense xt (xti ∈ −1, 1), sparse u (e.g., u = (1, 0, . . . , 0)). Poly is at advantage: Sparse side info, dense expert. R EFERENCES Auer, P., Cesa-Bianchi, N., and Gentile, C. (2002). Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64(1). Cesa-Bianchi, N. (1999). Analysis of two gradient-based algorithms for on-line regression. Journal of Computer and System Sciences, 59(3):392–411. Grove, A., Littlestone, N., and Schuurmans, D. (1997). General convergence results for linear discriminant updates. In Proceedings of the 10th Annual Conference on Computational Learning Theory, pages 171–183. ACM Press. Grove, A., Littlestone, N., and Schuurmans, D. (2001). General convergence results for linear discriminant updates. Machine Learning, 43(3):173–210. Helmbold, D., Kivinen, J., and Warmuth, M. (1999). Relative loss bounds for single neurons. IEEE Transactions on Neural Networks, 10(6):1291–1304. Herbster, M. and Warmuth, M. (2001). Tracking the best linear predictor. Journal of Machine Learning Research, 1:281–309. Kivinen, J. and Warmuth, M. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–63. Rockafellar, R. (1970). Convex Analysis. Princeton University Press. Warmuth, M. and Jagota, A. (1997). Continuous and discrete-time nonlinear gradient descent: Relative loss bounds and convergence. In Electronic proceedings of the 5th International Symposium on Artificial Intelligence and Mathematics. Widrow, B. and Hoff, M. (1960). Adaptive switching circuits. In 1960 IRE WESCON Conv. Record, pages 96–104.
© Copyright 2026 Paperzz