CIS 700: βalgorithms for Big Dataβ Lecture 8: Gradient Descent Slides at http://grigory.us/big-data-class.html Grigory Yaroslavtsev http://grigory.us Smooth Convex Optimization β’ Minimize π over βπ : β π admits a minimizer π₯ β (π»π π₯ β = 0) β π is continuously differentiable and convex on βπ : βπ₯, π¦ β βπ : π π₯ β π π¦ β₯ π₯ β π¦ π»f(y) β π is smooth (π»π is π½-Lipschitz) βπ₯, π¦ β βπ : π»π π₯ β π»π π¦ β€ π½||π₯ β π¦|| β’ Example: βπ= 1 π π₯ π΄π₯ 2 β ππ π₯ β π»π = π΄π₯ β π β π₯ β = π΄β1 π Gradient Descent Method β’ Gradient descent method: β Start with an arbitrary π₯1 β Iterate π₯π +1 = π₯π β π β π»π(π₯π ) β’ Thm. If π = 1/π½ then: π π₯π‘ β π π₯ β β€ 2π½ π₯1 β β 2 π₯ 2 π‘+3 β’ βLinear convergenceβ, can be improved to quadratic using Nesterovβs accelerated descent Gradient Descent: Analysis β’ Lemma 1: If π is π½-smooth then βπ₯, π¦: β βπ : π½ π π π₯ β€ π π¦ + π»π π¦ π₯ β π¦ + π₯βπ¦ 2 β’ π π₯ β π π¦ β π»π π¦ π π₯βπ¦ = 1 π»π 0 π¦+ 2 Gradient Descent: Analysis β’ Lemma 2: If π convex and π½-smooth then βπ₯, π¦: β βπ : π π¦ β₯ π π₯ + π»f π₯ β’ Cor: π»f x β π»π π¦ π π 1 π¦βπ₯ + π»π π₯ β π»π π¦ 2π½ π₯βπ¦ β₯ 1 π½ π»π π₯ β π»π π¦ 2 2 2 β’ π π₯ π¦ = π π¦ β π»π π₯ π π¦ β’ π»π π₯ π¦ = π»π π¦ β π»π(π₯) β’ π π₯ is convex, π½-smooth and minimized at π₯: π π₯ π₯ β π π¦ = π π₯ β π»π π₯ π π₯ β π π¦ + π»π π₯ π π¦ β₯ π₯ β π¦ π»π π₯ (y) π»π π₯ π¦1 β π»π π₯ π¦2 = π»π π¦1 β π»π π¦2 β€ π½||π¦1 β π¦2 || Gradient Descent: Analysis β’ Lemma 2: If π convex and π½-smooth then βπ₯, π¦: β βπ : π π¦ β₯ π π₯ + π»f π₯ π 1 π¦βπ₯ + π»π π₯ β π»π π¦ 2π½ 2 2 β’ π π₯ π¦ = π π¦ β π»π π₯ π π¦ β’ π»π π₯ π¦ = π»π π¦ β π»π(π₯) β’ π π₯ β π π¦ β π»f π₯ π π¦ β π₯ = π π₯ π₯ β π π₯ π¦ 1 π₯ β€ π π¦ β π»π π₯ π¦ β π π₯ π¦ π½ β€ 1 π¦ β π»π π₯ π¦ π½ 1 = β π»π π₯ π¦ 2π½ π»π π₯ 2 π 2 π½ 1 + π»π π₯ π¦ ππ¦ πΏππππ 1 2 π½ 1 2 =β π»π π₯ β π»π π¦ 2π½ Gradient Descent: Analysis β’ Gradient descent: π₯π +1 = π₯π β 1/π½ β π»π π₯π 2 β’ Thm: π π₯π‘ β π π₯ β β€ 2π½ π₯1 βπ₯ β 2 π‘+3 π π π₯π +1 β π π₯π β€ π»π π₯π π₯π +1 β π₯π 1 2 =β π»π π₯π 2π½ β’ Let πΏπ = π π₯π β π β. Then πΏπ +1 β€ πΏπ β 1 2π½ π½ + π₯π +1 β π₯π 2 π»π π₯π β’ πΏπ β€ π»π π₯π π π₯π β π₯ β β€ | π₯π β π₯ β | π»π π₯π β’ Lem: | π₯π β π₯ β | is decreasing with π . β’ πΏπ +1 β€ πΏπ β πΏπ 2 2π½ π₯1 βπ₯ β 2 2 . 2 Gradient Descent: Analysis β’ πΏπ +1 β€ πΏπ β β’ ππΏπ 2 β’ 1 πΏπ +1 πΏπ 2 2π½ π₯1 2 β βπ₯ + πΏπ +1 β€ πΏπ β β 1 πΏπ β₯πβ 1 πΏπ‘ β’ π π₯1 β π π₯ β β€ π»π π₯ β’ πΏπ‘ β€ β 1 π(π‘+3) π₯1 β π₯ β ;π= ππΏπ πΏπ +1 + 1 πΏπ 1 2π½ π₯1 β€ 2 β βπ₯ 1 πΏπ +1 β₯π π‘β1 + 1 π π₯1 βπ π₯ β π½ + π₯1 β π₯ β 2 2 1 = 4π Gradient Descent: Analysis β’ Lem: | π₯π β π₯ β | is decreasing with π . β’ π π»f x β π»π π¦ π₯βπ¦ β₯ β π»π π¦ π¦ β π₯ β β’ π₯π +1 β 2 β π₯ = π₯π β = π₯π β 2 β π₯ 1 π»π π½ 2 β π»π π₯π π½ β€ π₯π β π 1 π½ π»π π₯ β π»π π¦ 1 2 β₯ π»π π¦ π½ π₯π β π₯ β π₯π β π₯β 1 + 2 π»π π₯π π½ 1 β 2 π»π π₯π π½ 2 β π₯π β π₯ 2 β π₯ 2 2 2 Nesterovβs Accelerated Gradient Descent β’ Params: π0 = 0, ππ = 1+ 1+4π2π β1 2 , πΎπ = 1βππ ππ +1 β’ Accelerated Gradient Descent (π₯1 = π¦1 ): β π¦π +1 = 1 π₯π β π»π π½ π₯π β π₯π +1 = 1 β πΎπ π¦π +1 + πΎπ π¦π β’ Optimal convergence rate π(1/π‘ 2 ) β’ Thm. If π is convex and π½-smooth then: π π¦π‘ β π π₯β 2π½ π₯1 β π₯ β€ π‘2 β 2 Accelerated Gradient Descent: Analysis β’ π π₯ 1 β π»f π½ x βπ π¦ β€ 1 β€ π π₯ β π»f x π½ β π π₯ + π»π π₯ π π₯βπ¦ 2 β€ π»π π₯ π 1 π₯ β π»f x β π½ π»π π₯ π π₯ β π¦ 1 =β π»π π₯ 2π½ π₯ π½ + 2 π₯ β 1 π»f π½ x βπ₯ (by Lemma 1) 2 + π»f x T (x β y) + 2 Accelerated Gradient Descent: Analysis β’ π π₯ 1 β π»f π½ x βπ π¦ β€ 1 β 2π½ π»π π₯ 2 + π»f x T x βy β’ Apply to π₯ = π₯π , π¦ = π¦π : π π¦π +1 β π π¦π 1 = π π₯π β π»f xs π½ 1 β€β π»π π₯π 2π½ 2 + π»π π₯π π₯π β π¦π 2 π½ β 2 = π¦π +1 β π₯π β π½ π¦π +1 β π₯π β’ Apply to π₯ = π₯π , π¦ = π₯ β : π π¦π +1 β π (2) π₯β β€ π½ β 2 β π(π¦π ) π¦π +1 β π₯π π 2 π₯π β π¦π β π½ 2 (1) π¦π +1 β π₯π π (π₯π β π₯ β ) Accelerated Gradient Descent: Analysis β’ 1 x ππ β 1 + 2 , for πΏπ = π π¦π β π π₯ β : ππ πΏπ +1 β (ππ β1)πΏπ β€ π½ 2 β ππ π¦π +1 β π₯π β π½ π¦π +1 β π₯π π (ππ π₯π β ππ β 1 π¦π β π₯ β ) 2 β’ (x) ππ and use π2π β1 = π2π β ππ : π2π πΏπ +1 β π2π β1 πΏπ π½ β€ β ( ππ π¦π +1 β π₯π 2 β’ It holds that: ππ π¦π +1 β π₯π 2 2 + 2ππ π¦π +1 β π₯π π (ππ π₯π β ππ β 1 π¦π β π₯ β )) + 2ππ π¦π +1 β π₯π π (ππ π₯π β ππ β 1 π¦π β π₯ β )) = ππ π¦π +1 β ππ β 1 π¦π β 2 β π₯ β ππ π₯π β ππ β 1 π¦π β π₯β 2 Accelerated Gradient Descent: Analysis β’ By definition of AGD: π₯π +1 = π¦π +1 + πΎπ π¦π β π¦π +1 β ππ +1 π₯π +1 = ππ +1 π¦π +1 + 1 β ππ π¦π β π¦π +1 β ππ +1 π₯π +1 β ππ +1 β 1 π¦π +1 = ππ π¦π +1 β ππ β 1 π¦π β’ Putting last three facts together for π’π = ππ π₯π β ππ β 1 π¦π β π₯ β we have: π½ 2 2 2 2 ππ πΏπ +1 β ππ β1 πΏπ β€ π’π β π’π +1 2 β’ Adding up over π = 1 to π = π‘ β 1: π½ 2 πΏπ‘ β€ 2 π’1 2ππ‘β1 β’ By induction ππ‘β1 β₯ π‘ . 2 Q.E.D. Constrained Convex Optimization β’ Non-convex optimization is NP-hard: π₯π2 1 β π₯π 2 = 0 β βπ: π₯π β 0,1 π β’ Knapsack: β Minimize π ππ π₯π β Subject to: π π€π π₯π β€ π β’ Convex optimization can often be solved by ellipsoid algorithm in ππππ¦(π) time, but too slow Convex multivariate functions β’ Convexity: β’ β’ βπ₯, π¦ β βπ : π π₯ β₯ π π¦ + π₯ β π¦ π»f(y) βπ₯, π¦ β βπ , 0 β€ π β€ 1: π ππ₯ + 1 β π π¦ β€ ππ π₯ + 1 β π π(π¦) β’ If higher derivatives exist: π π₯ = π π¦ + π»π π¦ β π₯ β π¦ + π₯ β π¦ π π»2π π₯ π₯ β π¦ + β― β’ π»2π π₯ ππ = π2 π ππ₯π ππ₯π is the Hessian matrix β’ π is convex iff itβs Hessian is positive semidefinite, π¦ π π» 2 ππ¦ β₯ 0 for all π¦. Examples of convex functions β’ βπ -norm is convex for 1 β€ π β€ β: ππ₯ + 1 β π π¦ π β€ ππ₯ π + =π π₯ + 1 βπ π π₯1 π₯ π¦ 1 βπ π¦ π π π₯ β’ π π₯ = log(π + π 2 + β― π π ) max π₯1 , β¦ , π₯π β€ π π₯ β€ max π₯1 , β¦ , π₯π + log π β’ π π₯ = π₯ π π΄π₯ where π΄ is a p.s.d. matrix, π» 2 π = π΄ β’ Examples of constrained convex optimization: β (Linear equations with p.s.d. constraints): 1 π π₯ π΄π₯ 2 minimize: β π π π₯ (solution satisfies π΄π₯ = π) β (Least squares regression): Minimize: π΄π₯ β π 2 2 = x T AT A x β 2 Ax T b + bT b Constrained Convex Optimization β’ General formulation for convex π and a convex set πΎ: πππππππ§π: π π₯ subject to: π₯ β πΎ β’ Example (SVMs): β Data: π1 , β¦ , ππ β βπ labeled by π¦1 , β¦ , π¦π β β1,1 (spam / non-spam) β Find a linear model: π β ππ β₯ 1 β ππ is spam π β ππ β€ β1 β ππ is non-spam βπ: 1 β π¦π πππ β€ 0 β’ More robust version: πππππππ§π: πΏππ π 1 β π π¦π ππ π +π π β E.g. hinge loss Loss(0,t)=max(0,t) β Another regularizer: π π 1 (favors sparse solutions) 2 Gradient Descent for Constrained Convex Optimization β’ (Projection): π₯ β πΎ β π¦ β πΎ y = argminzβπΎ π§ β π₯ β’ Easy to compute for β β’ Let π»π π₯ 2 β€ πΊ, max 2 2 π₯,π¦βπΎ : π¦ = π₯/ π₯ π₯βπ¦ 4π·2 πΊ 2 π2 2 2 2 2 β€ π·. β’ Let π = β’ Gradient descent (gradient + projection oracles): β Let π = π·/πΊ π β Repeat for π = 0, β¦ , π: β’ π¦ (π+1) = π₯ (π) + ππ»π π₯ π β’ π₯ (π+1) = projection of π¦ (π+1) on πΎ β Output π§ = 1 π ππ₯ (π) Gradient Descent for Constrained Convex Optimization β’ π₯ π+1 βπ₯ 2 β 2 β€ π¦ π+1 = π₯ π = π₯ 2 β π₯β 2 π βπ₯ β 2 2 β β π₯ β ππ»π π₯ + π2 π»π π₯ 2 π 2 β’ Using definition of πΊ: π»π π₯ π β’ π π₯ β π₯ π π βπ β π₯β π₯β β€ 1 β€ 2π 1 2π π₯ π₯ π π β β π₯β β’ Sum over π = 1, β¦ , π: π π π₯ π=1 π βπ π₯β 1 β€ 2π π₯ 0 β 2 π 2 β 2ππ»π π₯ π₯β 2 2 π₯β 2 2 2 2 π+1 β π₯ β π₯ π+1 β π₯ β π₯ π π β β π₯β β π₯β π π₯β 2 β π₯β 2 2 + 2 2 2 π 2 + πΊ 2 π 2 πΊ 2 ππ 2 + πΊ 2 Gradient Descent for Constrained Convex Optimization β’ π π=1 π π₯ π β π π₯β β€ 1 2π π₯ 0 β π₯β 2 2 β Online Gradient Descent β’ Gradient descent works in a more general case: β’ π β sequence of convex functions π1 , π2 β¦ , ππ β’ At step π need to output π₯ (π) β πΎ β’ Let π₯ β be the minimizer of π ππ (π€) β’ Minimize regret: ππ π₯ π π β ππ (π₯ β ) β’ Same analysis as before works in online case. Stochastic Gradient Descent β’ (Expected gradient oracle): returns π such that πΌπ π = π»π π₯ . β’ Example: for SVM pick randomly one term from the loss function. β’ Let ππ be the gradient returned at step i β’ Let ππ = ππ π₯ be the function used in the i-th step of OGD β’ Let π§ = 1 π (π) β π₯ and π₯ be the minimizer of π. π Stochastic Gradient Descent π·πΊ π β’ Thm. πΌ π π§ β€ π π₯ β + gradient output by oracle. β’ π π§ β π π₯β β€ = 1 π 1 π π(π 1 β€ π π₯ π where πΊ is an upper bound of any β π(π₯ β )) (convexity) π»π π₯ π π₯ π β π₯β π π β π₯ β )] (grad. oracle) πΌ [π (π₯ π π 1 = πΌ[ππ (π₯ π ) β ππ (π₯ β )] π π 1 = πΌ[ ππ (π₯ π ) β ππ (π₯ β )] π π β’ πΌ[] = regret of OGD , always β€ π
© Copyright 2026 Paperzz