On Gradient-Based Optimization: Accelerated, Asynchronous, Distributed and Stochastic Michael I. Jordan University of California, Berkeley June 8, 2017 with Andre Wibisono, Ashia Wilson and Michael Betancourt Outline • Variational, Hamiltonian and symplectic perspectives on Nesterov acceleration • Avoiding saddle points, efficiently • Stochastically-controlled stochastic gradient Variational, Hamiltonian and Symplectic Perspectives on Acceleration with Andre Wibisono, Ashia Wilson and Michael Betancourt Interplay between Differentiation and Integration • The 300-yr-old fields: Physics, Statistics – cf. Lagrange/Hamilton, Laplace expansions, saddlepoint expansions • The numerical disciplines – e.g.,. finite elements, Monte Carlo Interplay between Differentiation and Integration • The 300-yr-old fields: Physics, Statistics – cf. Lagrange/Hamilton, Laplace expansions, saddlepoint expansions • The numerical disciplines – e.g.,. finite elements, Monte Carlo • Optimization? Accelerated gradient descent Setting: Unconstrained convex optimization min f (x) x∈Rd I Classical gradient descent: xk+1 = xk − β∇f (xk ) obtains a convergence rate of O(1/k) I Accelerated gradient descent: yk+1 = xk − β∇f (xk ) xk+1 = (1 − λk )yk+1 + λk yk obtains the (optimal) convergence rate of O(1/k 2 ) The acceleration phenomenon Two classes of algorithms: I Gradient methods • Gradient descent, mirror descent, cubic-regularized Newton’s method (Nesterov and Polyak ’06), etc. • Greedy descent methods, relatively well-understood I Accelerated methods • Nesterov’s accelerated gradient descent, accelerated mirror descent, accelerated cubic-regularized Newton’s method (Nesterov ’08), etc. • Important for both theory (optimal rate for first-order methods) and practice (many extensions: FISTA, stochastic setting, etc.) • Not descent methods, faster than gradient methods, still mysterious Accelerated methods: Continuous time perspective I Gradient descent is discretization of gradient flow Ẋt = −∇f (Xt ) (and mirror descent is discretization of natural gradient flow) Accelerated methods: Continuous time perspective I Gradient descent is discretization of gradient flow Ẋt = −∇f (Xt ) (and mirror descent is discretization of natural gradient flow) I Su, Boyd, Candes ’14: Continuous time limit of accelerated gradient descent is a second-order ODE 3 Ẍt + Ẋt + ∇f (Xt ) = 0 t Accelerated methods: Continuous time perspective I Gradient descent is discretization of gradient flow Ẋt = −∇f (Xt ) (and mirror descent is discretization of natural gradient flow) I Su, Boyd, Candes ’14: Continuous time limit of accelerated gradient descent is a second-order ODE 3 Ẍt + Ẋt + ∇f (Xt ) = 0 t I These ODEs are obtained by taking continuous time limits. Is there a deeper generative mechanism? Our work: A general variational approach to acceleration A systematic discretization methodology Bregman Lagrangian Define the Bregman Lagrangian: L(x, ẋ, t) = e γt +αt Dh (x + e −αt ẋ, x) − e βt f (x) Bregman Lagrangian Define the Bregman Lagrangian: L(x, ẋ, t) = e γt +αt Dh (x + e −αt ẋ, x) − e βt f (x) I Function of position x, velocity ẋ, and time t Bregman Lagrangian Define the Bregman Lagrangian: L(x, ẋ, t) = e γt +αt Dh (x + e −αt ẋ, x) − e βt f (x) I Function of position x, velocity ẋ, and time t I Dh (y , x) = h(y ) − h(x) − h∇h(x), y − xi is the Bregman divergence I h(y) Dh (y, x) h is the convex distance-generating function h(x) x y Bregman Lagrangian Define the Bregman Lagrangian: L(x, ẋ, t) = e γt +αt Dh (x + e −αt ẋ, x) − e βt f (x) I Function of position x, velocity ẋ, and time t I Dh (y , x) = h(y ) − h(x) − h∇h(x), y − xi is the Bregman divergence I h is the convex distance-generating function I f is the convex objective function h(y) Dh (y, x) h(x) x y Bregman Lagrangian Define the Bregman Lagrangian: L(x, ẋ, t) = e γt +αt Dh (x + e −αt ẋ, x) − e βt f (x) I Function of position x, velocity ẋ, and time t I Dh (y , x) = h(y ) − h(x) − h∇h(x), y − xi is the Bregman divergence I h is the convex distance-generating function I f is the convex objective function I αt , βt , γt ∈ R are arbitrary smooth functions h(y) Dh (y, x) h(x) x y Bregman Lagrangian Define the Bregman Lagrangian: 1 2 2αt +βt γt −αt kẋk − e L(x, ẋ, t) = e f (x) 2 I Function of position x, velocity ẋ, and time t I Dh (y , x) = h(y ) − h(x) − h∇h(x), y − xi is the Bregman divergence I h is the convex distance-generating function I f is the convex objective function I αt , βt , γt ∈ R are arbitrary smooth functions I In Euclidean setting, simplifies to damped Lagrangian h(y) Dh (y, x) h(x) x y Bregman Lagrangian Define the Bregman Lagrangian: L(x, ẋ, t) = e γt +αt Dh (x + e −αt ẋ, x) − e βt f (x) I Function of position x, velocity ẋ, and time t I Dh (y , x) = h(y ) − h(x) − h∇h(x), y − xi is the Bregman divergence I h is the convex distance-generating function I f is the convex objective function I αt , βt , γt ∈ R are arbitrary smooth functions I In Euclidean setting, simplifies to damped Lagrangian Ideal scaling conditions: β̇t ≤ e αt γ̇t = e αt h(y) Dh (y, x) h(x) x y Bregman Lagrangian L(x, ẋ, t) = e γt +αt Dh (x + e −αt ẋ, x) − e βt f (x) Variational problem over curves: Z min L(Xt , Ẋt , t) dt x X t Optimal curve is characterized by Euler-Lagrange equation: d ∂L ∂L (Xt , Ẋt , t) = (Xt , Ẋt , t) dt ∂ ẋ ∂x Bregman Lagrangian L(x, ẋ, t) = e γt +αt Dh (x + e −αt ẋ, x) − e βt f (x) Variational problem over curves: Z min L(Xt , Ẋt , t) dt x X t Optimal curve is characterized by Euler-Lagrange equation: d ∂L ∂L (Xt , Ẋt , t) = (Xt , Ẋt , t) dt ∂ ẋ ∂x E-L equation for Bregman Lagrangian under ideal scaling: h i−1 Ẍt + (e αt − α̇t )Ẋt + e 2αt +βt ∇2 h(Xt + e −αt Ẋt ) ∇f (Xt ) = 0 General convergence rate Theorem Theorem Under ideal scaling, the E-L equation has convergence rate f (Xt ) − f (x ∗ ) ≤ O(e −βt ) General convergence rate Theorem Theorem Under ideal scaling, the E-L equation has convergence rate f (Xt ) − f (x ∗ ) ≤ O(e −βt ) Proof. Exhibit a Lyapunov function for the dynamics: Et = Dh x ∗ , Xt + e −αt Ẋt + e βt (f (Xt ) − f (x ∗ )) Ėt = −e αt +βt Df (x ∗ , Xt ) + (β̇t − e αt )e βt (f (Xt ) − f (x ∗ )) ≤ 0 Note: Only requires convexity and differentiability of f , h Polynomial convergence rate For p > 0, choose parameters: αt = log p − log t βt = p log t + log C γt = p log t E-L equation has O(e −βt ) = O(1/t p ) convergence rate: Ẍt + h t i−1 p+1 Ẋt + Cp 2 t p−2 ∇2 h Xt + Ẋt ∇f (Xt ) = 0 t p Polynomial convergence rate For p > 0, choose parameters: αt = log p − log t βt = p log t + log C γt = p log t E-L equation has O(e −βt ) = O(1/t p ) convergence rate: Ẍt + h t i−1 p+1 Ẋt + Cp 2 t p−2 ∇2 h Xt + Ẋt ∇f (Xt ) = 0 t p For p = 2: I Recover result of Krichene et al with O(1/t 2 ) convergence rate I In Euclidean case, recover ODE of Su et al: 3 Ẍt + Ẋt + ∇f (Xt ) = 0 t Time dilation property (reparameterizing time) (p = 2: accelerated gradient descent) h 3 t i−1 ∇f (Xt ) = 0 : Ẍt + Ẋt + 4C ∇2 h Xt + Ẋt t 2 speed up time: Yt = Xt 3/2 y h 1 t i−1 4 2 O ∇ h Y + ∇f (Yt ) = 0 : Ÿ + Ẏ + 9Ct Ẏt t t t t3 t 3 O 1 t2 (p = 3: accelerated cubic-regularized Newton’s method) Time dilation property (reparameterizing time) (p = 2: accelerated gradient descent) h 3 t i−1 ∇f (Xt ) = 0 : Ẍt + Ẋt + 4C ∇2 h Xt + Ẋt t 2 speed up time: Yt = Xt 3/2 y h 1 t i−1 4 2 O ∇ h Y + ∇f (Yt ) = 0 : Ÿ + Ẏ + 9Ct Ẏt t t t t3 t 3 O 1 t2 (p = 3: accelerated cubic-regularized Newton’s method) I I All accelerated methods are traveling the same curve in space-time at different speeds Gradient methods don’t have this property • From gradient flow to rescaled gradient flow: Replace 21 k · k2 by p1 k · kp Time dilation for general Bregman Lagrangian O(e −βt ) : E-L for Lagrangian Lα,β,γ speed up time: Yt = Xτ (t) y O(e −βτ (t) ) : E-L for Lagrangian Lα̃,β̃,γ̃ where α̃t = ατ (t) + log τ̇ (t) β̃t = βτ (t) γ̃t = γτ (t) Time dilation for general Bregman Lagrangian O(e −βt ) : E-L for Lagrangian Lα,β,γ speed up time: Yt = Xτ (t) y O(e −βτ (t) ) : E-L for Lagrangian Lα̃,β̃,γ̃ where α̃t = ατ (t) + log τ̇ (t) β̃t = βτ (t) γ̃t = γτ (t) Question: How to discretize E-L while preserving the convergence rate? Discretizing the dynamics (naive approach) Write E-L as a system of first-order equations: t Zt = Xt + Ẋt p d ∇h(Zt ) = −Cpt p−1 ∇f (Xt ) dt Discretizing the dynamics (naive approach) Write E-L as a system of first-order equations: t Zt = Xt + Ẋt p d ∇h(Zt ) = −Cpt p−1 ∇f (Xt ) dt Euler discretization with time step δ > 0 (i.e., set xk = Xt , xk+1 = Xt+δ ): p k zk + xk k +p k +p 1 (p−1) zk = arg min Cpk h∇f (xk ), zi + Dh (z, zk−1 ) z xk+1 = with step size = δ p , and k (p−1) = k(k + 1) · · · (k + p − 2) is the rising factorial Naive discretization doesn’t work p k zk + xk k +p k +p 1 (p−1) zk = arg min Cpk h∇f (xk ), zi + Dh (z, zk−1 ) z xk+1 = Cannot obtain a convergence guarantee, and empirically unstable 2 8 7 6 1 f (xk ) 5 0 4 3 2 -1 1 -2 -2 -1 0 1 2 0 0 20 40 60 80 k 100 120 140 160 Modified discretization Introduce an auxiliary sequence yk : k p zk + yk k +p k +p 1 zk = arg min Cpk (p−1) h∇f (yk ), zi + Dh (z, zk−1 ) z xk+1 = 1 p Sufficient condition: h∇f (yk ), xk − yk i ≥ M p−1 k∇f (yk )k∗p−1 Modified discretization Introduce an auxiliary sequence yk : k p zk + yk k +p k +p 1 zk = arg min Cpk (p−1) h∇f (yk ), zi + Dh (z, zk−1 ) z xk+1 = 1 p Sufficient condition: h∇f (yk ), xk − yk i ≥ M p−1 k∇f (yk )k∗p−1 Assume h is uniformly convex: Dh (y , x) ≥ p1 ky − xkp Theorem Theorem ∗ f (yk ) − f (x ) ≤ O 1 k p Note: Matching convergence rates 1/(k p ) = 1/(δk)p = 1/t p Proof using generalization of Nesterov’s estimate sequence technique Accelerated higher-order gradient method k p zk + yk k +p k +p 2 1 p yk = arg min fp−1 (y ; xk ) + ky − xk k ←O y p k p−1 1 zk = arg min Cpk (p−1) h∇f (yk ), zi + Dh (z, zk−1 ) z xk+1 = If ∇p−1 f is (1/)-Lipschitz and h is uniformly convex of order p, then: 1 f (yk ) − f (x ∗ ) ≤ O ← accelerated rate k p p = 2: Accelerated gradient/mirror descent p = 3: Accelerated cubic-regularized Newton’s method (Nesterov ’08) p ≥ 2: Accelerated higher-order method Recap: Gradient vs. accelerated methods How to design dynamics for minimizing a convex function f ? Rescaled gradient flow p−2 Ẋt = −∇f (Xt ) / k∇f (Xt )k∗p−1 O 1 t p−1 Higher-order gradient method O 1 k p−1 when ∇p−1 f is 1 -Lipschitz 1 matching rate with = δ p−1 ⇔ δ = p−1 Recap: Gradient vs. accelerated methods How to design dynamics for minimizing a convex function f ? Rescaled gradient flow Polynomial Euler-Lagrange equation p−2 p−1 Ẋt = −∇f (Xt ) / k∇f (Xt )k∗ O 1 Ẍt + t p−1 Higher-order gradient method O 1 k p−1 when ∇p−1 f is h p+1 t i−1 Ẋt +t p−2 ∇2 h Xt + Ẋt ∇f (Xt ) = t p 1 O tp Accelerated higher-order method 1 -Lipschitz 1 matching rate with = δ p−1 ⇔ δ = p−1 O 1 k p when ∇p−1 f is 1 -Lipschitz 1 matching rate with = δ p ⇔ δ = p Towards A Symplectic Perspective • If initialized close enough, diminishing gradient flow will relax to an optimum quickly Towards A Symplectic Perspective • We can construct physical systems that will rapidly evolve into the neighborhood of the optimum, but the inertia can slow relaxation once we get there Towards A Symplectic Perspective Towards A Symplectic Perspective • Can a mixture of these flows yield rapid convergence to the optimum in both regimes? Towards A Symplectic Perspective Towards A Symplectic Perspective • Speed also depends on the discretization Towards A Symplectic Perspective • Discretization of the Lagrangian dynamics, however, is fragile and requires small step sizes. • We can build more robust solutions by taking a Legendre transform and considering a Hamiltonian formalism: Towards A Symplectic Perspective • The Hamiltonian perspective admits symplectic integrators which are accurate and stable even for large step sizes Towards A Symplectic Perspective • Exploiting this stability yields algorithms with state-of-theart performance, and perhaps even more: Part II Avoiding Saddlepoints, Efficiently with Chi Jin, Rong Ge, Praneeth Netrapalli and Sham Kakade Definitions and Algorithm Function f (·) is ρ-Hessian Lipschitz if ∀x1 , x2 , k∇2 f (x1 ) − ∇2 f (x2 )k ≤ ρkx1 − x2 k. Point x is an -second-order stationary point (-SOSP) if k∇f (x)k ≤ , and √ λmin (∇2 f (x)) ≥ − ρ Perturbed Gradient Descent (PGD) 1. for t = 0, 1, . . . do 2. 3. 4. if perturbation condition holds then xt ← xt + ξt , ξt uniformly ∼ B0 (r ) xt+1 ← xt − η∇f (xt ) Only adds perturbation when ∇f (xt ) ≤ , and no more than once per T steps. 10 / 20 Chi Jin How to Escape Sadde Point Efficiently Main Result PGD Converges to SOSP (This Work) For `-smooth and ρ-Hessian Lipschitz function f , PGD with η = O(1/`) and proper choice of r , T w.h.p. finds -SOSP in iterations: `(f (x0 ) − f ? ) Õ 2 *Dimension dependence in iteration is log4 (d) (almost dimension free). Assumptions 11 / 20 GD(Nestrov 1998) PGD(This Work) `-grad-Lip `-grad-Lip + ρ-Hessian-Lip Guarantees -FOSP -SOSP Iterations 2`(f (x0 ) − f ? )/2 Õ(`(f (x0 ) − f ? )/2 ) Chi Jin How to Escape Sadde Point Efficiently Part III Stochastically-Controlled Stochastic Gradient with Lihua Lei Setup Task: minimizing a composite objective: min f (x) = x∈Rd 1X fi (x) n i∈[n] Setup Task: minimizing a composite objective: min f (x) = x∈Rd 1X fi (x) n i∈[n] Assumption: ∃L < ∞, µ ≥ 0, s.t. µ L kx − y k2 ≤ fi (x) − fi (y ) − h∇fi (y ), x − y i ≤ kx − y k2 2 2 Setup Task: minimizing a composite objective: min f (x) = x∈Rd 1X fi (x) n i∈[n] Assumption: ∃L < ∞, µ ≥ 0, s.t. µ L kx − y k2 ≤ fi (x) − fi (y ) − h∇fi (y ), x − y i ≤ kx − y k2 2 2 µ = 0: non-strongly convex case; µ > 0: strongly convex case; κ , L/µ. Computation Complexity Accessing (fi (x), ∇fi (x)) incurs one unit of cost; Given > 0, let T () be the minimum cost to achieve E f (xT () ) − f (x ∗ ) ≤ ; Worst-case analysis: bound T () almost surely, e.g., 1 T () = O (n + κ) log (SVRG). SVRG Algorithm SVRG: (within an epoch) 1: I ← [n] 2: g← 3: m←n 4: Generate N ∼ U([m]) 5: for k = 1, 2, · · · , N do 1 |I| P i∈I fi 0 (x ) 0 6: Randomly pick i ∈ [n] 7: ν ← fi 0 (x) − fi 0 (x0 ) + g 8: x ← x − ην 9: end for Analysis Nesterov’s AGD SGD General Convex Strongly Convex n √ 1 2 √ 1 n κ log κ 1 log SVRG - Katyusha n √ (n + κ) log (n + √ 1 nκ) log 1 All above results are from worst-case analysis; SGD is the only method with complexity free of n; however, the stepsize η depends on the unknown kx0 − x ∗ k2 and the total number of epochs T . Average-Case Analysis Instead of bounding T (), one can bound ET (); Evidence from other areas that average-case analysis can give much better complexity bounds than worst-case analysis. Average-Case Analysis Instead of bounding T (), one can bound ET (); Evidence from other areas that average-case analysis can give much better complexity bounds than worst-case analysis. Goal: design an efficient algorithm such that ET () is independent of n; ET () is smaller than the worst-case bounds. Average-Case Analysis An algorithm is tuning-friendly if: the stepsize η is the only parameter to tune; η is a constant which only depends on L and µ. SGD SCSG SCSG+ SCSG+ General Convex Strongly Convex Tuning-friendly 1 2 κ 1 log No 1 n ∧ 2 1 1 log ∧n + s 1 1 log ∧n + log n n2 √ log n √ 3 n 2 1 1 ∧ n + κ log 1 κ + (α << 1) α r κ +κ Yes Yes No SCSG/SCSG+: Algorithm SVRG: (within an epoch) 1: I ← [n] 2: g← 3: m←n 4: Generate N ∼ U([m]) 5: for k = 1, 2, · · · , N do 1 |I| P i∈I fi 0 (x ) 0 6: Randomly pick i ∈ [n] 7: ν ← fi 0 (x) − fi 0 (x0 ) + g 8: x ← x − ην 9: end for SCSG/SCSG+: Algorithm SVRG: (within an epoch) 1: I ← [n] 2: g← 3: m←n 4: Generate N ∼ U([m]) 5: for k = 1, 2, · · · , N do 1 |I| P i∈I fi 0 (x ) 0 6: Randomly pick i ∈ [n] 7: ν ← fi 0 (x) − fi 0 (x0 ) + g 8: x ← x − ην 9: end for SCSG(+): (within an epoch) SCSG/SCSG+: Algorithm SVRG: (within an epoch) 1: I ← [n] 2: g← 1 |I| 3: m←n 4: Generate N ∼ U([m]) 5: for k = 1, 2, · · · , N do i∈I fi 0 (x ) 0 Randomly pick i ∈ [n] 7: ν ← fi 0 (x) − fi 0 (x0 ) + g 8: x ← x − ην end for Randomly pick I with size B 1 P 0 2: g ← |I| i∈I fi (x0 ) 1: P 6: 9: SCSG(+): (within an epoch) SCSG/SCSG+: Algorithm SVRG: (within an epoch) SCSG(+): (within an epoch) Randomly pick I with size B 1 P 0 2: g ← |I| i∈I fi (x0 ) 1: I ← [n] 2: g← 1 |I| 3: m←n 3: γ ← 1 − 1/B 4: Generate N ∼ U([m]) 4: Generate N ∼ Geo(γ) 5: for k = 1, 2, · · · , N do 1: P i∈I fi 0 (x ) 0 6: Randomly pick i ∈ [n] 7: ν ← fi 0 (x) − fi 0 (x0 ) + g 8: x ← x − ην 9: end for SCSG/SCSG+: Algorithm SVRG: (within an epoch) SCSG(+): (within an epoch) Randomly pick I with size B 1 P 0 2: g ← |I| i∈I fi (x0 ) 1: I ← [n] 2: g← 1 |I| 3: m←n 3: γ ← 1 − 1/B 4: Generate N ∼ U([m]) 4: Generate N ∼ Geo(γ) 5: for k = 1, 2, · · · , N do 5: for k = 1, 2, · · · , N do 1: P i∈I fi 0 (x ) 0 6: Randomly pick i ∈ [n] 6: Randomly pick i ∈ I 7: ν ← fi 0 (x) − fi 0 (x0 ) + g 7: ν ← fi 0 (x) − fi 0 (x0 ) + g 8: x ← x − ην 8: x ← x − ην 9: end for 9: end for SCSG/SCSG+: Algorithm In epoch j, SCSG fixes Bj ≡ B(); Explicit forms of B() are given in both non-strongly convex cases and strongly convex cases; SCSG+ uses an geometrically increasing sequence Bj = dB0 b j ∧ ne Conclusion SCSG/SCSG+ saves computation costs on average by avoiding calculating the full gradient; SCSG/SCSG+ also saves communication costs in the distributed system by avoiding sampling a gradient from the whole dataset; SCSG/SCSG+ are able to achieve an approximate optimum with potentially less than a single pass through the data; The average computation cost of SCSG+ beats the oracle lower bounds from worst-case analysis.
© Copyright 2026 Paperzz