On Gradient-Based Optimization: Accelerated, Asynchronous

On Gradient-Based Optimization:
Accelerated, Asynchronous, Distributed
and Stochastic
Michael I. Jordan
University of California, Berkeley
June 8, 2017
with Andre Wibisono, Ashia Wilson and Michael Betancourt
Outline
•  Variational, Hamiltonian and symplectic perspectives
on Nesterov acceleration
•  Avoiding saddle points, efficiently
•  Stochastically-controlled stochastic gradient
Variational, Hamiltonian and Symplectic
Perspectives on Acceleration
with Andre Wibisono, Ashia Wilson and Michael Betancourt
Interplay between Differentiation and
Integration
•  The 300-yr-old fields: Physics, Statistics
–  cf. Lagrange/Hamilton, Laplace expansions, saddlepoint
expansions
•  The numerical disciplines
–  e.g.,. finite elements, Monte Carlo
Interplay between Differentiation and
Integration
•  The 300-yr-old fields: Physics, Statistics
–  cf. Lagrange/Hamilton, Laplace expansions, saddlepoint
expansions
•  The numerical disciplines
–  e.g.,. finite elements, Monte Carlo
•  Optimization?
Accelerated gradient descent
Setting: Unconstrained convex optimization
min f (x)
x∈Rd
I
Classical gradient descent:
xk+1 = xk − β∇f (xk )
obtains a convergence rate of O(1/k)
I
Accelerated gradient descent:
yk+1 = xk − β∇f (xk )
xk+1 = (1 − λk )yk+1 + λk yk
obtains the (optimal) convergence rate of O(1/k 2 )
The acceleration phenomenon
Two classes of algorithms:
I
Gradient methods
• Gradient descent, mirror descent, cubic-regularized Newton’s
method (Nesterov and Polyak ’06), etc.
• Greedy descent methods, relatively well-understood
I
Accelerated methods
• Nesterov’s accelerated gradient descent, accelerated mirror
descent, accelerated cubic-regularized Newton’s method
(Nesterov ’08), etc.
• Important for both theory (optimal rate for first-order
methods) and practice (many extensions: FISTA, stochastic
setting, etc.)
• Not descent methods, faster than gradient methods, still
mysterious
Accelerated methods: Continuous time perspective
I
Gradient descent is discretization of gradient flow
Ẋt = −∇f (Xt )
(and mirror descent is discretization of natural gradient flow)
Accelerated methods: Continuous time perspective
I
Gradient descent is discretization of gradient flow
Ẋt = −∇f (Xt )
(and mirror descent is discretization of natural gradient flow)
I
Su, Boyd, Candes ’14: Continuous time limit of accelerated
gradient descent is a second-order ODE
3
Ẍt + Ẋt + ∇f (Xt ) = 0
t
Accelerated methods: Continuous time perspective
I
Gradient descent is discretization of gradient flow
Ẋt = −∇f (Xt )
(and mirror descent is discretization of natural gradient flow)
I
Su, Boyd, Candes ’14: Continuous time limit of accelerated
gradient descent is a second-order ODE
3
Ẍt + Ẋt + ∇f (Xt ) = 0
t
I
These ODEs are obtained by taking continuous time limits. Is
there a deeper generative mechanism?
Our work: A general variational approach to acceleration
A systematic discretization methodology
Bregman Lagrangian
Define the Bregman Lagrangian:
L(x, ẋ, t) = e γt +αt Dh (x + e −αt ẋ, x) − e βt f (x)
Bregman Lagrangian
Define the Bregman Lagrangian:
L(x, ẋ, t) = e γt +αt Dh (x + e −αt ẋ, x) − e βt f (x)
I
Function of position x, velocity ẋ, and time t
Bregman Lagrangian
Define the Bregman Lagrangian:
L(x, ẋ, t) = e γt +αt Dh (x + e −αt ẋ, x) − e βt f (x)
I
Function of position x, velocity ẋ, and time t
I
Dh (y , x) = h(y ) − h(x) − h∇h(x), y − xi
is the Bregman divergence
I
h(y)
Dh (y, x)
h is the convex distance-generating function
h(x)
x
y
Bregman Lagrangian
Define the Bregman Lagrangian:
L(x, ẋ, t) = e γt +αt Dh (x + e −αt ẋ, x) − e βt f (x)
I
Function of position x, velocity ẋ, and time t
I
Dh (y , x) = h(y ) − h(x) − h∇h(x), y − xi
is the Bregman divergence
I
h is the convex distance-generating function
I
f is the convex objective function
h(y)
Dh (y, x)
h(x)
x
y
Bregman Lagrangian
Define the Bregman Lagrangian:
L(x, ẋ, t) = e γt +αt Dh (x + e −αt ẋ, x) − e βt f (x)
I
Function of position x, velocity ẋ, and time t
I
Dh (y , x) = h(y ) − h(x) − h∇h(x), y − xi
is the Bregman divergence
I
h is the convex distance-generating function
I
f is the convex objective function
I
αt , βt , γt ∈ R are arbitrary smooth functions
h(y)
Dh (y, x)
h(x)
x
y
Bregman Lagrangian
Define the Bregman Lagrangian:
1
2
2αt +βt
γt −αt
kẋk − e
L(x, ẋ, t) = e
f (x)
2
I
Function of position x, velocity ẋ, and time t
I
Dh (y , x) = h(y ) − h(x) − h∇h(x), y − xi
is the Bregman divergence
I
h is the convex distance-generating function
I
f is the convex objective function
I
αt , βt , γt ∈ R are arbitrary smooth functions
I
In Euclidean setting, simplifies to damped
Lagrangian
h(y)
Dh (y, x)
h(x)
x
y
Bregman Lagrangian
Define the Bregman Lagrangian:
L(x, ẋ, t) = e γt +αt Dh (x + e −αt ẋ, x) − e βt f (x)
I
Function of position x, velocity ẋ, and time t
I
Dh (y , x) = h(y ) − h(x) − h∇h(x), y − xi
is the Bregman divergence
I
h is the convex distance-generating function
I
f is the convex objective function
I
αt , βt , γt ∈ R are arbitrary smooth functions
I
In Euclidean setting, simplifies to damped
Lagrangian
Ideal scaling conditions:
β̇t ≤ e αt
γ̇t = e αt
h(y)
Dh (y, x)
h(x)
x
y
Bregman Lagrangian
L(x, ẋ, t) = e γt +αt Dh (x + e −αt ẋ, x) − e βt f (x)
Variational problem over curves:
Z
min
L(Xt , Ẋt , t) dt
x
X
t
Optimal curve is characterized by Euler-Lagrange equation:
d ∂L
∂L
(Xt , Ẋt , t) =
(Xt , Ẋt , t)
dt ∂ ẋ
∂x
Bregman Lagrangian
L(x, ẋ, t) = e γt +αt Dh (x + e −αt ẋ, x) − e βt f (x)
Variational problem over curves:
Z
min
L(Xt , Ẋt , t) dt
x
X
t
Optimal curve is characterized by Euler-Lagrange equation:
d ∂L
∂L
(Xt , Ẋt , t) =
(Xt , Ẋt , t)
dt ∂ ẋ
∂x
E-L equation for Bregman Lagrangian under ideal scaling:
h
i−1
Ẍt + (e αt − α̇t )Ẋt + e 2αt +βt ∇2 h(Xt + e −αt Ẋt )
∇f (Xt ) = 0
General convergence rate
Theorem
Theorem Under ideal scaling, the E-L equation has convergence
rate
f (Xt ) − f (x ∗ ) ≤ O(e −βt )
General convergence rate
Theorem
Theorem Under ideal scaling, the E-L equation has convergence
rate
f (Xt ) − f (x ∗ ) ≤ O(e −βt )
Proof. Exhibit a Lyapunov function for the dynamics:
Et = Dh x ∗ , Xt + e −αt Ẋt + e βt (f (Xt ) − f (x ∗ ))
Ėt = −e αt +βt Df (x ∗ , Xt ) + (β̇t − e αt )e βt (f (Xt ) − f (x ∗ )) ≤ 0
Note: Only requires convexity and differentiability of f , h
Polynomial convergence rate
For p > 0, choose parameters:
αt = log p − log t
βt = p log t + log C
γt = p log t
E-L equation has O(e −βt ) = O(1/t p ) convergence rate:
Ẍt +
h
t i−1
p+1
Ẋt + Cp 2 t p−2 ∇2 h Xt + Ẋt
∇f (Xt ) = 0
t
p
Polynomial convergence rate
For p > 0, choose parameters:
αt = log p − log t
βt = p log t + log C
γt = p log t
E-L equation has O(e −βt ) = O(1/t p ) convergence rate:
Ẍt +
h
t i−1
p+1
Ẋt + Cp 2 t p−2 ∇2 h Xt + Ẋt
∇f (Xt ) = 0
t
p
For p = 2:
I
Recover result of Krichene et al with O(1/t 2 ) convergence
rate
I
In Euclidean case, recover ODE of Su et al:
3
Ẍt + Ẋt + ∇f (Xt ) = 0
t
Time dilation property (reparameterizing time)
(p = 2: accelerated gradient descent)
h
3
t i−1
∇f (Xt ) = 0
: Ẍt + Ẋt + 4C ∇2 h Xt + Ẋt
t
2



 speed up time: Yt = Xt 3/2
y
h
1
t i−1
4
2
O
∇
h
Y
+
∇f (Yt ) = 0
:
Ÿ
+
Ẏ
+
9Ct
Ẏt
t
t
t
t3
t
3
O
1
t2
(p = 3: accelerated cubic-regularized Newton’s method)
Time dilation property (reparameterizing time)
(p = 2: accelerated gradient descent)
h
3
t i−1
∇f (Xt ) = 0
: Ẍt + Ẋt + 4C ∇2 h Xt + Ẋt
t
2



 speed up time: Yt = Xt 3/2
y
h
1
t i−1
4
2
O
∇
h
Y
+
∇f (Yt ) = 0
:
Ÿ
+
Ẏ
+
9Ct
Ẏt
t
t
t
t3
t
3
O
1
t2
(p = 3: accelerated cubic-regularized Newton’s method)
I
I
All accelerated methods are traveling the same curve in
space-time at different speeds
Gradient methods don’t have this property
• From gradient flow to rescaled gradient flow: Replace 21 k · k2
by p1 k · kp
Time dilation for general Bregman Lagrangian
O(e −βt ) : E-L for Lagrangian Lα,β,γ



 speed up time: Yt = Xτ (t)
y
O(e −βτ (t) ) : E-L for Lagrangian Lα̃,β̃,γ̃
where
α̃t = ατ (t) + log τ̇ (t)
β̃t = βτ (t)
γ̃t = γτ (t)
Time dilation for general Bregman Lagrangian
O(e −βt ) : E-L for Lagrangian Lα,β,γ



 speed up time: Yt = Xτ (t)
y
O(e −βτ (t) ) : E-L for Lagrangian Lα̃,β̃,γ̃
where
α̃t = ατ (t) + log τ̇ (t)
β̃t = βτ (t)
γ̃t = γτ (t)
Question: How to discretize E-L while preserving the convergence
rate?
Discretizing the dynamics (naive approach)
Write E-L as a system of first-order equations:
t
Zt = Xt + Ẋt
p
d
∇h(Zt ) = −Cpt p−1 ∇f (Xt )
dt
Discretizing the dynamics (naive approach)
Write E-L as a system of first-order equations:
t
Zt = Xt + Ẋt
p
d
∇h(Zt ) = −Cpt p−1 ∇f (Xt )
dt
Euler discretization with time step δ > 0 (i.e., set xk = Xt ,
xk+1 = Xt+δ ):
p
k
zk +
xk
k +p
k +p
1
(p−1)
zk = arg min Cpk
h∇f (xk ), zi + Dh (z, zk−1 )
z
xk+1 =
with step size = δ p , and k (p−1) = k(k + 1) · · · (k + p − 2) is the
rising factorial
Naive discretization doesn’t work
p
k
zk +
xk
k +p
k +p
1
(p−1)
zk = arg min Cpk
h∇f (xk ), zi + Dh (z, zk−1 )
z
xk+1 =
Cannot obtain a convergence guarantee, and empirically unstable
2
8
7
6
1
f (xk )
5
0
4
3
2
-1
1
-2
-2
-1
0
1
2
0
0
20
40
60
80
k
100
120
140
160
Modified discretization
Introduce an auxiliary sequence yk :
k
p
zk +
yk
k +p
k +p
1
zk = arg min Cpk (p−1) h∇f (yk ), zi + Dh (z, zk−1 )
z
xk+1 =
1
p
Sufficient condition: h∇f (yk ), xk − yk i ≥ M p−1 k∇f (yk )k∗p−1
Modified discretization
Introduce an auxiliary sequence yk :
k
p
zk +
yk
k +p
k +p
1
zk = arg min Cpk (p−1) h∇f (yk ), zi + Dh (z, zk−1 )
z
xk+1 =
1
p
Sufficient condition: h∇f (yk ), xk − yk i ≥ M p−1 k∇f (yk )k∗p−1
Assume h is uniformly convex: Dh (y , x) ≥ p1 ky − xkp
Theorem
Theorem
∗
f (yk ) − f (x ) ≤ O
1
k p
Note: Matching convergence rates 1/(k p ) = 1/(δk)p = 1/t p
Proof using generalization of Nesterov’s estimate sequence
technique
Accelerated higher-order gradient method
k
p
zk +
yk
k +p
k +p
2
1
p
yk = arg min fp−1 (y ; xk ) + ky − xk k
←O
y
p
k p−1
1
zk = arg min Cpk (p−1) h∇f (yk ), zi + Dh (z, zk−1 )
z
xk+1 =
If ∇p−1 f is (1/)-Lipschitz and h is uniformly convex of order p,
then:
1
f (yk ) − f (x ∗ ) ≤ O
← accelerated rate
k p
p = 2: Accelerated gradient/mirror descent
p = 3: Accelerated cubic-regularized Newton’s method (Nesterov
’08)
p ≥ 2: Accelerated higher-order method
Recap: Gradient vs. accelerated methods
How to design dynamics for minimizing a convex function f ?
Rescaled gradient flow
p−2
Ẋt = −∇f (Xt ) / k∇f (Xt )k∗p−1
O
1
t p−1
Higher-order gradient method
O
1
k p−1
when ∇p−1 f is
1
-Lipschitz
1
matching rate with = δ p−1 ⇔ δ = p−1
Recap: Gradient vs. accelerated methods
How to design dynamics for minimizing a convex function f ?
Rescaled gradient flow
Polynomial Euler-Lagrange equation
p−2
p−1
Ẋt = −∇f (Xt ) / k∇f (Xt )k∗
O
1
Ẍt +
t p−1
Higher-order gradient method
O
1
k p−1
when ∇p−1 f is
h
p+1
t i−1
Ẋt +t p−2 ∇2 h Xt + Ẋt
∇f (Xt ) =
t
p
1
O
tp
Accelerated higher-order method
1
-Lipschitz
1
matching rate with = δ p−1 ⇔ δ = p−1
O
1
k p
when ∇p−1 f is
1
-Lipschitz
1
matching rate with = δ p ⇔ δ = p
Towards A Symplectic Perspective
•  If initialized close enough, diminishing gradient flow
will relax to an optimum quickly
Towards A Symplectic Perspective
•  We can construct physical systems that will rapidly
evolve into the neighborhood of the optimum, but the
inertia can slow relaxation once we get there
Towards A Symplectic Perspective
Towards A Symplectic Perspective
•  Can a mixture of these flows yield rapid convergence to
the optimum in both regimes?
Towards A Symplectic Perspective
Towards A Symplectic Perspective
•  Speed also depends on the discretization
Towards A Symplectic Perspective
•  Discretization of the Lagrangian dynamics, however, is
fragile and requires small step sizes.
•  We can build more robust solutions by taking a Legendre
transform and considering a Hamiltonian formalism:
Towards A Symplectic Perspective
•  The Hamiltonian perspective admits symplectic
integrators which are accurate and stable even for large
step sizes
Towards A Symplectic Perspective
•  Exploiting this stability yields algorithms with state-of-theart performance, and perhaps even more:
Part II
Avoiding Saddlepoints, Efficiently
with Chi Jin, Rong Ge, Praneeth Netrapalli and Sham Kakade
Definitions and Algorithm
Function f (·) is ρ-Hessian Lipschitz if
∀x1 , x2 , k∇2 f (x1 ) − ∇2 f (x2 )k ≤ ρkx1 − x2 k.
Point x is an -second-order stationary point (-SOSP) if
k∇f (x)k ≤ ,
and
√
λmin (∇2 f (x)) ≥ − ρ
Perturbed Gradient Descent (PGD)
1. for t = 0, 1, . . . do
2.
3.
4.
if perturbation condition holds then
xt ← xt + ξt ,
ξt uniformly ∼ B0 (r )
xt+1 ← xt − η∇f (xt )
Only adds perturbation when ∇f (xt ) ≤ , and no more than once per T steps.
10 / 20
Chi Jin
How to Escape Sadde Point Efficiently
Main Result
PGD Converges to SOSP (This Work)
For `-smooth and ρ-Hessian Lipschitz function f , PGD with η = O(1/`) and
proper choice of r , T w.h.p. finds -SOSP in iterations:
`(f (x0 ) − f ? )
Õ
2
*Dimension dependence in iteration is log4 (d) (almost dimension free).
Assumptions
11 / 20
GD(Nestrov 1998)
PGD(This Work)
`-grad-Lip
`-grad-Lip + ρ-Hessian-Lip
Guarantees
-FOSP
-SOSP
Iterations
2`(f (x0 ) − f ? )/2
Õ(`(f (x0 ) − f ? )/2 )
Chi Jin
How to Escape Sadde Point Efficiently
Part III
Stochastically-Controlled Stochastic Gradient
with Lihua Lei
Setup
Task: minimizing a composite objective:
min f (x) =
x∈Rd
1X
fi (x)
n
i∈[n]
Setup
Task: minimizing a composite objective:
min f (x) =
x∈Rd
1X
fi (x)
n
i∈[n]
Assumption: ∃L < ∞, µ ≥ 0, s.t.
µ
L
kx − y k2 ≤ fi (x) − fi (y ) − h∇fi (y ), x − y i ≤ kx − y k2
2
2
Setup
Task: minimizing a composite objective:
min f (x) =
x∈Rd
1X
fi (x)
n
i∈[n]
Assumption: ∃L < ∞, µ ≥ 0, s.t.
µ
L
kx − y k2 ≤ fi (x) − fi (y ) − h∇fi (y ), x − y i ≤ kx − y k2
2
2
µ = 0: non-strongly convex case;
µ > 0: strongly convex case; κ , L/µ.
Computation Complexity
Accessing (fi (x), ∇fi (x)) incurs one unit of cost;
Given > 0, let T () be the minimum cost to achieve
E f (xT () ) − f (x ∗ ) ≤ ;
Worst-case analysis: bound T () almost surely, e.g.,
1
T () = O (n + κ) log
(SVRG).
SVRG Algorithm
SVRG: (within an epoch)
1:
I ← [n]
2:
g←
3:
m←n
4:
Generate N ∼ U([m])
5:
for k = 1, 2, · · · , N do
1
|I|
P
i∈I fi
0 (x )
0
6:
Randomly pick i ∈ [n]
7:
ν ← fi 0 (x) − fi 0 (x0 ) + g
8:
x ← x − ην
9:
end for
Analysis
Nesterov’s AGD
SGD
General Convex
Strongly Convex
n
√
1
2
√
1
n κ log
κ
1
log
SVRG
-
Katyusha
n
√
(n + κ) log
(n +
√
1
nκ) log
1
All above results are from worst-case analysis;
SGD is the only method with complexity free of n; however,
the stepsize η depends on the unknown kx0 − x ∗ k2 and the
total number of epochs T .
Average-Case Analysis
Instead of bounding T (), one can bound ET ();
Evidence from other areas that average-case analysis can give
much better complexity bounds than worst-case analysis.
Average-Case Analysis
Instead of bounding T (), one can bound ET ();
Evidence from other areas that average-case analysis can give
much better complexity bounds than worst-case analysis.
Goal: design an efficient algorithm such that
ET () is independent of n;
ET () is smaller than the worst-case bounds.
Average-Case Analysis
An algorithm is tuning-friendly if:
the stepsize η is the only parameter to tune;
η is a constant which only depends on L and µ.
SGD
SCSG
SCSG+
SCSG+
General Convex
Strongly Convex
Tuning-friendly
1
2
κ
1
log
No
1
n
∧
2
1
1
log
∧n +
s
1
1
log
∧n +
log n
n2
√
log n
√ 3
n 2
1
1
∧ n + κ log
1
κ
+ (α << 1)
α
r
κ
+κ
Yes
Yes
No
SCSG/SCSG+: Algorithm
SVRG: (within an epoch)
1:
I ← [n]
2:
g←
3:
m←n
4:
Generate N ∼ U([m])
5:
for k = 1, 2, · · · , N do
1
|I|
P
i∈I fi
0 (x )
0
6:
Randomly pick i ∈ [n]
7:
ν ← fi 0 (x) − fi 0 (x0 ) + g
8:
x ← x − ην
9:
end for
SCSG/SCSG+: Algorithm
SVRG: (within an epoch)
1:
I ← [n]
2:
g←
3:
m←n
4:
Generate N ∼ U([m])
5:
for k = 1, 2, · · · , N do
1
|I|
P
i∈I fi
0 (x )
0
6:
Randomly pick i ∈ [n]
7:
ν ← fi 0 (x) − fi 0 (x0 ) + g
8:
x ← x − ην
9:
end for
SCSG(+): (within an epoch)
SCSG/SCSG+: Algorithm
SVRG: (within an epoch)
1:
I ← [n]
2:
g←
1
|I|
3:
m←n
4:
Generate N ∼ U([m])
5:
for k = 1, 2, · · · , N do
i∈I fi
0 (x )
0
Randomly pick i ∈ [n]
7:
ν ← fi 0 (x) − fi 0 (x0 ) + g
8:
x ← x − ην
end for
Randomly pick I with size B
1 P
0
2: g ← |I|
i∈I fi (x0 )
1:
P
6:
9:
SCSG(+): (within an epoch)
SCSG/SCSG+: Algorithm
SVRG: (within an epoch)
SCSG(+): (within an epoch)
Randomly pick I with size B
1 P
0
2: g ← |I|
i∈I fi (x0 )
1:
I ← [n]
2:
g←
1
|I|
3:
m←n
3:
γ ← 1 − 1/B
4:
Generate N ∼ U([m])
4:
Generate N ∼ Geo(γ)
5:
for k = 1, 2, · · · , N do
1:
P
i∈I fi
0 (x )
0
6:
Randomly pick i ∈ [n]
7:
ν ← fi 0 (x) − fi 0 (x0 ) + g
8:
x ← x − ην
9:
end for
SCSG/SCSG+: Algorithm
SVRG: (within an epoch)
SCSG(+): (within an epoch)
Randomly pick I with size B
1 P
0
2: g ← |I|
i∈I fi (x0 )
1:
I ← [n]
2:
g←
1
|I|
3:
m←n
3:
γ ← 1 − 1/B
4:
Generate N ∼ U([m])
4:
Generate N ∼ Geo(γ)
5:
for k = 1, 2, · · · , N do
5:
for k = 1, 2, · · · , N do
1:
P
i∈I fi
0 (x )
0
6:
Randomly pick i ∈ [n]
6:
Randomly pick i ∈ I
7:
ν ← fi 0 (x) − fi 0 (x0 ) + g
7:
ν ← fi 0 (x) − fi 0 (x0 ) + g
8:
x ← x − ην
8:
x ← x − ην
9:
end for
9:
end for
SCSG/SCSG+: Algorithm
In epoch j,
SCSG fixes Bj ≡ B();
Explicit forms of B() are given in both non-strongly convex
cases and strongly convex cases;
SCSG+ uses an geometrically increasing sequence
Bj = dB0 b j ∧ ne
Conclusion
SCSG/SCSG+ saves computation costs on average by
avoiding calculating the full gradient;
SCSG/SCSG+ also saves communication costs in the
distributed system by avoiding sampling a gradient from the
whole dataset;
SCSG/SCSG+ are able to achieve an approximate optimum
with potentially less than a single pass through the data;
The average computation cost of SCSG+ beats the oracle
lower bounds from worst-case analysis.