Gradient Descent Converges to Minimizers

Gradient Descent
Converges to Minimizers
Jason D. Lee, Max Simchowitz, Michael I. Jordan,
and Benjamin Recht
Presented by Qiuwei Li
EECS, Colorado School of Mines
SINE
x
s(x)
SIgnals and NEtworks
1
Outline
I
Gradient Decent Introduction
I
Almost Sure Not Converge to A Saddle
I
Proof by Stable-Center Manifold theorem
1
Gradient Decent
I
Objective: minimize the function f : Rn → R
I
Input: intimal x 0 , stepsize α
I
x k+1 = x k − α∇f (x k )
2
When Gradient Decent Converge?
I
Converge when Critical Points i.e. ∇f (x) = 0
I
Local Mins
I
Saddle Points
f (x) = 2(x12 + x22 )
f (x) = −x12 + x22
3
Hard to Converge to Saddle
4
Hard to Converge to Saddle
I
I
Those points Converge to Saddles are often measure 0.
Example: Only the Read Region
5
Terminologies and Assumptions
I
Strict saddle if λmin(∇2f (x ∗)) < 0
I
The gradient map: g (x) = x − α∇f (x)
I
Define W s (x ∗) = {x : limk g k (x) = x ∗}
I
k∇f (x) − ∇f (y )k2 ≤ Lkx − y k2
6
Main Result
I
Assume f ∈ C 2
I
Let x ∗ be a strict saddle
I
Assume 0 < α <
1
L
Then,
Vol(Ws (x ∗)) = 0
7
Stable-Center Manifold Theorem
I
Let x ∗ be a fixed point of some map g
I
g is diffeomorphism
I
Ecs = Span{vi : λi (Dg (x ∗ )) ≤ 1}
I
cs
Then there exist disk Wloc
of x ∗ (living in tangent space
of Ecs at x ∗ ) and neighbourhood B of x ∗ such that
(
cs
cs
g (Wloc
) ∩ B ⊂ Wloc
cs
g k (x) ∈ B, ∀k ≥ 0 ⇒ x ∈ Wloc
8
Stable-Center Manifold Theorem
9
Tangent Spaces have the same Dims of
Manifold
10
Proof by Stable-Center Manifold Theorem
I
Show the gradient map g is diffeomorphism. Then
x ∈ W s (x ∗ ) ⇒ g t (x) ∈ B, ∀t ≥ Tx
⇒ g k (g t (x)) ∈ B, ∀k ≥ 0
cs
⇒ g t (x) ∈ Wloc
cs
⇒ x ∈ g −t (Wloc
)
−t
cs
⇒ x ∈ ∪t≥0 g (Wloc
)
I
I
I
(1)
(2)
(3)
(4)
(5)
x ∗ is strict saddle ⇒ λmin (∇2 f (x ∗ )) < 0
cs
Dg (x ∗ ) = I − α∇2 f (x) ⇒ dim(Ecs ) < n ⇒ Vol(Wloc
)=
0
Diffeomorphism maps zero volume to zero volume
cs
⇒ Vol(g −t (Wloc
)) = 0
11
Show g is diffeomorphism if 0 < α <
I
I
I
I
I
I
1
L
diffeomorphism=bijection+g , g −1 continuously differen.
Show bijection by construct the inverse map of y = g (x)
1
: x = Proxα(−f ) (y ) = arg min kx − y k2 + α(−f )(x)
x 2
∇2 ( 21 kx − y k2 − αf (x)) = I − α∇2 f (x) > 0
Strongly CVX ⇒ Unique minimizer
xy : xy − y − α∇f (xy ) = 0, ∀y
f ∈ C 2 ⇒ g continuously differentiable
Inverse function Thm: Dg (x) = I − α∇2 f (x) > 0 ⇒
g −1 continuously differentiable
12
A Example
I
f (x) = 21 x > Hx ∈ C 2 with H = diag(λ1 , . . . , λn )
I
λ1 , . . . , λk > 0, λk+1 , . . . , λn < 0
I
x ∗ = 0 the unique saddle point, also strict saddle
I
g (x) = (I − αH)x diffeomorphism
I
Es = Span{e1 , . . . , ek } ⇒ zero volume
Then 0 < α < 1/|λ|max ⇒ Vol(Ws (x ∗ )) = 0
Double Check:
I g k (x) = (I − αH)k x =
Pk
Pn
(1
−
αλ
)x
+
i
i
i=1
j=k+1 (1 − αλj )xj
I
I
Ws (x ∗ ) = {x : xk+1 = . . . = xn = 0} ⇒ zero volumn
13
A Example: f (x) = −x12 + x22
14