Taming nonconvexity via geometry

GEOMETRIC OPTIMIZATION
SUVRIT SRA
Laboratory for Information and Decision Systems
Massachusetts Institute of Technology
NIPS 2016, Barcelona
Nonconvex optimization workshop
Includes work with:
Reshad Hosseini
Pourya H. Zadeh
Hongyi Zhang
π
‣ Vector spaces
‣ Manifolds
(hypersphere, orthogonal matrices, complicated surfaces)
‣ Convex sets
Machine Learning
Graphics
Robotics
Vision
BCI
NLP
Statistics
Geometric Optimization
(probability simplex, semidefinite cone, polyhedra)
‣ Metric spaces
(tree space, Wasserstein spaces, CAT(0), space-of-spaces)
2
Suvrit Sra (MIT)
Geometric Optimization
Example: Riemannian optimization
Orthogonality
constraint
Fixed-rank
constraint
Vector space
optimization
Positive
semi-definite
constraint
... ...
Stiefel
manifold
Grassmann
manifold
PSD
manifold
Riemannian
optimization
... ...
[Udriste, 1994; Absil et al., 2009]
3
Suvrit Sra (MIT)
Geometric Optimization
Classes of function in optimization
Convex
Smooth
Lipschitz
Strongly convex
4
Suvrit Sra (MIT)
Geometric Optimization
Classes of function in optimization
Geodesically
Convex
Smooth
Lipschitz
Strongly convex
5
Suvrit Sra (MIT)
Geometric Optimization
What is geodesic convexity?
x
Convexity
Geodesic convexity
f ((1
t)x
y
(1 t)x + ty
x
(1
ty)  (1
t) x
ty
y
t)f (x) + tf (y)
on a Riemannian manifold f (y)
⌦
1
f (x) + gx , Expx (y)
↵
x
Metric spaces & curvature: [Menger; Alexandrov; Busemann; Bridson, Häflinger; Gromov; Perelman]
6
Suvrit Sra (MIT)
Geometric Optimization
Positive definite matrix manifold
Y
1
2
1
2
X#t Y := X (X Y X
= (1 t)X tY
Examples
(
f (X) =
1
2
t
)X
1
2
X #t Y
Geodesic
X
log det(X),
tr(X ↵ ),
log tr(X),
kX ↵ k.
Exercise
f (X#t Y )  (1
t)f (X) + tf (Y )
7
Suvrit Sra (MIT)
Geometric Optimization
Positive definite matrix manifold
Recognizing, constructing,
and optimizing g-convex
functions
Corollaries
X 7! log det(B +
X 7! log per(B +
2
R (X, Y
),
X
i
X
i
A⇤i XAi )
A⇤i XAi )
2
S (X, Y
)
(jointly g-convex)
Many more theorems and corollaries
[Sra, Hosseini (2013,2015)]
– [Wiesel 2012]
– [Rápcsák 1984]
– [Udriste 1994]
One-D version known as: Geometric Programming
www.stanford.edu/~boyd/papers/gp_tutorial.html
[Boyd, Kim,Vandenberghe, Hassibi (2007). 61pp.]
8
Suvrit Sra (MIT)
Geometric Optimization
X
0
9
Suvrit Sra (MIT)
Geometric Optimization
Matrix square root
Broadly applicable
Key to ‘expm’, ‘logm’
10
Suvrit Sra (MIT)
Geometric Optimization
Matrix square root
Nonconvex optimization through the Euclidean lens
[Jain, Jin, Kakade, Netrapalli; Jul 2015]
min kM
X2Rn⇥n
2 2
X kF
Gradient descent
Xt+1
Xt
2
⌘(Xt
M )Xt
2
⌘Xt (Xt
M)
Simple algorithm; linear convergence; nontrivial analysis
11
Suvrit Sra (MIT)
Geometric Optimization
Matrix square root
Geodesic
1
2
X#t Y := X (X
1
2
YX
1
2
t
)X
1
2
Midpoint
1
2
A = A# 12 I
12
Suvrit Sra (MIT)
Geometric Optimization
Matrix square root
Nonconvex optimization through non-Euclidean lens
[Sra; Jul 2015]
min
X 0
2
S (X, A)
+
2
S (X, I)
Fixed-point iteration
Xk+1
[(Xk + A)
1
+ (Xk + I)
1
]
1
Simple method; linear convergence; 1/2 page analysis!
Global optimality thanks to geodesic convexity
2
S (X, Y
Suvrit Sra (MIT)
) :=
1
2
log det
X+Y
2
Geometric Optimization
1
2
log det(XY )
13
Matrix square root
10
GD
LSGD
-1
10 -2
10 -3
10 -4
10 -5
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Running time (seconds)
50 ⇥ 50 matrix I + U U T
 ⇡ 64
Relative error (Frobenius norm)
Relative error (Frobenius norm)
10
0
YAMSR
LSGD
10 -5
10 -10
10 -15
0
0.02
0.04
0.06
0.08
0.1
0.12
Running time (seconds)
14
Suvrit Sra (MIT)
Geometric Optimization
Metric learning
What does a metric learning method do?
Metric Learning
[Habibzadeh, Hosseini, Sra, ICML 2016]
15
Suvrit Sra (MIT)
Geometric Optimization
Euclidean metric learning
Pairwise constraints
S := {(xi , xj ) | xi and xj are in the same class}
D := {(xi , xj ) | xi and xj are in di↵erent classes}
Goal
given pairwise constraints learn Mahalanobis distance
dA (x, y) := (x
T
y) A(x
y)
Positive definite matrix A
16
Suvrit Sra (MIT)
Geometric Optimization
Metric learning methods
min
MMC
A⌫0
such that
[Xing, Jordan, Russell, Ng 2002]
(xi ,xj )2S
X
(xi ,xj )2D
min
LMNN
X
A⌫0
X
(xi ,xj )2S
h
(1
dA (xi , xl )
[Weinberger, Saul 2005]
⇠ijl
A⌫0
such that
[Davis, Kulis, Jain, Sra, Dhillon 2007]
q
dA (xi , xj )
µ)dA (xi , xj ) + µ
dA (xi , xj )
1
X
1
l
(1
yil )⇠ijl
⇠ijl
0
min
ITML
dA (xi , xj )
Dld (A, A0 )
dA (x, y)  u,
dA (x, y)
Dld (A, A0 ) := tr(AA0 1 )
l,
(x, y) 2 S,
(x, y) 2 D
log det(AA0 1 )
d
tons of other methods!
17
Suvrit Sra (MIT)
Geometric Optimization
i
A simple new way for metric learning
Euclidean idea
X
min
A⌫0
dA (xi , xj )
(xi ,xj )2S
X
dA (xi , xj )
(xi ,xj )2D
New idea
X
min
A⌫0
dA (xi , xj ) +
(xi ,xj )2S
X
dA 1 (xi , xj )
(xi ,xj )2D
Equivalently solve
min
A 0
S :=
X
(xi
xj )(xi
xj )T ,
(xi
xj )(xi
xj ) T
h(A) := tr(AS) + tr(A
1
D)
cool!
(xi ,xj )2S
D :=
X
18
(xi ,xj )2D
Suvrit Sra (MIT)
Geometric Optimization
A simple new way for metric learning
1
2
X#t Y := X (X
Closed form solution!
rh(A) = 0
,
S
A
1
A=S
DA
1
1
1
2
YX
1
2
t
)X
=0
# 12 D
More generally
min
A 0
(1
t)
#t D
D
1
+t
2
R (D, A)
#t D
S
1
*
S
2
1
, A)
R (S
S
1
19
Suvrit Sra (MIT)
Geometric Optimization
1
2
Experiments
[Habibzadeh, Hosseini, Sra ICML 2016]
20
Suvrit Sra (MIT)
Geometric Optimization
Gaussian mixture models
pmix (x) :=
K
X
k=1
max
Y
i
⇡k pN (x; ⌃k , µk )
pmix (xi )
Expectation maximization (EM): default choice
pN (x; ⌃, µ) / p
1
det(⌃)
exp
1
2 (x
µ)T ⌃
1
(x
µ)
[Hosseini, Sra NIPS 2015]
21
Suvrit Sra (MIT)
Geometric Optimization
Gaussian mixture models
–
Nonconvex – difficult, possibly several local optima
–
GMMs – Recent surge of theoretical results
–
In Practice – EM still default choice
(Often claimed that standard nonlinear programming algorithms inferior for GMMs)
Difficulty: Positive definiteness constraint on ⌃k
Geometric opt
Unconstrained, Cholesky
New
TX
X
Folklore
LL
⇠X
Sd+
T
22
Suvrit Sra (MIT)
Geometric Optimization
Failure of “obvious” LLT
sep.
EM
CG-LLT
0.2
52s
⫽ 12.7
614s ⫽ 12.7
1
160s
⫽ 13.4
435s ⫽ 13.5
5
72s
⫽ 12.8
426s ⫽ 12.8
kµi
µj k
sep max{tr⌃i , tr⌃j }
ij
d=20
simulation
23
Suvrit Sra (MIT)
Geometric Optimization
Failure of manifold optimization
K
EM
Riem-CG
2
17s ⫽ 29.28
947s ⫽ 29.28
5
202s ⫽ 32.07
5262s ⫽ 32.07
10
2159s ⫽ 33.05
17712s ⫽ 33.03
d=35
n=200,000
images
dataset
manopt.org
Riemannian opt. toolbox
Suvrit Sra (MIT)
24
Geometric Optimization
What’s wrong?
log-likelihood for one component
max L(µ, ⌃) :=
µ,⌃ 0
n
log det ⌃
2
n
X
i=1
log pN (xi ; µ, ⌃).
1 Xn
(xi
i=1
2
T
µ) ⌃
1
(xi
µ)
Euclidean convex problem
Not geodesically convex
Mismatched geometry?
25
Suvrit Sra (MIT)
Geometric Optimization
Reformulate as g-convex

xi
yi =
1
⌃ + µµ
S=
T
µ
b
max L(S)
:=
S 0

n
X
i=1
T
µ
1
log qN (yi ; S),
Thm. The modified log-likelihood is g-convex. Local
max of modified mixture LL is local max of original.
26
Suvrit Sra (MIT)
Geometric Optimization
Success of geometric optimization
K
EM
Riem-CG
L-RBFGS
2
17s ⫽ 29.28
18s ⫽ 29.28
14s ⫽ 29.28
5
202s ⫽ 32.07
140s ⫽ 32.07
117s ⫽ 32.07
10 2159s ⫽ 33.05
1048s ⫽ 33.06
658s ⫽ 33.06
Riem-CG (manopt) savings:
947→18; 5262 →140; 17712→1048
27
github.com/utvisionlab/mixest
Suvrit Sra (MIT)
d=35
n=200,000
images
dataset
Geometric Optimization
First-order
algorithms
[Zhang, Sra, COLT 2016]
Suvrit Sra (MIT)
Geometric Optimization
28
first-order g-convex optimization
min
X
x2X ⇢M
g-convex set;
f
g-convex func;
f (x)
M
Riemannian manifold
oracle access to exact or stochastic (sub)gradients
x
Expx ( ⌘rf (x))
analog to: x
x
⌘rf (x)
29
Suvrit Sra (MIT)
Geometric Optimization
In particular, we study the global complexity
of first-order g-convex optimization
Global Complexity
✓ ?
Gradient Descent
Stochastic Gradient Descent
Coordinate Descent
Accelerated Gradient Descent
E[f (xa )
⇤
f (x )]  ?
Fast Incremental Gradient
... ...
Convex Optimization
G-Convex Optimization
30
Suvrit Sra (MIT)
Geometric Optimization
Convergence rates depend on lower bounds
on the sectional curvature
(Sub)gradient
Lipschitz
Strongly convex
/ smooth
Strongly convex
& smooth
Stochastic
(sub)gradient
convex
O
⇣r 1 ⌘
O
O
g-convex
t
⇣1⌘
O
O
t
⇣
1
µ
Lg
t
⌘
... ...
O
⇣r ⇣
⇣⇣
⇣
max
t
⌘
⌘
max
t
1
⇣max
1
µ
min
,
⇣max Lg
p
|min |D
⇣p
⌘
,
tanh
|min |D
See paper for other interesting results [Zhang, Sra, COLT 2016]
Suvrit Sra (MIT)
Geometric Optimization
t
⌘
31
Nonconvex optimization on manifolds
min
x2M
•
•
•
•
•
1
f (x) :=
n
n
X
fi (x)
i=1
M is a Riemannian manifold
g-convex and g-nonconvex ‘f’ allowed!
First global complexity results for stochastic
methods on general Riemannian manifolds
Can be faster than Riemannian SGD
New insights into eigenvector computation
[Zhang, Reddi, Sra, NIPS 2016]
See also: [Kasai, Sato, Mishra, OPT2016]
Suvrit Sra (MIT)
Geometric Optimization
32