Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal

Sub-Sampled Newton Methods for Machine Learning
Jorge Nocedal
Northwestern University
Goldman Lecture, Sept 2016
1
Collaborators
Raghu Bollapragada
Richard Byrd
Northwestern University
University of Colorado
2
Optimization, Statistics, Machine Learning
Dramatic breakdown of barriers between optimization and
statistics. Partly stimulated by the rise of machine learning
Learning process and learning measures defined using
continuous optimization models – convex and non-convex
Nonlinear optimization algorithms used to train statistical
models – a great challenge with the advent of high dimensional
models and huge data sets
The stochastic gradient method currently plays a central role
3
Stochastic Gradient Method
Robbins-Monro (1951)
1.  Why has it risen to such prominence?
2.  What is the main mechanism that drives it?
3.  What can we say about its behavior in convex and nonconvex cases?
4.  What ideas have been proposed to improve upon SG?
``Optimization Methods for Machine Learning’’,
Bottou, Curtis, Nocedal (2016)
4
Problem statement
Given training set {(x1 , y1 ),…(xn , yn )}
Given a loss function ℓ(z, y)
Find a prediction function h(x; w)
(hinge loss, logistic,...)
(linear, DNN,...)
1 n
min w ∑ ℓ(h(xi ; w), yi )
n i=1
Notation: fi (w) = ℓ(h(xi ; w), yi )
1 n
Rn (w) = ∑ fi (w)
n i=1
empirical risk
Random variable ξ = (x, y)
F(w) = E[ f (w; ξ )]
expected risk
5
Stochastic Gradient Method
First present algorithms for empirical risk minimization
1 n
Rn (w) = ∑ fi (w)
n i=1
wk+1 = wk − α k ∇fi (wk )
i ∈{1,..., n} choose at random
•  Very cheap, noisy iteration; gradient w.r.t. just 1 data point
•  Not a gradient descent method
•  Stochastic process dependent on the choice of
i
•  Descent in expectation
6
Batch Optimization Methods
wk+1 = wk − α k ∇Rn (wk )
α
wk+1 = wk − k
n
batch gradient method
n
∑ ∇ f (w
i
k
)
i=1
•  More expensive, accurate step
•  Can choose among a wide range of optimization algorithms
•  Opportunities for parallelism
Why has SG emerged as the preeminent method?
Computational trade-offs between stochastic and batch methods
Ability to minimize
F (generalization error)
7
Practical Experience
Logistic regression;
speech data
Rn
0.6
Fast initial progress
of SG followed by
drastic slowdown
Empirical Risk
0.5
0.4
Batch
LBFGS
0.3
L-BFGS
0.2
Can we explain this?
SGD
0.1
0
0
0.5
1
1.5
2
2.5
Accessed Data Points
3
3.5
4
5
x 10
10 epochs
8
Intuition
SG employs information more efficiently than batch methods
Argument 1:
Suppose data consists of 10 copies of a set S
Iteration of batch method 10 times more expensive
SG performs same computations
9
Computational complexity
Total work to obtain Rn (wk ) ≤ Rn (w* ) + ε
Batch gradient method: n log(1 / ε)
Stochastic gradient method: 1 / ε
Think of ε = 10 −3
Which one is better?
More precisely:
Batch: ndκ log(1 / ε)
SG:
dνκ 2 / ε
10
Example by Bertsekas
w1 − 1
w1,*
1 n
Rn (w) = ∑ fi (w)
n i=1
Region of confusion
1
Note that this is a geographical argument
Analysis: given wk what is the expected decrease in the
objective function Rn as we choose one of the quadratics
randomly?
11
A fundamental observation
E[Rn (wk+1 ) − Rn (wk )] ≤ − α k ‖∇Rn (wk )‖22 + α k2 E‖∇fik (wk )‖2
Initially, gradient decrease dominates; then variance in gradient hinders
progress
To ensure convergence: α k → 0 in SG method to control variance.
Sub-sampled Newton methods directly control the noise given in the last
term
What can we say when α k = α is constant?
12
•  Only converges to a neighborhood of the optimal value.
13
=
k
14
Deep Neural Networks
Although much is understood about the SG method, still some
great mysteries: why is it so much better than batch methods
on DNNs?
15
Sharp and flat minima
Keskar et al. (2016)
Observing R along line
From SG solution to
batch solution
Goodfellow et al
Deep convolutional
Neural net CIFAR-10
SG solution
SG: mini-batch of size 256
ADAM optimizer
Batch solution
Batch: 10% of training set
16
Testing accuracy and sharpness
Keskar (2016)
Sharpness of minimizer
vs batch size
Sharpness:
Max R in a small box
around minimizer
Testing accuracy vs batch size
17
Drawback of SG method: distributed computing
18
Sub-sampled Newton Methods
19
1 n
Rn (w) = ∑ fi (w)
n i=1
Iteration
Choose S ⊂ {1,..., n},
X ∈{1,..., n} uniformly and independently
wk+1 = wk + α k p
∇ 2 FS (wk ) p = −∇FX (wk )
Sub-sampled gradient and Hessian
1
∇FX (wk ) =
∇fi (wk )
∑
| X | i∈X
1
2
∇ FS (wk ) =
∇
fi (wk )
∑
| S | i∈S
2
True Newton method impractical in large-scale machine learning
Will not achieve scale invariance or quadratic convergence
But the stochastic nature of the objective creates opportunities:
Coordinate Hessian sample S and gradient sample X for optimal complexity
20
Active research area
• 
• 
• 
• 
• 
• 
• 
• 
• 
Friedlander and Schmidt (2011)
Byrd, Chin, Neveitt, N. (2011)
Royset (2012)
Erdogdu and Montanari (2015)
Roosta-Khorasani and Mahoney (2016)
Agarwal, Bullins and Hazan (2016)
Pilanci and Wainwright (2015)
Pasupathy, Glynn, Ghosh, Hashemi (2015)
Xu, Yang, Roosta-Khorasani, Re’, Mahoney (2016)
21
Linear convergence
∇ 2 FSk (wk ) p = −∇FXk (wk )
wk+1 = wk + α p
The following result is well known for strongly convex objective:
Theorem: Under standard assumptions. If
a) α = µ / L
b) | Sk |= constant
c) | X k |= η k η > 1 (geometric growth)
Then, E[‖wk − w*‖] → 0 at a linear rate and
work complexity matches that of stochastic gradient method
µ = smallest eigenvalue of any subsampled Hessian
L = largest eigenvalue of Hessian of F
22
Local superlinear convergence
Objective: expected risk F
We can show the linear-quadratic result
σ ‖wk − w*‖
v
E k [‖wk+1 − w ‖] ≤ C1‖wk − w ‖ +
+
µ | Sk |
µ | Xk |
*
*
2
To obtain superlinear convergence:
i) | Sk |→ ∞
ii) | X k | must increase faster than geometrically
23
Closer look at the constant
We can show the linear-quadratic result
*
σ
‖w
−
w
‖
v
*
* 2
k
E k [‖wk+1 − w ‖] ≤ C1‖wk − w ‖ +
+
µ | Sk |
µ | Xk |
|| E[(∇ 2 Fi (w) − ∇ 2 F(w)2 ] ||≤ σ 2
tr(Cov(∇fi (w))) ≤ v 2
24
Achieving faster convergence
We can show the linear-quadratic result
σ ‖wk − w*‖
v
E k [‖wk+1 − w ‖] ≤ C1‖wk − w ‖ +
+
µ | Sk |
µ | Xk |
*
To obtain
*
2
we used the bound
E k [‖∇ 2 FSk (wk ) − ∇ 2 F(wk )‖≤
σ
‖wk − w*‖
| Sk |
Not matrix concentration inequalities
25
Formal superlinear convergence
Theorem: under the conditions just stated, there is a neighborhood
of w* such that for the sub-sampled Newton method with α k = 1
E[‖wk − w*‖] → 0 superlinearly
26
Observations on quasi- Newton methods
Bk p = −∇FX (wk )
Dennis-More’
wk+1 = wk + α k p and suppose wk → w*
‖[Bk − ∇ 2 f (x * )]pk ‖
If
→0
‖pk ‖
convergence rate is superlinear
1. Bk need not converge to B*; only good approximation
along search directions
27
Hacker’s definition of second-order method
1.  One for which unit steplength is acceptable (yields sufficient reduction)
most of the time
2.  Why? How can one learn the scale of search directions without having
learned curvature of the function in some relevant spaces?
3.  This ``definition’’ is problem dependent
28
Inexact Methods
∇ 2 FS (wk ) p = −∇FX (wk )
1. 
2. 
3. 
4. 
5. 
wk+1 = wk + α k p
Exact method: Hessian approximation is inverted (.e.g Newton-sketch)
Inexact method: solve linear system inexactly by iterative solver
Conjugate gradient
Stochastic gradient
Both require only Hessian-vector products
Newton-CG chooses a fixed sample S, applies CG to
qk ( p) = F(wk ) + ∇F(wk )T p +
1 T 2
p ∇ FS (wk ) p
2
29
Newton-SGI (stochastic gradient iteration)
If we apply the standard gradient method to
1
qk ( p) = F(wk ) + ∇F(wk )T p + pT ∇ 2 F(wk ) p
2
we obtain the iteration
pki+1 = pki − ∇qk ( pki )
= (I − ∇ 2 F(wk )) pki − ∇F(wk )
Consider instead the semi-stochastic gradient iteration:
1. Choose and index j at random;
2. pki+1 = (I − ∇ 2 Fj (wk )) pki − ∇F(wk )
Change sample
Hessian at each
nner iteration
Gradient method using the exact gradient but and an estimate
of the Hessian. This method is implicit in Agarwal, Bullins, Hazan 2016
30
Comparing Newton-CG and Newton-GD
Number of Hessian-vector products to achieve
1
‖wk+1 − w*‖≤ ‖wk − w*‖
(*)
2
(
O (κˆ lmax )2κˆ l log(κˆ l )log(d)
(
O (κˆ lmax )2 κˆ lmax log(κˆ lmax )
)
)
Newton-SGI
Newton-CG
Results in Agarwal, Bullins and Hazan (2016) and Xu, Yang, Re,
Roosta-Khorasani, Mahoney (2016):
Decrease (*) obtained at each step with probability 1-p
Our results give convergence of the whole sequence in expectation
Complexity bounds are very pessimistic, particularly for CG
31
Final Remarks
• 
• 
The search for effective optimization algorithms for machine learning is
ongoing
In spite of the total dominance of the SG method at present on very large
scale applications
• 
• 
SG does not parallelize well
SG is a first order method affected by ill conditioning
• 
A method that is noisy at the start and gradually becomes more accurate
seems attractive
32