Asynchronous Parallel Stochastic Gradient For

Asynchronous
Parallel
Stochastic
Gradient
For
Nonconvex
Optimization
Two Implementations of AsySG (AsySG-con & AsySG-incon)
The procedure of AsySG
A central node or a shared memory
maintains the optimization variable 𝑥.
𝐺 𝑥𝑘 , 𝜉𝑚 .
Assume that certain assumptions hold and 𝔼𝜉
𝛾=𝑂
𝐺 𝑥𝑘 , 𝜉𝑚
1. (Read): read the parameter 𝑥𝑘 from
the central node/shared memory.
2. (Compute): sample 𝑀 training data
𝜉1 , ⋯ , 𝜉𝑀 and compute a batch of the
𝑚=1
Motivation
 Nonconvex optimization is quite common in Machine Learning (Deep
Learning, Natural Language Processing, Recommendation System, etc.)
 Asynchronous Stochastic Gradient (AsySG) is a powerful method in
solving large scale machine learning problems.
 However, the theoretical analysis is still limited for nonconvex
optimization.
optimization
(deep
learning,
min 𝔼 𝛻𝑓 𝑥𝑘
Key challenges in analysis
1) 𝒙𝒕 ≠ 𝒙𝒕 ;
2) Different implemenations => Different forms of 𝒙𝒕 .
𝑥𝑘+1 = 𝑥𝑘 − 𝛾
Example: Ξ = {1,2, ⋯ , 𝑁} is an index set of all training samples and
𝐹(𝑥; 𝜉) is the corresponding loss function.
1
≤
𝐾
𝐾
𝔼 𝛻𝑓 𝑥𝑘
𝐺 𝑥𝑘−𝜏𝑘 , 𝜉𝑘,𝑚 .
𝑥𝑘+1
𝑖𝑘
= 𝑥𝑘
𝑖𝑘
𝐺 𝑥𝑘 , 𝜉𝑘,𝑚
𝑚=1
𝑚=1
𝑇 ≔ upper bound of staleness.
−𝛾
𝑥𝑘 = 𝑥𝑘 −
For example, in AsySG-con: max 𝜏𝑘 ≤ 𝑇.
𝑘
In AsySG-incon: 𝐽 𝑘 ⊂ 𝑘 − 1, ⋯ , 𝑘 − 𝑇 .
In practice 𝑇 is proportional to the number of workers.
𝑘
Synchronous parallelism
low system overhead
high system overhead
≤𝑂
𝑀𝐾
𝑘=1
.
Caused by
asynchrony.
𝐺(𝑥)
Dominate!
Caused by SGD
𝑥
𝑥
𝛻𝑓(𝑥)
𝑥𝑗+1 − 𝑥𝑗 .
𝑗∈𝐽 𝑘
𝑀 is the mini-batch size,
𝛾 is the steplength.
[2] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. NIPS, 2011.
Asynchronous parallelism
𝜎
.
[1] F. Niu, B. Recht, C. Re, and S. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. NIPS, 2011.
update
 Consistent convergence rate with SGD.
 Linear speedup up to 𝑂 𝐾 machines.
 Better linear speedup property than existing
work (linear speedup up to 𝑂 𝐾 1/4 machines
in [2]).
 For AsySG-incon, we have a similar convergence
rate.
2
𝑀
References
Child Node-0
Child Node-1
Child Node-2
, then the output of AsySG-con satisfies
Figure 2 Inconsistent read in multicore machine (AsySG-incon).
Asynchronous vs Synchronous
Time
𝑀𝑇 2
𝜎2
Experiment
𝑀
 𝜉 ∈ Ξ is a random variable.
 𝑓(𝑥) is a smooth but not necessarily convex function.
idle
2
k∈{1,⋯,K}
Figure 1 Consistent read in computing cluster (AsySG-con).
min 𝑓 𝑥 = 𝔼𝜉 [𝐹 𝑥, 𝜉 ] .
computing
. If the delay parameter 𝑇 is bounded by 𝐾 ≥ 𝑂
𝑥k
NLP,
𝑥∈ℝ𝑛
≤ 𝜎 2 . Set the steplength to be a constant
the following ergodic convergence rate:
Background
Consider the nonconvex
Recommendation):
1
𝑀𝐾𝜎2
stochastic gradient 𝐺 𝑥𝑘 , 𝜉𝑚 =
∇𝑓 𝑥𝑘 , 𝜉𝑚 , 𝑚 = 1, ⋯ , 𝑀 locally.
3. (Update): Update parameter 𝑥 in the
central node/shared memory without
locks.
2
𝐺 𝑥, 𝜉 − 𝛻𝑓 𝑥
𝑚=1
𝑀
All child nodes/threads run the
following procedure concurrently:
Xiangru Lian,
Yijun Huang,
Yuncheng Li,
and Ji Liu
Theorem
𝑀
𝑥𝑘+1 ← 𝑥𝑘 − 𝛾
Convergence Rate for AsySG
Figure 3 AsySG-con algorithm run
on various numbers of mahicnes.
AsySG-con
(CIFAR10-FULL)
[3] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep
networks. NIPS, 2012.
[4] J. Liu, S. J. Wright, C. Re, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. ICML, 2014.
AsySG-incon
(Synthetic data)
iter speedup
time speedup
mpi-1
1.01
1.00
iter speedup
time speedup
thr-1
1
1
Figure 4 AsySG-incon algorithm
run on various numbers of cores.
mpi-2
1.93
1.73
thr-4
3.9
4.0
mpi-3
2.65
2.28
thr-8
7.8
8.1
mpi-4
3.42
2.88
mpi-5
4.27
3.56
mpi-6
4.92
4.07
mpi-7
5.36
4.41
mpi-8
5.96
5.00
thr-12 thr-16 thr-20 thr-24 thr-28 thr-32
11.6
15.4
19.9
24.1
28.7
31.6
11.9
16.3
19.2
22.7
26.1
29.2

Download Report

Asynchronous Parallel Stochastic Gradient For

Paperzz.com

Your Paperzz