Asynchronous Parallel Stochastic Gradient For

Asynchronous
Parallel
Stochastic
Gradient
For
Nonconvex
Optimization
Two Implementations of AsySG (AsySG-con & AsySG-incon)
The procedure of AsySG
A central node or a shared memory
maintains the optimization variable π‘₯.
𝐺 π‘₯π‘˜ , πœ‰π‘š .
Assume that certain assumptions hold and π”Όπœ‰
𝛾=𝑂
𝐺 π‘₯π‘˜ , πœ‰π‘š
1. (Read): read the parameter π‘₯π‘˜ from
the central node/shared memory.
2. (Compute): sample 𝑀 training data
πœ‰1 , β‹― , πœ‰π‘€ and compute a batch of the
π‘š=1
Motivation
οƒ˜ Nonconvex optimization is quite common in Machine Learning (Deep
Learning, Natural Language Processing, Recommendation System, etc.)
οƒ˜ Asynchronous Stochastic Gradient (AsySG) is a powerful method in
solving large scale machine learning problems.
οƒ˜ However, the theoretical analysis is still limited for nonconvex
optimization.
optimization
(deep
learning,
min 𝔼 𝛻𝑓 π‘₯π‘˜
Key challenges in analysis
1) 𝒙𝒕 β‰  𝒙𝒕 ;
2) Different implemenations => Different forms of 𝒙𝒕 .
π‘₯π‘˜+1 = π‘₯π‘˜ βˆ’ 𝛾
Example: Ξ = {1,2, β‹― , 𝑁} is an index set of all training samples and
𝐹(π‘₯; πœ‰) is the corresponding loss function.
1
≀
𝐾
𝐾
𝔼 𝛻𝑓 π‘₯π‘˜
𝐺 π‘₯π‘˜βˆ’πœπ‘˜ , πœ‰π‘˜,π‘š .
π‘₯π‘˜+1
π‘–π‘˜
= π‘₯π‘˜
π‘–π‘˜
𝐺 π‘₯π‘˜ , πœ‰π‘˜,π‘š
π‘š=1
π‘š=1
𝑇 ≔ upper bound of staleness.
βˆ’π›Ύ
π‘₯π‘˜ = π‘₯π‘˜ βˆ’
For example, in AsySG-con: max πœπ‘˜ ≀ 𝑇.
π‘˜
In AsySG-incon: 𝐽 π‘˜ βŠ‚ π‘˜ βˆ’ 1, β‹― , π‘˜ βˆ’ 𝑇 .
In practice 𝑇 is proportional to the number of workers.
π‘˜
Synchronous parallelism
low system overhead
high system overhead
≀𝑂
𝑀𝐾
π‘˜=1
.
Caused by
asynchrony.
𝐺(π‘₯)
Dominate!
Caused by SGD
π‘₯
π‘₯
𝛻𝑓(π‘₯)
π‘₯𝑗+1 βˆ’ π‘₯𝑗 .
π‘—βˆˆπ½ π‘˜
𝑀 is the mini-batch size,
𝛾 is the steplength.
[2] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. NIPS, 2011.
Asynchronous parallelism
𝜎
.
[1] F. Niu, B. Recht, C. Re, and S. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. NIPS, 2011.
update
ο‚· Consistent convergence rate with SGD.
ο‚· Linear speedup up to 𝑂 𝐾 machines.
ο‚· Better linear speedup property than existing
work (linear speedup up to 𝑂 𝐾 1/4 machines
in [2]).
ο‚· For AsySG-incon, we have a similar convergence
rate.
2
𝑀
References
Child Node-0
Child Node-1
Child Node-2
, then the output of AsySG-con satisfies
Figure 2 Inconsistent read in multicore machine (AsySG-incon).
Asynchronous vs Synchronous
Time
𝑀𝑇 2
𝜎2
Experiment
𝑀
ο‚· πœ‰ ∈ Ξ is a random variable.
ο‚· 𝑓(π‘₯) is a smooth but not necessarily convex function.
idle
2
k∈{1,β‹―,K}
Figure 1 Consistent read in computing cluster (AsySG-con).
min 𝑓 π‘₯ = π”Όπœ‰ [𝐹 π‘₯, πœ‰ ] .
computing
. If the delay parameter 𝑇 is bounded by 𝐾 β‰₯ 𝑂
π‘₯k
NLP,
π‘₯βˆˆβ„π‘›
≀ 𝜎 2 . Set the steplength to be a constant
the following ergodic convergence rate:
Background
Consider the nonconvex
Recommendation):
1
π‘€πΎπœŽ2
stochastic gradient 𝐺 π‘₯π‘˜ , πœ‰π‘š =
βˆ‡π‘“ π‘₯π‘˜ , πœ‰π‘š , π‘š = 1, β‹― , 𝑀 locally.
3. (Update): Update parameter π‘₯ in the
central node/shared memory without
locks.
2
𝐺 π‘₯, πœ‰ βˆ’ 𝛻𝑓 π‘₯
π‘š=1
𝑀
All child nodes/threads run the
following procedure concurrently:
Xiangru Lian,
Yijun Huang,
Yuncheng Li,
and Ji Liu
Theorem
𝑀
π‘₯π‘˜+1 ← π‘₯π‘˜ βˆ’ 𝛾
Convergence Rate for AsySG
Figure 3 AsySG-con algorithm run
on various numbers of mahicnes.
AsySG-con
(CIFAR10-FULL)
[3] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep
networks. NIPS, 2012.
[4] J. Liu, S. J. Wright, C. Re, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. ICML, 2014.
AsySG-incon
(Synthetic data)
iter speedup
time speedup
mpi-1
1.01
1.00
iter speedup
time speedup
thr-1
1
1
Figure 4 AsySG-incon algorithm
run on various numbers of cores.
mpi-2
1.93
1.73
thr-4
3.9
4.0
mpi-3
2.65
2.28
thr-8
7.8
8.1
mpi-4
3.42
2.88
mpi-5
4.27
3.56
mpi-6
4.92
4.07
mpi-7
5.36
4.41
mpi-8
5.96
5.00
thr-12 thr-16 thr-20 thr-24 thr-28 thr-32
11.6
15.4
19.9
24.1
28.7
31.6
11.9
16.3
19.2
22.7
26.1
29.2