Asynchronous
Parallel
Stochastic
Gradient
For
Nonconvex
Optimization
Two Implementations of AsySG (AsySG-con & AsySG-incon)
The procedure of AsySG
A central node or a shared memory
maintains the optimization variable π₯.
πΊ π₯π , ππ .
Assume that certain assumptions hold and πΌπ
πΎ=π
πΊ π₯π , ππ
1. (Read): read the parameter π₯π from
the central node/shared memory.
2. (Compute): sample π training data
π1 , β― , ππ and compute a batch of the
π=1
Motivation
ο Nonconvex optimization is quite common in Machine Learning (Deep
Learning, Natural Language Processing, Recommendation System, etc.)
ο Asynchronous Stochastic Gradient (AsySG) is a powerful method in
solving large scale machine learning problems.
ο However, the theoretical analysis is still limited for nonconvex
optimization.
optimization
(deep
learning,
min πΌ π»π π₯π
Key challenges in analysis
1) ππ β ππ ;
2) Different implemenations => Different forms of ππ .
π₯π+1 = π₯π β πΎ
Example: Ξ = {1,2, β― , π} is an index set of all training samples and
πΉ(π₯; π) is the corresponding loss function.
1
β€
πΎ
πΎ
πΌ π»π π₯π
πΊ π₯πβππ , ππ,π .
π₯π+1
ππ
= π₯π
ππ
πΊ π₯π , ππ,π
π=1
π=1
π β upper bound of staleness.
βπΎ
π₯π = π₯π β
For example, in AsySG-con: max ππ β€ π.
π
In AsySG-incon: π½ π β π β 1, β― , π β π .
In practice π is proportional to the number of workers.
π
Synchronous parallelism
low system overhead
high system overhead
β€π
ππΎ
π=1
.
Caused by
asynchrony.
πΊ(π₯)
Dominate!
Caused by SGD
π₯
π₯
π»π(π₯)
π₯π+1 β π₯π .
πβπ½ π
π is the mini-batch size,
πΎ is the steplength.
[2] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. NIPS, 2011.
Asynchronous parallelism
π
.
[1] F. Niu, B. Recht, C. Re, and S. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. NIPS, 2011.
update
ο· Consistent convergence rate with SGD.
ο· Linear speedup up to π πΎ machines.
ο· Better linear speedup property than existing
work (linear speedup up to π πΎ 1/4 machines
in [2]).
ο· For AsySG-incon, we have a similar convergence
rate.
2
π
References
Child Node-0
Child Node-1
Child Node-2
, then the output of AsySG-con satisfies
Figure 2 Inconsistent read in multicore machine (AsySG-incon).
Asynchronous vs Synchronous
Time
ππ 2
π2
Experiment
π
ο· π β Ξ is a random variable.
ο· π(π₯) is a smooth but not necessarily convex function.
idle
2
kβ{1,β―,K}
Figure 1 Consistent read in computing cluster (AsySG-con).
min π π₯ = πΌπ [πΉ π₯, π ] .
computing
. If the delay parameter π is bounded by πΎ β₯ π
π₯k
NLP,
π₯ββπ
β€ π 2 . Set the steplength to be a constant
the following ergodic convergence rate:
Background
Consider the nonconvex
Recommendation):
1
ππΎπ2
stochastic gradient πΊ π₯π , ππ =
βπ π₯π , ππ , π = 1, β― , π locally.
3. (Update): Update parameter π₯ in the
central node/shared memory without
locks.
2
πΊ π₯, π β π»π π₯
π=1
π
All child nodes/threads run the
following procedure concurrently:
Xiangru Lian,
Yijun Huang,
Yuncheng Li,
and Ji Liu
Theorem
π
π₯π+1 β π₯π β πΎ
Convergence Rate for AsySG
Figure 3 AsySG-con algorithm run
on various numbers of mahicnes.
AsySG-con
(CIFAR10-FULL)
[3] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep
networks. NIPS, 2012.
[4] J. Liu, S. J. Wright, C. Re, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. ICML, 2014.
AsySG-incon
(Synthetic data)
iter speedup
time speedup
mpi-1
1.01
1.00
iter speedup
time speedup
thr-1
1
1
Figure 4 AsySG-incon algorithm
run on various numbers of cores.
mpi-2
1.93
1.73
thr-4
3.9
4.0
mpi-3
2.65
2.28
thr-8
7.8
8.1
mpi-4
3.42
2.88
mpi-5
4.27
3.56
mpi-6
4.92
4.07
mpi-7
5.36
4.41
mpi-8
5.96
5.00
thr-12 thr-16 thr-20 thr-24 thr-28 thr-32
11.6
15.4
19.9
24.1
28.7
31.6
11.9
16.3
19.2
22.7
26.1
29.2
© Copyright 2026 Paperzz