Appendix
Fugue: Slow-Worker-Agnostic Distributed Learning for Big Models
on Big Data
This appendix contains proof details for the main paper, as well as some experimental details. In this document, some of the
theorems and equations refer to the main paper; we put “(main paper)” behind these references to avoid confusion.
A
Parameter Tuning for Experiments
Model
Fugue’s η0
Fugue’s η0
BarrieredFugue’s η0
GraphLab’s η0
GraphLab’s step dec
PSGD’s η0
TM
0.01
0.01
0.01
0.05
0.9
0.005
Dictionary Learning
0.01
0.01
0.01
0.001
0.9
0.01
MMND
0.01
0.05
0.05
0.1
0.9
0.1
Table 1: Final tuned parameter values for Fugue, BarrieredFugue, GraphLab and PSGD. All the methods are tuned to perform
optimally. η0 is the initial step size, where η is defined in equation 4. λ is the Dictionary Learning `1 penalty defined in
0
equation 2. η is parameter that modifies the learning rate when extra updates are executed while waiting for slow workers.
step dec is a parameter for decreasing learning rate for GraphLab used in their collaborative filtering library.
0
Table 1 shows the final parameter values for each problem, after they have been tuned optimally for each method. η controls
the learning rate in additional updates by the worker when it is waiting for slower ones to finish. This makes it satisfy the
0
0
condition in equation 11. The modified learning rate ηt = ηt ∗ (η )x+1 where x is the number of extra update iterations the
worker has perfomed while waiting in this sub-epoch. One iteration here means updating all the data points in the curent
sub-epoch once. ηt is the sub-epoch’s original learning rate without any extra updates similar to BarrieredFugue. The learning
η0
rate for epoch t is given by ηt = t+1
for Fugue and BarrieredFugue.
B
Convergence Proof: Theorem 1 (main paper)
From equation 4 (main paper) we have
ψ (t+1)
= ψ (t) − ηt δL(t) (V (t) , ψ (t) )
h
i
= ψ (t) − ηt ∇L(ψ (t) ) + ηt ∇L(ψ (t) ) − δL(t) (V (t) , ψ (t) )
= ψ (t) − ηt ∇L(ψ (t) ) + ηt εt
Using ni and Nw as defined in equation 6 (main paper)
1
(1)
P
t+m( w
1 ni )
ψ (t+(
Pw
1
ni )m)
= ψ (t) +
X
P
t+m( w
1 ni )
X
−ηi ∇L(ψ (i) ) +
i=t
=⇒ ψ (t+mNw ) = ψ (t) +
ηi εi
i=t
t+mN
Xw
t+mN
Xw
−ηi ∇L(ψ (i) ) +
i=t
ηi εi
i=t
assuming
w
X
n i = Nw
1
=⇒ ψ (t+mNw ) = ψ (t) +
t+mN
Xw
−ηi ∇L(ψ (i) ) + MmNw
(2)
i=t
Pt+mN
where MmNw = i=t w ηi εi is a martingale sequence since it is a sum of martingale difference sequence. mNw captures
the m whole sub-epochs of work done as a whole by all the workers combined. From Doobs martingale inequality (Friedman,
1975, ch. 1, Thm 3.8)
2 Pt+mNw
E
η
ε
i i
i=t
(3)
P
sup
|Mr | ≥ c ≤
c2
t+mNw ≥r≥t
Pr
where Mr = i=t ηi εi . Lets look at the RHS of equation 3 above:
#
!2
"mN
t+mN
Xw
Xw
2
E
ηi εi = E
(ηi εi )
i=t
i=1
(equation 8 (main paper) =⇒ E[εi εj ] = 0 if i 6= j)
=
mN
Xw
ηi2 E[ε2i ] ≤
mN
Xw
i=1
ηi2 D → 0
i=1
X
where E[ε2i ] < D ∀i and assuming
ηi2 < ∞
lim =⇒ P sup |Mi | ≥ c = 0 as t → ∞
t→∞
(4)
i≥t
From equation 4 we have
t+mN
Xw
ψ (t+mNw ) = ψ (t) +
−ηi ∇L(ψ (i) )
i=t
asymptotically.
Note that we do a theoretical analysis of the algorithm without projection steps. Extending the proof to include projection can
be done by using Arzela-Ascoli theorem and the limits of converging sub-sequence of our algorithm’s SGD updates (Kushner
and Yin, 2003).
C
Intra sub-epoch variance
ψ (t+1) = ψ (t) − δψ (t) (V (t) , ψ (t) )
where δψ
(t)
(V
(t)
,ψ
(t)
)=η
(t)
(t)
δL (V
(t)
,ψ
(t)
)⇒ψ
(t+1)
t
t
t
= ψ − ηt δL (V , ψ )
Summing equation 5 over ni , the number of points updated in block i of a sub-epoch
ni
X
t+ni
t
ψ
=ψ −
ηt+i δLt+i (V t+i , ψ t+i )
i=1
2
(5)
t
(6)
As defined earlier, V denotes the joint potential for all the ni points encountered in block i. The equation 7 (main paper) for
V is
p(ψ (t+ni ) |ψ t )dψ (t+ni ) = p(V (ψ (t+ni ) , ψ t ))dV
Z
Z
⇒ p(ψ (t+ni ) )dψ (t+ni ) =
p(ψ (t+ni ) |ψ t )p(ψ t )dψ t dψ (t+ni ) =
p(V (ψ (t+ni ) , ψ t ))dV p(ψ t )dψ t
ψt
(7)
ψt
Lemma 1 Let u(ψ (t+ni ) ) be a function of ψ (t+ni ) then
Eψ
(t+ni )
t
[u(ψ (t+ni ) )] = Eψ [EV [u(ψ (t+ni ) )]]
Proof. From equation 7
Eψ
(t+ni )
[u(ψ (t+ni ) )] =
Z
u(ψ (t+ni ) )p(ψ (t+ni ) )dψ (t+ni )
ψ (t+ni )
Z
u(ψ (t+ni ) )P (ψ (t+ni ) )dψ (t+ni )
=
ψ t+i
Z Z
=
u(ψ (t+ni ) )P (V (ψ (t+ni ) , ψ))dV P (ψ t )dψ t
ψt
V
t
= Eψ [EV [u(ψ (t+ni ) )]]
Lemma 2
EV [δLt+i (v t+i , ψ t+i )] =
dEV [Lt+i (v t+i , ψ t+i )]
dψ t+i
Proof. Due to randomness in picking the point to be updated in iteration t + i We have
Z
EV [Lt+i (v t+i , ψ t+i )] = L(y, ψ t+i )dy
⇒
t+i t+i
dEV [Lt+i (v t+i , ψ t+i )]
(v , ψ t+i )
V dL
=
E
[
] = EV [δLt+i (v t+i , ψ t+i )]
dψ t+i
dψ t+i
Lemma 3
EV [Lt+i (v t+i , ψ t+i )] = Ev
t+i
[Lt+1 (v t+i , ψ t+i )]
Proof.
Using the definition of V t+i in equation 7 (main paper), the fact that V is a joint variable of each V t+i and an any iteration
t + i the chance of picking any data point is completely random and indpendent of any other iteration.
EV [Lt+i (v t+i , ψ t+i )] = Ev
t+i
[Lt+1 (v t+i , ψ t+i )]
Lemma 4
dLt+i (v t+i , ψ t+i ) dLt+j (v t+j , ψ t+i )
dEv
E [
]
=
dψ t+i
dψ t+j
V
3
t+i
[Lt+i (v t+i , ψ t+i )] dEv
dψ t+i
t+i
[Lt+i (v t+i , ψ t+i )]
dψ t+i
Proof. Two different data points picked at iteration (t + i) and (t + j) are independent of each other. Using this fact and the
definition of potetntial function V in equation 7
EV [
t+i t+i
(v , ψ t+i )] V dLt+j (v t+j , ψ t+i )
dLt+i (v t+i , ψ t+i ) dLt+j (v t+j , ψ t+i )
V dL
]
=
E
[
]E [
]
dψ t+i
dψ t+j
dψ t+i
dψ t+j
dEV [Lt+i (v t+i , ψ t+i )] dEV [Lt+j (v t+j , ψ t+j )]
using lemma 2 =
dψ t+i
dψ t+j
t+j
t+i
dEv [Lt+i (v t+i , ψ t+i )] dEv [Lt+j (v t+j , ψ t+j )]
using lemma 3 =
dψ t+i
dψ t+j
Theorem 1 We define ψ∗ as the global optima and Ω0 as the hessian of the loss at ψ∗ i.e. Ω0 =
is univariate) then
dEv
t+i
d2 E[L(ψ∗ )]
dψ∗2
(assuming that ψ
[Lt+i (v t+i , ψ t+i )]
= Ω0 (ψt − ψ∗ + δi ) + O(ρ2t )
dψ t+i
where O(ρ2t ) = O(|ψt+i − ψ∗ |2 ) with the assumption that O(ρt+i ) is small ∀i ≥ 0 and δi = ψt+i − ψt .
Proof. Lets define φ(ψt+i ) = Ev
t+i
[Lt+i (v t+i , ψ t+i )] Using Taylor’s theorem and expanding around ψ∗
dφ(ψ∗ ) t+i
(ψ t+1 − ψ∗ )2 d2 φ(ψ∗ )
+ O((ψ t+i − ψ∗ )3 )
(ψ
− ψ∗ ) +
dψ∗
2
dψ∗2
(ψ t+i − ψ∗ )2 d2 φ(ψ∗ )
dφ(ψ∗ )
t+i
3
+
O((ψ
−
ψ
)
)
as
= φ(ψ∗ ) +
=
0
at
optima
∗
2
dψ∗2
dψ∗
t+i
2
dφ(ψ )
d φ(ψ∗ )
⇒
= (ψ t+i − ψ∗ )
+ O((ψ t+i − ψ∗ )2 )
dψ t+i
dψ∗2
t+i
dEv [Lt+i (v t+i , ψ t+i )]
t
2
= Ω0 (ψ − ψ∗ + δi ) + O(ρt ) with the assumption that O(ρt ) is small
⇒
dψ t+i
2
2
we have O(ρt+i ) = O(ρt )
φ(ψ t+i ) = φ(ψ∗ ) +
Theorem 2 With ψ∗ as defined in theorem 1 and assuming that ψ is univariate we have
Ev
t+i
[(
dLt+i (v t+i , ψ t+i ) 2
) ] = Ω1 + O(E[O(ρt )]) + O(ρ2t )
dψ t+i
where O(ρ2t ) and δi are as defined in theorem 1 and Ω1 = Ev
t+i
4
[( dL
t+i
(v t+i ,ψ∗ ) 2
) ]
dψ∗
Proof. Expanding Lt+i (v t+i , ψ t+i ) around ψ∗ using Taylor’s theorem
Lt+i (v t+i , ψ t+i ) = Lt+i (v t+i , ψ∗ ) +
dLt+i (v t+i , ψ∗ ) t+i
(ψ
− ψ∗ )
dψ∗
1 d2 Lt+i (v t+i , ψ∗ ) t+i
(ψ
− ψ∗ )2 + O((ψ t+i − ψ∗ )3 )
2
dψ∗2
dLt+i (v t+i , ψ t+i )
dLt+i (v t+i , ψ∗ ) d2 Lt+i (v t+i , ψ∗ ) t+i
⇒
=
(ψ
− ψ∗ ) + O((ψ t+i − ψ∗ )2 )
+
dψ t+i
dψ∗
dψ∗2
t+i t+i
t+i
(v , ψ∗ ) 2
dLt+i (v t+i , ψ t+i ) 2
v t+i dL
)
]
=
E
[(
⇒ Ev [(
)
t+i
dψ
dψ∗
dLt+i (v t+i , ψ∗ ) d2 Lt+i (v t+i , ψ∗ ) t+i
(ψ
− ψ∗ ) + O((ψ t+i − ψ∗ )2 )]
+2
dψ∗
dψ∗2
t+i
dLt+i (v t+i , ψ t+i ) 2
⇒ Ev [(
) ] = Ω1 + O(E[(ψt+i − ψ∗ ]) + O(ρ2t )
dψ t+i
= Ω1 + O(E[O(ρt )]) + O(ρ2t )
+
C.1
Within block variance bound
Theorem 3 The variance of the parameter ψ at the end of a sub-epoch S in block Si which updated ni points as defined in
equation 6 (main paper) is
V ar(ψ t+ni ) = V ar(ψ t ) − 2ηt ni Ω0 (V ar(ψ t )) − 2ηt ni Ω0 CoV ar(ψt , δ¯t ) + ηt2 ni Ω1
+ O(ηt2 ρt ) + O(ηt ρ2t ) + O(ηt3 ) + O(ηt2 ρ2t )
|
{z
}
∆t
Constants Ω0 and Ω1 are defined in theorems 1 and theorems 2 respectively.
5
Proof. We start with analysing EV [u(ψ (t+ni) )] term from lemma 1
EV [u(ψ (t+ni) )] = EV [u(ψ t + (−
ni
X
ηt+i δLt+i (v t+i , ψ t+i )))]
i=1
|
{z
}
∇
du(ψ t )
1 du2 (ψ t ) 2
∇
+
∇ + O(ηt3 )]
dψ t
2 d(ψ t )2
ni
ni
2
t
X
du(ψ t ) V X
t+i t+i
t+i
2 1 du (ψ ) V
= u(ψ t ) − ηt
E
E
[
δL
(v
,
ψ
)]
+
η
[(
δLt+i (v t+i , ψ t+i ))2 ]
t
t )2
dψ t
2
d(ψ
i=1
i=1
+ O(ηt3 )
since ηt = ηt+i within a block and expanding ∇
= EV [u(ψ t ) −
ni
ni
2
t
X
du(ψ t ) X
dEV [Lt+i (v t+i , ψ t+i )]
dLt+i (v t+i , ψ t+i ) 2
2 1 du (ψ ) V
= u(ψ ) − ηt
+
η
E
[(
) ]
t
dψ t i=1
dψ t+i
2 d(ψ t )2
dψ t+i
i=1
t
+ O(ηt3 )
t+i
ni
dEv [Lt+i (v t+i , ψ t+i )]
du(ψ t ) X
(
)
dψ t i=1
dψ t+i
using Lemma 3
= u(ψ t ) − ηt
+ ηt2
ni
X dLt+i (v t+i , ψ t+i ) dLt+j (v t+j , ψ t+j ) dLt+i (v t+i , ψ t+i ) 2
1 du2 (ψ t ) V X
V
(
)
]
+
E
[
]
E
[
2 d(ψ t )2
dψ t+i
dψ t+i
dψ t+j
i=1
i6=j
+
O(ηt3 )
X
t+i
ni
ni
2
t
t+i t+i
dEv [Lt+i (v t+i , ψ t+i )]
du(ψ t ) X
(v , ψ t+i ) 2
2 1 du (ψ )
v t+i dL
(
)
+
η
) ]
E
[(
t
dψ t i=1
dψ t+i
2 d(ψ t )2 i=1
dψ t+i
X dEvt+i [Lt+i (v t+i , ψ t+i )] dEvt+j [Lt+j (v t+j , ψ t+j )] +(
) + O(ηt3 )
dψ t+i
dψ t+j
i6=j
using Lemma 4
= u(ψ t ) − ηt
(8)
From equation 8 and lemma 1
ψ (t+ni )
E
[u(ψ
(t+ni )
t+i
ni
dEv [Lt+i (v t+i , ψ t+i )]
du(ψ t ) X
t
)] = E
u(ψ ) − ηt
(
)
dψ t i=1
dψ t+i
ni
t+i
1 du2 (ψ t ) X
dLt+i (v t+i , ψ t+i ) 2
+ ηt2
Ev [(
) ]
t
2
2 d(ψ )
dψ t+i
i=1
ψ (t)
+(
X dEvt+i [Lt+i (v t+i , ψ t+i )] dEvt+j [Lt+j (v t+j , ψ t+j )] ) + O(ηt3 )
dψ t+i
dψ t+j
i6=j
6
(9)
From equation above the variance of ψ t+ni is
V ar(ψ t+ni ) = Eψ
(t+ni )
(t+n )
2
i
[(ψ (t+ni ) )2 ] − Eψ
[ψ (t+ni ) ]
t
t
= Eψ [(ψ t )2 ] − ηt ni Eψ [2ψ t (Ω0 (ψ t − ψ∗ + δ¯t ) + O(ρ2t ))]
Pni
δi
using theorem 1 and defining δ¯t = i=1
ni
t
1
+ ηt2 Eψ [2{ni (Ω1 + O(E[ρt ]) + O(ρ2t ))
X2
+
(Ω0 (ψ t+i − ψ∗ ) + O(ρ2t ))(Ω0 (ψ t+j − ψ∗ ) + O(ρ2t ))}]
i6=j
t
t
− Eψ [ψ t ] − ηt ni Eψ [(Ω0 (ψ t − ψ∗ + δ¯t ) + O(|ψ t − ψ∗ |2 ))]
t
t
2
t
t
= Eψ [(ψ t )2 ] − 2Ω0 ηt ni Eψ [(ψ t )2 ] + 2Ω0 ηt ni ψ∗ Eψ [ψ t ] − 2Ω0 ηt ni Eψ [ψ t δ¯t ] − O(ηt ρ2t )
+ ηt2 ni Ω1 + O(ηt2 ρt ) + O(ηt2 ρ2t ) + O(ηt2 ρ3t ) + O(ηt2 ρ4t )
t
2
t
t
t
− Eψ [ψ t ] + 2ni ηt Eψ [ψ t ](Eψ [Ω0 ψ t ] − Ω0 ψ∗ + Eψ [Ω0 δ¯t ] + O(ρ2t )) − O(ηt2 ρ2t ) + O(ηt3 )
= V ar(ψ t ) − 2ηt ni Ω0 (V ar(ψ t )) − 2ηt ni Ω0 CoV ar(ψt , δ¯t ) + ηt2 ni Ω1
+ O(ηt2 ρt ) + O(ηt ρ2t ) + O(ηt3 ) + O(ηt2 ρ2t )
{z
}
|
(10)
∆t
7
© Copyright 2026 Paperzz