Large Scale SGD on Hadoop

SGD ON HADOOP
FOR BIG DATA
& HUGE MODELS
Alex Beutel
Based on work done with Abhimanu Kumar,
Vagelis Papalexakis, Partha Talukdar, Qirong Ho,
Christos Faloutsos, and Eric Xing
Outline
1. When to use SGD for distributed learning
2. Optimization
• Review of DSGD
• SGD for Tensors
• SGD for ML models – topic modeling, dictionary learning, MMSB
3. Hadoop
1. General algorithm
2. Setting up the MapReduce body
3. Reducer communication
4. Distributed normalization
5. “Always-On SGD” – How to deal with the straggler problem
4. Experiments
When distributed SGD is useful
1 Billion users on Facebook
Collaborative Filtering
Predict movie preferences
300 Million Photos uploaded
to Facebook per day!
Dictionary Learning
Remove noise or missing pixels from
images
Tensor Decomposition
Find communities in temporal graphs
400 million tweets per day
Topic Modeling
What are the topics of webpages,
tweets, or status updates
Gradient Descent
Stochastic Gradient Descent (SGD)
y = (x- 4)2 + (x- 5)2
z1 = 4
y = (x - 4)2
z2 = 5
y = (x - 5)2
Stochastic Gradient Descent (SGD)
y = (x- 4)2 + (x- 5)2
z1 = 4
y = (x - 4)2
z2 = 5
y = (x - 5)2
DSGD for Matrices
(Gemulla, 2011)
Movies
V
Users
U
Genres
≈
X
DSGD for Matrices (Gemulla, 2011)
V
U
≈
Independent!
X
DSGD for Matrices (Gemulla, 2011)
Independent Blocks
DSGD for Matrices (Gemulla, 2011)
Partition your data & model into d × d blocks
Results in d=3 strata
Process strata sequentially,
process blocks in each stratum in parallel
TENSORS
What is a tensor?
• Tensors are used for structured data > 2 dimensions
• Think of as a 3D-matrix
For example:
Derek Jeter plays baseball
Subject
Object
Verb
Tensor Decomposition
V
U
≈
X
Tensor Decomposition
V
U
≈
X
Tensor Decomposition
Independent
V
U
≈
Not Independent
X
Tensor Decomposition
Z1
Z2
Z3
For d=3 blocks per stratum, we require d2=9 strata
Z1
Z1
Z1
Z2
Z2
Z2
Z3
Z3
Z3
Z2
Z2
Z2
Z3
Z1
Z3
Z1
Z3
Z1
Z1
Z2
Z3
Z1
Z3
Z1
Z3
Z2
Z2
Coupled Matrix + Tensor Decomposition
Subject
Y
X
Object
Document
Verb
Coupled Matrix + Tensor Decomposition
A
V
U
≈
Y
X
Coupled Matrix + Tensor Decomposition
Z1
Z2
Z3
Z1
Z2
Z3
CONSTRAINTS &
PROJECTIONS
Example: Topic Modeling
Words
Topics
Documents
Constraints
• Sometimes we want to restrict response:
• Non-negative
Ui,k ³ 0
T
min
X
-UV
• Sparsity
X
F
+ lu U 1 + lv V 1
• Simplex (so vectors become probabilities)
åU
k
• Keep inside unit ball
2
U
å i,k £1
k
i,k
=1
How to enforce? Projections
• Example: Non-negative
More projections
• Sparsity (soft thresholding):
• Simplex
x
P(x) =
x1
• Unit ball
Dictionary Learning
• Learn a dictionary of concepts and a sparse
reconstruction
• Useful for fixing noise and missing pixels of images
Sparse encoding
Within unit ball
Mixed Membership Network Decomp.
• Used for modeling communities in graphs (e.g. a social
network)
Simplex
Non-negative
IMPLEMENTING
ON HADOOP
High level algorithm
Z1
Z1
Z1
Z2
Z2
Z2
Z3
Stratum 1
Z3
Stratum 2
Z3
Stratum 3
…
for Epoch e = 1 … T do
for Subepoch s = 1 … d2 do
Let Z s = {Z1, Z2 … Zd } be the set of blocks in stratum s
for block b = 1 … d in parallel do
Run SGD on all points in block Zb Î Z s
end
end
end
Bad Hadoop Algorithm: Subepoch 1
Mappers
Reducers
Run
SGD on
Update:
Z1(1)
Run
SGD on
U2 V1 W3
U3 V2 W1
U1 V3 W2
Update:
Z2(1)
Run
SGD on
Z3(1)
Update:
Bad Hadoop Algorithm: Subepoch 2
Mappers
Reducers
Run
SGD on
Update:
Z1(2)
Run
SGD on
U2 V1 W2
U3 V2 W3
U1 V3 W1
Update:
Z2(2)
Run
SGD on
Z3(3)
Update:
Hadoop Challenges
• MapReduce is typically very bad for iterative algorithms
• T × d2 iterations
• Sizable overhead per Hadoop job
• Little flexibility
High Level Algorithm
V1
Z1
V1
V2
Z2
Z3
U1 V1 W1
Z1
V3
V2
Z2
V3
Z3
U2 V2 W2
U3 V3 W3
High Level Algorithm
V1
Z1
V1
V2
Z2
Z3
U1 V1 W1
Z1
V3
V2
Z2
V3
Z3
U2 V2 W2
U3 V3 W3
High Level Algorithm
Z1
V1
V2
Z2
V1
Z2
V3
Z3
U1 V1 W3
Z1
V3
Z3
U2 V2 W1
V2
U3 V3 W2
High Level Algorithm
Z1
V1
Z2
V1
Z2
Z3
V2
V3
U1 V1 W2
V3
Z1
Z3
U2 V2 W3
V2
U3 V3 W1
Hadoop Algorithm
p = {i, j, k, v}
Mappers
Process
points:
Map each
point
{p, b, s}
Partition
&
Sort
to its block
with necessary
info to order
Use:
Partitioner
KeyComparator
GroupingComparator
Reducers
Hadoop Algorithm
Reducers
Mappers
Process
points:
…
Map each
point
Z1(4) Z1(3) Z1(2)
Z1(1)
Partition
&
Sort
to its block
Z2(5) Z2(3) Z2(2)
Z2(1)
Z3(4) Z3(3) Z3(2)
Z3(1)
with necessary
info to order
…
Hadoop Algorithm
Z1
Reducers
Z2
Z3
Run
SGD on
Mappers
Process
points:
…
Map each
point
Z1(4) Z1(3) Z1(2)
Run
SGD on
&
Sort
Z2(4) Z2(3) Z2(2)
to its block
U1 V1 W1
Z1(1)
Partition
Run
SGD on
…
Z3(4) Z3(3) Z3(2)
Z3(1)
Update:
U2 V2 W2
Z2(1)
with necessary
info to order
Update:
Update:
U3 V3 W3
Hadoop Algorithm
Z1
Reducers
Z2
Z3
Mappers
Process
points:
…
Map each
point
Z1(5) Z1(4) Z1(3)
Z1(2)
Partition
&
Sort
Z2(5) Z2(4) Z2(3)
to its block
Z2(2)
with necessary
info to order
…
Z3(5) Z3(4) Z3(3)
Z3(2)
Run
SGD on
Update:
Z1(1)
U1 V1 W1
Run
SGD on
Update:
Z2(1)
U2 V2 W2
Run
SGD on
Update:
Z3(1)
U3 V3 W3
Hadoop Algorithm
Z1
Reducers
Z2
Z3
Mappers
Process
points:
…
Z1(5) Z1(4) Z1(3)
Run
SGD on
Update:
Z1(2)
U1 V1 W1
HDFS
Map each
point
Partition
&
Sort
Z2(5) Z2(4) Z2(3)
to its block
Run
SGD on
Update:
Z2(2)
U2 V2 W2
HDFS
with necessary
info to order
…
Z3(5) Z3(4) Z3(3)
Run
SGD on
Update:
Z3(2)
U3 V3 W3
Hadoop Summary
1. Use mappers to send data points to the correct
reducers in order
2. Use reducers as machines in a normal cluster
3. Use HDFS as the communication channel between
reducers
Distributed Normalization
Words
Topics
π1 β1
Documents
π2 β2
π3 β3
Distributed Normalization
Transfer σ(b) to all machines
Each machine calculates σ:
σ(b) is a k-dimensional vector,
summing the terms of βb
s (b)
k =
åb
d
s = ås (b)
b=1
π1 β1
j,k
j Î bb
σ(2)
σ(1)
σ(2) σ(2)
Normalize:
σ(3)
π3 β3
π2 β2
σ(1)
σ(1)
σ(3)
σ(3)
b j,k =
b j,k
sk
Barriers & Stragglers
Reducers
Mappers
Process
points:
…
Z1(5) Z1(4) Z1(3)
Z1(2)
Run
SGD on
Update:
Z1(1)
U1 V1 W1
HDFS
Map each
point
Wasting
time
Run
Update:
waiting!
SGD
on
Partition
&
Sort
Z2(5) Z2(4) Z2(3)
to its block
Z2(2)
Z2(1)
U2 V2 W2
HDFS
with necessary
info to order
…
Z3(5) Z3(4) Z3(3)
Z3(2)
Run
SGD on
Update:
Z3(1)
U3 V3 W3
Solution: “Always-On SGD”
For each reducer:
Run SGD on all points in
current block Z
Shuffle points in Z and
decrease step size
Check if other reducers
are ready to sync
Run SGD on points in Z
again
If not ready to sync
Sync parameters
and get new block Z
Wait
If not ready to sync
“Always-On SGD”
Reducers
Run SGD on old points again!
Process
points:
…
Z1(5) Z1(4) Z1(3)
Z1(2)
Run
SGD on
Update:
Z1(1)
U1 V1 W1
HDFS
Map each
point
Partition
&
Sort
Z2(5) Z2(4) Z2(3)
to its block
Z2(2)
Run
SGD on
Update:
Z2(1)
U2 V2 W2
HDFS
with necessary
info to order
…
Z3(5) Z3(4) Z3(3)
Z3(2)
Run
SGD on
Update:
Z3(1)
U3 V3 W3
“Always-On SGD”
Reducer 1
Reducer2
Reducer 3
Reducer 4
First SGD pass of block Z
Read Parameters from HDFS
Extra SGD Updates
Write Parameters to HDFS
EXPERIMENTS
FlexiFaCT (Tensor Decomposition)
Convergence
FlexiFaCT (Tensor Decomposition)
Scalability in Data Size
FlexiFaCT (Tensor Decomposition)
Scalability in Tensor Dimension
Handles up to 2 billion parameters!
FlexiFaCT (Tensor Decomposition)
Scalability in Rank of Decomposition
Handles up to 4 billion parameters!
FlexiFaCT (Tensor Decomposition)
Scalability in Number of Machines
Fugue (Using “Always-On SGD”)
Dictionary Learning: Convergence
Fugue (Using “Always-On SGD”)
Community Detection: Convergence
Fugue (Using “Always-On SGD”)
Topic Modeling: Convergence
Fugue (Using “Always-On SGD”)
Topic Modeling: Scalability in Data Size
Fugue (Using “Always-On SGD”)
Topic Modeling: Scalability in Rank
Fugue (Using “Always-On SGD”)
Topic Modeling: Scalability over Machines
Fugue (Using “Always-On SGD”)
Topic Modeling: Number of Machines
Fugue (Using “Always-On SGD”)
Key Points
• Flexible method for tensors & ML models
• Can use stock Hadoop through using HDFS for
communication
• When waiting for slower machines, run updates on old
data again
Questions?
Alex Beutel
[email protected]
http://alexbeutel.com