Incoop: MapReduce for Incremental Computation

Incremental
Parallel and Distributed
Systems
Pramod Bhatotia
MPI-SWS
&
Saarland University
April 2015
Alexander
Akshat
Ekin
Björn
Pedro
Flavio
Rodrigo
Rafael
Umut
Common workflow:
Rerun the same application with evolving input
Scientific computing
Data analytics
Reactive Systems
Goal: Efficient execution of apps in successive runs
Small input change  small work to update the output
3
Existing approaches
Re-compute
from scratch
(-) Inefficient
(+) Easy to design
Design applicationspecific algorithms
(+) Efficient
(-) Hard to design
Incremental computation
Automatic + Efficient
4
My thesis research
To enable
transparent, practical, and efficient
incremental computation
in real-world
parallel and distributed systems
5
My projects @ MPI-SWS
Incoop
Shredder
Incremental batch processing
Incremental storage
[HotCloud’11, SoCC’11]
[FAST’12]
iThreads
Incremental multithreading
[ASPLOS’15]
Slider
Incremental stream
processing [Middleware’14,
Hadoop Summit’15]
Conductor*
Incremental job deployment
[NSDI’12, LADIS’10, PODC’10]
* 2nd author
6
Incremental
Multithreading
[email protected]
Joint work with:
Pedro Fonseca & Björn Brandenburg (MPI-SWS)
Rodrigo Rodrigues (Nova University of Lisbon)
Umut Acar (CMU)
Auto “Incrementalization”
Well-studied topic in the PL community
Compiler and language-based approaches
Target sequential programs! 
8
Parallelism
Proposals for parallel incremental computation
• Hammer et al. [DAMP’07]
• Burckhardt et al. [OOPSLA’11]
Limitations
• Require a new language with special data-types
• Restrict programming model to strict fork-join
Existing multi-threaded programs are
not supported!
9
Design goals
1. Transparency
• Target unmodified pthreads based programs
2. Practicality
• Support the full range of synchronization primitives
3. Efficiency
• Design parallel infrastructure for incremental
computation
10
iThreads
$ LD_PRELOAD=“iThreads.so”
$ ./myProgram <input-file>
(initial run)
$ echo “<offset> <len>” >> changes.txt
$ ./myProgram <input-file>
run)
(incremental
Speedups w.r.t. pthreads up to
8X
11
Outline
• Motivation
• Design
• Evaluation
12
Behind the scenes
step#1
step#2
step#3
Divide
Build
Perform
Computation
Sub-computations
Initial run
Dependence
graph
Change
propagation
Incremental run
13
A simple example
shared variables: x, y, and z
thread-1
thread-2
lock();
z = ++y;
unlock();
lock();
x++;
unlock();
local_var++;
lock();
y = 2*x + z;
unlock();
14
Step # 1
Divide
Computation
Build
Sub-computations
Perform
Dependence
graph
Change
propagation
How do we divide
a computation into subcomputations?
15
Sub-computations
thread-1
thread-2
lock();
z = ++y;
unlock();
Single
Entire
Instruction?
thread?
lock();
x++;
unlock();
local_var++;
“Fine-grained”
“Coarse-grained”
lock();
Requires
tracking
individual
instructions
Small
change
implies
recomputing
thread
yload/store
= 2*x +the
z; entire
unlock();
16
Sub-computation granularity
Entire
thread
Single
instruction
Coarse-grained
Expensive
Release Consistency (RC) memory model to
define the granularity of a sub-computation
A sub-computation is a sequence of instructions
between pthreads synchronization points
17
Sub-computations
thread-1
thread-2
lock();
z = ++y;
unlock();
Sub-computation
for thread-1
lock();
x++;
unlock();
Subcomputations
for thread-2
local_var++;
lock();
y = 2*x + z;
unlock();
18
Step # 2
Divide
Computation
Build
Sub-computations
Perform
Dependence
graph
Change
propagation
How do we build
the dependence graph?
19
Example: Changed schedule
shared variables: x, y, and z
thread-1
thread-2
lock();
z = ++y;
unlock();
different schedule
Read={x}
Write={x}
lock();
x++;
unlock();
Read={local_var}
Write={local_var}
Read={x,z}
(1): Record happens-before
dependencies
local_var++;
Write={y}
Read={y}
Write={y,z}
lock();
y = 2*x + z;
unlock();
20
Example: Same schedule
shared variables: x, y, and z
thread-1
Read={y}
Write={y,z}
lock();
z = ++y;
unlock();
thread-2
Same schedule
lock();
x++;
unlock();
Read={x}
Write={x}
local_var++;
Read={local_var}
Write={local_var}
lock();
y = 2*x + z;
unlock();
Read={x,z}
Write={y}
21
Example: Changed input
shared variables: x, y, and z
thread-1
Read={y}
Write={y,z}
lock();
z = ++y;
unlock();
y’
thread-2
different input
lock();
x++;
unlock();
Read={x}
Write={x}
Read={local_var}
(2): Record datalocal_var++;
dependencies
Write={local_var}
lock();
y = 2*x + z;
unlock();
z
Read={x,z}}
Read={x,
Write={y}
Write={y}
22
Dependence graph
Vertices: sub-computations
Edges:
• Happens-before order b/w sub-computations
• Intra-thread: Totally ordered based on execution
• Inter-thread: Partially ordered based on sync
primitives
• Data dependency b/w sub-computations
• If they can be ordered based on happens-before, &
• Antecedent writes the data that is read by precedent
23
Step # 3
Divide
Computation
Build
Sub-computations
Perform
Dependence
graph
Change
propagation
How do we perform
change propagation?
24
Change propagation
Dirty set  {Changed input}
For each sub-computation in a thread
Check validity in the recorded happens-before order
If (Read set
Dirty set)
– Re-compute the sub-computation
– Add write set to the dirty set
Else
– Skip execution of the sub-computation
– Apply the memoized effects of the sub-computation
25
Outline
• Motivation
• Design
• Evaluation
26
Evaluation
• Evaluating iThreads
1. Speedups for the incremental run
2. Overheads for the initial run
• Implementation (for Linux)
• 32-bit dynamically linkable shared library
• Platform
• Evaluated on Intel Xeon 12 cores running Linux
27
1. Speedup for incremental run
No. of Threads = 12
No. of Threads = 24
No. of Threads = 48
No. of Threads = 64
Time speedup
10
Better
1
Worst
<0.1
0.1
Histo
L
K
M
S
B
S
P
graminear_re means atrix_muwapationlackschotring_maCA
g
les
tch
l
s
Cann Word RevI
eal
_cou ndex
nt
Speedups vs. pthreads up to
8X
28
Space Overheads
Memoization overhead
Write a lot of
intermediate state
1000
100
10
1
Histo
L
K
M
S
B
S
P
graminear_re means atrix_muwapationlackschotring_maCA
g
les
tc h
l
s
Cann Word RevI
e al
_cou ndex
nt
Space overheads w.r.t. input size
29
2. Overheads for the initial run
Time overhead
1000
No. of Threads = 12
No. of Threads = 24
No. of Threads = 48
No. of Threads = 64
100
Worst
10
1
Better
0.1
Histo
P
S
B
S
M
K
L
graminear_re means atrix_muwapationlackschotring_maCA
g
tc h
l
s
l
es
Cann Word RevI
_cou ndex
e al
nt
Runtime overheads vs. pthreads
30
Summary
A case for parallel incremental computation
iThreads: Incremental multithreading
• Transparent: Targets unmodified programs
• Practical: Supports the full range of sync primitives
• Efficient: Employs parallelism for change propagation
Usage: A dynamically linkable shared library
31
Incremental Systems
Transparent + Practical + Efficient
[email protected]
Thank you all!
32