Incremental Parallel and Distributed Systems Pramod Bhatotia MPI-SWS & Saarland University April 2015 Alexander Akshat Ekin Björn Pedro Flavio Rodrigo Rafael Umut Common workflow: Rerun the same application with evolving input Scientific computing Data analytics Reactive Systems Goal: Efficient execution of apps in successive runs Small input change small work to update the output 3 Existing approaches Re-compute from scratch (-) Inefficient (+) Easy to design Design applicationspecific algorithms (+) Efficient (-) Hard to design Incremental computation Automatic + Efficient 4 My thesis research To enable transparent, practical, and efficient incremental computation in real-world parallel and distributed systems 5 My projects @ MPI-SWS Incoop Shredder Incremental batch processing Incremental storage [HotCloud’11, SoCC’11] [FAST’12] iThreads Incremental multithreading [ASPLOS’15] Slider Incremental stream processing [Middleware’14, Hadoop Summit’15] Conductor* Incremental job deployment [NSDI’12, LADIS’10, PODC’10] * 2nd author 6 Incremental Multithreading [email protected] Joint work with: Pedro Fonseca & Björn Brandenburg (MPI-SWS) Rodrigo Rodrigues (Nova University of Lisbon) Umut Acar (CMU) Auto “Incrementalization” Well-studied topic in the PL community Compiler and language-based approaches Target sequential programs! 8 Parallelism Proposals for parallel incremental computation • Hammer et al. [DAMP’07] • Burckhardt et al. [OOPSLA’11] Limitations • Require a new language with special data-types • Restrict programming model to strict fork-join Existing multi-threaded programs are not supported! 9 Design goals 1. Transparency • Target unmodified pthreads based programs 2. Practicality • Support the full range of synchronization primitives 3. Efficiency • Design parallel infrastructure for incremental computation 10 iThreads $ LD_PRELOAD=“iThreads.so” $ ./myProgram <input-file> (initial run) $ echo “<offset> <len>” >> changes.txt $ ./myProgram <input-file> run) (incremental Speedups w.r.t. pthreads up to 8X 11 Outline • Motivation • Design • Evaluation 12 Behind the scenes step#1 step#2 step#3 Divide Build Perform Computation Sub-computations Initial run Dependence graph Change propagation Incremental run 13 A simple example shared variables: x, y, and z thread-1 thread-2 lock(); z = ++y; unlock(); lock(); x++; unlock(); local_var++; lock(); y = 2*x + z; unlock(); 14 Step # 1 Divide Computation Build Sub-computations Perform Dependence graph Change propagation How do we divide a computation into subcomputations? 15 Sub-computations thread-1 thread-2 lock(); z = ++y; unlock(); Single Entire Instruction? thread? lock(); x++; unlock(); local_var++; “Fine-grained” “Coarse-grained” lock(); Requires tracking individual instructions Small change implies recomputing thread yload/store = 2*x +the z; entire unlock(); 16 Sub-computation granularity Entire thread Single instruction Coarse-grained Expensive Release Consistency (RC) memory model to define the granularity of a sub-computation A sub-computation is a sequence of instructions between pthreads synchronization points 17 Sub-computations thread-1 thread-2 lock(); z = ++y; unlock(); Sub-computation for thread-1 lock(); x++; unlock(); Subcomputations for thread-2 local_var++; lock(); y = 2*x + z; unlock(); 18 Step # 2 Divide Computation Build Sub-computations Perform Dependence graph Change propagation How do we build the dependence graph? 19 Example: Changed schedule shared variables: x, y, and z thread-1 thread-2 lock(); z = ++y; unlock(); different schedule Read={x} Write={x} lock(); x++; unlock(); Read={local_var} Write={local_var} Read={x,z} (1): Record happens-before dependencies local_var++; Write={y} Read={y} Write={y,z} lock(); y = 2*x + z; unlock(); 20 Example: Same schedule shared variables: x, y, and z thread-1 Read={y} Write={y,z} lock(); z = ++y; unlock(); thread-2 Same schedule lock(); x++; unlock(); Read={x} Write={x} local_var++; Read={local_var} Write={local_var} lock(); y = 2*x + z; unlock(); Read={x,z} Write={y} 21 Example: Changed input shared variables: x, y, and z thread-1 Read={y} Write={y,z} lock(); z = ++y; unlock(); y’ thread-2 different input lock(); x++; unlock(); Read={x} Write={x} Read={local_var} (2): Record datalocal_var++; dependencies Write={local_var} lock(); y = 2*x + z; unlock(); z Read={x,z}} Read={x, Write={y} Write={y} 22 Dependence graph Vertices: sub-computations Edges: • Happens-before order b/w sub-computations • Intra-thread: Totally ordered based on execution • Inter-thread: Partially ordered based on sync primitives • Data dependency b/w sub-computations • If they can be ordered based on happens-before, & • Antecedent writes the data that is read by precedent 23 Step # 3 Divide Computation Build Sub-computations Perform Dependence graph Change propagation How do we perform change propagation? 24 Change propagation Dirty set {Changed input} For each sub-computation in a thread Check validity in the recorded happens-before order If (Read set Dirty set) – Re-compute the sub-computation – Add write set to the dirty set Else – Skip execution of the sub-computation – Apply the memoized effects of the sub-computation 25 Outline • Motivation • Design • Evaluation 26 Evaluation • Evaluating iThreads 1. Speedups for the incremental run 2. Overheads for the initial run • Implementation (for Linux) • 32-bit dynamically linkable shared library • Platform • Evaluated on Intel Xeon 12 cores running Linux 27 1. Speedup for incremental run No. of Threads = 12 No. of Threads = 24 No. of Threads = 48 No. of Threads = 64 Time speedup 10 Better 1 Worst <0.1 0.1 Histo L K M S B S P graminear_re means atrix_muwapationlackschotring_maCA g les tch l s Cann Word RevI eal _cou ndex nt Speedups vs. pthreads up to 8X 28 Space Overheads Memoization overhead Write a lot of intermediate state 1000 100 10 1 Histo L K M S B S P graminear_re means atrix_muwapationlackschotring_maCA g les tc h l s Cann Word RevI e al _cou ndex nt Space overheads w.r.t. input size 29 2. Overheads for the initial run Time overhead 1000 No. of Threads = 12 No. of Threads = 24 No. of Threads = 48 No. of Threads = 64 100 Worst 10 1 Better 0.1 Histo P S B S M K L graminear_re means atrix_muwapationlackschotring_maCA g tc h l s l es Cann Word RevI _cou ndex e al nt Runtime overheads vs. pthreads 30 Summary A case for parallel incremental computation iThreads: Incremental multithreading • Transparent: Targets unmodified programs • Practical: Supports the full range of sync primitives • Efficient: Employs parallelism for change propagation Usage: A dynamically linkable shared library 31 Incremental Systems Transparent + Practical + Efficient [email protected] Thank you all! 32
© Copyright 2026 Paperzz