Thomas J. Watson Research Center Modeling Optimistic Concurrency using Quantitative Dependence Analysis Christoph von Praun Rajesh Bordawekar Calin Cascaval PPoPP, February 2008 © 2007 IBM Corporation Motivation Which applications can benefit from optimistic concurrency? How can the performance benefit be quantified? 1) Parallelization of sequential code: optimistic concurrency may simplify the parallelization process. – How much parallelism can a straightforward parallelization achieve? – Given a set of tasks: How much do data dependences inhibit parallelism? – Which data structures have to be changed or protected? 2) Optimization of parallel codes that synchronize through locks: – Benefit of using optimistic vs. pessimistic concurrency control? Quantitative dependence analysis can help to answer these questions. PPoPP, February 2008 2 Corporation © 2004 IBM Contributions 1) Quantitative dependence analysis on a task-based program execution model. Key metrics are dependence density among tasks and (algorithmically) available parallelism. 2) Tool that extracts input data for the model from a singlethreaded program execution. 3) Case study on real applications using quantitative dependence analysis as a guideline for the use of optimistic concurrency. PPoPP, February 2008 3 Corporation © 2004 IBM Outline Motivation Model Experience Conclusions PPoPP, February 2008 4 Corporation © 2004 IBM Example (1/2) int filter(const double* arr, double* res, int len) { int i, j = 0; Intended double v; for (i = 0; i < arr_len; ++i) { v = f(arr[i]); if (v < THRESH) { result_arr[j] = v; ++j; } } return j; behavior: record all values v in arr where f(v) larger than THRESH in the result array res. Entries in res can occur in any order. } PPoPP, February 2008 5 Corporation © 2004 IBM Example (2/2) Loop parallelization with optimistic concurrency OpenTM [PACT07] notation int filter(const double* arr, double* res, int len) { int i, j = 0; double v; #pragma omp transfor private (i,v) for (i = 0; i < len; ++i) { v = f(arr[i]); if (v < THRESH) { res[j] = v; ++j; } } return j; } PPoPP, February 2008 Execution of method filter is a program phase. One loop iteration is a (speculative) task. Tasks may execute concurrently as unordered transactions. Memory-level dependencies among tasks are resolved through the runtime system. 6 Corporation © 2004 IBM Model of a program execution program program execution program execution schedule independent tasks, low dependence density critical sections medium dependence density high dependence density program phase PPoPP, February 2008 group of tasks corresponding to the same critical section task inter-task dependence execution context, thread 7 Corporation © 2004 IBM Preliminary definitions tasks: t,t1,t 2 ,... set of tasks: T all tasks in a program phase: T p length of a task: len(t) read _ set(t) : {l | t reads from location l and t has not previously written to write_ set(t) : {l | t writes to location l} flow_ dep(t1,t 2 ) : write_ set(t1) read _ set(t 2 ) l} for ordered tasks: pred(t) : {s | execution of task s must precede t} succ(t) : {s | execution of task s must follow t} PPoPP, February 2008 8 Corporation © 2004 IBM Data dependence has_ data_ dep(t1,t 2 ) : t1 t 2 ( flow _ dep(t1,t 2 ) flow _ dep(t 2 ,t1) ) Dependence density data_ dep _ dens(t) : len (s) s Tp has_ data t2 t1 _ dep(t,s) len (t) tTp {t} t3 t6 flow_dep data_ dep _dens(t) data_ dep _ dens(T ) : t5 t7 {t i | has_ data_ dep(t1,t i )} t T |T | Probabilty of flow dependence between t and some other randomly chosen task s TP in the same program phase. PPoPP, February 2008 t4 9 Corporation © 2004 IBM Causal dependence (simple variant: ‘seq’) has_ causal _ dep(t1,t 2 ) : (t 2 succ(t1)has_ causal _ chain(t1,t 2 )) (t 2 pred(t1) has_ causal _ chain(t 2 ,t1)) has_ causal _ chain(t1,t 2 ) : flow _ dep(t1,t 2 ) t 3 succ(t1) pred(t 2 ) : flow _ dep(t1,t 3 ) Dependence density causal _ dep _ dens(t) : len (s) sTp has_ t 2 dep(t,s)t 3 t1 causal_ len (t) t Tp {t} succ _ dens(t) causal _ depflow_dep causal _ dep _ dens(T ) : PPoPP, February 2008 t4 t5 t6 tT {t i| | has_ causal _ chain(t1,t i )} |T 10 Corporation © 2004 IBM Available parallelism n1 avail _ par(T p ,n) : (1 dep _ dens(T p )) k k 0 1 (1 dep _ dens(T p )) n dep _ dens(T p ) Example avail _dep par(T _ density 0.5 avail _ par(T p ,n) p ) : lim n avail _ par(T p ,4) 1 0.5 0.25 0.125 1 contribution of useful dep _ dens(T p ) work (fraction of a task) executed by thread k: increasing probablity of conflict with additional concurrent tasks diminishing return of additional execution units PPoPP, February 2008 11 Corporation © 2004 IBM Limitations This model is idealized ... considers only memory-level dependences ignores shortage or contention on other resources, e.g., execution threads, memory access path does not model TM contention management for unordered tasks: scheduler picks tasks at random PPoPP, February 2008 12 Corporation © 2004 IBM Outline Motivation Model Experience Conclusions PPoPP, February 2008 13 Corporation © 2004 IBM Methodology 1) Program annotation • Mark phase and task boundaries in the program code. 2) Recording • Dynamic binary instrumentation of a single-threaded program run • Monitor execution of phases and tasks (critical sections) • Sample 5% of the tasks: record addresses of shared (non-stack) memory accesses 3) Analysis • Compute probability of memory-level dependence among two randomly chosen tasks within a phase (word-granularity address disambiguation, higher granularity possible to model false sharing) • Compute dependence density and available parallelism PPoPP, February 2008 14 Corporation © 2004 IBM Results - Overview Unordered tasks program src coverage [%-phase] vacation-low vacation-high kmeans-low kmeans-high mysql-keycache client.c:170 client.c:170 normal.c:164 normal.c:164 mf_keycache.c:1808 mf_keycache.c:1863 90.8 86.7 1.8 3.5 3.3 3.5 data-dep density 0.0012 0.0026 0.0242 0.0497 0.2577 0.9946 avail-par 833 385 41 20 4 1 Ordered tasks program src umt2k snswp3d.c:357 PPoPP, February 2008 coverage [%-phase] 100 causal-depdensity seq (avg) 0.88-0.97 (0.91) causal-depdensity win250 (avg) 0.08–0.86 (0.31) 15 Corporation © 2004 IBM MySQL keycache (1/3) MySQL keycache – part of the MyISAM storage manager – caches index bocks for database tables that reside on disk – implementation: thread-safe data structure protected by a single lock. ATIS SQL benchmark (serial execution) 1. 2. 3. 4. create tables insert records retrieve data (select, join, key prefix join, distinct, group join): drop tables phase-A phase-B read-only database operation PPoPP, February 2008 16 Corporation © 2004 IBM MySQL keycache (2/3) Distribution of write-access probability in different critical sections: fraction of total written locations [%] Locations that are updated by almost every critical region. Potential scalability bottleneck for transactions. Reason for high data-dependence density. ... adds up to 100 16 14 Example: keycache->global_cache_write++; 12 10 8 6 4 2 0 <0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 mf_keycache.c:1863 mf_keycache.c:2083 PPoPP, February 2008 mf_keycache.c:1808 mf_keycache.c:2284 >0.9 probability of access in a specific critical section. 17 Corporation © 2004 IBM MySQL keycache (3/3) Single lock is held during ~8% of the execution time of phase-B (retrieve) – 8% split in ~500k critical section instances Amdahl’s Law: 8% serial, 92 % of the workload are ‘perfectly‘ parallel – assume, e.g., 92 processors – maximum speedup: 100 / (1 + 8) = 11.1 – simplifying assumption: execution of 500k critical section instances smooth out such that threads are not waiting ... this ideal model makes too optimistic predictions. Lessons learned: – read-only database operation is internally r-w operation – keycache is an optimization designed for workloads with no or little concurrency – keycache may become scalability bottleneck PPoPP, February 2008 18 Corporation © 2004 IBM Umt2k Simulation of energy transport/propagation across an object Object represented as mesh (sparse, ‘unstructured’ data representation) Mesh traversersal is 50% of execution time start of traversal progress of traversal Graphic source: Wikipedia PPoPP, February 2008 19 Corporation © 2004 IBM Doacross loop: snswp3d.c 356-549 for (i = 1; i <= nelem; ++i) { ... /* Flux calculation */ if (afezm > zero) { iexit = ixfez; for (ip = 1; ip <= npart; ++ip) { loop carried dependence – prevents doall parallelization Potential point of conflict for speculative parallelization (in this case, dependence occurs in almost every iteration) /* Compute sources */ sigvx = sigvol_ref(ip, ix); ... /* Calculate average angular flux in each tet (PSIT) */ psit_ref(ip) = stet + ybase * psi_inc_ref(ip, ix); psifez = sfez + xbase * psi_inc_ref(ip, ix); tpsic_ref(ip, ic) = tpsic_ref(ip, ic) + tetwtx * psit_ref(ip); psi_inc_ref(ip, iexit) = psi_inc_ref(ip, iexit) + two * afezm * psifez; ... ix = next_ref(i + 1); ixfez = konnect_ref(3, ix); ... reduction } PPoPP, February 2008 20 Corporation © 2004 IBM RAW dependence distance in snswp3d.c 356-549 dependence distance 1: 98661 (37.2%), 2: 12173 (45.26%) count 5000 ... 4500 4000 3500 3000 2500 2000 1500 1000 500 ... 0 1 11 21 31 41 51 61 71 81 91 101 Loop is doacross at runtime; 265K iterations, avg. dependence distance: ~12.5 PPoPP, February 2008 111 121 dependence distance [# iterations] 21 Corporation © 2004 IBM Causal dependence density precise: scheduler considers window of 250 consecutive tasks for execution PPoPP, February 2008 22 Corporation © 2004 IBM Algorithmic considerations Iteration space is an unstructured graph Iteration order is a linearization of that graph: topological sort. Computation of topo-sort is ~15% of the total umt2k runtime Any topological sort is good (i.e. respects algorithmic dependences) – the one chosen is good for uniprocessor cache locality – choose a different one that allows larger speculation window PPoPP, February 2008 © 2004 IBM Corporation Scheduling experiment (1/2) Schedule iterations into buckets with k-wide execution buckets: topo-sort k-wide buckets 1 2 3 4 ... Algorithm from the tail of the topo-sort: compute closest algorithmic dependences (RAW) from the head of the topo-sort: hoist iteration into bucket that – has a free slot – follows closest the bucket that holds the iteration on which it is dependent PPoPP, February 2008 24 Corporation © 2004 IBM Scheduling experiment (2/2) Iterations in each bucket can execute in parallel (transactions required to account for occasional write-write dependence on reduction variables) Buckets execute in series How well did the buckets fill up...? Bucket fill histogram with k=32: percent of total buckets [%] 98.2 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Buckets fill up very well. Possible parallelism with k=32: 30.96 Conclusion: turned type “doacross” loop into “mostly doall” loop. PPoPP, February 2008 bucket fill 25 Corporation © 2004 IBM Outline Motivation Model Experience Conclusions PPoPP, February 2008 26 Corporation © 2004 IBM Concluding remarks We designed a simple execution model and analysis that estimates the performance potential of optimistic concurrency from a profiled, single threaded program run. Key metrics: dependence density, available parallelism. Metrics capture application properties and abstract from runtime implementation artifacts. Methodology and tool proved useful in a project on transactional synchronization: Goal was to identify and quantify opportunities for optimistic concurrency in today’s programs. no or little effort to adapt applications no architecture simulation or STM overheads quick turnaround time PPoPP, February 2008 27 Corporation © 2004 IBM [email protected] PPoPP, February 2008 28 Corporation © 2004 IBM
© Copyright 2025 Paperzz