Parallelization Strategies Laxmikant Kale Overview • OpenMP Strategies • Need for adaptive strategies – Object migration based dynamic load balancing – Minimal modification strategies • Thread based techniques: ROCFLO, .. • Some future plans OpenMP • Motivation: – Shared memory model often easy to program – Incremental optimization possible ROCFLO via OpenMP • Parallelization of ROCFLO using a loopparallel paradigm via OpenMP – Poor speedup compared with MPI version – Was locality the culprit? • Study conducted by Jay Hoeflinger – In collaboration with Fady Najjar ROCFLO with MPI Speedup Loop-Parallel Speedup for ROCFLO with OpenMP 18 16 14 12 10 8 6 4 2 0 OpenMP Speedup Perfect Speedup 1 2 4 Processors 8 16 The Methodology • Do OpenMP/MPI comparison experiments. • Write an OpenMP version of ROCFLO – Start with the MPI version of ROCFLO, – Duplicate the structure of the MPI code exactly (including message passing calls). – This removes locality as a problem. • Measure performance – If any parts do not scale well, determine why. Speedup of MPI-analog OpenMP ROCFLO 70 60 Speedup 50 Perfect Speedup 40 Speedup of OpenMP code 30 20 10 0 1 2 4 8 16 Processors 32 64 Barrier Cost: MPI vs OpenMP (Origin 2000) MPI OpenMP 2 4 8 16 32 64 16.1 44.957 75.926 130.919 235.971 557.49 5.2577 10.508 21.92 46.43 83.117 181.498 3.062175 4.278359 3.463777 2.819707 2.839022 3.071604 600 Microseconds 500 400 MPI 300 OpenMP 200 100 0 2 4 8 16 32 Number of Processors 64 Processors: MPI ALLOCATE loop MPI computation loop OpenMP ALLOCATE loop OpenMP computation loop 1 2 198.098 138.212 204.95 138.302 4 139.292 133.024 208.375 138.341 8 139.577 133.087 214.2 138.371 16 223.671 138.283 32 139.412 133.03 262.556 138.32 64 139.8 133.02 396.008 138.322 A Comparison of ALLOCATE with MPI and OpenMP 450 400 Time (seconds) 350 MPI ALLOCATE loop 300 MPI computation loop 250 200 OpenMP ALLOCATE loop 150 OpenMP computation loop 100 50 0 1 2 4 8 16 processors 32 64 I/O time vs processors used - OpenMP 6 Time (sec) 5 4 stdout (OpenMP) 3 sep files (OpenMP) 2 1 0 1 2 4 8 16 Processors 32 64 I/O times vs processors used - MPI Time (sec) 0.05 0.04 0.03 stdout (MPI) 0.02 sep files (MPI) 0.01 0 1 2 4 8 16 Processors 32 64 I/O times vs processors used - OpenMP and MPI 6 5 4 Time stdout (OpenMP) sep files (OpenMP) 3 stdout (MPI) sep files (MPI) 2 1 0 1 2 4 8 16 Processors 32 64 Reasons for Speedup Loss 70 Perfect Speedup 60 Speedup 50 40 Speedup with scaling IO,ALLOC,msg infrastructure 30 Speedup with scaling I/O and ALLOCATE 20 Speedup with scaling I/O 10 Speedup of OpenMP code 0 1 2 4 8 16 Processors 32 64 So Locality was not the whole problem! • The other problems turned out to be: – – – – I/O which doesn’t scale ALLOCATE which doesn’t scale our non-scaling reduction implementation our first-cut messaging infrastructure which, could be improved • Conclusion – Efficient loop parallel version may be feasible, avoiding Allocates and using scalable IO Need for adaptive strategies • Computation structure changes over time: – Combustion • Adaptive techniques in application codes: – Adaptive refinement in structures or even fluid – Other codes such as crack propagation • Can affect the load balance dramatically – One can go from 90% efficiency to less than 25% Multi-partition decompositions • Idea: decompose the problem into a number of partitions, – independent of the number of processors – # Partitions > # Processors • The system maps partitions to processors – The system should be able to map and re-map objects as needed Load Balancing Framework • Aimed at handling ... – Continuous (slow) load variation – Abrupt load variation (refinement) – Workstation clusters in multi-user mode • Measurement based – Exploits temporal persistence of computation and communication structures – Very accurate (compared with estimation) – instrumentation possible via Charm++/Converse Charm++ • A parallel C++ library – supports data driven objects – many objects per processor, with method execution scheduled with availability of data – system supports automatic instrumentation and object migration – Works with other paradigms: MPI, openMP, .. Load balancing framework Load balancing demonstration • To test the abilities of the framework – A simple problem: Gauss-Jacobi iterations – Refine selected sub-domains • AppSpector: web based tool – Submit parallel jobs – Monitor performance and application behavior – Interact with running jobs via GUI interfaces Adapitivity with minimal modification • Current code base is parallel (MPI) – But doesn’t support adaptivity directly – Rewrite the code with objects?... • Idea: support adaptivity with minimal changes to F90/MPI codes • Work by: – Milind Bhandarkar, Jay Hoeflinger, Eric de Sturler Migratable threads approach • Change required: – Encapsulate global variables in modules • Dynamically allocatable • Intercept MPI calls – Implement them in a multithreaded layer • Run each original MPI process as a thread – User level thread • Migrate threads as needed by load balancing – Trickier problem than object migration Progress: • • • • • • Test Fortran-90 - C++ interface Encapsulation feasibility: Thread migration mechanics ROCFLO study: Test code implementation ROCFLO implementation Another approach to adaptivity • Cleanly separate parallel and sequential code: – All parallel code in C++ – All application code in Fortran 90 sequential subroutines • Needs more restructuring of application codes – But is feasible, especially for new codes – Much easier to migrate – Improves modularity
© Copyright 2024 Paperzz