Adaptive Parallelization Strategies

Parallelization Strategies
Laxmikant Kale
Overview
• OpenMP Strategies
• Need for adaptive strategies
– Object migration based dynamic load balancing
– Minimal modification strategies
• Thread based techniques: ROCFLO, ..
• Some future plans
OpenMP
• Motivation:
– Shared memory model often easy to program
– Incremental optimization possible
ROCFLO via OpenMP
• Parallelization of ROCFLO using a loopparallel paradigm via OpenMP
– Poor speedup compared with MPI version
– Was locality the culprit?
• Study conducted by Jay Hoeflinger
– In collaboration with Fady Najjar
ROCFLO with MPI
Speedup
Loop-Parallel Speedup for ROCFLO with
OpenMP
18
16
14
12
10
8
6
4
2
0
OpenMP Speedup
Perfect Speedup
1
2
4
Processors
8
16
The Methodology
• Do OpenMP/MPI comparison experiments.
• Write an OpenMP version of ROCFLO
– Start with the MPI version of ROCFLO,
– Duplicate the structure of the MPI code exactly
(including message passing calls).
– This removes locality as a problem.
• Measure performance
– If any parts do not scale well, determine why.
Speedup of MPI-analog OpenMP ROCFLO
70
60
Speedup
50
Perfect Speedup
40
Speedup of OpenMP
code
30
20
10
0
1
2
4
8
16
Processors
32
64
Barrier Cost: MPI vs OpenMP
(Origin 2000)
MPI
OpenMP
2
4
8
16
32
64
16.1
44.957
75.926 130.919 235.971
557.49
5.2577
10.508
21.92
46.43
83.117 181.498
3.062175 4.278359 3.463777 2.819707 2.839022 3.071604
600
Microseconds
500
400
MPI
300
OpenMP
200
100
0
2
4
8
16
32
Number of Processors
64
Processors:
MPI ALLOCATE loop
MPI computation loop
OpenMP ALLOCATE loop
OpenMP computation loop
1
2
198.098
138.212
204.95
138.302
4
139.292
133.024
208.375
138.341
8
139.577
133.087
214.2
138.371
16
223.671
138.283
32
139.412
133.03
262.556
138.32
64
139.8
133.02
396.008
138.322
A Comparison of ALLOCATE with MPI and
OpenMP
450
400
Time (seconds)
350
MPI ALLOCATE loop
300
MPI computation loop
250
200
OpenMP ALLOCATE
loop
150
OpenMP computation
loop
100
50
0
1
2
4
8
16
processors
32
64
I/O time vs processors used - OpenMP
6
Time (sec)
5
4
stdout (OpenMP)
3
sep files (OpenMP)
2
1
0
1
2
4
8
16
Processors
32
64
I/O times vs processors used - MPI
Time (sec)
0.05
0.04
0.03
stdout (MPI)
0.02
sep files (MPI)
0.01
0
1
2
4
8
16
Processors
32
64
I/O times vs processors used - OpenMP and MPI
6
5
4
Time
stdout (OpenMP)
sep files (OpenMP)
3
stdout (MPI)
sep files (MPI)
2
1
0
1
2
4
8
16
Processors
32
64
Reasons for Speedup Loss
70
Perfect Speedup
60
Speedup
50
40
Speedup with scaling
IO,ALLOC,msg
infrastructure
30
Speedup with scaling
I/O and ALLOCATE
20
Speedup with scaling
I/O
10
Speedup of OpenMP
code
0
1
2
4
8
16
Processors
32
64
So Locality was not the
whole problem!
• The other problems turned out to be:
–
–
–
–
I/O which doesn’t scale
ALLOCATE which doesn’t scale
our non-scaling reduction implementation
our first-cut messaging infrastructure which,
could be improved
• Conclusion
– Efficient loop parallel version may be feasible,
avoiding Allocates and using scalable IO
Need for adaptive strategies
• Computation structure changes over time:
– Combustion
• Adaptive techniques in application codes:
– Adaptive refinement in structures or even fluid
– Other codes such as crack propagation
• Can affect the load balance dramatically
– One can go from 90% efficiency to less than 25%
Multi-partition decompositions
• Idea: decompose the problem into a number
of partitions,
– independent of the number of processors
– # Partitions > # Processors
• The system maps partitions to processors
– The system should be able to map and re-map
objects as needed
Load Balancing Framework
• Aimed at handling ...
– Continuous (slow) load variation
– Abrupt load variation (refinement)
– Workstation clusters in multi-user mode
• Measurement based
– Exploits temporal persistence of computation and
communication structures
– Very accurate (compared with estimation)
– instrumentation possible via Charm++/Converse
Charm++
• A parallel C++ library
– supports data driven objects
– many objects per processor, with method
execution scheduled with availability of data
– system supports automatic instrumentation and
object migration
– Works with other paradigms: MPI, openMP, ..
Load balancing framework
Load balancing demonstration
• To test the abilities of the framework
– A simple problem: Gauss-Jacobi iterations
– Refine selected sub-domains
• AppSpector: web based tool
– Submit parallel jobs
– Monitor performance and application behavior
– Interact with running jobs via GUI interfaces
Adapitivity with minimal
modification
• Current code base is parallel (MPI)
– But doesn’t support adaptivity directly
– Rewrite the code with objects?...
• Idea: support adaptivity with minimal
changes to F90/MPI codes
• Work by:
– Milind Bhandarkar, Jay Hoeflinger, Eric de
Sturler
Migratable threads approach
• Change required:
– Encapsulate global variables in modules
• Dynamically allocatable
• Intercept MPI calls
– Implement them in a multithreaded layer
• Run each original MPI process as a thread
– User level thread
• Migrate threads as needed by load balancing
– Trickier problem than object migration
Progress:
•
•
•
•
•
•
Test Fortran-90 - C++ interface
Encapsulation feasibility:
Thread migration mechanics
ROCFLO study:
Test code implementation
ROCFLO implementation
Another approach to adaptivity
• Cleanly separate parallel and sequential code:
– All parallel code in C++
– All application code in Fortran 90 sequential
subroutines
• Needs more restructuring of application codes
– But is feasible, especially for new codes
– Much easier to migrate
– Improves modularity