Synchronization Transformations

Synchronization Transformations
for
Parallel Computing
Pedro Diniz
and
Martin Rinard
Department of Computer Science
University of California, Santa Barbara
http://www.cs.ucsb.edu/~{pedro,martin}
Motivation
Parallel Computing Becomes Dominant Form of
Computation
Parallel Machines Require Parallel Software
Parallel Constructs Require New Analysis and
Optimization Techniques
Our Goal
Eliminate Synchronization Overhead
Talk Outline
• Motivation
• Model of Computation
• Synchronization Optimization Algorithm
• Applications Experience
• Dynamic Feedback
• Related Work
• Conclusions
Model of Computation
• Parallel Programs
• Serial Phases
• Parallel Phases
Acq
• Single Address Space
S1
• Atomic Operations on Shared Data
• Mutual Exclusion Locks
• Acquire Constructs
• Release Constructs
Rel
Mutual
Exclusion
Region
Reducing Synchronization Overhead
Acq
S1
S2
S3
Rel
Rel
Acq
Synchronization Optimization
Idea:
Replace Computations that Repeatedly Acquire and
Release the Same Lock with a Computation that
Acquires and Releases the Lock Only Once
Result:
Reduction in the Number of
Executed Acquire and Release Constructs
Mechanism:
Lock Movement Transformations and
Lock Cancellation Transformations
Lock Cancellation
Acquire Lock Movement
Release Lock Movement
Synchronization Optimization Algorithm
Overview:
• Find Two Mutual Exclusion Regions With the Same Lock
• Expand Mutual Exclusion Regions Using Lock
Movement Transformations Until They are Adjacent
• Coalesce Using Lock Cancellation Transformation to
Form a Single Larger Mutual Exclusion Region
Interprocedural Control Flow Graph
Acquire Movement Paths
Release Movement Paths
Migration Paths and Meeting Edge
Intersection of Paths
Compensation Nodes
Final Result
Synchronization Optimization Trade-Off
• Advantage:
• Reduces Number of Executed Acquires and Releases
• Reduces Acquire and Release Overhead
• Disadvantage: May Introduce False Exclusion
• Multiple Processors Attempt to Acquire Same Lock
• Processor Holding the Lock is Executing Code that was
Originally in No Mutual Exclusion Region
False Exclusion Policy
Goal:
Limit Potential Severity of False Exclusion
Mechanism:
Constrain the Application of Basic Transformations
• Original:
• Bounded:
• Aggressive:
Never Apply Transformations
Apply Transformations only on
Cycle-Free Subgraphs of ICFG
Always apply Transformations
Experimental Results
• Automatic Parallelizing Compiler Based on
Commutativity Analysis [PLDI’96]
• Set of Complete Scientific Applications (C++ subset)
• Barnes-Hut N-Body Solver (1500 lines of Code)
• Liquid Water Simulation Code (1850 lines of Code)
• Seismic Modeling String Code (2050 lines of Code)
• Different False Exclusion Policies
• Performance of Generated Parallel Code on Stanford
DASH Shared-Memory Multiprocessor
Lock Overhead
Original
40
Bounded
20
0
Aggressive
Barnes-Hut (16K Particles)
60
40
20
0
Original
Bounded
Aggressive
Water (512 Molecules)
Percentage Lock Overhead
60
Percentage Lock Overhead
Percentage Lock Overhead
Percentage of Time that the Single Processor Execution Spends
Acquiring and Releasing Mutual Exculsion Locks
60
40
20
0
Original
Aggressive
String (Big Well Model)
Contention Overhead
Contention Percentage
Percentage of Time that Processors Spend Waiting to
Acquire Locks Held by Other Processors
100
75
100
75
100
75
50
25
0
50
25
0
50
25
0
0
4 8 12 16
Processors
0
Aggressive
Bounded
Original
4 8 12 16
0 4 8 12 16
Processors
Processors
Barnes-Hut (16K Bodies) Water (512 Molecules) String (Big Well Model)
Speedup
Performance Results : Barnes-Hut
16
Ideal
14
Aggressive
12
Bounded
10
Original
8
6
4
2
0
0
2
4 6 8 10 12 14
Number of Processors
Barnes-Hut (16384 bodies)
16
Speedup
Performance Results: Water
16
Ideal
14
Bounded
12
Original
10
Aggressive
8
6
4
2
0
0
2
4 6 8 10 12 14
Number of Processors
Water (512 Molecules)
16
Performance Results: String
16
Ideal
14
Original
Speedup
12
Aggressive
10
8
6
4
2
0
0
2
4
6
8
10 12 14
Number of Processors
String (Big Well Model)
16
Choosing Best Policy
• Best False Exclusion Policy May Depend On
• Topology of Data Structures
• Dynamic Schedule Of Computation
• Information Required to Choose Best Policy
Unavailable at Compile Time
• Complications
• Different Phases May Have Different Best Policy
• In Same Phase, Best Policy May Change Over Time
Solution: Dynamic Feedback
• Generated Code Consists of
• Sampling Phases: Measure Performance of Different Policies
• Production Phases : Use Best Policy From Sampling Phase
• Periodically Resample to Discover Changes in Best
Policy
• Guaranteed Performance Bounds
Dynamic Feedback
Aggressive Bounded Original
Aggressive
Overhead
Code
Version
Time
Sampling Phase
Production Phase
Sampling Phase
Speedup
Dynamic Feedback : Barnes-Hut
16
Ideal
14
Aggressive
12
Dynamic
Feedback
10
Bounded
8
Original
6
4
2
0
0
2
4 6 8 1 1 1
Number of Processors
0 2 4
Barnes-Hut (16384 bodies)
1
6
Speedup
Dynamic Feedback : Water
16
Ideal
14
Bounded
12
Dynamic
Feedback
10
Original
Aggressive
8
6
4
2
0
0
2
4 6 8 10 12 14
Number of Processors
Water (512 Molecules)
16
Dynamic Feedback : String
16
Ideal
14
Original
Dynamic
Feedback
Aggressive
Speedup
12
10
8
6
4
2
0
0
2
4
6
8
10 12
14 16
Number of Processors
String (BigWell Model)
Related Work
• Parallel Loop Optimizations (e.g. [Tseng:PPoPP95])
• Array-based Scientific Computations
• Barriers vs. Cheaper Mechanisms
• Concurrent Object-Oriented Programs (e.g. [PZC:POPL95])
• Merge Access Regions for Invocations of Exclusive Methods
• Concurrent Constraint Programming
• Bring Together Ask and Tell Constructs
• Efficient Synchronization Algorithms
• Efficient Implementations of Synchronization Primitives
Conclusions
• Synchronization Optimizations
• Basic Synchronization Transformations for Locks
• Synchronization Optimization Algorithm
• Integrated into Prototype Parallelizing Compiler
• Object-Based Programs with Dynamic Data Structures
• Commutativity Analysis
• Experimental Results
• Optimizations Have a Significant Performance Impact
• With Optimizations, Applications Perform Well
• Dynamic Feedback