PPT

ECE 669
Parallel Computer Architecture
Lecture 23
Parallel Compilation
ECE669 L23: Parallel Compilation
April 29, 2004
Parallel Compilation
° Two approaches to compilation
• Parallelize a program manually
• Sequential code converted to parallel code
° Develop a parallel compiler
• Intermediate form
• Partitioning
- Block based or loop based
• Placement
• Routing
ECE669 L23: Parallel Compilation
April 29, 2004
Compilation technologies for parallel machines
Assumptions:
Input:
Parallel program
Output:
“Coarse” parallel program
& directives for:
•
•
•
•
Which threads run in 1 task
Where tasks are placed
Where data is placed
Which data elements go in each data chunk
Limitation: No special optimizations
for synchronization -synchro mem refs treated
as any other comm.
ECE669 L23: Parallel Compilation
April 29, 2004
Toy example
° Loop parallelization
ECE669 L23: Parallel Compilation
April 29, 2004
Example
° Matrix multiply
° Typically,
FORALL i
FORALL j
FOR
k
Ci, j   Ci , j  Ai , k  Bk , j 
° Looking to find parallelism...
ECE669 L23: Parallel Compilation
April 29, 2004
Choosing a program representation...
° Dataflow graph
a0
b0
a2
+
c0
• No notion of storage
• Data values flow along arcs
• Nodes represent operations
ECE669 L23: Parallel Compilation
b2
+
c2
a5
b5
+
c5
problem
April 29, 2004
Compiler representation
° For certain kinds of structured programs
Array
A
Data array
Ai, j 
Index expressions
FORALL
LOOP
FORALL
LOOP nest
° Unstructured programs
Data
X
Communication weight
Task
B
Task
A
ECE669 L23: Parallel Compilation
i
April 29, 2004
j
Process reference graph
FORALL i FROM 0 to 5
Ci   A
i  B
i
Monolithic
memory
Memory
or 2
P0
P1
P2
1
P5
......Not very useful
• Nodes represent threads (processes) computation
• Edges represent communication (memory references)
• Can attach weights on edges to represent volume of
communication
• Extension: precedence relations edges can be added too
• Can also try to represent multiple loop produced threads as one
node
ECE669 L23: Parallel Compilation
April 29, 2004
Process communication graph
B0
A0
d1
d0
1
1
P0
1
C0
A01
d2 d3
1
P1
1
A2
1
B0
d4 d5
P5
P2
C1
• Allocate data items to nodes as well
• Nodes: Threads, data objects
• Edges: Communication
• Key: Works for both shared-memory, object-oriented, and
dataflow systems! (Msg. passing)
ECE669 L23: Parallel Compilation
April 29, 2004
PCG for Jacobi relaxation
Fine PCG
4
4
4
4
4
: Computation
: Data
Coarse PCG
ECE669 L23: Parallel Compilation
April 29, 2004
Compilation with PCGs
Fine process communication
graph
Partitioning
Coarse process communication
graph
ECE669 L23: Parallel Compilation
April 29, 2004
Compilation with PCGs
Fine process communication
graph
Partitioning
Coarse process communication
graph
MP:
Placement
Coarse process communication
graph
... other phases, scheduling.
Dynamic?
ECE669 L23: Parallel Compilation
April 29, 2004
Parallel Compilation
° Consider loop partitioning
° Create small local compilation
° Consider static routing between tiles
° Short neighbor-to-neighbor communication
° Compiler orchestrated
ECE669 L23: Parallel Compilation
April 29, 2004
Flow Compilation
° Modulo unrolling
° Partitioning
° Scheduling
ECE669 L23: Parallel Compilation
April 29, 2004
Modulo Unrolling – Smart Memory
° Loop unrolling relies on
dependencies
° Allow maximum parallelism
° Minimize communication
ECE669 L23: Parallel Compilation
April 29, 2004
Array Partitioning – Smart Memory
° Assign each line to separate
memory
° Consider exchange of data
° Approach is scalable
ECE669 L23: Parallel Compilation
April 29, 2004
Communication Scheduling – Smart Memory
° Determine where data should be sent
° Determine when data should be sent
ECE669 L23: Parallel Compilation
April 29, 2004
Speedup for Jacobi – Smart Memory
° Virtual wires
indicates scheduled
paths
° Hard wires are
dedicated paths
° Hard wires require
more wiring
resources
° RAW is a parallel
processor from MIT
ECE669 L23: Parallel Compilation
April 29, 2004
Partitioning
• Use heuristic for unstructured programs
• For structured programs...
...start from:
Arrays
A
C
B
L0
L1
L2
Loop Nests
ECE669 L23: Parallel Compilation
List of
arrays
April 29, 2004
List of
loops
Notion of Iteration space, data space
E.g.
A
Forall
i
Forall
j
Ai, j   Ai  2, j  1 Ai , j  1
Matrix
Data space
Iteration space
j
Represents a “thread” with a given
value of i,j
i
ECE669 L23: Parallel Compilation
April 29, 2004
Notion of Iteration space, data space
E.g.
A
Forall
i
Forall
j
Ai, j   Ai  2, j  1 Ai , j  1
+
Matrix
Data space
This thread affects the above
computation
Iteration space
j
Represents a “thread” with a given
value of i,j
i
° Partitioning: How to “tile” iteration for MIMD M/Cs
data spaces?
ECE669 L23: Parallel Compilation
April 29, 2004
Loop partitioning for caches
° Machine model
A
Memory
Network
Cache
Cache
Cache
P
P
P
• Assume all data is in memory
• Minimize first-time cache fetches
• Ignore secondary effects such as invalidations due to writes
ECE669 L23: Parallel Compilation
April 29, 2004
Summary
° Parallel compilation often targets block based and
loop based parallelism
° Compilation steps address identification of
parallelism and representations
• Graphs often useful to represent program dependencies
° For static scheduling both computation and
communication can be represented
° Data positioning is an important for computation
ECE669 L23: Parallel Compilation
April 29, 2004