final version

Chapter 10
Scheduling
Presented by Vladimir Yanovsky
The goals
• Scheduling: Mapping of parallelism within the
constraints of limited available parallel resources
• In general, we must sacrifice some parallelism to
fit a program within the available resources
• Our goal: Minimize the amount of parallelism
sacrificed/maximize utilization of the resources
Lecture Outline
– Straight line scheduling
– Trace Scheduling
– Loops: Kernel Scheduling
(Software Pipelining)
– Vector unit scheduling
Scheduling - Motivation
• Transistor sizes have shrank. Can be exploited
by:
1. Several processors on the same silicone.
2. Multiple identical execution units.
• The more parallelisms allows the processor,
the more important scheduling is.
Processor Types
• Superscalar
Multiple functional units controlled and
scheduled by the hardware.
• VLIW (Very Large Instruction Word)
Scheduled by the compiler
VLIW vs Superscalar
• Compatibility
• Capability of run-time adjustments
(branches & cache misses)
• Design simplicity
• Global view of the program
Scheduling – standard approach
• Scheduling in VLIW and Superscalar
architectures:
– Receive a sequential stream of instructions
– Reorder this sequential stream to utilize available
parallelism
– Reordering must preserve dependences
• Our model for this talk is VLIW
Reuse Constrains
• Need to execute:
a = b + c + d + e
• One possible sequential stream:
add a, b, c
add a, a, d
add a, a, e
• And, another:
add r1, b, c
add r2, d, e
add a, r1, r2
Fundamental Problem
• Fundamental conflict in scheduling:
– If the original instruction stream takes into
account available resources it will create
artificial dependences
– If not, then there may not be enough
resources to correctly execute the stream
• Who should be earlier, register allocation
or scheduling?
Processor Model
• VLIW type
• Processor contains a number of issue units
• Issue unit has an associated type and a
delay
• Purpose: to select set of instructions for each
cycle such that the number of instructions of
each type is not greater than the number of
execution units of this type
Straight Line Scheduling
• Scheduling a basic block: receives a
dependence graph
G = (N, E, type, delay)
– N: set of instructions in the code
– E: (n1, n2)  E iff n2 must wait completion of n1
due to a dependency
– Each n  N has a type, type(n), and a delay,
delay(n).
Straight Line Scheduling
• A correct schedule is a mapping, S, from vertices in the
graph to nonnegative integers representing cycle
numbers such that:
1. If (n1,n2)  E, S(n1) + delay(n1)  S(n2), i.e. deps satisfied
2. Hardware constraints are satisfied.
• The length of a schedule, S, denoted L(S) is defined as:
L(S) = maxn (S(n) + delay(n))
• Goal of straight-line scheduling: Find a shortest possible
correct schedule.
List Scheduling
• Use variant of topological sort:
– Maintain a list of instructions which have no
unscheduled predecessors in the graph
– Schedule these instructions
– This will allow other instructions to be added to
the list
– Repeat until all instructions are scheduled
List Scheduling
• We maintain two arrays:
– count determines for each instruction how many
predecessors are still to be scheduled
– earliest array maintains the earliest cycle on which
the instruction can be scheduled.
• Maintain a number of worklists which hold
instructions to be scheduled for a particular cycle
number. All their predecessors are scheduled.
List Scheduling - Initialization
for each n N do begin count[n] := 0; earliest[n] = 0 end
for each (n1,n2) E do begin
count[n2] := count[n2] + 1;
successors[n1] := successors[n1]  {n2};
end
for i := 0 to MaxC – 1 do W[i] := ; //MaxC max(delay)+1
Wcount := 0; //The number of ready instructions
for each n N do
if count[n] = 0 then begin //No dependencies
W[0] := W[0]  {n}; Wcount := Wcount + 1;
end
end
c := 0; // c is the cycle number
cW := 0; // cW is the number of the worklist for cycle c
instr[c] := ;
List Scheduling Algorithm
while Wcount > 0 do begin
while W[cW] =  do begin
c := c + 1; instr[c] := ; cW := mod(cW+1,MaxC);
end
nextc := mod(c+1,MaxC); //next cycle
while W[cW] ≠  do begin
select and remove an arbitrary instruction x from W[cW];
Priority
if free issue units of type(x) on cycle c then begin
instr[c] := instr[c]  {x}; Wcount := Wcount - 1;
for each y  successors[x] do begin
count[y] := count[y] – 1;
earliest[y] := max(earliest[y], c+delay(x));
if count[y] = 0 then begin
loc := mod(earliest[y],MaxC);
W[loc] := W[loc]  {y}; Wcount := Wcount + 1;
end
end
else W[nextc] := W[nextc]  {x}; //x could not be scheduled
For each unused unit insert stall
end
end
Finding the critical path
for each n  N do begin count[n] := 0; remaining[n] := delay(n); end
for each (n1,n2)  E do begin
count[n1] := count[n1] + 1; //count[n]==0 iff nothing depends on n
predecessors[n2] := predecessors[n2]  {n1};
end
W := ∅;
for each n  N do if count[n] = 0 then W := W  {n};//init: W-inst without deps
while W ≠ ∅ do begin
select and remove an arbitrary instruction x from W;
for each y  predecessors[x] do begin
count[y] := count[y] – 1;
remaining[y] := max(remaining[y], remaining[x]+delay(y));
if count[y] = 0 then W := W  {y};
end
end
Problems of list scheduling
•
•
Previous basic block must complete
before the next is started.
Cannot schedule loops.
Trace Scheduling
• Exploit parallelism between
several basic blocks.
• Trace: is a collection of
basic blocks that form a
single path through all or
part of the program.
• CFG without loops
Trace Scheduling
Scheduling
j=j+1
i=i+2
if e1
i = i + 2 is moved below
the split – inserted fixup
code
Trace Scheduling
•
Trace scheduling algorithm:
1. Select a trace based on profiling information.
2. Schedule the trace using basic block
scheduler adding dependencies from the
splits/joints to the upstream/downstream
instructions respectively.
3. Insert a fixup code.
4. Remove the scheduled trace from the CFG
5. If CFG not empty Goto 1
Trace & line scheduling conclusions
1. Problem with line & trace scheduling – cannot
schedule loops effectively. Must unroll loops to
have more “meat” for work.
2. Trace scheduling increases code size by
inserting fixup code, may lead to exponential
code increase.
3. Need up-to-date memory dependencies
information to do anything about moving
memory accesses.
Kernel Scheduling
• Moves instructions not only in space but
also in time – across iterations.
• Allows to better exploit parallelism
between loop iterations.
Kernel Scheduling problem
• A kernel scheduling problem is a graph:
G = (N, E, delay, type, cross)
where cross (n1, n2) defined for each edge
in E is the number of iterations crossed by
the dependence relating n1 and n2
Software Pipelining
• Example:
l0
l1
l2
l3
l4
l5
ld
ld
fld
fld
fadd
fst
ai
comp
ble
r1,0
r2,400
fr1, c
fr2,a(r1)
fr2,fr2,fr1
fr2,b(r1)
r1,r1,8
r1,r2
l0
fld
2
fadd
3
fst
• A legal schedule:
Load/Store
10: fld fr2,a(r1)
Integer
ai
Floating Pt.
r1,r1,8
comp r1,r2
fst fr3,b-16(r1)
ble l0
fadd fr3,fr2,fr1
Software Pipelining
Load/Store
Integer
10: fld fr2,a(r1)
ai
Floating Pt.
r1,r1,8
comp r1,r2
fst fr3,b-16(r1)
l0
l1
l2
l3
l4
l5
ld
ld
fld
fld
fadd
fst
ai
comp
ble
fadd fr3,fr2,fr1
ble l0
r1,0
r2,400
fr1, c
fr2,a(r1)
fr2,fr2,fr1
fr2,b(r1)
r1,r1,8
r1,r2
l0
S[10]
S[l1]
S[l2]
S[l3]
S[l4]
S[l5]
=
=
=
=
=
=
0;
2;
2;
0;
1;
2;
I[l0]
I[l1]
I[l2]
I[l3]
I[l4]
I[l5]
=
=
=
=
=
=
0;
0;
1;
0;
0;
0;
Software Pipelining
• Have to generate epilog and prolog to ensure correctness
• Prolog:
ld
r1,0
ld
r2,400
fld fr1, c
p1
fld fr2,a(r1);
ai r1,r1,8
p2
comp r1,r2
p3
beq e1;
fadd fr3,fr2,fr1
• Epilog:
e1
nop
e2
nop
e3
fst fr3,b-8(r1)
Kernel Scheduling
• A solution to the kernel scheduling problem is a
pair of tables (S,I), where:
– the schedule S maps each instruction n to a cycle
within the kernel
– the iteration I maps each instruction to an iteration
offset from zero, such that:
S[n1] + delay(n1)  S[n2] + (I[n2] – I[n1] + cross(n1,n2)) Lk(S)
for each edge (n1,n2) in E, where:
Lk(S) = maxn (S(n)) is the length of the kernel for S.
• Another name for kernel’s length is II – initiation
interval
Kernel scheduling - intuition
• S[n1] + delay(n1)  S[n2] + (I[n2] – I[n1] + cross(n1,n2)) Lk(S)
• Instructions with I[n] = 0 are running in the “current”
iteration.
• If I[n]>0 this means that the instruction is delayed by I[n]
iterations.
• Even if n1 has large delay, n2 can be moved to a later
iteration instead of forcing it to be scheduled in the cycle
S[n1] + delay(n1)
Resource Constrains
• Resource usage constraint:
– No recurrence in the loop
– #t: number of instructions in each iteration that must
issue in a unit of type t
Lk(S) 
# t 
 
max

t
mt 

• We can always find a schedule S, such that
Lk(S) =
# t 
 
max

t
mt 

Kernel Scheduling
for each instruction x in G in topological order do begin
earlyS := 0; earlyI := 0;
for each predecessor y of x in G do
thisS := S[y] + delay(y); thisI := I[y];
if thisS ≥ L then begin
thisS := mod(thisS,L); thisI := thisI + thisS/L  ;
end
if thisI > earlyI or ((thisI = earlyI) && (thisS>earlyS)) then begin
earlyI := thisI; earlyS := thisS;
end
end
starting at cycle earlyS, find the first cycle c0
where the resource needed by x is available,
wrapping to the beginning of the kernel if necessary;
S[x] := c0;
if c0 < earlyS then I[x] := earlyI +1 else I[x] := earlyI; //Wrapped over kernel
end
Software Pipelining Example
l0
l1
l2
l3
l4
ld
ai
ai
ai
st
a,x(i)
a,a,1
a,a,1
a,a,1
a,x(i)
Memory1
Integer1
Integer2
l0: S=0; I=0
l1: S=0; I=1
l2: S=0; I=2
Integer3
l3: S=0; I=3
Memory2
l4: S=0; I=4
• 2 memory units, 3 integer units.
•II=1 is enough. Each time next instruction is pushed to the next iteration.
Register Pressure
1. The same register a
cannot be used in 4 different
iterations running
simultaneously.
2. Need to store register’s value
for each overlapping
iterations and rename them
cyclically after each iteration.
3. Issue 2 can be solved by
unrolling with renaming
though this will increase code
size
•
•
•
•
•
l0 ld a0,x(i)
l1 ai a1,a0,1
l2 ai a2,a1,1
l3 ai a3,a2,1
l4 st a3,x(i)
Prolog & Epilog
Block
Prologue
Kernel
What's Happening
Code Layout
iter 1
Stage A
Stage B
Stage C
Stage D
Stage E
iter 2
Stage A
Stage B
Stage C
Stage D
Stage E
Stage A
Stage B
Stage C
Stage D
Stage E
II
Stage A
Stage B
Stage C
Stage D
Stage E
Stage A
Stage B
Stage C
Epilogue
Stage D
iter x-1 Stage E
iter x
1. Current iteration when entering the kernel is 5.
Fill Pipeline
Steady State
Drain Pipeline
2. I(Stage A)=0, that is we execute Stage A in the same iteration as initially.
3. I(Stage B) = 1, i.e. Stage B is always delayed to the next iteration.
4. Prolog: StageA1; StageB1,StageA2;StageC1,StageB2,StageA3…
Prolog & Epilog generation
• Prolog:
for k = 0 to maxn(I(n))-1
lay out the kernel replacing all
n s.t. I(n)>k by NO-OP
• Epilog:
for to k=1 to maxn(I(n))
lay out the kernel replacing all
n s.t. I(n)<k by NO-OP
• Compact both using list schedule.
Recurrences
• Given a recurrence (n1, n2, …, nk):
k
Lk(S) 
 delay( n )
i
i 1
k
 cross( n , n )
– Right hand side is called the slope of the
recurrence. Nominator is the number of cycles it
takes to complete all the computations of the
recurrence, denominator is the number
iterations available to do this.
i 1
i
i 1
k
– Lk(S) 
MAXc
 delay( n )
k
i
i 1
 cross( n , n
i 1
i
)
i 1
Kernel Scheduling – General Case
1.
2.
3.
4.
5.
Compute MII to be the
maximum of resource
constraint and the maximum
slope.
II=MII
Remove an edge from every
recurrence.
Schedule(II) using the simple
kernel scheduling algorithm.
If failed (dependency of any
removed edge is violated),
increase II and got 4.
Kernel Scheduling - Conclusions
• Handling control flow is difficult. May use
hardware support for predicated execution
or handling the “control flow regions” as
black boxes.
• Increased register pressure may limit only
to single basic block inner loops anyway.
• Benefits from unrolling with renaming.
Vector Unit Scheduling
• A vector instruction involves the execution
of many scalar instructions
• Much of the benefit from the pipelining is
already achieved
• Still, something can be done
Chaining
• Chaining:
vload
vload
vadd
vstore
•
•
•
•
t1,
t2,
t3,
t3,
a
b
t1, t2
c
Two load units
Each operation takes 64 cycles
192 cycles without chaining
66 cycles with chaining
• Proximity within instructions required for hardware to
identify opportunities for chaining
Chaining rearranging
vload
vload
vadd
vload
vmul
vmul
vadd
a,x(i)
b,y(i)
t1,a,b
c,z(i)
t2,c,t1
t3,a,b
t4,c,t3
• Rearranging:
vload
vload
vadd
vmul
vload
vmul
vadd
a,x(i)
b,y(i)
t1,a,b
t3,a,b
c,z(i)
t2,c,t1
t4,c,t3
2 load, 1 addition,
1 multiplication pipe
Instruction fusion
vload
a,x(i)
vload
b,y(i)
vadd
t1,a,b
vload
c,z(i)
vmul
t2,c,t1
vmul t3,a,b
vadd
t4,c,t3
Instruction fusion – cont.
After
vload
vload
vadd
vmul
vload
vmul
vadd
Fusion
a,x(i)
b,y(i)
t1,a,b
t3,a,b
c,z(i)
t2,c,t1
t4,c,t3
The End!

Download Report

final version

Paperzz.com

Your Paperzz