Get

CSE P501 – Compiler Construction

Instruction Scheduling Issues


Spring 2014
Latencies
List scheduling
Jim Hogg - UW - CSE - P501
O-1
Instruction Scheduling is . . .
a
b
c
d
e
f
g
h
Schedule
• Execute in-order to get
correct answer
Originally devised for super-computers
Now used everywhere:
• in-order procs - older ARM
• out-of-order procs - newer x86
Compiler does 'heavy lifting' - reduce chip power
Spring 2014
b
a
d
f
c
g
h
f
•
•
•
•
•
Jim Hogg - UW - CSE - P501
Issue in new order
eg: memory fetch is slow
eg: divide is slow
Overall faster
Still get correct answer!
O-2
Chip Complexity, 1
Following factors make scheduling complicated:

Different kinds of instruction take different times (in clock
cycles) to complete

Modern chips have multiple functional units



so they can issue several operations per cycle
"super-scalar"
Loads are non-blocking

~50 in-flight loads and ~50 in-flight stores
Spring 2014
JIm Hogg - UW - CSE - P501
O-3
Typical Instruction Timings
Spring 2014
Instruction
Time in Clock
Cycles
int  int
1
int * int
3
float  float
3
float * float
5
float  float
15
int  int
30
JIm Hogg - UW - CSE - P501
O-4
Load Latencies
Core
Core
L1 = 64 KB per core
L2 = 256 KB per core
L3 = 2-8 MB shared




DRAM


Instruction
Register
L1 Cache
L2 Cache
L3 Cache
DRAM
~5 per cycle
1 cycle
~4 cycles
~10 cycles
~40 cycles
~100 ns
T-5
Super-Scalar
Spring 2014
JIm Hogg - UW - CSE - P501
O-6
Chip Complexity, 2

Branch costs vary (branch predictor)

Branches on some processors have delay slots (eg: Sparc)

Modern processors have branch-predictor logic in hardware



GOAL: Scheduler should reorder instructions to




heuristics predict whether branches are taken or not
keeps pipelines full
hide latencies
take advantage of multiple function units (and delay slots)
help the processor effectively pipeline execution
However, many chips schedule on-the-fly too

eg: Haswell out-of-order window = 192 ops
Spring 2014
JIm Hogg - UW - CSE - P501
O-7
Data Dependence Graph
a
Start
Cycle
Instruction
a
loadAI rarp, @a => r1
b
add
c
loadAI rarp, @b => r2
d
mult
e
loadAI rarp, @c => r2
f
mult
g
loadAI rarp, @d => r2
h
mult
i
storeAI r1
r 1 , r1
r 1 , r2
r1 , r2
r1 , r2
leaf
c
b
=> r1
e
d
=> r1
g
f
=> r1
root
h
=> r1
=> rarp, @a
i
read-after-write = RAW = true dependence = flow dependence
write-after-read = WAR = anti-dependence
write-after-write = WAW = output-dependence
The scheduler has freedom to re-order instructions, so long as it complies with
inter-instruction dependencies
Spring 2014
JIm Hogg - UW - CSE - P501
O-8
Scheduling Really Works ...
Original
Scheduled
Start
Cycle
Instruction
Start
Cycle
Instruction
1
loadAI
rarp, @a => r1
1
loadAI
rarp, @a => r1
4
add
r1, r1
=> r1
2
loadAI
rarp, @b => r2
5
loadAI
rarp, @b => r2
3
loadAI
rarp, @c => r3
8
mult
r1, r2
=> r1
4
add
r1, r1
=> r1
10
loadAI
rarp, @c => r2
5
mult
r1, r2
=> r1
13
mult
r1, r2
6
loadAI
rarp, @d
15
loadAI
rarp, @d
7
mult
r1, r3
=> r1
18
mult
r1, r2
9
mult
r1, r2
=> r1
20
storeAI r1
11
storeAI r1
a = 2*a*b*c*d
Spring 2014
=> r1
=> r2
=> r1
=> rarp, @a
1 Functional Unit
Load or Store: 3 cycles
Multiply:
2 cycles
Otherwise:
1 cycle
=> r2
=> rarp, @a
New schedule uses extra register: r3
Preserves (WAW) output-dependency
JIm Hogg - UW - CSE - P501
O-9
Scheduler: Job Description

The Job


Given code for some machine; and latencies for each instruction,
reorder to minimize execution time
Constraints




Produce correct code
Minimize wasted cycles
Avoid spilling registers
Don't take forever to reach an answer
Spring 2014
JIm Hogg - UW - CSE - P501
O-10
Job Description - Part 2

foreach instruction in dependence graph

Denote current instruction as ins

Denote number of cyles to execute as ins.delay

Denote cycle number in which ins should start as ins.start

foreach instruction dep that is dependent on ins

Ensure ins.start + ins.delay <= dep.start
What if the scheduler makes a mistake?
On-chip hardware stalls the pipeline until operands become available: so slower,
but still correct!
Spring 2014
JIm Hogg - UW - CSE - P501
O-11
Dependence Graph + Timings
Start
Cycle
Instruction
a
loadAI rarp, @a => r1
b
add
c
loadAI rarp, @b => r2
d
mult
e
loadAI rarp, @c => r2
f
mult
g
loadAI rarp, @d => r2
h
mult
i
storeAI r1
r1 , r 1
r1 , r2
r1 , r 2
r1 , r 2
a13
=> r1
=> r1
=> r1
=> r1
c12
b10
e10
d9
g8
f7
h5
=> rarp, @a
i3
1 Functional Unit
Load or Store: 3 cycles
Multiply:
2 cycles
Otherwise:
1 cycle
Spring 2014
•
•
•
•
Superscripts show path length to end of computation
a-b-d-f-h-i is critical path
Can schedule leaves any time - no constraints
Since a has longest delay, schedule it first; then c; then ...
JIm Hogg - UW - CSE - P501
O-12
List Scheduling

Build a precedence graph D

Compute a priority function over the nodes in D

typical: longest latency-weighted path

Rename registers to remove WAW conflicts

Create schedule, one cycle at a time


Use queue of operations that are Ready
At each cycle


Spring 2014
Choose a Ready operation and schedule it
Update Ready queue
JIm Hogg - UW - CSE - P501
O-13
List Scheduling Algorithm
cycle = 1
Ready = leaves of D
Active = { }
// clock cycle number
// ready to be scheduled
// being executed
while Ready  Active  {} do
foreach ins  Active do
if ins.start + ins.delay < cycle then
remove ins from Active
foreach successor suc of ins in D do
if suc  Ready then Ready = {suc} endif
enddo
endif
endforeach
if Ready  {} then
remove an instruction, ins, from Ready
ins.start = cycle;
Active = ins;
endif
cycle++
endwhile
O-14
Beyond Basic Blocks

List scheduling dominates, but moving beyond basic
blocks can improve quality of the code. Possibilities:

Schedule extended basic blocks (EBBs)


Watch for exit points – limits reordering or requires compensating
Trace scheduling

Use profiling information to select regions for scheduling using traces
(paths) through code
Spring 2014
JIm Hogg - UW - CSE - P501
O-15