Combining Scheduling and Register Allocation

COMP 412
FALL 2010
Combining Scheduling & Allocation
Comp 412
The Last Lecture
Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved.
Students enrolled in Comp 412 at Rice University have explicit permission to make copies
of these materials for their personal use.
Faculty from other educational institutions may use these materials for nonprofit
educational purposes, provided this copyright notice is preserved.
Combining Scheduling & Allocation
Sometimes, combining two optimizations can produce solutions
that cannot be obtained by solving them independently.
• Requires bilateral interactions between optimizations
— Click and Cooper, “Combining Analyses, Combining
Optimizations”, TOPLAS 17(2), March 1995.
• Combining two optimizations can be a challenge
Scheduling & allocation are a classic example
• Scheduling changes variable lifetimes
• Renaming in the allocator changes dependences
• Spilling changes the underlying code
Comp 412, Fall 2010
(SCCP)
false
dependences
1
Combining Scheduling & Allocation
Many authors have tried to combine allocation & scheduling
• Underallocate to leave “room” for the scheduler
— Can result in underutilization of registers
• Preallocate to use all registers
— Can create false dependences
• Solving the problems together can produce solutions that
cannot be obtained by solving them independently
— See Click and Cooper, “Combining Analyses, Combining
Optimizations”, TOPLAS 17(2), March 1995.
In general, these papers try to combine global allocators with
local or regional schedulers — an algorithmic mismatch
Comp 412, Fall 2010
Before we go there, a long digression about
how much improvement we might expect …
2
Iterative Repair Scheduling
The Problem
•
•
•
•
•
List scheduling has dominated field for 20 years
Anecdotal evidence both good & bad, little solid evidence
No intuitive paradigm for how it works
It works well, but will it work well in the future ?
Is there room for improvement?
(e.g., with allocation?)
Schielke’s Idea
• Try more powerful algorithms from other domains
• Look for better schedules
• Look for understanding of the solution space
This led us to iterative repair scheduling
Comp 412, Fall 2010
3
Iterative Repair
Iterative repair works well on many
kinds of scheduling problems.
Scheduling
The Algorithm
• Scheduling cargo for the space
shuttle
• Start from some approximation
to a schedule
(badliterature
or broken)
• Typical
problems in the
involve
10s
or 100s of
repairs
• Find & prioritize all cycles that
need
repair
(tried
6 schemes)
We used it with millions of repairs
— Either resource or data constraints
• Perform the needed repairs, in priority order
— Break ties randomly
— Reschedule dependent operations, in random order
— Evaluation function on repair can reject the repair
(try another)
• Iterate until repair list is empty
• Repeat this process many times to explore the solution space
— Keep the best result !
Randomization & restart is a
fundamental theme of our recent
work
Comp 412, Fall 2010
4
Iterative Repair Scheduling
How does iterative repair do versus list scheduling?
•
•
•
•
Found many schedules that used fewer registers
Found very few faster schedules
Were disappointed with the results
Hopeful sign for
this lecture
Began a study of the properties of scheduling problems
Iterative repair, itself, doesn’t justify the additional costs
• Can we identify schedulers where it will win?
• Can we learn about the properties of scheduling problems ?
— And about the behavior of list scheduling ...
Comp 412, Fall 2010
5
Instruction Scheduling Study
Methodology
•
•
•
•
Looked at blocks & extended blocks in benchmark programs
Used his RBF algorithm & tested for optimality
Holes in schedule?
Delays on critical path?
If non-optimal, used IR to find its best schedule (simple tests)
Checked these results against an IP formulation using CPLEX
The Results
• List scheduling does quite well on a conventional uniprocessor
1
Over 92% of blocks scheduled optimally for speed
Over 73% of extended blocks scheduled optimally for speed
• CPLEX had a hard time with the easy blocks
— Too many optimal solutions to investigate
Comp 412, Fall 2010
These results were obtained with code from benchmark programs.
Recall, from the local scheduling lecture, that RBF generated
optimal schedules for 80% of the randomly generated blocks.
6
Back to today’s subject
Combining Allocation & Scheduling
The Problem
• Well-understood that the problems are intricately related
• Previous work under-allocates or under-schedules
— Except Goodman & Hsu
Our Approach
• Formulate an iterative repair framework
— Moves for scheduling, as before
— Moves to decrease register pressure or to spill
• Allows fair competition in a combined attack
Grows out of search for novel techniques from other areas
Comp 412, Fall 2010
7
Combining Allocation & Scheduling
The Details
• Run IR scheduler & keep the schedule with lowest demand
for registers
(register pressure)
• Start with ALAP schedule rather than ASAP schedule
• Reject any repair that increases maximum pressure
• Cycle with pressure > k triggers “pressure repair”
— Identify ops that reduce pressure & move one
— Lower threshold for k seems to help
• Ran it against the classic method
— Schedule, allocate, schedule
Comp 412, Fall 2010
(using Briggs’ allocator)
8
Combining Allocation & Scheduling
The Results
• Many opportunities to lower pressure
— 12% of basic blocks
— 33% of extended blocks
• These schedule may be faster, too
— Best case was 41.3%
(procedure)
— Average case, 16 regs, was 5.4%
— Average case, 32 regs, was 3.5%
Knowing that new solutions
exist does not ensure that
they are better solutions!
This work confirms years of
suspicion, while providing an
effective, albeit
nontraditional, technique
(whole applications)
This approach finds faster codes that spill fewer values
It is competing against a very good global allocator
— Rematerialization catches many of the same effects
Comp 412, Fall 2010
The opportunity is present, but the IR scheduler is still quite slow …
9
Other approaches in the literature
Balancing Speed and Register Pressure
Goodman & Hsu proposed a novel scheme
• Context: debate about prepass versus postpass scheduling
• Problem: tradeoff between allocation & scheduling
• Solution:
— Schedule for speed until fewer than Threshold registers
— Schedule for registers until more than Threshold registers
• Details:
— “for speed” means one of the latency-weighted priorities
— “for registers” means an incremental adaptation of SU scheme
Comp 412, Fall 2010
James R. Goodman and Wei-Chung Hsu, “Code Scheduling and Register
Allocation in Large Basic Blocks,” Proceedings of the 2nd International
Conference on Supercomputing, St. Malo, France, 1988, pages 442-452.
10
Local Scheduling & Register Allocation
List scheduling is a local, incremental algorithm
• Decisions made on an operation-by-operation basis
• Use local (basic-block level) metrics
Need a local, incremental register-allocation algorithm
• Best’s algorithm, called “bottom-up local” in EaC
— To free a register, evict the value with furthest next use
• Uses local (basic-block level) metrics
Combining these two algorithms leads to a fair, local algorithm
for the combined problem
— Idea is due to Dae-Hwan Kim & Hyuk-Jae Lee
— Can use a non-local eviction heuristic
(new twist on Best’s alg.)
See Dae-Hwan Kim and Hyuk-Jae Lee, “Integrated instruction scheduling and fine-grain
register allocation for embedded processors,” LNCS 4017, pages 269-278, July 2006
(6th Int’l Workshop on Embedded Computer Systems: Architectures, Modeling, and
Comp 412,
Fall 2010
Simulation
(SAMOS
2006) Samos, Greece)
11
Paraphrasing from the local scheduling lecture …
Original Code for Local List Scheduling
Cycle  1
Ready  leaves of D
Active  Ø
while (Ready  Active  Ø)
if (Ready  Ø) then
remove an op from Ready
S(op)  Cycle
Active  Active  op
Cycle  Cycle + 1
update the Ready queue
Comp 412, Fall 2010
12
The Combined Algorithm
Cycle  1
Ready  leaves of D
Active  Ø
while (Ready  Active  Ø)
if (Ready  Ø) then
remove an op from Ready
make operands available in
registers
allocate a register for target
S(op)  Cycle
Active  Active  op
Bottom-up local:
Keep a list of free registers
On last use, put register back
on free list
To free register, store value
used farthest in the future
Cycle  Cycle + 1
update the Ready queue
Reload Live on Exit values, if
necessary
Comp 412, Fall 2010
Fast, simple, & effective
13
Notes on the Final Exam
• Closed-notes, closed-book exam
• Exam available Wednesday.
• Three hour time limit
— I aimed for a two-hour exam, but I don’t want you to feel time
pressure. You may take one break of up to fifteen minutes
apiece.
• You are responsible for the entire course
— Exam focuses primarily on material since the midterm
— Chapters 5, 6, 7, 8, 9.1, 9.2, 11, 12, & 13
— All the lecture notes
• Return the exam to DH 3080 (Penny Anderson’s office) by
5PM on the last day of exams – December 15, 2010
• If you must leave, you can email me a Word file or a PDF
document.
Comp 412, Fall 2010
14
Scheilke’s RBF Algorithm for Local Scheduling
Relying on randomization & restart, we can smooth the behavior
of classic list scheduling algorithms
Randomized
Schielke’s RBF algorithm
• Run 5 passes of forward list scheduling
and 5 passes of backward list scheduling
• Break each tie randomly
• Keep the best schedule
— Shortest time to completion
— Other metrics are possible
Backward &
Forward
(shortest time + fewest registers)
In practice, this approach does very well
— Reuses the dependence graph
Comp 412, Fall 2010
My “algorithm of choice” for list scheduling … 15