Optimal Polynomial Time Algorithms for Register Assignment

Optimal Polynomial Time Algorithms
for Register Assignment
Presented at the Chinese University
of Hong Kong
- Fernando M. Q. Pereira August 28th, 2007
University of California, Los Angeles
Background
Register Allocation

Assign physical locations to the
variables in a program.




Registers are fast, but few.
Memory is large, but slow.
Constraints: variables simultaneously
alive must be assigned to different
physical locations.
If there are not enough registers, some
variables must be mapped into memory.
These are called spilled variables.
Spill Free Register Allocation


Instance: program P and K registers
Problem: can each of the variables
of P be mapped to one of the K
registers such that variables
simultaneously alive are given
different registers?
Liveness? Live Range?


A variable is alive if it can be used in the
future.
Live range of a variable is the collection
of program points where it is alive.
1)
2)
3)
4)
a := 1
b := 2
c := a
d :=
5)
e :=
6)
a :=
7) ret a +
b
c
d
e
a
b
c
d
e
a
Quiz 1

1)
2)
3)
4)
How many registers?
a := 1
b := 2
c := a
d :=
5)
e :=
6)
a :=
7) ret a +


b
c
d
e
a
b
c
d
e
a
a(R1):= 1
b(R2):= 2
c(R1):= a(R1)
d(R2):= b(R2)
e(R3):= c(R1)
a(R1):= d(R2)
ret a(R1)+e(R3)
Is there a general algorithm?
Is this problem in P or NP?
Register Allocation and Graphs

SFRA = Graph coloring [Chaitin81]
a := 1
b := 2
c := a
d :=
e :=
a :=
ret a +

b
c
d
e
a
b
b
a
c
c
d
e
a
SFRA is NP-complete…
e
d
Example
a := 1
b := 2
c := a
Thee registers: R1, R2 and R3
1
R2 := 2
R1 := R1
d := b
e := c
a := d
ret a + e
R1 :=
R3 := R2
R2 := R1
b(R2)
a(R1)
e(R2)
c(R1)
d(R3)
R1 := R3
ret R1 + R2
Live Range Splitting


Live ranges are split via copy instructions
and/or renaming of variables.
May reduce the degree of the
interference graph.
a := 1
b := 1
:= b
b
a
c := 1
:= a
:= c
(a)
c
b
a
c
(b)
(c)
a1 := 1
b := 1
:= b
a2 := a1
c := 1
:= a2
:= c
(d)
a1
b
b
a2
c
(e)
a1
a2
c
(f)
Quiz 2

If I can split live ranges, how many
registers?
a := 1 a
b
b := 2
c
c := a
d := b
e := c
a := d a
ret a + e
d
e
a1 := 1
b := 2
c := a1
d := b
e := c
a2 := d
ret a2 + e
a1
b
c
d
e
a2
Quiz 3

P or NP?




Instance: program P, K registers
Problem: is there a way to split the live
ranges of P so that all its variables can fit
into K registers?
This problem has polynomial solution!
Three independent proofs in 2005:



Philip Brisk, WLS’05
Florent Bouchez, INRIA, Master’s thesis
Sebastian Hack, CC’06
Quiz 4, and a bit of intuition…
c

Is coloring of
Circular arc-graphs
in P or NP?
b
d
e
a

Is coloring of
Interval-graphs in P
or NP?
a b
c d
e
Intuition on Live Range Splitting
c
b
c
b
d
d
e
e
a
a1
b
a
c
a2
a1
e
d
e
d
b
c
a2
SSA-Form: the new hope.



Static Single Assignment[CFR+91].
Intermediate program representation.
Each variable is defined only once.
1)
a1 := 1
2)
b := 2
c := a1
3)
4)
d := b
5)
e := c
6)
a2 := d
7) ret a2 + e
a1
a1
b
c
b
d
e
c
a2
d
e
a2
Polynomial time SFRA





[Brisk05,Bouchez05,Hack06]: the
interference graph of SSA-form
programs is chordal.
Chordal graphs can be colored in
polynomial time.
SFRA has polynomial solution for SSAform programs.
Any program can be converted to SSAform.
The SSA-form program never requires
more regs than the original program.
Quiz 5: RA in basic blocks



A basic block is a
sequence of
instructions with no
branches.
How is the
interference graph of a
SSA-form basic
block?
Give polynomial time
algorithm for register
assignment in basic
blocks.
Too good, but…

… real computer architectures are a
little too surreal…
There are more things in x86,
Horatio…


The polynomial time register
assignment algorithm is too abstract.
Some computer architectures are
messy:



Pre-colored registers
Registers of different sizes.
Testimony: no publicly available
implementation for x86 after two years.
Pre-colored registers

Some variables must be assigned to
particular registers.

Ex.: calling conventions, division, etc
a := 10;
b := 2;
R0 := a;
R1 := b;
call(R0, R1);
a := 10;
b := 2;
AX := a;
(AL,AH) := DIV AX, b;
d := AL; // quotient
r := AH; // remainder
Function call (PowerPC)
Division (x86)
Quiz 6: pre-coloring extension

Is pre-coloring extension of interval
graphs in P or NP?
easy :)

difficult :(
Pre-coloring extension is NP-complete
for interval graphs[Biro92]

and even for Unit-interval
graphs[Marx06]…
Alias Register Allocation

Aliased registers can be used
independently, or in combination.


32 bits
Ex.: x86, Sun SPARC, MIPS floating point
numbers, etc.
Ex.: aliased registers in the Pentium:
EAX
EBX
ECX
EDX
16 bits
AX
BX
CX
DX
8 bits
AH AL
BH BL
CH CL
DH DL
Quiz 7: Weighted Coloring
e
a b

d
c
What is the optimal 1-2-coloring
of the graph in the left?
Shipbuilding
b(0)
a(23)
b
a
c
e
d
Alias RA
b(2)
a(01)
c(01)
c(12)
e(1)
d(3)
e(3)
d(4)
Alias Register Allocation


Alias Register Allocation is similar to the
shipbuilding problem[Gol04, pp 204]
Alias Register Allocation is NPcomplete[LPP07] for interval graphs.

And so is the shipbuilding problem...
What can SSA do?
The SSA transformation
is too weak to handle
alias register allocation
and programs with precolored variables.
Register Allocation by Puzzle Solving
Polynomial time 1-2-coloring extension
with live range splitting.
Aliased Register Allocation with
Pre-coloring


Instance: program P containing variables
that are either short or long, 2K available
registers, plus a partial function  that
associates variables with registers. Long
variables are assigned two registers {2i,
2i+1}, 0  i < K, and short variables are
assigned one register.
Problem: is it possible to extend  so that it
constitutes a valid register allocation of P?
The register allocator is allowed to split live
ranges.
In other words…

Optimal spill free register allocation.


x86, Ultra SPARC, MIPS, PowerPC, … as
far as I know, any register based
architecture.
Heuristics for spilling.
Heuristics for spilling?


Optimal solution for spill free register
allocation.
If it is not possible to find an optimal
register assignment for program P,
variables of P must be stored in
memory.


Finding the minimum number of variables
that must be spilled is NP-complete.
Finding the largest K colorable induced
subgraph of a chordal graph is NPcomplete [Yannakakis87].
[PP07] - The Main Ideas




Elementary Programs and Elementary
graphs.
Elementary programs have elementary
interference graphs.
Any well structured program can be
converted to an elementary program.
Each connected component of an
elementary graph is a clique substitution
of P3.
[PP07] - The Main Ideas
Elementary Programs
P is an elementary program if:
1.
P is strict
2.
P is in static single assignment form
3.
For any variable v of P, LR(v) contains at most
one program point outside the basic block that
contains def(v)
4.
If two variables u,v of P interfere, then
either def(u) = def(v), or kill(u) = kill(v)
5.
If two variables u,v of P interfere, then
either LR(u)  LR(v), or LR(v)  LR(u)
(a) Strict program (b) Elementary program
Interference graph
Clique Substitution of P3

P3 is a path with three vertices.
P3
X Clique
K2
K3
Y Clique
P3[K2, K2, K3]
Z Clique
Elementary Graphs

Definition: G is an elementary graph if
and only if


every connected component of G is a
clique substitution of P3
Theorem: An elementary program has
an elementary interference graph.
Aligned 1-2-coloring extension


Instance: Graph G with nodes that are
either short or long, 2K available colors,
plus a partial function  that associates
nodes with colors. Long nodes are
assigned two colors {2i, 2i+1}, 0  i < K,
and short nodes are assigned one.
Problem: is it possible to extend  so
that it constitutes a valid coloring of G?
Graph Hierarchy
The Puzzles
The Board:
The Pieces:
From graphs to puzzles








Given PX,Y,Z we build a puzzle:
Vertex  piece
Color  column
X-clique  upper row
Y-clique  both rows
Z-clique  lower row
Pre-coloring  some pieces are already on the
board
Theorem: Aligned 1-2-coloring extension for
clique substitutions of P3 and puzzle solving are
equivalent under linear-time reductions
Rules, Patterns and matches
match
Don’t match
Example Program
Our Solution
Counter-example 1
Lesson: use a size-2 piece before two size-1 pieces
Counter-example 2
Lesson: statements 7-10 must come before statements 11-14
Counter-example 3
Lesson: statement 15 must come before statements 11-14
Counter-example 4
Lesson: the order in statement 11-14 is crucial
Running Complexity


Theorem: a puzzle is solvable if, and
only if, our program succeeds on the
puzzle.
Our puzzle solving program runs in
linear time.
Spilling




Visit each puzzle once.
If the puzzle is not solvable, then
remove some pieces and try to solve
again.
Each time we remove a piece, we also
remove all other pieces that stem from
the same variable in the original
program.
Spill farthest use first.
Experimental Results

Puzzle solver has been implemented in
the LLVM[CV04] framework.



Compile C programs to x86 target.
Over one million lines of code compiled!
We have compared our allocator with
LLVM’s default algorithm, and a graph
coloring well known heuristics.
Benchmarks
Benchmark
LoC
Asm
btcode
ASCI Purple:smg2000
SPEC2000:175.vpr
SPEC2000:188.ammp
74,875
70,253
54,335
73,039
52,917
35,567
303,037
173,475
149,245
MallocBench:expresso
SPEC2000:197.parser
SPEC2000:164.gzip
52,853
49,388
39,157
45,041
32,849
8,130
250,770
163,025
46,188
…
…
…
(six more)
Total
409,540 286,900 1,345,898
Types of Puzzles
Number of Iterations
Benchmark
Puzzles
Avg
max
Once
ASCI Purple:smg2000
52,791
1.33
8
33,822
SPEC2000:175.vpr
47,276
1.10
10
45,575
SPEC2000:188.ammp
33,428
1.09
9
28,515
MallocBench:expresso
43,791
1.06
3
38,925
SPEC2000:197.parser
30,868
1.05
4
28,992
7,840
1.06
3
6,718
…
…
…
…
251,428
1.13
10
213,411
SPEC2000:164.gzip
(six more)
Total
Execution Time of Generated Code
Data normalized with respect to GCC -02.
Conclusion



If you want to do register allocation for
the Pentium, your problem is to solve a
collection of puzzles.
Fast compilation time, competitive code
quality.
Many possible directions for future
research.