Compile-time Copy Elimination

Register Allocation and Spilling
via Graph Coloring
G. J. Chaitin
IBM Research, 1982
Motivation




Before the register allocation phase, the
compiler assumes that there are an unlimited
number of general purpose registers
The symbolic registers must be mapped to
real registers in a way that avoids conflicts
Symbolic registers that cannot be mapped to
real registers must be spilled to memory
We need an algorithm to map registers with
minimal spilling cost
Paper Overview
Register allocation overview
 Subsumption algorithm
 Interference graph coloring algorithm
 Spilling algorithm

Register Allocation Steps
1.
2.
Determine which registers are live at any
point in the intermediate language (IL)
program
Build a register interference graph


3.
4.
5.
Nodes represent symbolic registers
Edges represent a conflict between symbolic
registers
Subsumption: eliminate unnecessary
register copies
Find a 32-coloring of the interference graph
Decide which registers to spill if necessary
Subsumption





If the source and destination of a register
copy do not interfere, they may be coalesced
into a single node
For each register copy in IL, determine
whether the registers interfere
If not, coalesce the two nodes into one
After first pass, rewrite IL code
Repeat until no more coalescing is possible
Subsumption Example
Instructions Live
A=1
A
B=A
B
Dead
B=B+1
C=B
C
B
D=A
D
A
…
C, D
A
B
C
D
Subsumption Example
Instructions
Live
AD = 1
AD
BC = AD
BC
Dead
BC = BC + 1
…
AD,
BC
AD
BC
Finding a 32-Coloring






Each symbolic register is assigned a color
representing a real register
If no adjacent nodes have the same color, then the
coloring succeeds
Assume that G has a node N with degree < 32
Then G is 32-colorable iff the reduced graph from
which N and all its edges have been omitted is 32colorable
Algorithm throws away nodes of degree < 32 until all
nodes have been removed
Algorithm fails if no node has degree < 32
3-coloring example
Instructions
A=1
B=2
C=3
?=A
D=4
?=B
?=C
?=D
Live
Dead
A
B
C
A
D
B
C
D
A
B
C
D
Spilling




If the 32-coloring fails, then nodes must be
spilled to memory
Spilled registers are stored to memory, then
loaded momentarily when their results are
needed
Every time spill code is generated, the
interference graph must be rebuilt
Usually recoloring succeeds after spilling, but
sometimes several passes are required
Spilling
NP-Complete problem
 Heuristic: spill the node that minimizes

– Cost of spilling / Degree of node

Cost of spilling
– (number of definition points + number of
use points) * frequency of each point

In some cases, spilled node can be
reloaded for an extended interval
Conclusion
The graph coloring and spilling
algorithms should produce faster code
 The register allocation algorithm is
efficient

– Graph coloring is (N)
– But uses (N2) space
Compile-time Copy Elimination
Peter Schnorf
Mahadevan Ganapathi
John Hennessy
Stanford, 1993
Motivation





Single assignment languages simplify
dependency checking
Which simplifies automatic detection and
exploitation of parallelism
But single-assignment languages require a
large number of copies
Previous implementations eliminate copies at
runtime
Increased efficiency if copies can be
eliminated at compile time
Paper Overview



Single-assignment languages
Code generation
Compile-time copy elimination techniques
–
–
–
–

Substitution
Pattern matching
Substructure sharing
Substructure targeting
Results – success!
– Eliminated all copies in bubble sort
Single-assignment languages


Functional languages (LISP, Haskell, SISAL)
Simpler dependency checking
– True dependencies – write, read

b = f(c), a = f(b)
– Anti-dependencies – read, write

a = f(b), b = f(c)
– Output dependencies – write, write

a = f(b), a = f(c)
– Aliasing


caused by pointers, array indexes
To avoid aliasing, all inputs and outputs are passed
by value
Example – Swap(A,i,j)

Data flow diagram
Input
– Edges transport values
– Simple nodes are operations


Pick any feasible node
evaluation order at random
Naïve implementation
– Each edge has its own memory
– Swap uses 5 array copies!

AElement
AElement
AReplace
Optimized implementation
– Swap array updates are done inplace
AReplace
Example: BubbleSort(A)



Compound nodes represent
control flow
Loops are implemented
using recursion to avoid
multiple assignment of the
iteration variable
Naïve implementation
– Bubble sort requires (n2) array
copies

Optimized implementation
– All array updates are done in
place
– But parallelism is decreased
Code Generation Overview

Input is from compiler front-end
– IF1: intermediate data-flow graph
representation
Code generator eliminates copies
 Output is in C

– Compiled into machine code using an
optimized C compiler
Vertical Substitution

If input and output
have the same
type and size, they
can share memory
– Updates are done
in-place
Input
1
AElement
2
AElement
3
AReplace
4
AReplace
Horizontal Substitution

If an output has
several
destinations, the
output edges
can share
memory
Input
1
AElement
2
AElement
3
AReplace
4
AReplace
Horizontal and Vertical Substitution

Horizontal and vertical substitution can
interfere with each other
– A node along the substitution chain
modifies the shared object before its last
use

Edges can be marked as read-only if
they are shared and this is not the last
use
Horizontal and Vertical Substitution
Input
1
AElement
2
Input
AElement
3
AReplace
4
AReplace
1
AElement
2
3
AElement
AReplace
4
AReplace
Interprocedural Substitution
Previous discussion concerned simple
nodes that can be analyzed at compiler
design time
 Information about a function is needed
in order to use substitution

– Does the function modify an input?
– Will an input be chained to an output?
Intersubgraph Substitution
Substitution analysis is done for each
construct
 Same basic principles

Determining the Evaluation Order
Evaluation order can impact efficiency
of substitution
 Naïve implementation selects the next
node to evaluate at random
 Hints tell algorithm which nodes should
be evaluated before and after other
nodes if possible
 Hints are ad hoc?

Pattern Matching
Replace hard-to-optimize pieces of
code
 Patterns are language-specific
 Patterns are detected using “ad hoc”
methods

Substructure Sharing
Allow substructures to be referenced
without copies
 AElement can be treated as a NoOp
 Happens after substitution analysis –
less important
 Same principles as substitution
analysis

Substructure Targeting
Allow structures to be built from
substructures without copies
 Similar to substructure sharing

Results
Compared optimizations versus naïve
implementation
 Optimization eliminate all copies for
bubble sort
 Informal comparison to run-time
optimizer shows improvements

Results
Conclusions
Substitution, pattern matching and
substructure sharing can almost
eliminate unnecessary copies in a
single assignment language.
 Copy elimination no longer has to be
done at run-time.
 Single assignment languages should be
more efficient for parallel programs.
