Compiling
for
IA-64
Carol Thompson
Optimization Architect
Hewlett Packard
History of ILP Compilers
• CISC era: no significant ILP
– Compiler is merely a tool to enable use of highlevel language, at some performance cost
• RISC era: advent of ILP
– Compiler-influenced architecture
– Instruction scheduling becomes important
• EPIC era: ILP as driving force
– Compiler-specified ILP
Increasing Scope for ILP
Compilation
• Early RISC Compilers
– Basic block scope (delimited by
branches & branch targets)
• Superscalar RISC and early VLIW
Compilers
– Trace scope (single entry,
single path)
– Superblocks & Hyperblocks
(single entry, multiple path)
• EPIC Compilers
– Composite regions: multiple
entry, multiple path
Basic Blocks
Superblock
Traces
Composite Regions
Unbalanced and Unbiased
Control Flow
• Most code is not well balanced
– Many very small blocks
– Some very large
– Then and else clause frequently
unbalanced
– Number of instructions
– Pathlength
• Many branches are highly biased
– But some are not
– Compiler can obtain frequency
information from profiling or
derive heuristically
60
40
0
55
5
55
5
40
0
60
Basic Blocks
• Basic Blocks are simple
– No issues with executing
unnecessary instructions
– No speculation or
predication support
required
• But, very limited ILP
– Short blocks offer very
little opportunity for
parallelism
– Long latency code is
unable to take advantage
of issue bandwidth in an
earlier block
60
40
0
55
5
55
5
40
0
60
Traces
• Traces allow scheduling of multiple
blocks together
– Increases available ILP
– Long latency operations can be
moved up, as long as they are on
the same trace
• But, unbiased branches are a
problem
– Long latency code in slightly less
frequent paths can’t move up
– Issue bandwidth may go unused
(not enough concurrent
instructions to fill available
execution units)
60
40
0
55
5
55
5
40
0
60
Superblocks and Hyperblocks
• Superblocks and Hyperblocks
allow inclusion of multiple
important paths
– Long latency code may
migrate up from multiple paths
– Hyperblocks may be fully
predicated
– More effective utilization of
issue bandwidth
• But, requires code duplication
• Wholesale predication may
lengthen important paths
60
40
5
55
0
5
55
40
0
60
5
Composite Regions
• Allow rejoin from non-Region code
– Wholesale code duplication is
not required
– Support full code motion across
region
– Allow all interesting paths to be
scheduled concurrently
• Nested, less important Regions
bear the burden of the rejoin
– Compensation code, as needed
60
40
0
55
5
55
5
40
0
60
Predication Approaches
• Full Predication
of entire Region
– Penalizes
short paths
60
40
0
55
5
55
5
40
0
60
On-Demand Predication
60
40
0
55
5
55
5
40
0
60
• Predicate (and
Speculate) as
needed
– reduce critical
path(s)
– fully utilize issue
bandwidth
• Retain control flow to
accommodate
unbalanced paths
Predicate Analysis
• Instruction scheduler requires knowledge of
predicate relationships
– For dependence analysis
– For code motion
–…
• Predicate Query System
– Graphical representation of predicate
relationships
– Superset, subset, disjoint, …
Predicate Computation
• Compute all predicates possibly needed
• Optimize
– to share predicates where possible
– to utilize parallel compares
– to fully utilize dual-targets
Predication and Branch Counts
Normalized Dynamic Branch Counts
1.2
1
0.8
0.6
0.4
0.2
0
-O
-O w/pred
+O4+P
vo
rte
x
pe
rl
ijp
eg
li
pr
es
s
gc
c
co
m
m
88
ks
im
go
+O4 +P w/pred
Benchmark
• Predication reduces branches
– at both moderate and aggressive opt. levels
Predication & Branch Prediction
Normalized Mispredict Rates
2
-O
1.5
-O w/pred
1
+O4+P
0.5
+O4 +P w/pred
0
go
m
sim
k
88
c
gc
m
co
ss
e
pr
li
eg
p
ij
rl
pe
x
te
r
vo
Benchmark
• Comparable misprediction rate with predication
– despite significantly fewer branches
increased mean time between mispredicted
branches
Register Allocation
• Modeled as a graph-coloring
problem.
– Nodes in the graph
represent live ranges of
variables
– Edges represent a
temporal overlap of the
live ranges
– Nodes sharing an edge
must be assigned
different colors (registers)
x = ...
y = ...
x
y
= ... x
z = ...
=…y
=…z
z
y
x
z
Requires Two Colors
Register Allocation
With Control Flow
x = ...
y = ...
x
x
y
y
=…y
x = ...
z = ...
=…z
z
z
Requires Two Colors
=…x
Register Allocation
With Predication
x = ...
x
x
y
y = ...
z = ...
z
y
z
= …y
x = ...
= …z
=…x
Now Requires Three
Colors
Predicate Analysis
x = ...
y = ...
z = ...
= …y
x
p0
y
z
p1
p2
x = ...
= …z
=…x
p1 and p2 are disjoint
If p1 is TRUE, p2 is false
and vice versa
Register Allocation
With Predicate Analysis
x = ...
y = ...
z = ...
= …y
x
x
y
z
y
z
x = ...
= …z
=…x
Now Back to Two
Colors
Effect of Predicate-Aware
Register Allocation
• Reduces register requirements for individual
procedures by 0% to 75%
– Depends upon how aggressively predication is
applied
• Average dynamic reduction in register stack
allocation for gcc is 4.7%
Object-Oriented Code
• Challenges
– Small Procedures, many
indirect (virtual)
– Limits size of regions,
scope for ILP
– Exception Handling
– Bounds Checking (Java)
– Inherently serial must check before
executing load or
store
Solutions
Inlining
for non-virtual functions or
provably unique virtual
functions
Speculative inlining for most
common variant
Dynamic optimization (e..g Java)
Make use of dynamic profile
Speculative execution
Guarantees correct
exception behavior
Liveness analysis of handlers
Architectural support for
speculation ensures
recoverability
Method Calls
• Barrier between execution
streams
• Often, location of called
method must be determined
at runtime
– Costly “identity check” on
object must complete
before method may begin
– Even if the call nearly
always goes to the same
place
– Little ILP
Resolve
target
method
Call-dependent
code
Possible
target
Possible
target
Possible
target
Speculating Across Method
Calls
• Compiler predicts target method
– Profiling
– Current state of class hierarchy
• Predicted method is inlined
– Full or partial
• Speculative execution of called method begins
while actual target is determined
Speculation Across Method
Calls
Resolve
target
method
call method
Dominant
called
method
Other
target
method
Other
target
method
Resolve
Dominant target
called
method
method
Other
target
method
call other
method if needed
Other
target
method
Bounds & Null Checks
• Checks inhibit code motion
• Null checks
x = y.foo;
if( y == null ) throw NullPointerException;
x = y.foo;
• Bounds checks
x = a[i];
if( a == null ) throw NullPointerException;
if( i < 0 || i >= a.length)
throw ArrayIndexOutOfBounds Exception;
x = a[i];
Speculating Across Bounds
Checks
• Bounds checks rarely fail
x = a[i];
ld.st = a[i];
if( a == null ) throw NullPointerException;
if( i < 0 || i >= a.length)
throw ArrayIndexOutOfBoundsException;
chk.s t
x = t;
• Long latency load can begin before checks
Exception Handling
• Exception handling inhibits motion of subsequent
code
if( y.foo ) throw MyException;
x = y.bar + z.baz;
Speculation in the Presence
of Exception Handling
• Execution of subsequent instructions may begin
before exception is resolved
if( y.foo ) throw MyException;
x = y.bar + z.baz;
ld
t1 = y.foo
ld.s
t2 = y.bar
ld.s
t3 = z.baz
add
x = t2 + t3
if( t1 ) throw MyException;
chk.s
x
Dependence Graph for
Instruction Scheduling
add t1 = 8,p
If( n < p->count ) {
(*log)++;
return p->x[n];
} else {
return 0;
}
ld4 count = [t1]
cmp4.ge p1,p2=n,count
(p1) ld4 t3 = [log]
(p1) add t2 = 1,t2
(p1) st4 [log] = t2
mov out0 = 0
(p1) ld4 t3 = [p]
shladd t4 = n,4,t3
(p1) ld4 out0 = [t4]
br.ret rp
Dependence Graph with
Predication & Speculation
add t1 = 8,p
• During dependence graph
construction, potentially
ld4 count = [t1]
control and data
cmp4.ge p1,p2=n,count
speculative edges and
nodes are identified
(p1) ld4 t3 = [log]
• Check nodes are added
(p1) add t2 = 1,t2
where possibly needed
(note that only data
(p1) st4 [log] = t2
mov out0 = 0
speculation checks are
shown here)
(p1) ld4 t3 = [p]
chk.a p
shladd t4 = n,4,t3
(p1) ld4 out0 = [t4]
chk.a t4
br.ret rp
Dependence Graph with
Predication & Speculation
(p1) ld4 t3 = [p]
(p1) ld4 t3 = [log]
add t1 = 8,p
shladd t4 = n,4,t3
(p1) add t2 = 1,t2
ld4 count = [t1]
(p1) ld4 out0 = [t4]
(p2) mov out0 = 0
cmp4.ge p1,p2=n,count
(p1) st4 [log] = t2
chk.a p
chk.a t4
br.ret rp
• Speculative edges may be violated. Here the graph is re-drawn to show the
enhanced parallelism
• Note that the speculation of both writes to the out0 register would require
insertion of a copy. The scheduler must consider this in its scheduling
• Nodes with sufficient slack (e.g. writes to out0) will not be speculated
Conclusions
• IA-64 compilers push the complexity of the compiler
– However, the technology is a logical progression
from today’s
– Today’s RISC compilers
– are more complex
– are more reliable
– and deliver more performance
than those of the early days
– Complexity trend is mirrored in both hardware and
applications
– Need a balance to maximize benefits from each
© Copyright 2026 Paperzz