VLIW - University of Houston

COSC3330 Computer Architecture
Lecture 16. VLIW
Instructor: Weidong Shi (Larry), PhD
Computer Science Department
University of Houston
Topic
• VLIW
Example Pipelined ILP Machine
Max Throughput, Six Instructions per Cycle
One Pipeline Stage
Latency
in Cycles
Two Integer Units,
Single Cycle Latency
Two Load/Store Units,
Three Cycle Latency
Two Floating-Point Units,
Four Cycle Latency
• How much instruction-level parallelism (ILP)
required to keep machine pipelines busy?
Sequential ISA Bottleneck
Sequential
source code
Superscalar compiler
a = foo(b);
for (i=0, i<
Find independent
operations
Schedule
operations
Superscalar processor
Check instruction Schedule execution
dependencies
Sequential
machine code
VLIW: Very Long Instruction Word
Int Op 1
Int Op 2
Mem Op 1
Mem Op 2
FP Op 1
FP Op 2
Two Integer Units,
Single Cycle Latency
Two Load/Store Units,
Three Cycle Latency Two Floating-Point Units,
Four Cycle Latency
•
•
•
•
Multiple operations packed into one instruction
Each operation slot is for a fixed function
Constant operation latencies are specified
Architecture requires guarantee of:
 Parallelism within an instruction => no x-operation RAW check
 No data use before data ready => no data interlocks
Very-Long Instruction Word (VLIW) Computers
PC
Instruction word consists
of several conventional 3operand instructions, one
for each of the ALUs
Instruction Memory
Op Rd Ra Rb Op Rd Ra Rb
Register file has 3N ports to
feed N ALUs. All ALU-ALU
communication takes place via
register file.
Op Rd Ra Rb
Register
File
6
Why VLIWs?
• Opportunity for much simpler hardware






Compiler discovers dependencies
Places the instructions
Simple encodings
Potentially lower # of transistors than other designs
Reduced speculation, Out-of-Order not needed
Size efficiencies, price, power consumption
• Is this true for Itanium?
7
VLIW Compiler Responsibilities
The compiler:
• Schedules to maximize parallel execution
• Guarantees intra-instruction parallelism
• Schedules to avoid data hazards
 Typically separates operations with explicit NOPs
Design Philosophy: VLIW vs. Superscalar
Static _VOID
_DEFUN(_mor_nu),
struct _reent
*ptr _AND
register size_t
{ .
.
.
Same
Normal
Source code
Static _VOID
_DEFUN(_mor_nu),
struct _reent
*ptr _AND
register size_t
{ .
.
.
RISC
Object code
Normal
Compiler
IM1 = I–1
IM2 = I–2
IM3 = I–3
T1 = LOAD .
T3 = 2*T1
.
.
Scheduling and
Operation
Independence:
Recognizing
hardware
Run-time
Compile Time
Normal compiler
plus scheduling
and operation
Independence:
Recognizing
software
The same ILP
Hardware in
Both cases
Early VLIW Machines
• Multiflow Trace (1987)
 commercialization of ideas from Fisher’s Yale group
including “trace scheduling”
 available in configurations with 7, 14, or 28
operations/instruction
 28 operations packed into a 1024-bit instruction
word
• Cydrome Cydra-5 (1987)
 7 operations encoded in 256-bit instruction word
 rotating register file
Josh Fisher
“In recognition of 25 years of seminal
contributions to instruction-level
parallelism, pioneering work on VLIW
architectures, and the formulation of the
Trace Scheduling compilation technique”
2003 Eckert-Mauchly Award
11
Intel/HP EPIC
• Explicitly Parallel Instruction Computer (EPIC)
• A kin breed of VLIW (e.g., compiler holding the
key to high performance)
• New Intel architecture (designed from ground)
• Not compactible with x86
• 64 bit, IA-64 (not x86-64)
• RISC + Superscalar
An Itanium Instruction Bundle
ld4 r43=[r38]
add r38=16,r38
br.call.sptk b0=printf# ;;
Intel Itanium
• Execution: 6 inst.
per clock
• 10 stage pipeline
• 4 ALU, 4
Multimedia ALU, 4
FP (up to 8 FP
ops./cycle), 2
Load / Store, 3
Branch units
• 128 64-bit general
purpose registers
• 128 80-bit
floating-point
registers
13
13
Intel Itanium ISA
• Itanium Instruction “Bundle” (VLIW)
 128 bits each
 Contains three Itanium instructions (aka syllables)
 Template bits in each bundle specify dependencies both within a
bundle as well as between sequential bundles
 A collection of independent bundles forms a “group” (use stops)
127
Instruction Slot 1
86
45
Instruction Slot 2
54
Instruction Slot 3
0
Templt
• Each Itanium Instruction
 Fixed-length 41 bits long
 Left-most 4 bits (40-37) are the major opcode (e.g. FP ld/st, INT
ld/st, ALU)
 Contains max three 7-bit register specifiers
14
Intel Itanium ISA
• Each IA-64 instruction is categorized into 6 types and
may be executed on one or more execution unit types.
• 4 functional unit categories:
–
–
–
–
I unit (integer)
F unit (floating-point)
M unit (memory)
B unit (branch)
• 6 microoperation categories:
–
–
–
–
–
–
Integer ALU (A-type) executed on M- or I units
Non-ALU Integer (I-type) executed on I units
Memory (M-type) executed on M units
Floating-point (F-type) executed on F units
Branch (B-type) executed on B units
Extended (L/X-type) executed on I- or B units
15
Encoding Instruction Bundle
{ .mii
ld4
add
add
r28=[r8]
r9 = 2,r1;;
r30= 1,r9
}
MI_I format  Template encoded “02”
• Use “;;” as “stop bit” in assembly code to separate dependent
instructions
• Instructions between “;;” belong to the same “instruction group”
– RAW and WAW are not allowed in the same instruction group
• Each instruction slot can represent one functional unit type
based on encoding (e.g. slot 0 can be M-unit or B-unit)
16
Intel Itanium ISA
• There are 12 basic bundle types:
MII, MI_I, MLX, MMI, M_MI, MFI, MMF, MIB, MBB,
BBB, MMB, MFB.
• Each basic type has two versions, one with a
stop after the third slot and one without
– MII, MI_I, MLX, MMI, M_MI, MFI, MMF, MIB,
MBB, BBB, MMB, MFB
– MII_, MI_I_, MLX_, MMI_, M_MI_, MFI_, MMF_, MIB_,
MBB_, BBB_, MMB_, MFB_
17
Itanium Instruction Example
{ .mii
add r1 = r2, r3
sub r4 = r4, r5;;
shr r1, r4, r1;;
}
{ .mmi
ld8 r2, [r1];;
st8 [r1] = r23
tbit p1,p2 = r4, 5
}
{ .mbb
ld8 r45 = [r55]
(p3)br.call b1=func1
(p4)br.cond Label1
}
{ .mfi
st4 [r45] = r6
fmac f1=f2,f3
add r3=r3, 8;;
}
18
Predication
Traditional Architectures
Itanium™ Architecture
cmp
cmp
then
else
p1
p2
p1
p2
p1
p2
• Converts branches to conditional execution
 Executes multiple paths simultaneously
• Exposes parallelism and reduces critical path
 Better utilizes wider machines
 Reduces mispredicted branches
More Example of Parallel Compare
if (c1 && c2 && c3 && c4)
r1 = r2 + r3;
else // !c1 || !c2 || !c3 || !c4
r4 = r5 – r6
c1
c2
Itanium Code
0
c3
1
c4
2
then
cmp.eq
cmp.eq.and.orcm
cmp.eq.and.orcm
cmp.eq.and.orcm
cmp.eq.and.orcm
(p1) add
r1=r2,r3
(p2) sub
r4=r5-r6
else
20
p1,p2 = r0,r0;;
p1,p2 = c1,r0
p1,p2 = c2,r0
p1,p2 = c3,r0
p1,p2 = c4,r0
More Example of Parallel Compare
Parallel cmp.eq.and or cmp.eq.or write the same values to both predicates
Use cmp.eq.and.orcm or
cmp.eq.or.andcm for writing
complementary predicates
Also called DeMorgan type
(for complementary output)
cmp.ge.and.orcm p6,p7= 80, r4
And Predicate
cmp.eq.and p1,p2= 80, r4
• Usage
 p1 = p1 and (80 == r4?)
 p2 = p2 and (80 == r4?)
• How to initialize p1 and p2
 cmp.unc.eq
p1,p2 = r0,r0
22
Design Philosophy: VLIW vs. Superscalar
VLIW - Compiler Challenges
• Very complex compiler
 Statically predictable branches
 Static disambiguation of memory addresses
 Information unavailable at static compile time
 Interprocedural optimization is difficult
• Code bloat
 Compiler specifies placement of each instruction
• place NOPs to preserve instruction execution order
 Many nop’s
HW Issues - Scalability
PC
Instruction Memory 1
register file ports cause
these structures to get
large.
clustering in which
several functional units
share a register file and
the compiler
orchestrates the
movement of data
among them
Op Rd Ra Rb Op Rd Ra Rb
Register
File 1
Instruction Memory N
Op Rd Ra Rb
Register
File N
25
Other Hardware Issues
• Compatibility of code
 Backward compatibility or upgradeability
 Due to exposed implementation details
• Multiflow sold machines from 7-wide to 24-wide
• Each required recompilation of source program
26