ILP - University of Houston

COSC3330 Computer Architecture
Lecture 11. ILP
Instructor: Weidong Shi (Larry), PhD
Computer Science Department
University of Houston
Topic
• ILP
Role Play
Fetch Unit
Execution Unit
3
Clock
4
After Spring Break
~ 5 functional units
1 clock
4-5 instructions
5
Sequential Program Semantics
• Human expects “sequential semantics”
 Tries to issue an instruction every clock cycle
 There are dependencies, control hazards and long
latency instructions
• To achieve performance with minimum effort
 To issue more instructions every clock cycle
 E.g., an embedded system can save power by
exploiting instruction level parallelism and decrease
clock frequency
Scalar Pipeline (Baseline)
Instruction Sequence
• Machine Parallelism = D (= 5)
• Issue Latency (IL) = 1
• Peak IPC = 1
D
IF
DE
EX
MEM
WB
1
2
3
4
5
6
Execution Cycle
Superpipelined Machine
1 major cycle = M minor cycles
Machine Parallelism = M x D (= 15) per major cycle
Issue Latency (IL) = 1 minor cycles
Peak IPC = 1 per minor cycle = M per baseline cycle
Superpipelined machines are simply deeper
pipelined
Instruction Sequence
•
•
•
•
•
IF
1
I
I
DE
D
D
I
EX
D D
D
E E
2
3
4
5
6
7
8
9
1
2
MEM
WB
E M M M W W W
E
E M
E
E E
D
E E
D
D E
D
D D
I
D D
I
I
D
I
I
I
3
4
5
6
Execution Cycle
Cost of Pipelining
Superscalar Machine
• Can issue > 1 instruction per cycle by hardware
• Replicate resources, e.g., multiple adders or multi-ported data
caches
• Machine Parallelism = S x D (= 10) where S is superscalar
degree
• Issue Latency (IL) = 1
• IPC = 2
Instruction Sequence
IF
DE
EX
MEM
1
2
WB
S
3
4
5
6
7
8
9
10
Execution Cycle
Diversified Pipelines
Separate pipelines for
integer, multiply, FPU,
load/store
11
Instruction Level Parallelism (ILP)
• Basic idea
Execute several instructions in parallel
• We already do pipelining…
 But it can only churn out at best 1 instr/cycle
• We want multiple instr/cycle
 Yes, it gets a bit complicated and we have to add a
fan to cool the processor, but it delivers performance
(power is another issue)
 That’s how we got from 486 (pipelined)
to Pentium and beyond
Is This Legal?!?
• ISA defines instruction execution one by one
 I1: ADD R1 = R2 + R3
• fetch the instruction
• read R2 and R3
• do the addition
• write R1
• increment PC
 Now repeat for I2
• Darth Sidious: Begin landing your troops.
Nute Gunray: Ah, my lord, is that... legal?
Darth Sidious: I will make it legal.
Sure, As Long As We Don’t Get Caught!
• How about pipelining?
 already breaks the “rules” described on previous slide
 we fetch I2 before I1 has finished
• Parallelism exists in that we perform different
operations (fetch, decode, …) on several different
instructions in parallel
 as mentioned, limit of 1 IPC
2
4
6
8
10
12
14
What Does It Mean to Not “Get Caught”?
• Program executes correctly
• Ok, what’s “correct”?
 As defined by the ISA
 Same processor state (registers, PC, memory)
as if you had executed one-at-a-time
• You can squash instructions that don’t correspond
to the “correct” execution
Example: Toll Booth
D
C
Caravanning on a trip, must stay in
A order to prevent losing anyone
B
When we get to the toll, everyone gets
in the same lane to stay in order
This works… but it’s slow. Everyone has to
wait for D to get through the toll booth
Lane 1
Go through two at a time
(in parallel)
Lane 2
Before Toll Booth
You Didn’t
See That…
After Toll Booth
Illusion of Sequentiality
• So long as everything looks OK to the outside
world you can do whatever you want!
 “Outside Appearance” = “Architecture” (ISA)
 “Whatever you want” = “Microarchitecture”
 mArch basically includes everything not explicitly
defined in the ISA
• pipelining, branch prediction, number of
instructions issued per cycle, etc.
Back to ILP… But How?
• Simple ILP recipe
 Read and decode a few instructions each cycle
• can’t execute > 1 IPC if we’re not fetching > 1 IPC
 If instructions are independent, do them at the same
time
 If not, do them one at a time
Ex. Original Pentium
Fetch
Fetch up to 32 bytes
Decode1
Decode up to 2 insts
Decode2
Decode2
Execute
Execute
Writeback
Writeback
Read operands and
Check dependencies
This is “Superscalar”
• “Scalar” CPU executes one inst at a time
 includes pipelined processors
• “Superscalar” can execute more than one inst at
a time
 X + Y, W * Z
ILP is Bounded
• For any sequence of instructions, the available
parallelism is limited
• Hazards, Dependencies are what limit the ILP
 Data dependencies
 Control dependencies
 Memory dependencies
Dependencies
• Data Dependencies
 RAW: Read-After-Write (True Dependence)
 WAR: Anti-Depedence
 WAW: Output Dependence
Data Dependencies
• Register dependencies
 RAW, WAR, WAW, based on register number
Read-After-Write
Write-After-Read
Write-After-Write
A: R1 = R2 + R3
B: R4 = R1 * R4
A: R1 = R3 / R4
B: R3 = R2 * R4
A: R1 = R2 + R3
B: R1 = R3 * R4
R1
R2
R3
R4
5A 7
7
-2
-2
-2
9
9B 9
3
3
21
R1
R2
R3
R4
5A 3
3
B
-2
-2
-2
9
9
-6
3
3
3
R1
R2
R3
R4
5 A 7 B 27
-2
-2
-2
9
9
9
3
3
3
R1
R2
R3
R4
5
5A 7
-2
-2
-2
9B 9
9
3
15
15
R1
R2
R3
R4
5
5 A -2
B
-2
-2
-2
9
-6
-6
3
3
3
R1
R2
R3
R4
5 B 27 A 7
-2
-2
-2
9
9
9
3
3
3
ILP
• Arrange instructions based on dependencies
• ILP = Number of instructions / Longest Path
R5 = 8(R6)
R7 = R5 – R4
R9 = R7 * R7
R15 = 16(R6)
R17 = R15 – R14
R19 = R15 * R15
Window in Search of ILP
R5 = 8(R6)
ILP = 1
R7 = R5 – R4
R9 = R7 * R7
R15 = 16(R6)
ILP = 1.5
R17 = R15 – R14
R19 = R15 * R15
ILP = ?
Window in Search of ILP
R5 = 8(R6)
R7 = R5 – R4
R9 = R7 * R7
R15 = 16(R6)
R17 = R15 – R14
R19 = R15 * R15
Window in Search of ILP
C1: R5
= 8(R6)
R15 = 16(R6)
C2: R7
= R5 – R4
R17 = R15 – R14 R19 = R15 * R15
C3: R9
= R7 * R7
•
•
•
•
ILP = 6/3 = 2 better than 1 and 1.5
Larger window gives more opportunities
Who exploit the instruction window?
But what limits the window?
Scheduling
• Central problem to ILP processing
 need to determine when parallelism (independent
instructions) exist
 in Pentium example, decode stage checks for multiple
conditions:
• is there a data dependency?
 does one instruction generate a value needed by
the other?
 do both instructions write to the same register?
• is there a structural dependency?
 most CPUs only have one divider, so two divides
cannot execute at the same time
Scheduling
• How many instructions are we looking for?
 3-6 is typical today
 A CPU that can ideally* do N instrs per cycle
is called “N-way superscalar”, “N-issue superscalar”,
or simply “N-way” or “N-issue”
• *Peak execution bandwidth
• This “N” is also called the “issue width”
Static (In-Order) Scheduling
• Cycle 1
 Start I1.
 Can we also start I2? No.
• Cycle 2
 Start I2.
 Can we also start I3? Yes.
 Can we also start I4? No.
Program code
I1: ADD R1, R2, R3
I2: SUB R4, R1, R5
I3: AND R6, R1, R7
I4: OR R8, R2, R6
I5: XOR R10, R2, R11
• If the next instruction cannot start,
stop looking for things to do in this cycle!
Dynamic (Out-of-Order) Scheduling
• Cycle 1
 Operands ready? I1, I5.
 Start I1, I5.
Program code
I1: ADD R1, R2, R3
I2: SUB R4, R1, R5
• Cycle 2
 Operands ready? I2, I3.
 Start I2,I3.
I3: AND R6, R1, R7
I4: OR R8, R2, R6
I5: XOR R10, R2, R11
• Window size (W):
how many instructions ahead do we look.
 Do not confuse with “issue width” (N).
 E.g. a 4-issue out-of-order processor can have a 128-entry
window (it can look at up to 128 instructions at a time).
Ordering?
• In previous example, I5 executed before I2, I3
and I4!
• How to maintain the illusion of sequentiality?
One-at-a-time = 45s
5s
5s
Hands toll-booth
agent a $100 bill;
takes a while to
count the change
30s
5s
With a “4-Issue” Toll Booth
L1
L2
L3
L4
OOO = 30s
Out-of-Order Execution
• We’re not executing instructions in data-flow
order
 Great! More performance
• But outside world can’t know about this
 Must maintain illusion of sequentiality
Atom Processor
• 2-issue
simultaneous
multithreading
• 16 stage inorder
• two integer
ALUs
• no instruction
reordering
Intel Quad Core
• 4 cores/chip
• 16 pipeline stages,
~3GHz
• 4-wide superscalar
• Out of order
Quiz
In-order or Out-of-Order?
Two-way superscalar inorder RISC processor
36
Quiz
In-order or Out-of-Order?
Three in-order cores
37
Quiz
In-order or Out-of-Order?
Pentium iii 3-issue out of
order
38
Quiz
In-order or Out-of-Order?
Dual-issue, in-order,
Cortex A8
Iphone 4
39
Quiz
In-order or Out-of-Order?
Cortex-A9 MPCore,
Out-of-order
Iphone 4S
40
Cortex A9
• The A8 has a dual-issue inorder 13-stage integer
pipeline. Doubling issue width
increased IPC (instructions
per clock) and the deeper
pipeline gave it frequency
headroom.
• The Cortex A9 goes back
down to an 8-stage pipeline.
It's still a dual-issue pipeline,
but instructions can execute
out of order.
ILP is Bounded
• For any sequence of instructions, the available
parallelism is limited
• Hazards/Dependencies are what limit the ILP
 Data dependencies
 Control dependencies
 Memory dependencies
RAW Memory Dependency
• RAW (Read-After-Write)
 A writes to a location, B reads from the location,
therefore B has a RAW dependency on A
 Also called a “true dependency”
A: STORE R1, 0[R2]
B: LOAD R5, 0[R2]
Instructions executing in same cycle
cannot have RAW
WAR Memory Dependency
• WAR (Write-After-Read)
 A reads from a location, B writes to the location,
therefore B has a WAR dependency on A
 If B executes before A has read its operand, then the
operand will be lost
 Also called an anti-dependence
A: LOAD R5, 0[R2]
B: STORE R3, 0[R2]
A: LOAD R5, 0[R2]
ADD R7, R5, R7
B: STORE R3, 0[R2]
WAW Memory Dependency
• Write-After-Write
 A writes to a location, B writes to the same location
 If B writes first, then A writes, the location will end up
with the wrong value
 Also called an output-dependence
A: STORE R1, 0[R2]
B: STORE R3, 0[R2]
A: STORE R1, 0[R2]
LOAD R5, 0[R2]
B: STORE R3, 0[R2]
Memory Location Ambiguity
• When the exact location is not known:
 A: STORE R1, 0[R2]
 B: LOAD R5, 24[R8]
 C: STORE R3, -8[R9]
 RAW exists if (R2+0) == (R8+24)
 WAR exists if (R8+24) == (R9 – 8)
 WAW exists if (R2+0) == (R9 – 8)
Memory Dependency
• Ambiguous dependency also forces “sequentiality”
• To increase ILP, needs dynamic memory disambiguation
mechanisms that are either safe or recoverable
• ILP could be 1, could be 3, depending on the actual
dependence
i1:
load r2, (r12)
?
i2: store r7, 24(r20)
?
i3: store r1, (0xFF00)
?
Control Dependencies
• If we have a conditional branch, until we
actually know the outcome, all later instructions
must wait
 That is, all instructions are control dependent on all
earlier branches
 This is true for unconditional branches as well (e.g.,
can’t return from a function until we’ve loaded the
return address)
la $8, array
beq $20, $22, L1
lb $10, 1($8)
add $11, $9, $10
sb $11, ($8)
L1:
addiu $8, $8, 4
Pop from the Stack
C code snippet
z = fact(x);
int fact(int n)
{
if (n < 1)
return(1);
else
return(n * fact(n-1))
}
$sp
Return address
$a0 (= X)
MIPS snippet
li $v0, 1
j fact
fact:
blt $a0, $v0, return
sub $sp, $sp, 8
sw
$ra, 4($sp)
sw
$a0, 0($sp)
sub $a0, $a0, 1
jal fact
lw
$a0,
lw
$ra,
mult $a0,
mflo $v0
add $sp,
return:
jr $ra
0($sp)
4($sp)
$v0
$sp, 8
Name Dependency
• WAR and WAW result due to reuse of names
R2 = R1 + R3
R1 = R5 – R7
R2 = R1 >> #3
• Would WAR and WAW exist with more registers?
ILP Example
• True dependency forces
“sequentiality”
• ILP = 3/3 = 1
load r2, (r12)
t
c2=i2: add r1, r2, 9
a
c3=i3: mul r2, r5, r6
• False dependency
removed
• ILP = 3/2 = 1.5
c1=i1:
o
i1: load r2, (r12)
t
i2: add r1, r2, 9
i3: mul r8, r5, r6
c1: load r2, (r12)
c2: add r1, r2, #9
mul r8, r5, r6
Eliminating WAR Dependencies
• WAR dependencies are from reusing registers
A: R1 = R3 / R4
B: R3 = R2 * R4
R1
R2
R3
R4
5A 3
3
B
-2
-2
-2
9
9
-6
3
3
3
A: R1 =X R3 / R4
B: R5 = R2 * R4
R1
R2
R3
R4
5
5 A -2
B
-2
-2
-2
9
-6
-6
3
3
3
R1
R2
R3
R4
R5
5
5A 3
B
-2
-2
-2
9
9
9
3
3
3
4
-6
-6
With no dependencies, reordering
still produces the correct results
Eliminating WAW Dependencies
• WAW dependencies are also from reusing registers
A: R1 = R2 + R3
B: R1 = R3 * R4
R1
R2
R3
R4
5 A 7 B 27
-2
-2
-2
9
9
9
3
3
3
A: R5 =
X R2 + R3
B: R1 = R3 * R4
R1
R2
R3
R4
5 B 27 A 7
-2
-2
-2
9
9
9
3
3
3
Same solution works
R1
R2
R3
R4
R5
5 B 27 A 27
-2
-2
-2
9
9
9
3
3
3
4
4
7
Another Register Example
When only 4 registers available
R1 = 8(R0)
R3 = R1 – 5
R2 = R1 * R3
24(R0) = R2
R1 = 16(R0)
R3 = R1 – 5
R2 = R1 * R3
32(R0) = R2
ILP =
Another Register Example
When more registers (or register renaming) available
R1 = 8(R0)
R3 = R1 – 5
R2 = R1 * R3
24(R0) = R2
R5 = 16(R0)
R1
R6 = R5
R3
R1 – 5
R7 = R5
R2
R1 * R3
R6
32(R0) = R2
R7
ILP =
Obvious Solution: More Registers
• Add more registers to the ISA? BAD!!!
 Changing the ISA can break binary
compatibility
 All code must be recompiled
 Not a scalable solution
Better Solution: Register Renaming
• Give processor more registers than specified by
the ISA  temporarily map ISA registers
(“logical” or “architected” registers) to the
physical registers to avoid overwrites
• Components:
 mapping mechanism
 physical registers
• allocated vs. free registers
• allocation/deallocation mechanism
Register Renaming
• Example
Program code
I1: ADD R1, R2, R3
 I3 can not exec before I2
I2: SUB R2, R1, R6
because
I3: AND R6, R11, R7
I3 will overwrite R6
I4: OR R8, R5, R2
 I5 can not go before I2 because I5: XOR R2, R4, R11
I2, when it goes, will overwrite
R2 with a stale value
RAW
WAR
WAW
Register Renaming
• Solution:
• Let’s give I2 temporary name/
location (e.g., S) for the value
it produces.
Program code
I1: ADD R1, R2, R3
I2: SUB R2,
S, R1, R6
R11,R7
R7
I3: AND R6,
U, R11,
I4: OR
R8, R5, R2
S
I5: XOR R2,
T, R4, R11
• But I4 uses that value,
so we must also change that to S…
• In fact, all uses of R2 from I3 to the next instruction that
writes to R2 again must now be changed to S!
• We remove WAW deps in the same way: change R2 in
I5 (and subsequent instrs) to T.
Register Renaming
• Implementation
 Space for S, T, U etc.
 How do we know when
to rename a register?
Program code
I1: ADD R1, R2, R3
I2: SUB S,
R1, R5
I3: AND U, R11, R7
I4: OR
R8, R5, S
I5: XOR T, R4, R11
• Simple Solution
 Do renaming for every instruction
 Change the name of a register
each time we decode an
instruction that will write to it.
 Remember what name we gave it 
Register File Organization
• We need some physical structure to store the
register values
Architected
Register
File
ARF
“Outside” world sees the ARF
RAT
One Physical REG per instruction in-flight
PRF
Register
Alias
Table
Physical
Register
File
Putting it all Together
top:
•
R1 = R2 + R3
•
R2 = R4 – R1
•
R1 = R3 * R6
•
R2 = R1 + R2
•
R3 = R1 >> 1
•
BNEZ R3, top
Free pool:
X9, X11, X7, X2, X13, X4, X8, X12, X3,
X5…
ARF
PRF
R1
R2
R3
R4
R5
R6
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12
X13
X14
X15
X16
RAT
R1
R2
R3
R4
R5
R6
R1
R2
R3
R4
R5
R6
Renaming in action
R1 = R2 + R3
R2 = R4 – R1
R1 = R3 * R6
R2 = R1 + R2
R3 = R1 >> 1
BNEZ R3, top
R1 = R2 + R3
R2 = R4 – R1
R1 = R3 * R6
R2 = R1 + R2
R3 = R1 >> 1
BNEZ R3, top
= R2 + R3
= R4 –
= R3 * R6
=
+
=
>>
BNEZ
=
=
=
=
=
BNEZ
1
1
Free pool:
X9, X11, X7, X2, X13, X4, X8, X12, X3,
X5…
ARF
PRF
, top
+
–
* R6
+
>>
R1
R2
R3
R4
R5
R6
, top
RAT
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12
X13
X14
X15
X16
R1
R2
R3
R4
R5
R6
R1
R2
R3
R4
R5
R6
Even Physical Registers are Limited
• We keep using new physical registers
 What happens when we run out?
• There must be a way to “recycle”
• When can we recycle?
 When we have given its value to all
instructions that use it as a source operand!
 This is not as easy as it sounds
Instruction Commit (Leaving the Pipe)
Architected register file contains
the “official” processor state
R3
ARF
R3
RAT
PRF
T42
When an instruction leaves the
pipeline, it makes its result
“official” by updating the ARF
The ARF now contains the
correct value; update the RAT
Free Pool
T42 is no longer needed, return
to the physical register free pool
Careful with the RAT Update!
Update ARF as usual
Deallocate physical register
R3
ARF
R3
RAT
T17
PRF
Don’t touch that RAT!
(Someone else is the most
recent writer to R3)
At some point in the future,
the newer writer of R3 exits
T42
Free Pool
This instruction was the most
recent writer, now update the RAT
Deallocate physical register
Cortex A9