Lecture 9 Dynamic Branch Prediction, Superscalar, VLIW, and

Lecture 10
Dynamic Branch Prediction,
Superscalar, VLIW, and
Software Pipelining
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 1
Review: Tomasulo Summary
•
•
•
•
•
Register file is not the bottleneck
Avoids the WAR, WAW hazards of Scoreboard
Not limited to basic blocks (provided branch prediction)
Allows loop unrolling in HW
Lasting Contributions
– Dynamic scheduling
– Register renaming
– Load/store disambiguation
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 2
Reasons for Branch Prediction
•
•
For a branch instruction, predict whether the branch will actually
be TAKEN or NOT TAKEN
Pipeline bubbles(stalls) due to branch are the main source of
performance degradation
– Avoids pipeline bubbles by feeding the pipeline with the instructions
in the predicted path(speculative)
•
•
•
Speculative execution significantly improves the performance of
the deeply pipelined, wide issue superscalar processors
More accurate branch prediction is required because all
speculative work beyond a branch must be thrown away if
mispredicted
Performance=f(Accuracy, cost of misprediction)
– Accuracy is better with Dynamic
Dynamic Branch Prediction
Prediction
CS510 Computer Architectures
Lecture 10 - 3
1-Bit Branch History Table
•
•
Simplest of all dynamic prediction schemes
Based on the properties that most branches are either usually
TAKEN or usually NOT TAKEN
–
•
•
property found in the iterations of a loop
Keep the last outcome of each branch in the BHT
BHT is indexed using the lower order bits of PC
.
.
.
PC
BHT
Dynamic Branch Prediction
• Problem: in a loop, 1-bit BHT
will cause two mispredictions:
Actual outcome End of loop branch,
- Last time through the loop,
when it exits instead of
Prediction
looping as before
- First time through loop,
when it predicts exit
instead of looping
CS510 Computer Architectures
Lecture 10 - 4
1-Bit Branch History Table
Example
main()
{
int i, j;
while (1)
for (i=1; i<=4; i++) {
j = i++;
}
}
‘while’ branch : always TAKEN
11111111111111111….
‘for’ branch : TAKEN x 3, NOT TAKEN x 1
1110111011101110….
Prediction accuracy of ‘for’ branch: 50%
branch outcome
1
1
1
0
1
1
1
0
1
1
1
0
1 ...
last outcome
1
1
1
1
0
1
1
1
0
1
1
1
0 ...
prediction
T
T
T
T
N T
T
T
N
T
T
T
N ...
new last outcome
1
1
1
0
1
1
0
1
1
1
0
1 ...
X O O X
X ...
correctness
O O O X
1
X O O X
Assume initial last outcome = 1
Dynamic Branch Prediction
CS510 Computer Architectures
O : correct, X : mispredict
Lecture 10 - 5
2-Bit Counter Scheme
•
•
Prediction is made based on the last two branch outcomes
Each of BHT entries consists of 2 bits, usually a 2-bit counter,
which is associated with the state of the automaton(many different
automata are possible)
Branch outcome
.
.
.
Automaton
PC
Prediction
BHT
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 6
2-Bit BHT(1)
• 2-bit scheme where change prediction only if get misprediction twice
• MSB of the state symbol represents the prediction;
– 1x: TAKEN, 0x: NOT TAKEN
T
Predict
TAKEN
NT
Predict
TAKEN
10
11
T
NT
T
Predict
NOT
TAKEN
branch outcome
1 1 1 0 1 1 1 0 1 1 1 0 ...
counter value
11 11 11 11 10 11 11 11 10 11 11 11 . . .
prediction
T T T T T T T T T T T T ...
new counter value 11 11 11 10 11 11 11 10 11 11 11 10 . . .
NT
01
00
T
Predict
NOT
TAKEN
NT
Dynamic Branch Prediction
correctness
O O O X O O O X O O O X ...
O :: correct,
X : mispredict
assume initial counter value
11
Prediction accuracy of ‘for’ branch: 75 %
CS510 Computer Architectures
Lecture 10 - 7
2-Bit BHT(2)
•
•
A branch going unusual direction once causes a misprediction only
once, rather than twice as in the 1-bit BHT
MSB of the state represents the prediction;
– 1x: TAKEN, 0x: NOT TAKEN
T
Predict
TAKEN
NT
10
11
Predict
TAKEN
T
NT
T
Predict
NOT
TAKEN
branch outcome
1 1 1 0 1 1 1 0 1 1 1 0 ...
counter value
10 11 11 11 10 11 11 11 10 11 11 11 . . .
prediction
T T T T T T T T T T T T ...
new counter value 11 11 11 10 11 11 11 10 11 11 11 10 . . .
NT
00
01
T
NT
Dynamic Branch Prediction
Predict
NOT
TAKEN
correctness
O O O X O O O X O O O X ...
assume initial counter value : 10
Prediction accuracy of ‘for’ branch: 75 %
CS510 Computer Architectures
Lecture 10 - 8
2-Level Self Prediction
• First level
– Record previous few outcomes(history) of the branch itself
– Branch History Table(BHT) - A shift register
• Second Level
– Predictor Table(PT) indexed by history pattern captured by BHT
– Each entry is a 2-bit counter
PC
BHT
PT
...
Prediction
...
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 9
2-Level Self Prediction
Algorithm
Using PC, read BHT
Using self history from BHT, read the counter value from PT
If (MSB==1), predict T
else predict NT
After resolving branch outcome, bookkeeping
If mispredicted, discard all speculative executions
Shift right the new branch outcome into the entry read from BHT
Update PT:
If(branch outcome==T) increase counter value by 1
else decrease by 1
Branch Outcome
Self History(BHT)
Counter Value(PT)
Prediction
New BHT entry
New PT entry
Correctness
...
1
1
0
1
1
1
0
1
1
1
0
1
0000 1000 1100 1110 0111 1011 1101 1110 0111 1011 1101 1110 ...
10
10
10
10
10
10
01
11
11
11
00 ...
10
...
T
T
T
T
T
T
N
T
T
T
N
T
1000 1100 1110 0111 1011 1101 1110 0111 1011 1101 1110 0111 ...
11
11
01
11
11
11
00
11
11
11
00 ...
11
...
O
O
X
O
O
O
O
O
O
O
O
O
Initial BHT entry =0000, Self history length=4, Initial PT value=10
Dynamic Branch Prediction
Prediction Accuracy = 100%
CS510 Computer Architectures
Lecture 10 - 10
BHT Accuracy
• Mispredict because either:
– Wrong guess for that branch
– Got branch history of wrong branch when indexing the table
• 4096 entry table, programs vary from 1% misprediction
(nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc
at 12%
• 4096 about as good as infinite table, but 4096 is a lot of HW
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 11
Correlating Branches
Example:
if (d==0)
L1 :
d=1;
if (d==1)
Initial value
of d
0
1
2
BNEZ
ADDI
SUBI
BNEZ
R1,L1 ;
(b1)(d!=0)
R1,R1,#1; d==0, so d=1
R3,R1,#1
R3,L2;
(b2)(d!=1)
.....
L2 :
d==0?
Y
N
N
b1
NT
T
T
Value of d
before b2
1
1
2
d==1?
Y
Y
N
b2
NT
NT
T
If b1 is NT, then b2 is NT
1-bit
self history
predictor
Sequence of
2,0,2,0,2,0,...
d=?
2
0
2
0
b1
prediction
NT
T
NT
T
b1
action
T
NT
T
NT
New b1
prediction
T
NT
T
NT
b2
prediction
NT
T
NT
T
b2
action
T
NT
T
NT
New b2
prediction
T
NT
T
NT
All branches are mispredicted
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 12
Correlating Branches
(2,2) correlating predictor
Idea: TAKEN/NOT TAKEN of
recently executed branches
(GHR-global history) is related
to the behavior of the next
branch (as well as the history
of that branch behavior)(PHTself history)
Branch address
4
Pattern History Table(PHT)
XX
- Behavior of the recent
branches(GHR) selects
from, say, four predictions
(PHT) of the next branch,
and updating the selected
prediction only
XX
Prediction
2-bits per
predictors
Global History Register(GHR; 2 bits)
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 13
Correlating Branches
Example:
if (d==0)
L1 :
d=1;
if (d==1)
Self
Prediction
bits(XX)
NT/NT
NT/T
T/NT
T/T
d=? b1 prediction
2
NT/NT
0
T/NT
2
T/NT
0
T/NT
b1 action
T
NT
T
NT
R1,L1 ;
branch b1 (d!=0)
R1,R1,#1; d==0, so d=1
R3,R1,#1
R3,L2;
branch b2(d!=1)
.....
L2 :
Gloabal
Prediction, if last
Prediction, if last
branch action was NT branch action was T
NT
NT
T
NT
NT
T
T
T
new b1 prediction b2 prediction
T/NT
NT/NT
T/NT
NT/T
T/NT
NT/T
T/NT
NT/T
Initial self prediction bits NT/NT and
Initial last branch was NT.
Dynamic Branch Prediction
BNEZ
ADDI
SUBI
BNEZ
b2 action
T
NT
T
NT
new b2 prediction
NT/T
NT/T
NT/T
NT/T
Prediction used is shown in purple
Misprediction only in the firest two predictions
CS510 Computer Architectures
Lecture 10 - 14
Accuracy of Different Schemes
4096 Entries 2-bit BHT
Unlimited Entries 2-bit BHT
1024 Entries (2,2) BHT
16%
14%
12%
11%
10%
8%
6%
6%
6%
6%
5%
5%
4%
4%
Dynamic Branch Prediction
CS510 Computer Architectures
li
eqntott
espresso
gcc
matrix300
nasa7
0%
fpppp
1%
0%
spice
1%
doducd
2%
tomcatv
of Mispredictions
Frequency
Frequency of Mispredictions
18%
Lecture 10 - 15
Need Predicted Address
as well as Prediction
Branch Target Buffer (BTB): Indexed by the Address of branch to
get prediction AND branch address (if taken)
PC of instruction to fetch
Look up
Predicted PC
T/NT(as predicted)
Number of
entries
in branchtarget
buffer
=
No: instruction is not predicted to be
a branch. Proceed normally
Yes: then instruction is a branch and predicted
PC should be used as the next PC
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 16
Branch Target Buffer
Send PC to memory and
branch-target buffer
BTB
Y
Y
N
IF
No
No
ID
Is
instruction
a taken
branch?
Entry found in
BTB?
Prediction
T
T
-
Actual
T
NT
T
Penalty
0
2
2
Yes
Send out predicted PC
Yes
No
Taken
branch?
Yes
Normal
instruction
execution
EX
Dynamic Branch Prediction
Enter branch instr.
address and next
PC into BTB
Mispredicted branch, kill
fetched instr.; restart fetch
at other target; delete entry
from BTB.
CS510 Computer Architectures
Branch
correctly
predicted
Lecture 10 - 17
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 18
Getting CPI < 1:
Issuing Multiple Instructions/Cycle
Two variations
– Superscalar: varying number of instructions/cycle (1 to 8),
scheduled by compiler or by HW(Tomasulo)
• IBM PowerPC, Sun SuperSparc, DEC Alpha, HP 7100
– Very Long Instruction Words (VLIW): fixed number of
instructions (16) scheduled by the compiler
• Joint HP/Intel agreement in 1998?
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 19
Getting CPI < 1:
Issuing Multiple Instructions/Cycle
• Superscalar DLX: 2 instructions, 1 FP & 1 anything else
– Fetch 64-bits/clock cycle; Int on left, FP on right
Integer Instruction
FP Instruction
– Can issue the 2nd instruction only if the 1st instruction issues
– More ports for FP registers to do FP load & FP op in a pair
Instr Type
Int. instruction
FP instruction
Int. instruction
FP instruction
Int. instruction
FP instruction
IF
IF
ID
ID
IF
IF
Pipe
Stages
EX
EX
ID
ID
MEM
MEM
EX
EX
WB
WB
MEM
MEM
IF
IF
WB
WB
ID
ID
EX
EX
MEM
MEM
WB
WB
 1 cycle load delay expands to 3 instructions in SS
– instruction in right half cannot use it, nor instructions in the next slot
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 20
Unrolled Loop that Minimizes
Stalls for Scalar
Loop: LD
ADDD
SD
SUBI
BNEZ
F0, 0(R1)
F4, F0, F2
0(R1), F4
R1, R1, #32
R1, Loop
LD to ADDD: 1 Cycle
ADDD to SD: 2 Cycles
1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14
LD
LD
LD
LD
ADDD
ADDD
ADDD
ADDD
SD
SD
SUBI
SD
BNEZ
SD
F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
0(R1),F4
-8(R1),F8
R1,R1,#32
16(R1),F12
R1,LOOP
8(R1),F16
; 8-32 = -24
14 clock cycles, or 3.5 per iteration
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 21
Loop Unrolling in Superscalar
Loop:
Integer instruction
LD F0,0(R1)
LD F6,-8(R1)
LD F10,-16(R1)
LD F14,-24(R1)
LD F18,-32(R1)
SD 0(R1),F4
SD -8(R1),F8
SD -16(R1),F12
SUBI R1,R1,#40
SD 16(R1),F16
BNEZ R1,LOOP
SD 8(R1),F20
FP instruction
LD to ADDD: 1 Cycle
ADDD to SD: 2 Cycles
ADDD F4,F0,F2
ADDD F8,F6,F2
ADDD F12,F10,F2
ADDD F16,F14,F2
ADDD F20,F18,F2
Clock cycle
1
2
3
4
5
6
7
8
9
10
11
12
• Unrolled 5 times to avoid delays (+1 due to 2
cycle initiation delay for ADDD to SD)
• 12 clocks, or 2.4 clocks per iteration
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 22
Dynamic Scheduling in
Superscalar
• Dependencies stop instruction issue
• Code compiled for scalar version will run poorly on
SS
– Good code for superscalar depends on the structure of
the superscalar
• Simple approach
– Issue an integer instruction and a FP instruction to their
separate reservation stations, i.e. for Integer FU/Reg and
for FP FU/Reg
– Issue both only when two instructions do not access the
same register - instructions with data dependence
cannot be issued together
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 23
Dynamic Scheduling
in Superscalar
How to issue two instructions to the reservation stations and
keep in-order issue for Tomasulo?
– In order issue for the purpose of simplifying bookkeeping
– Issue stage runs 2X Clock Rate, so that issue can be made 2
times in an ordinary clock cycle to make issues remain in order
but execution can be done on the same ordinary clock cycle
– Only FP loads might cause dependency between integer and FP
issue:
• Replace load reservation station with a load queue;
operands must be read in the order they are fetched
• Load checks addresses in Store Queue to avoid RAW violation
• Store checks addresses in Load Queue to avoid WAR, WAW
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 24
Performance of Dynamic SS
Iter No.
1 LD F0,0(R1)
1 ADDD F4,F0,F2
1 SD 0(R1),F4
1 SUBI R1,R1,# 8
1 BNEZ R1,LOOP
2 LD F0,0(R1)
2 ADDD F4,F0,F2
2 SD 0(R1),F4
2 SUBI R1,R1,#8
2 BNEZ R1,LOOP
Issues
1
1
2
3
4
5
5
6
7
8
Executes Memory access Write Result
3
4
2(efa)
7
4
2 more cycles to complete
3(efa)
4
5
6
5
8
7
6(efa)
2 more cycles to complete 11
8
10
7(efa)
9
8
9
* 4 clocks per iteration
Branches, Decrements still take 1 clock cycle
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 25
Limits of Superscalar
• While Integer/FP split is simple for the HW, get CPI of 0.5 only
for programs with:
– Exactly 50% FP operations
– No hazards
• If more instructions issue at the same time, greater difficulty of
decode and issue
– Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide
if 1 or 2 instructions can issue
• VLIW: trade-off instruction space for simple decoding
– The long instruction word has room for many operations
– By definition, all the operations the compiler puts in the long
instruction word can execute in parallel
– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
• 16 to 24 bits per field => 7x16 or 112 bits to 7x24 or 168 bits
wide
– Need compiling technique that schedules across several branches
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 26
Loop Unrolling in VLIW
Memory
reference 1
Memory
reference 2
LD F0,0(R1)
LD F10,-16(R1)
LD F18,-32(R1)
LD F26,-48(R1)
LD F6,-8(R1)
LD F14,-24(R1)
LD F22,-40(R1)
SD 0(R1),F4
SD -16(R1),F12
SD -32(R1),F20
SD 0(R1),F28
SD -8(R1),F8
SD -24(R1),F16
SD -40(R1),F24
FP
operation 1
ADDD
ADDD
ADDD
ADDD
F4,F0,F2
F12,F10,F2
F20,F18,F2
F28,F26,F2
FP
op. 2
Int. op/
branch
Clock
a
ADDD F8,F6,F2
ADDD F16,F14,F2
ADDD F24,F22,F2
SUBI R1,R1,#48
BNEZ R1,LOOP
1
2
3
4
5
6
7
8
9
• Unrolled 7 times to avoid delays
• 7 results in 9 clocks, or 1.3 clocks per iteration
• Need more registers in VLIW
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 27
Limits to Multi-Issue Machines
1. Inherent limitations of ILP in programs
– 1 branch in 5 instructions => how to keep a 5-way VLIW busy?
– Latencies of units => many operations must be scheduled
– Need about Pipeline Depth x No. Functional Units of
independent operations to keep machines busy
2. Difficulties in building HW
– Duplicate FUs to get parallel execution
– Increase ports to Register File - VLIW example
• needs 6 reads(2 Integer Unit, 2 LD Units and 2 SD Units) and 3
writes(1 Integer Unit and 2 LD Units) for Int. register
• needs 6 reads(4 FP Units, 2 SD Units) and
4 writes(2 FP Units and 2 LD Units) for FP register
– Increase ports to memory
– Decoding SS and impact on clock rate, pipeline depth
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 28
Limits to Multi-Issue Machines
3.Limitations specific to either SS or VLIW
implementation
–
–
–
–
Multiple issue logic in SS
VLIW code size: unroll loops + wasted fields in VLIW
VLIW lock step => 1 hazard & all instructions stall
VLIW & binary compatibility is practical weakness
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 29
Detecting and Eliminating
Dependencies
• Read Section 4.5
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 30
Software Pipelining
• Observation: if iterations from loops are independent, then can get
ILP by taking instructions from different iterations
• Software pipelining: reorganizes loops so that each iteration is
made from instructions chosen from different iterations of the
original loop (Tomasulo in SW)
Iteration
0
Iteration
1
Iteration
2
Iteration
3
Iteration
4
Softwarepipelined
iteration
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 31
SW Pipelining Example
After: Software Pipelined version of loop
Before: Unrolled 3 times
Iter i
Iter i+1
Iter i+2
1 LOOP
2
3
4
5
6
7
8
9
10
11
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ
F0,0(R1)
F4,F0,F2
0(R1),F4
F6,-8(R1)
F8,F0,F2
-8(R1),F8
F10,16(R1)
F12,F10,F2
-16(R1),F12
R1,R1,#24
R1,LOOP
1 LOOP
2
3
4
5
LD
ADDD
LD
SD
ADDD
LD
SUBI
BNEZ
SD
ADDD
SD
F0,0(R1)
Start-up code
F4,F0,F2
F0,-8(R1)
0(R1),F4; Stores to M[i]
F4,F0,F2; Adds to M[i-1]
F0,-16(R1); Loads from M[i-2]
R1,R1,#8
R1,LOOP
0(R1),F4
F4,F0,F2
Finish code
-8(R1),F4
Read F4(i)
SD
IF
ADDD
LD
ID
IF
EX
ID
IF
Read F0(i)
Dynamic Branch Prediction
Mem
EX
ID
WB
Mem
EX
Write F4(i+1)
WB
Mem
WB
Write F0(i+2)
CS510 Computer Architectures
Lecture 10 - 32
SW Pipelining Example
Symbolic Loop Unrolling
– Less code space
– Overhead paid only once
vs. each iteration in loop unrolling
Number of
overlapped
operations
Software Pipelining
Time
Number of
overlapped
operations
Loop Unrolling
...
Time
100 iterations = 25 loops with 4 unrolled iterations each
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 33
Summary
•
•
•
Branch Prediction
– Branch History Table: 2 bits for loop accuracy
– Correlation: Recently executed branches correlated with next branch
– Branch Target Buffer: include branch address & prediction
Superscalar and VLIW
– CPI < 1
– Dynamic issue vs. Static issue
– More instructions issue at the same time, larger the penalty of hazards
SW Pipelining
– Symbolic Loop Unrolling to get most from pipeline with little code
expansion, little overhead
– SW pipelining works when the behavior of branches is fairly
predictable at compile time
Dynamic Branch Prediction
CS510 Computer Architectures
Lecture 10 - 34