Branch Prediction

CSL718 : Pipelined Processors
Improving Branch Performance –
contd.
21st Jan, 2006
Anshul Kumar, CSE IITD
Improving Branch Performance
• Branch Elimination
– replace branch with other instructions
• Branch Speed Up
– reduce time for computing CC and TIF
• Branch Prediction
– guess the outcome and proceed, undo if necessary
• Branch Target Capture
– make use of history
Anshul Kumar, CSE IITD
slide 2
Improving Branch Performance
• Branch Elimination
– replace branch with other instructions
• Branch Speed Up
– reduce time for computing CC and TIF
• Branch Prediction
– guess the outcome and proceed, undo if necessary
• Branch Target Capture
– make use of history
Anshul Kumar, CSE IITD
slide 3
Branch Elimination
F
C
Use conditional/guarded instructions
(predicated execution)
T
S
OP1
BC CC = Z,  + 2
ADD R3, R2, R1
OP2
Examples:
C:S
OP1
ADD R3, R2, R1, NZ
OP2
HP PA (all integer arithmetic/logical instructions)
DEC Alpha, SPARC V9 (conditional move)
Anshul Kumar, CSE IITD
slide 4
Branch Elimination - contd.
CC
IF
OP1
BC
IF
IF
D AG DF DF DF EX EX
IF
IF
IF
D AG TIF TIF TIF
IF
IF
IF
IF
IF
D AG DF DF DF
ADD/OP2
ADD
(cond)
IF
Anshul Kumar, CSE IITD
D’
D AG
EX EX
slide 5
Improving Branch Performance
• Branch Elimination
– replace branch with other instructions
• Branch Speed Up
– reduce time for computing CC and TIF
• Branch Prediction
– guess the outcome and proceed, undo if necessary
• Branch Target Capture
– make use of history
Anshul Kumar, CSE IITD
slide 6
Branch Speed Up :
early target address generation
•
•
•
•
Assume each instruction is Branch
Generate target address while decoding
If target in same page omit translation
After decoding discard target address if not
Branch
BC
IF
IF
Anshul Kumar, CSE IITD
IF
D TIF TIF TIF
AG
slide 7
Branch Speed Up :
increase CC - branch gap
Increase the gap between the instruction
which sets CC and branching
• Early CC setting
• Delayed branch
Anshul Kumar, CSE IITD
slide 8
delayed early CC
branch setting
Summary - Branch Speed Up
uncond
cond (T)
cond (I)
uncond
cond (T)
cond (I)
Anshul Kumar, CSE IITD
n=0
4
6
5
4
6
5
n=1
4
5
4
3
5
4
n=2
4
4
3
2
4
3
n=3
4
4
2
1
3
2
n=4
4
4
1
0
2
1
n=5
4
4
0
0
1
0
slide 9
Delayed Branch with Nullification
•
•
•
•
(Also called annulment )
Delay slot is used optionally
Branch instruction specifies the option
Option may be exercised based on
correctness of branch prediction
Helps in better utilization of delay slots
Anshul Kumar, CSE IITD
slide 10
Improving Branch Performance
• Branch Elimination
– replace branch with other instructions
• Branch Speed Up
– reduce time for computing CC and TIF
• Branch Prediction
– guess the outcome and proceed, undo if necessary
• Branch Target Capture
– make use of history
Anshul Kumar, CSE IITD
slide 11
Branch Prediction
• Treat conditional branches as unconditional
branches / NOP
• Undo if necessary
Strategies:
– Fixed (always guess inline)
– Static (guess on the basis of instruction type /
displacement)
– Dynamic (guess based on recent history)
Anshul Kumar, CSE IITD
slide 12
Static Branch Prediction
Instr
%
Guess
Branch
Correct
uncond
14.5
always
100%
14.5%
cond
58
never
54%
27%
loop
9.8
always
91%
9%
call/ret
17.7
always
100%
17.7%
Total 68.2%
Anshul Kumar, CSE IITD
slide 13
Threshold for Static prediction
I-1
I
IF
IF
D AG AG DF DF EX EX
IF
IF
CC
D AG AG TIF TIF
actual
T I
guess T 4 5
 I 6 0
guess target if 4 p + 5 (1 - p) < 6 p + 0 (1 - p)
i.e. p > .71
Anshul Kumar, CSE IITD
slide 14
Dynamic Branch Prediction basic
idea
Predict based on the history of previous
branch
loop: xxx
xxx
xxx
xxx
BC loop
Anshul Kumar, CSE IITD
2 mispredictions
for every
occurrence
slide 15
Dynamic Branch Prediction 2 bit prediction scheme
N
0
T
3/2
0/1
predict taken
1
T
T
N
predict not taken
N
2
3
N
T
Anshul Kumar, CSE IITD
slide 16
Dynamic Branch Prediction second scheme
Predict based on the history of previous n
branches
e.g., if n = 3 then
3 branches taken  predict taken
2 branches taken  predict taken
1 branch taken  predict not taken
0 branches taken  predict not taken
Anshul Kumar, CSE IITD
slide 17
Dynamic Branch Prediction Bimodal predictor
Maintain saturating counters
T
T
N
0
1
N
T
2
N
3
T
N
One counter per branch or
One counter per cache line merge results if multiple branches
Anshul Kumar, CSE IITD
slide 18
Dynamic Branch Prediction History of last n occurrences
current entry
outcome of last
three occurrences
of this branch
1
1
0
updated entry
actual outcome
‘taken’
1
1
1
0 : not taken
1 : taken
prediction using
majority decision
Anshul Kumar, CSE IITD
slide 19
Dynamic Branch Prediction storing prediction counters
store in separate buffer or
store in cache directory
CACHE
directory
storage
cache line
counter
Anshul Kumar, CSE IITD
slide 20
Correct guesses vs. history length
n
Compiler Business Scientific
Supervisor
0
64.1
64.4
70.4
54.0
1
91.9
95.2
86.6
79.7
2
93.3
96.5
90.8
83.4
3
93.7
96.6
91.0
83.5
4
94.5
96.8
91.8
83.7
5
94.7
97.0
92.0
83.9
Anshul Kumar, CSE IITD
slide 21
Two-Level Prediction
• Uses two levels of information to make a
direction prediction
– Branch History Table (BHT) - last n
occurrences
– Pattern History Table (PHT) - saturating 2 bit
counters
• Captures patterned behavior of branches
– Groups of branches are correlated
– Particular branches have particular behavior
Anshul Kumar, CSE IITD
slide 22
Correlation between branches
B1: if (x)
...
B2: if (y)
...
z = x && y
B3: if (z)
...
Anshul Kumar, CSE IITD
• B3 can be predicted
with 100% accuracy
based on the outcomes
of B1 and B2
slide 23
Some Two-level Predictors
PC
GBHR
10110
BHT
PHT
11010
T/NT
PHT
T/NT
01111
11100
00111
Global Predictor
Local Predictor
bits from PC and BHT can be combined to index PHT
Anshul Kumar, CSE IITD
slide 24
Two-level Predictor Classification
• Yeh and Patt 3-letter naming scheme
– Type of history collected
• G (global), P (per branch), S (per set)
– PHT type
• A (adaptive), S (static)
– PHT organization
• g (global), p (per branch), s (per set)
• Examples - GAs, PAp etc.
Anshul Kumar, CSE IITD
slide 25
Improving Branch Performance
• Branch Elimination
– replace branch with other instructions
• Branch Speed Up
– reduce time for computing CC and TIF
• Branch Prediction
– guess the outcome and proceed, undo if necessary
• Branch Target Capture
– make use of history
Anshul Kumar, CSE IITD
slide 26
Branch Target Capture
• Branch Target Buffer (BTB)
• Target Instruction Buffer (TIB)
instr addr
pred stats
prob of target change < 5%
Anshul Kumar, CSE IITD
target
target addr
target instr
slide 27
BTB Performance
decision
result
BTB miss
go inline .4
inline
.6
target inline
.8 .2
delay
0
BTB hit
go to target
target
.2 .8
5
4
0
.4*.8*0 + .4*.2*5 + .6*.2*4 + .6*.8*0
= 0.88
Anshul Kumar, CSE IITD
slide 28
Dynamic information about branch
• Previous branch
decisions
• Explicit prediction
• Stored in cache
directory
Branch History Table, BHT
• Previous target address /
instruction
• Implicit prediction
• Stored in separate buffer
Branch Target Buffer, BTB
Br Target Addr Cache, BTAC
Target Instr Buffer, TIB
Br Target Instr Cache, BTIC
These two can be combined
Anshul Kumar, CSE IITD
slide 29
Storing prediction info
directory
storage
In cache
cache line
counter
In separate
buffer
instr addr
Anshul Kumar, CSE IITD
pred stats
target
slide 30
Combined prediction mechanism
• Explicit : use history bits
• Implicit : use BTB hit/miss
– hit  go to target, miss  go inline
• Combined : BTB hit/miss followed by
explicit prediction using history bits. One of
the following is commonly used
– hit  go to target, miss  explicit prediction
– miss  go inline, hit  explicit prediction
Anshul Kumar, CSE IITD
slide 31
Combined prediction
BTB miss
I
I
BTB miss
BTB hit
expl predict
expl predict
I
I
T
T
T
I
BTB hit
T
I
TI
T
I
TI
T
T
Prediction  T: Target, I: Inline Actual outcome  T: Target, I: Inline
Anshul Kumar, CSE IITD
slide 32
Structure of Tables
Instruction fetch path with
• BHT
• BTAC
• BTIC
Anshul Kumar, CSE IITD
slide 33
Compute/fetch scheme
(no dynamic branch prediction)
BTA
IIFA
Compute
BTA
Instruction
I Fetch address
F
A
R
A
I
I+1
I+2
I+3
I - cache
+
Next sequential
address
Anshul Kumar, CSE IITD
BTI BTI+1 BTI+2 BTI+3
slide 34
BHT (Branch History Table)
Instruction
Fetch address
2 2 2 2
128 x 4 lines
8 instr/line
I-cache
16 K
4-way set assoc
BHT
2 2 2 2
4 instr/cycle
decode queue
issue queue
128 x 4
entries
History bits
4 x 1 instr
Prediction
logic
4 x 1 instr
Taken / not taken
BTA for a taken guess
Anshul Kumar, CSE IITD
slide 35
BTAC scheme
BTA
IIFA
Instruction
I Fetch address
F
A
R
A
I
I+1
I+2
I+3
BA BTA
I - cache
BTAC
+
Next sequential
address
Anshul Kumar, CSE IITD
BTI BTI+1 BTI+2 BTI+3
slide 36
BTIC scheme - 1
BTA
IIFA
Instruction
I Fetch address
F
A
R
A
I
BA
I - cache
BTI
BTA+
BTIC
+
Next sequential
address
To decoder
Anshul Kumar, CSE IITD
slide 37
BTIC scheme - 2
computed
BTA+
IIFA
Instruction
I Fetch address
F
A
R
A
I
I+1
BA
I - cache
BTI
BTI+1
BTIC
+
Next sequential
address
To decoder
Anshul Kumar, CSE IITD
slide 38
Successor index in I-cache
IIFA
Instruction
I Fetch address
F
A
R
A
I
I+1
successor
I+2 I+3
index
I - cache
Next address
BTI BTI+1 BTI+2 BTI+3
Anshul Kumar, CSE IITD
slide 39