The O-GEHL branch predictor

1
Two research studies related to
branch prediction and instruction
sequencing
André Seznec
INRIA/IRISA
2
Storage Free Confidence
Estimator
for the TAGE predictor
Why confidence estimation for branch
predictors
•
Energy/performance tradeoffs:
•
Guiding fetch gating or fetch throttling:
•
Dynamic speculative structures resizing
•
Controlling SMT resource allocation through fetch
policies
• Fetch the “most” useful instructions
•
Dual Path execution
3
4
What is confidence estimation ?
•
Assert a confidence to a prediction :

•
Is it likely that the prediction is correct ?
Generally discriminate only low and high confidence
predictions:
•
High confidence: « very likely » to be correct
•
Low confidence: « not so likely » to be correct
Confidence estimation for branch
predictors
•
1981, Jim Smith:
•
•
weak counters predictions are more likely to mispredict
1996, Jacobsen, Rotenberg, Smith: Gshare-like 4-bit counters


•
low confidence < threshold ≤ high confidence
1998 Enhanced JRS Grunwald et al:

•
Increment on correct prediction, reset on misprediction
Use the prediction in the index
A few other proposals:
•
Self confidence for perceptrons ..
Most studies still use enhanced JRS confidence estimators
5
Metrics for confidence estimators
(Grunwald et al 1998)
•
SENS Sensitivity:
•
•
PVP Predictive Value of a Positive test
•
•
Probability of high conf. to be correct
SPEC, Specificity:
•
•
Fraction of correct pred. classified as high conf.
Fraction of mispred. classified as low conf.
PVN, Predictive Value of a Negative test
•
Probability of low conf. to be mispredicted
Different qualities for different usages
6
The current limits of confidence
prediction
•
Discriminating between high and low confidence is
unsufficient:
•
What is the misp. rate on high and low confidence ?
•
Malik et al:

•
Use probability for each counter value on an
enhanced JRS
Enhanced JRS and state-of-the art branch predictors ?
•
Each predictor  its own confidence estimator
7
8
This study
Cost-effective confidence estimator for TAGE
•
No storage overhead
•
Discrimate:

Low conf. pred. : ≈ 30 % misp. rate or more

Medium conf. pred.: 8-15% misp.rate

High conf. pred. : < 1 % misp rate
TAGE:
multiple tables, global history predictor
The set of history lengths forms a geometric series
L(0)  0
L(i)  α i  1L(1)
Capture correlation
on very long histories
{0, 2, 4, 8, 16, 32, 64, 128}
most of the storage
for short history !!
What is important: L(i)-L(i-1)
is drastically increasing
9
TAGE
Geometric history length + PPM-like
+ optimized update policy
pc
h[0:L1]
pc
hash
ctr
pc h[0:L2]
hash
tag
u
hash
ctr
=?
1
hash
tag
u
hash
ctr
=?
1
1
1
tag
=?
1
1
1
1
Tagless base
predictor
pc h[0:L3]
prediction
1
hash
u
10
11
Miss
Hit
Pred
=?
1
=?
1
1
=?
1
1
1
1
Hit
Altpred
1
1
12
Prediction computation
•
General case:
•
•
Longest matching component provides the prediction
Special case:
•
Many mispredictions on newly allocated entries: weak Ctr
On many applications, Altpred more accurate than Pred
•
Property dynamically monitored through a single 4-bit
counter
13
A tagged table entry
•
Ctr: 3-bit prediction counter
•
U: 2-bit useful counter
•
•
Was the entry recently useful ?
Tag: partial tag
U
Tag
Ctr
14
Confidence by observation on TAGE
•
Apart the prediction, the predictor delivers:
•
The provider component and the value of the prediction
counter

•
High correlation with the quality of the predictions
The history of mispredictions can also be observed

burst of mispredictions might indicate predictor warming
or program phase changing
15
Experimental framework
•
20 traces from the CBP-1 and 20 traces from the CBP-2
•
16Kbits TAGE : 5 tables, max hist 80 bits
•
64Kbits TAGE : 8 tables, max hist 130 bits
•
256Kbits TAGE : 9 tables, max hist 300 bits
•
Probability of misprediction as a metric of confidence:
•
Misprediction Per Kilopredictions (MKP)
16
Bimodal as the provider component
Provides many (often most) of the predictions:
•
•
Allocation of a tagged table entry happens on a misprediction

•
256Kbits TAGE, bimodal= very accurate prediction
•
•
Generally bimodal prediction = the bias of the branch
Often less than 1 MKP, always significantly lower than the
global misprediction rate
16Kbits TAGE:
•
Often bimodal= very accurate prediction
•
On demanding apps: bimodal not better than average
17
Discriminating the bimodal predictions
•
Weak counters:
•
Systematically more than 250 MKP (generally more than 300
MKP)

•
« Identify » conflicts due to limited predictor size:
•
•
Can be classified as low confidence
Was there a misprediction provided by the bimodal recently (10
last branches) ?

≈80-150 MKP for 16Kbits, ≈50-70 MKP for 64Kbits

Can be classified as medium confidence
The remaining:
•
High confidence: <10 MKP, generally much less
A tagged component as the
provider
•
Discrimate on the values of the prediction
counter
|2ctr +1|
Weak:
TAGE 16Kbits
TAGE 256Kbits
1
340 MKP
325 MKP
Nearly Weak: 3
313 MKP
312 MKP
Nearly Sat.:
5
213 MKP
225 MKP
Saturated :
7
29 MKP
17 MKP
18
Tagged component as provider:
a more thorough analysis
•
Weak, Nearly Weak , Nearly Saturated:
•
•
For all benchmarks, for the three TAGE configurations in the range
of 200 MKP or higher
Saturated:
•
Slightly lower than the global misprediction rate of the
applications

Very high confidence for predictable applications (< 10 MKP)

Not that high confidence for poorly predictable applications
(> 50 MKP)
Problem: Saturated often represents more than 50 % of the predictions
19
20
21
Intermediate summary
•
High confidence class:
•
•
Low confidence class:
•
•
Bimodal weak and not saturated tagged
Medium confidence class:
•
•
(Bimodal saturated, no recent misprediction by bimodal)
(Bimodal and recent misprediction by bimodal)
Tagged saturated:
•
Depends on applications, predictor size etc

Very large class ..
22
Tweaking the predictor
to improve confidence
How to improve confidence on tagged
counter saturated class
•
Widening the prediction counter ?
•
•
Not that good:

Slightly decreased accuracy

Only marginal improvement on accuracy on saturated
class
Modifying the counter update:
•
Transition to saturated state with a very low probability

P=1/128 in our experiments

Marginal accuracy loss ( ≈ 0.02 MPKI)
23
24
Towards 3 confidence classes
•
•
Tagged Saturated is high confidence
16 Kbits
64Kbits
256 Kbits
Maximum
16 MKP
13 MKP
12 MKP
Average
4 MKP
2 MKP
2 MKP
Nearly Saturated is enlarged and is medium confidence
Maximum
Average
16 Kbits
64Kbits
256 Kbits
169 MKP
173 MKP
174 MKP
85 MKP
71 MKP
73 MKP
25
Towards 3 confidence classes
•
Low confidence:
•
•
Medium confidence:
•
•
Weak bimodal + Weak tagged + Nearly Weak tagged
Bimodal recently mispredicted + Nearly Saturated tagged
High confidence:
•
Bimodal saturated + Saturated tagged
26
Prediction and misprediction coverage
Misprediction rate
high conf
medium conf
low conf
16Kbits
0.740-0.093 (5)
0.209-0.466 (85)
0.051-0.439 (317)
64Kbits
0.799-0.076 (3)
0.160-0.450 (71)
0.040-0.474 (316)
256 Kbits
0.813-0.050 (2)
0.148-0.455 (73)
0.036-0.491 (325)
Misprediction coverage
Prediction coverage
27
Behavior examples, 64Kbits
Misprediction rate
high conf
medium conf
low conf
twolf 15.143 MPKI
0.465-0.053 (13)
0.385-0.460 (137)
0.150-0.487 (390)
gcc
4.192 MPKI
0.780-0.093 (3)
0.195-0.450 (51)
0.025-0.457 (295)
vortex 0.300 MPKI
0.976-0.004 (0)
0.019-0.710(110)
0.005-0.286 (207)
Misprediction coverage
Prediction coverage
28
Predictions
Mispredictions
low
medium
high
29
Summary on confidence estimation
Many studies on applications of confidence estimations, but a very few
on confidence estimators.
•
Each predictor requires a different confidence estimator
•
•
A very cost-effective and efficient confidence estimator for TAGE
•
Storage free, very limited logic
•
Discriminate between 3 confidence classes:

Medium + low conf > 90 % of the mispredictions

High conf in the range of 1 % mispredictions or less
30
SYRANT
with Nathanael Prémillieu
« Moderate cost »
control independence exploitation
31
Why ?
•
•
Branch pred. accuracy is reaching a plateau:
•
TAGE 2006,
•
?
Try something else ..
32
Control flow reconvergence
Branch (if)
not-taken path
taken path
(else)
Reconvergence
point
Instruction flow
33
Exploiting Control flow reconvergence
Misprediction !
Can we save some useful work
after the the reconvergence point
34
Control
Dependent
(CD)
To be
Shoud be conserved
Control
Independent Data
Independent
(CIDI)
Reconvergence point
To invalidate
Control
Independent Data
Dependent (CIDD)
35
Difficulties
•
Not the same renaming scheme on both paths:

•
How to conserve results ?
Identification of the reconvergence point:
•
Check against all previously fetched
instructions on the wrong path ?
•
Identification of CIDI and CIDD instructions ?
SYmmetric Resource Allocation on NotTaken path
Not-taken path
taken and Taken paths
36
P0
P1
Physical registers
(LSQ entries, ROB entries)
Branch
P2
Gap
P3
P0
P1
P2
Unused registers
P3
P4
P4
P5
P5
P6
P7
P8
Reconvergence
point
P6
P7
P8
Insert gaps to reuse same physical registers
37
Execution
Register validity through a tagging process
at rename stage at refetch
On a misprediction, increment the tag: X to Y
X1
R1
X1
X2
R2
X2
T2
Y3
Y4
Y5
R21
R7
R5
X3
R6
R7
X4
N4
X5
N5
Branch
Reconvergence
I0 Y6
R9
X6
I1 X7
R5
X7
T7
R1,R2
I2 Y8
R6
X8
R5,R21
I3 X9
R7
X9
R5,R1
Corrected path
Predicted path
R6,R7
38
Conserve tag and validity if
1) same instruction
2) same operands including tags
X1
R1
X1
X2
R2
X2
T2
Y3
Y4
Y5
R21
R7
R5
X3
R6
R7
X4
N4
X5
N5
Y6
R9
X6
X7
R5
X7
T7
R1,R2
Y8
R6
X8
R5,R21
X9
R7
X9
R5,R1
R6,R7
39
Conserve tag and validity if
1) same instruction
2) same operands including tags
X1
R1
X1
X2
R2
X2
T2
Y3
Y4
Y5
R21
R7
R5
X3
R6
R7
X4
N4
X5
N5
Y6
R9
X6
X7
R5
X7
T7
R1,R2
Y8
R6
X8
R5,R21
X9
R7
X9
R5,R1
R6,R7
40
Conserve tag and validity if
1) same instruction
2) same operands including tags
X1
R1
X1
X2
R2
X2
T2
Y3
Y4
Y5
R21
R7
R5
X3
R6
R7
X4
N4
X5
N5
Y6
R9
X6
X7
R5
X7
T7
R1,R2
Y8
R6
X8
R5,R21
X9
R7
X9
R5,R1
R6,R7
41
Conserve tag and validity if
1) same instruction
2) same operands including tags
X1
R1
X1
X2
R2
X2
T2
Y3
Y4
Y5
R21
R7
R5
X3
R6
R7
X4
N4
X5
N5
Y6
R9
X6
X7
R5
X7
T7
R1,R2
Y8
R6
X8
R5,R21
X9
R7
X9
R5,R1
R6,R7
42
Conserve tag and validity if
1) same instruction
2) same operands including tags
X1
R1
X1
X2
R2
X2
T2
Y3
Y4
Y5
R21
R7
R5
X3
R6
R7
X4
N4
X5
N5
Y6
R9
X6
X7
R5
X7
T7
R1,R2
Y8
R6
X8
R5,R21
X9
R7
X9
R5,R1
R6,R7
43
Reconvergence detection
•
Precise detection would require checking every
PC for each instruction
•
Use approximate detection
•
Detect the first branch after reconvergence
Approximate detection of the
reconvergence point
Active Branch List
Shadow Branch List
B1
1
T
B3
17
NT
B2
12
T
B4
22
NT
B3
17
NT
B5
23
T
B4
22
NT
B6
29
T
B5
23
T
B7
40
NT
B6
29
T
B7
40
NT
Branch
NbR
Direction
Copy wrong path on branch
misprediction detection
44
45
ABL
SBL
B1
1
T
B3
17
NT
B2
12
NT
B4
22
NT
B'3
23
T
B5
23
T
B'4
27
NT
B6
29
T
B'5
28
NT
B7
40
NT
B6
32
T
Allows to monitor the resource consumption on both paths
46
WP
Taken
B1
RP
Taken
Not-Taken
B1
B1
B2
B2
RP2
RP2
RP1
RP1
Determine the gap
RP1
RANT
Use the gap
47
Gap size issue
•
The two paths may be very different:
•
Waste of resource

•
Sometimes 100’s of instructions
Different filters:
•
Only try when gap size is limited
•
Only try if wrong path was the longest
•
Only try if branch confidence is low (or medium)
•
Only try if reconvergence point/gap confidence is high
Continue execution after branch
misprediction resolution
•
On « normal » superscalar processors:
•
•
Kill every instruction after the misprediction
Control independence exploitation:
•
Let execution continue until resources are
claimed back
Phantom execution
48
49
Preliminary performance evaluation
•
8-way superscalar,
•
deep pipeline 20-stage
•
Very large instruction window
•
TAGE predictor
•
SPEC 2006
50
Reconvergence is detected in most cases
51
Some speed-up but relatively poor
52
4-way issue processor
53
That’s preliminary ..
•
No gap size limit on the predicted path
•
No discrimination on medium/low confidence
•
No retroaction on branch prediction
•
Just did not use the computed path.
54
Preexecution of branches
•
Just consider ABL/SBL mechanisms:
Can preexecution of branches be helpful ?


Without visibility on validity

With visibility on validity (in SYRANT)
–
To be done
55
Just use preexecution to guide the branch prediction
56
Summary on SYRANT
•
Control Independence exists:
•
•
Can be potentially exploited through a
SYRANT-like mechanism:

Still to be improved/understood

Need to understand retroactions
Can exploit pre-execution of branches

Reduce misprediction rate
57
58
Back Ups
59
Control Dependent
Instructions
R18,R25
R1
T1
R12,R13
R2
T2
Not-taken path
Branch
R18,R25
R1
T1
R12,R13
R2
T2
R15,R14
R5
N3
R17,R11
R4
N4
R19,R22
R6
N5
R15
N6
Branch
R23,R24
R6
T4
R26,R30
R4
T5
R15
T6
R1,R2
R3
T7
T7
T1,T2
R1,R2
R3
T7
T7
T1,T2
R6,R2
R7
T8
T8
T4,T2
R6,R2
R7
N8
N8
N5,T2
R7,R3
R9
T9
T9
T7,T8
R7,R3
R9
N9
N9
T7,N8
R4,R16
Reconvergence
point
Execution
Execution
Taken path
R4,R16
CID
D
CIDI
CID
D
CID
D
60
Updating the U counter
If (Altpred ≠ Pred) then
• Pred = taken : U= U + 1
• Pred ≠ taken : U = U - 1
Graceful aging:
Periodic shift of all U counters
• implemented through the reset of a single bit
Allocating a new entry on a
misprediction
•
Find a single “useless” entry with a longer history:
•
Priviledge the smallest possible history

•
But not too much

•
To minimize footprint
To avoid ping-pong phenomena
Initialize Ctr as weak and U as zero
61
62
TAGE update policy
• General principle:
Minimize the footprint of the prediction.
• Just update the longest history
matching component and allocate at
most one entry on mispredictions
63
Branch (if)
incorrect
path
(then)
(else)
Reconvergence
point
Instruction flow
SYmmetric Resource Allocation on NotTaken path
Not-taken path
taken and Taken paths
64
P0
Physical registers
P0
P1
Branch
P1
P2
Gap
P3
P2
Unused registers
P3
P4
P4
P5
P5
P6
P7
P8
Reconvergence
point
P6
P7
P8
65
ABL
Branch
SBL
CI Direction
B1
1
T
B2
12
T
B3
17
NT
B4
22
NT
B5
23
T
B6
29
T
B7
40
NT
Branch
CI Direction