CSL718 : Pipelined Processors Improving Branch Performance – contd. 21st Jan, 2006 Anshul Kumar, CSE IITD Improving Branch Performance • Branch Elimination – replace branch with other instructions • Branch Speed Up – reduce time for computing CC and TIF • Branch Prediction – guess the outcome and proceed, undo if necessary • Branch Target Capture – make use of history Anshul Kumar, CSE IITD slide 2 Improving Branch Performance • Branch Elimination – replace branch with other instructions • Branch Speed Up – reduce time for computing CC and TIF • Branch Prediction – guess the outcome and proceed, undo if necessary • Branch Target Capture – make use of history Anshul Kumar, CSE IITD slide 3 Branch Elimination F C Use conditional/guarded instructions (predicated execution) T S OP1 BC CC = Z, + 2 ADD R3, R2, R1 OP2 Examples: C:S OP1 ADD R3, R2, R1, NZ OP2 HP PA (all integer arithmetic/logical instructions) DEC Alpha, SPARC V9 (conditional move) Anshul Kumar, CSE IITD slide 4 Branch Elimination - contd. CC IF OP1 BC IF IF D AG DF DF DF EX EX IF IF IF D AG TIF TIF TIF IF IF IF IF IF D AG DF DF DF ADD/OP2 ADD (cond) IF Anshul Kumar, CSE IITD D’ D AG EX EX slide 5 Improving Branch Performance • Branch Elimination – replace branch with other instructions • Branch Speed Up – reduce time for computing CC and TIF • Branch Prediction – guess the outcome and proceed, undo if necessary • Branch Target Capture – make use of history Anshul Kumar, CSE IITD slide 6 Branch Speed Up : early target address generation • • • • Assume each instruction is Branch Generate target address while decoding If target in same page omit translation After decoding discard target address if not Branch BC IF IF Anshul Kumar, CSE IITD IF D TIF TIF TIF AG slide 7 Branch Speed Up : increase CC - branch gap Increase the gap between the instruction which sets CC and branching • Early CC setting • Delayed branch Anshul Kumar, CSE IITD slide 8 delayed early CC branch setting Summary - Branch Speed Up uncond cond (T) cond (I) uncond cond (T) cond (I) Anshul Kumar, CSE IITD n=0 4 6 5 4 6 5 n=1 4 5 4 3 5 4 n=2 4 4 3 2 4 3 n=3 4 4 2 1 3 2 n=4 4 4 1 0 2 1 n=5 4 4 0 0 1 0 slide 9 Delayed Branch with Nullification • • • • (Also called annulment ) Delay slot is used optionally Branch instruction specifies the option Option may be exercised based on correctness of branch prediction Helps in better utilization of delay slots Anshul Kumar, CSE IITD slide 10 Improving Branch Performance • Branch Elimination – replace branch with other instructions • Branch Speed Up – reduce time for computing CC and TIF • Branch Prediction – guess the outcome and proceed, undo if necessary • Branch Target Capture – make use of history Anshul Kumar, CSE IITD slide 11 Branch Prediction • Treat conditional branches as unconditional branches / NOP • Undo if necessary Strategies: – Fixed (always guess inline) – Static (guess on the basis of instruction type / displacement) – Dynamic (guess based on recent history) Anshul Kumar, CSE IITD slide 12 Static Branch Prediction Instr % Guess Branch Correct uncond 14.5 always 100% 14.5% cond 58 never 54% 27% loop 9.8 always 91% 9% call/ret 17.7 always 100% 17.7% Total 68.2% Anshul Kumar, CSE IITD slide 13 Threshold for Static prediction I-1 I IF IF D AG AG DF DF EX EX IF IF CC D AG AG TIF TIF actual T I guess T 4 5 I 6 0 guess target if 4 p + 5 (1 - p) < 6 p + 0 (1 - p) i.e. p > .71 Anshul Kumar, CSE IITD slide 14 Dynamic Branch Prediction basic idea Predict based on the history of previous branch loop: xxx xxx xxx xxx BC loop Anshul Kumar, CSE IITD 2 mispredictions for every occurrence slide 15 Dynamic Branch Prediction 2 bit prediction scheme N 0 T 3/2 0/1 predict taken 1 T T N predict not taken N 2 3 N T Anshul Kumar, CSE IITD slide 16 Dynamic Branch Prediction second scheme Predict based on the history of previous n branches e.g., if n = 3 then 3 branches taken predict taken 2 branches taken predict taken 1 branch taken predict not taken 0 branches taken predict not taken Anshul Kumar, CSE IITD slide 17 Dynamic Branch Prediction Bimodal predictor Maintain saturating counters T T N 0 1 N T 2 N 3 T N One counter per branch or One counter per cache line merge results if multiple branches Anshul Kumar, CSE IITD slide 18 Dynamic Branch Prediction History of last n occurrences current entry outcome of last three occurrences of this branch 1 1 0 updated entry actual outcome ‘taken’ 1 1 1 0 : not taken 1 : taken prediction using majority decision Anshul Kumar, CSE IITD slide 19 Dynamic Branch Prediction storing prediction counters store in separate buffer or store in cache directory CACHE directory storage cache line counter Anshul Kumar, CSE IITD slide 20 Correct guesses vs. history length n Compiler Business Scientific Supervisor 0 64.1 64.4 70.4 54.0 1 91.9 95.2 86.6 79.7 2 93.3 96.5 90.8 83.4 3 93.7 96.6 91.0 83.5 4 94.5 96.8 91.8 83.7 5 94.7 97.0 92.0 83.9 Anshul Kumar, CSE IITD slide 21 Two-Level Prediction • Uses two levels of information to make a direction prediction – Branch History Table (BHT) - last n occurrences – Pattern History Table (PHT) - saturating 2 bit counters • Captures patterned behavior of branches – Groups of branches are correlated – Particular branches have particular behavior Anshul Kumar, CSE IITD slide 22 Correlation between branches B1: if (x) ... B2: if (y) ... z = x && y B3: if (z) ... Anshul Kumar, CSE IITD • B3 can be predicted with 100% accuracy based on the outcomes of B1 and B2 slide 23 Some Two-level Predictors PC GBHR 10110 BHT PHT 11010 T/NT PHT T/NT 01111 11100 00111 Global Predictor Local Predictor bits from PC and BHT can be combined to index PHT Anshul Kumar, CSE IITD slide 24 Two-level Predictor Classification • Yeh and Patt 3-letter naming scheme – Type of history collected • G (global), P (per branch), S (per set) – PHT type • A (adaptive), S (static) – PHT organization • g (global), p (per branch), s (per set) • Examples - GAs, PAp etc. Anshul Kumar, CSE IITD slide 25 Improving Branch Performance • Branch Elimination – replace branch with other instructions • Branch Speed Up – reduce time for computing CC and TIF • Branch Prediction – guess the outcome and proceed, undo if necessary • Branch Target Capture – make use of history Anshul Kumar, CSE IITD slide 26 Branch Target Capture • Branch Target Buffer (BTB) • Target Instruction Buffer (TIB) instr addr pred stats prob of target change < 5% Anshul Kumar, CSE IITD target target addr target instr slide 27 BTB Performance decision result BTB miss go inline .4 inline .6 target inline .8 .2 delay 0 BTB hit go to target target .2 .8 5 4 0 .4*.8*0 + .4*.2*5 + .6*.2*4 + .6*.8*0 = 0.88 Anshul Kumar, CSE IITD slide 28 Dynamic information about branch • Previous branch decisions • Explicit prediction • Stored in cache directory Branch History Table, BHT • Previous target address / instruction • Implicit prediction • Stored in separate buffer Branch Target Buffer, BTB Br Target Addr Cache, BTAC Target Instr Buffer, TIB Br Target Instr Cache, BTIC These two can be combined Anshul Kumar, CSE IITD slide 29 Storing prediction info directory storage In cache cache line counter In separate buffer instr addr Anshul Kumar, CSE IITD pred stats target slide 30 Combined prediction mechanism • Explicit : use history bits • Implicit : use BTB hit/miss – hit go to target, miss go inline • Combined : BTB hit/miss followed by explicit prediction using history bits. One of the following is commonly used – hit go to target, miss explicit prediction – miss go inline, hit explicit prediction Anshul Kumar, CSE IITD slide 31 Combined prediction BTB miss I I BTB miss BTB hit expl predict expl predict I I T T T I BTB hit T I TI T I TI T T Prediction T: Target, I: Inline Actual outcome T: Target, I: Inline Anshul Kumar, CSE IITD slide 32 Structure of Tables Instruction fetch path with • BHT • BTAC • BTIC Anshul Kumar, CSE IITD slide 33 Compute/fetch scheme (no dynamic branch prediction) BTA IIFA Compute BTA Instruction I Fetch address F A R A I I+1 I+2 I+3 I - cache + Next sequential address Anshul Kumar, CSE IITD BTI BTI+1 BTI+2 BTI+3 slide 34 BHT (Branch History Table) Instruction Fetch address 2 2 2 2 128 x 4 lines 8 instr/line I-cache 16 K 4-way set assoc BHT 2 2 2 2 4 instr/cycle decode queue issue queue 128 x 4 entries History bits 4 x 1 instr Prediction logic 4 x 1 instr Taken / not taken BTA for a taken guess Anshul Kumar, CSE IITD slide 35 BTAC scheme BTA IIFA Instruction I Fetch address F A R A I I+1 I+2 I+3 BA BTA I - cache BTAC + Next sequential address Anshul Kumar, CSE IITD BTI BTI+1 BTI+2 BTI+3 slide 36 BTIC scheme - 1 BTA IIFA Instruction I Fetch address F A R A I BA I - cache BTI BTA+ BTIC + Next sequential address To decoder Anshul Kumar, CSE IITD slide 37 BTIC scheme - 2 computed BTA+ IIFA Instruction I Fetch address F A R A I I+1 BA I - cache BTI BTI+1 BTIC + Next sequential address To decoder Anshul Kumar, CSE IITD slide 38 Successor index in I-cache IIFA Instruction I Fetch address F A R A I I+1 successor I+2 I+3 index I - cache Next address BTI BTI+1 BTI+2 BTI+3 Anshul Kumar, CSE IITD slide 39
© Copyright 2025 Paperzz