EE382A Lecture 6: Register Renaming Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a EE382A – Autumn 2009 Lecture 6- 1 John P Shen Announcements • Project proposal due on Wed 10/14 – 2-3 2 3 pages submitted through email – – – – – • List the group members Describe the topic including why it is important and your thesis Describe the methodology you will use (experiments (experiments, tools tools, machines) Statement of expected results Few key references to related work Still missing some photos EE382A – Autumn 2009 Lecture 6- 2 John P Shen Lecture 6 Outline 1. Branch Prediction (epilog) a. 2-level Predictors b. AMD Opteron Example c. Confidence Prediction d. Trace Cache 2. Register Data Flow a. False Register Dependences b. Register Renaming Technique c. Register R i t R Renaming i IImplementation l t ti EE382A – Autumn 2009 Lecture 6- 3 John P Shen Dynamic Branch Prediction Using History nPC to Icache prediction FA-mux t t specu. target PC Branch Predictor specu. cond. (using a BTB) nPC=BP(PC) nPC(seq.) = PC+4 Fetch Decode Buffer Decode BTB update (target addr. and history) Dispatch Buffer Dispatch Reservation Stations Issue Branch Execute Finish EE382A – Autumn 2009 Completion Buffer Lecture 6- 4 John P Shen 2-Level Adaptive Prediction [Yeh & Patt] Nomenclature: {G,P}A{g,p,s} { , } {g,p, } Pattern History Table (PHT) PC 00...00 Branch History Shift 00...01 Register (BHSR) (shift left when update) 00...10 101 1 1 111 0 0 111 index 1 0 PHT Bits old new 11...10 11...11 Branch Result To achieve 97% average prediction accuracy: G (1) BHR: 18 bits; g (1) PHT: 218 x 2 bits P (512x4) (512 4) BHR BHR: 12 bit bits; g (1) PHT: PHT 212 x 2 bit bits P (512x4) BHR: 6 bits; s (512) PHT: 26 x 2 bits EE382A – Autumn 2009 Lecture 6- 5 FSM Logic Prediction total = 524 kbits t t l = 33 kbit total kbits total = 78 kbits John P Shen Example: Global BHSR Scheme (GAs) Branch Address j bits Prediiction Branch History Shift Register (BHSR) k bits BHT of 2 x 2j+k EE382A – Autumn 2009 Lecture 6- 6 John P Shen Example: Per-Branch BHSR Scheme (PAs) Branch Address j bits i bits k bit bits EE382A – Autumn 2009 Prediction Branch Historyy Shift Register R i t (BHSR) k x 2i Standard BHT BHT of 2 x 2j+k Lecture 6- 7 John P Shen Gshare Branch Prediction [McFarling] Branch Address j bits Branch History Shift Register (BHSR) Predicttion xor k bits BHT of 2 x 2 max(j,k) EE382A – Autumn 2009 Lecture 6- 8 John P Shen Fetch & Predict Example: AMD Opteron EE382A – Autumn 2009 Lecture 6- 9 John P Shen Why is Prediction Important in Opteron? EE382A – Autumn 2009 Lecture 6- 10 John P Shen Fetch & Predict Example: AMD Opteron EE382A – Autumn 2009 Lecture 6- 11 John P Shen Other Branch Prediction Related Issues • Multi-cycle BTB – Keep fetching sequentially, sequentially repair later (bubbles for taken branches) – Need pipelined access though • BTB & predictor in series – Get fast target/direction prediction from BTB only – After decoding, use predictor to verify BTB • Causes a p pipeline p mini-flush if BTB was wrong g – This approach allows for a much larger/slower predictor • BTB and predictor integration – Can C merge BTB with ith th the llocall partt off a predictor di t – Can merge both with I-cache entries • Predictor/BTB/RAS updates – Can you see any issue? EE382A – Autumn 2009 Lecture 6- 12 John P Shen Prediction Confidence A Very Useful Tool for Speculation • Estimate if your prediction is likely to be correct • A li ti Applications – Avoid fetching down unlikely path • Save time & power by waiting – Start executing down both paths (selective eager execution) – Switch to another thread (for multithreaded processors) • Implementation – Naïve: don’t use NT or TN states in 2-bit counters – Better: array of CIR (correct/incorrect registers) • Shift in if last prediction was correct/incorrect • Count the number of 0s to determine confidence – Many other implementations are possible • Using counters etc EE382A – Autumn 2009 Lecture 6- 13 John P Shen Branch Confidence Prediction EE382A – Autumn 2009 Lecture 6- 14 John P Shen Dynamic History Length • Four types of history – Local (bimodal) history (Smith predictor) • Table of counters summarizes local history • Simple, but only effective for biased branches – Local outcome history • Shift register of individual branch outcomes • Separate counter for each outcome history – Global Gl b l outcome t hi history t • Shift register of recent branch outcomes • Separate counter for each outcome history – Path history • Shift register of recent (partial) block addresses • Can differentiate similar global outcome histories • Can combine or “alloy” histories in many ways EE382A – Autumn 2009 Lecture 6- 15 John P Shen Understanding Advanced Predictors • • • History length – Short Sh t history—lower hi t l ttraining i i costt – Long history—captures macro-level behavior – Variable history length predictors Really long history (long loops) – Loop count predictors – Fourier transform into frequency domain Limited capacity & interference – Constructive vs. destructive – Bi-mode, Bi d gskewed, k d agree, YAGS – Read sec. 9.3.2 carefully EE382A – Autumn 2009 Lecture 6- 16 John P Shen High-Bandwidth Fetch: Trace Cache Instruction Cache E F G Trace Cache H I J A B A B C D E F G H I J C D (a) • • (b) Fold out taken branches by tracing instructions as they commit into a fill buffer Eric Rotenberg, S. Bennett, and James E. Smith. Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching. MICRO, December 1996. EE382A – Autumn 2009 Lecture 6- 17 John P Shen Intel Pentium 4 Trace Cache Front-End BTB Instruction TLB and Prefetcher Level-Two Unified Data and Instruction Cache Instruction Decode Trace Cache BTB Trace Cache Ins truction Fetch Queue To renamer, execute, etc. • • • • No first-level instruction cache: trace cache only Trace cache BTB identifies next trace Miss leads to fetch from level two cache Trace cache instructions are decoded (uops) EE382A – Autumn 2009 Lecture 6- 18 John P Shen Modern Superscalar, Out-of-order Processor • Pipelining reduces cycle time • Superscalar S l iincreases IPC (instruction per cycle) • Both schemes need to find lots of ILP in the program I-cache Branch Predictor FETCH Instruction Buffer Instruction Flow DECODE Integer Floating-point Media Memory EXECUTE Register Data Flow Reorder Buffer (ROB) Store Queue EE382A – Autumn 2009 – Must simultaneously increase number of instructions considered considered, Memory number of instructions executed, Data and allow for out-of-order execution Flow COMMIT D-cache Lecture 6- 19 John P Shen What Limits ILP INSTRUCTION PROCESSING CONSTRAINTS Resource C R Contention t ti (Structural Dependences) C d D Code Dependences d Control Dependences (RAW) True T D Dependences d (WAR) Anti-Dependences EE382A – Autumn 2009 Lecture 6- 20 Data Dependences St Storage Conflicts C fli t Output Dependences (WAW) John P Shen Register Renaming & Dynamic Scheduling • Register Renaming: address limitations of the scoreboard – Scoreboard limitation • Up to one pending instruction per destination register – Eliminate WAR and WAW dependences without stalling • Dynamic y scheduling g – Track & resolve true-data dependences (RAW) – Scheduling hardware: • Instruction window, reservation stations, common data bus, … – Original proposal: Tomasulo’s algorithm [Tomasulo, 1967] EE382A – Autumn 2009 Lecture 6- 21 John P Shen Register Data Flow INSTRUCTION EXECUTION MODEL Each ALU Instruction: Ri Fn Dest. Reg. Funct. Unit (Rj, Rk) Source Registers “Register Register Transfer” Transfer R0 R1 FU1 • • • FU2 Interconnect Rm Registers “Read” • • • FUn Functional Units “Execute” “Write” Need Availability of F n (Structural Dependences) Need Availability of Rj, Rk (True Data Dependences) Need Availability y of Ri (Anti-and ( output p Dependences) p ) EE382A – Autumn 2009 Lecture 6- 22 John P Shen Causes of (Register) Storage Conflict REGISTER RECYCLING MAXIMIZE USE OF REGISTERS MULTIPLE ASSIGNMENTS OF VALUES TO REGISTERS OUT OF ORDER ISSUING AND COMPLETION LOSE IMPLIED PRECEDENCE OF SEQUENTIAL CODE LOSE 1-1 CORRESPONDENCE BETWEEN VALUES AND REGITERS WAW Ri • • • • • • • • • Ri EE382A – Autumn 2009 ••• DEF Ri USE Ri USE ••• DEF WAR Lecture 6- 23 John P Shen The Reason for WAW and WAR: Register Recycling COMPILER REGISTER ALLOCATION CODE GENERATION REG. ALLOCATION Single Assignment Assignment, Symbolic Reg Reg. Map Symbolic Reg. to Physical Reg. Maximize a e Reuse euse o of Reg. eg INSTRUCTION LOOPS 9 $34: 10 11 12 13 14 15 16 17 18 19 20 21 22 mul addu mull addu lw mul addu mul addu lw mul addu dd addu ble EE382A – Autumn 2009 $14 $15, $24 $24, $25, $11, $12, $13,, $ $14, $15, $24, $25, $10 $10, $9, $9, $7, 40 $4, $14 $9 $9, 4 $15, $24 0($25) $9, 40 $$5,, $ $12 $8, 4 $13, $14 0($15) $11, $24 $10 $25 $10, $9, 1 10, $34 For (k=1;k<= 10; k++) t += a [i] [k] * b [k] [j] ; Reuse Same Set of Reg. in each Iteration Overlapped Execution of different Iterations Lecture 6- 24 John P Shen Resolving False Dependences • • • (1) R4 ← R3 + 1 Must Prevent (2) from completing before (1) is dispatched (2) R3 ← R5 + 1 (1) R3 ← R3 + R5 • • • • • • ← R3 Must Prevent (2) from completing before (1) completes (2) R3 ← R5 + 1 Stalling: delay dispatching (or write back) of the later instruction Copy Operands: Copy not-yet-used operand to prevent being overwritten ((WAR)) Register Renaming: use a different register (WAW & WAR) EE382A – Autumn 2009 Lecture 6- 25 John P Shen Register Renaming: The Idea • Anti and output dependences are false dependences • The dependence is on name/location rather than data • Given unlimited number of registers, anti and output dependences can always be eliminated r3 ← r1 op r2 r5 ← r3 op r4 r3 ← r6 op r7 Original Renamed r1 ← r2 / r3 r4 ← r1 * r5 r1 ← r3 + r6 r3 ← r1 - r4 EE382A – Autumn 2009 r1 ← r2 / r3 r4 ← r1 * r5 r8 ← r3 + r6 r9 ← r8 - r4 Lecture 6- 26 John P Shen Register Renaming Technique Register Renaming Resolves: Anti-Dependences Output Dependences Architected A i Registers R1 R2 • • • Rn Physical i Registers P1 P2 • • • Pn • • • Pn + k EE382A – Autumn 2009 : Design of Redundant Registers Number: One Multiple Allocation: Fixed for Each Register Pooled for all Regsiters Location: Attached to Register File (Centralized) Attached to functional units (Distributed) Lecture 6- 27 John P Shen Register Renaming Implementation • Renaming: – Map a small set of architecture registers to a large set of physical registers – New mapping for an architectural register when it is assigned a new value • Renaming buffer organization (how are registers stored) – Unified RF, split RF, renaming in the ROB – RF = register file • Number of renaming registers • Number of read/write ports • Register mapping (how do I find the register I am looking for) – Allocation, de-allocation, and tracking EE382A – Autumn 2009 Lecture 6- 28 John P Shen Renaming Buffer Options • Unified/merged register file – MIPS R10K, R10K Alpha 21264 – Registers change role architecture to renamed • Rename register file (RRF) – PA 8500, PPC 620 – Holds new values until they are committed to ARF – Extra data transfer… • Renaming in the ROB – Pentium III • Note: can have a single scheme or separate for integer/FP EE382A – Autumn 2009 Lecture 6- 29 John P Shen Unified Register File: Physical Register FSM EE382A – Autumn 2009 Lecture 6- 30 John P Shen Number of Rename Registers • Naïve: as many as the number of pending instructions – Waiting to be scheduled + executing + waiting to commit • Simplification – Do not need renaming for stores, branches, … • Usual approach: – # scheduler entries ≤ # RRF entries ≤ # ROB entries • Examples: – PPC 620: scheduler 15, RRF 16 (RRF), ROB 16 – MIPS R12000: scheduler 48, RRF 64 (merged), ROB 48 – Pentium III: scheduler 20, RRF 40 (in ROB), ROB 40 EE382A – Autumn 2009 Lecture 6- 31 John P Shen Register File Ports • Read: if operands read as instructions enter scheduler – Max # ports = 2 * # instructions dispatched • Read: if operands read as instruction leave scheduler – Max #ports = 2* # instructions issued • Can C b be wider id th than th the # off iinstructions t ti di dispatched… t h d • Write: # of FUs or # of instructions committing – Depends on unified vs separate rename registers • Notes: – Ca Can implement p e e t less ess po ports ts a and d have a e st structural uctu a hazards a a ds • Need control logic for port assignment & hazard handling – When using separate RRF and ARF, need ports for the final transfer – Alternatives to increasing gp ports: duplicated p RF or banked RF • What are the issues? EE382A – Autumn 2009 Lecture 6- 32 John P Shen Register Mapping (From Architectural to Physical Address) • Option 1: use a map table (ARF # → physical location) – Map holds the state of the register too… – Simple, but need two steps for reading (ok if operands read late) • Option 2: associative search in RRF, ROB, … – Each physical y register g remembers its status ((ok if operand read early) y) – More complicated but one step read EE382A – Autumn 2009 Lecture 6- 33 John P Shen Integrating Map Tables with the ARF EE382A – Autumn 2009 Lecture 6- 34 John P Shen Renaming Operation: Allocation Lookup Allocation, Lookup, De-allocation De allocation • At dispatch: for each instruction handled in parallel – Check the physical location & availability of source operands – Map destination register to new physical register • Stall if no register available – Note: N t mustt have h enough h ports t tto any map tables t bl • At complete: update physical location • At commit/retire: for each instruction handled in parallel – Copy from RRF/ROB to ARF & deallocate RRF entry OR – Upgrade physical location and deallocate register with old value • It is now safe to do that • Question: can we allocate later or deallocate earlier? EE382A – Autumn 2009 Lecture 6- 35 John P Shen Renaming Operation EE382A – Autumn 2009 Lecture 6- 36 John P Shen Renaming Difficulties: Wide Instruction Issue • Need many ports in RFs and mapping tables • Instruction dependencies during dispatching/issuing/committing – Must handle dependencies across instructions – E.g. add R1←R2+R3; sub R6←R1+R5 – Implementation: use comparators, multiplexors, counters • Comparators: discover RAW dependencies • Multiplexors: generate right physical address (old or new allocation) physical y registers g allocated • Counters: determine number of p EE382A – Autumn 2009 Lecture 6- 37 John P Shen Renaming Difficulties: Mispredictions & Exceptions • If exception/misprediction occurs, register mapping must be precise • Separate RRF: consider all RRF entries free • g consider all ROB entries free ROB renaming: • Unified RF: restore precise mapping – Single map: traverse ROB to undo mapping (history file approach) • ROB mustt remember b old ld mapping… i – Two maps: architectural and future register map • On exception, copy architectural map into future map… – Checkpointing: Ch k i ti kkeep regular l check h k points i t off map, restore t when h needed d d • When do we make a checkpoint? On every instruction? On every branch? • What are the trade-offs? • We’ll W ’ll revisit i it thi this approach h llater t on… EE382A – Autumn 2009 Lecture 6- 38 John P Shen inorrder out-off-order inord der Dynamic Scheduling Based on Reservation Stations Reg. Write Back Dispatch Buffer Dispatch Reg. File Allocate Reorder Buffer entries Ren. Reg. Reservation Stations Branch Integer g g Integer Float.Point Load/ Store Compl. Buffer (Reorder Buff.) EE382A – Autumn 2009 Complete Lecture 6- 39 John P Shen Embedded “Data Flow” Engine Dispatch Buffer Dispatch Reservation Stations - Read register or - Assign register tag - Advance instructions to reservation stations - Monitor reg. tag - Receive data being forwarded - Issue when all operands ready Branch “Dynamic Execution” Completion Buffer Complete EE382A – Autumn 2009 Lecture 6- 40 John P Shen
© Copyright 2024 Paperzz