Reducing Datapath Energy Through the Isolation of Short-Lived Operands Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower 1 Outline – – – – – – 2 Introduction Motivations Contributions Basic idea: isolate short-lived operands in a small dedicated register file and avoid their writes to the ROB and the ARF Resources impacted: ROB, ARF Power savings: 21% with 32-entry additional RF Results Conclusions Future work A P6-like Superscalar Datapath Function Units Instruction Issue IQ F1 F2 D1 D2 Architectural Register File FU1 FU2 ROB Fetch FUm Decode/Dispatch LSQ Instruction dispatch 3 ARF EX D-cache Result/status forwarding buses Out-of-Order Execution and In-Order Retirement Inst. Queue F R In-order front end Ex D ROB Out-of-order core 4 ARF In-order retirement Energy-dissipating Events Ex Inst. Queue F R D ARF Write Write In-order front end ROB Read Out-of-order core 5 In-order retirement The Idea : Isolating Short-Lived Values Write short-lived values into a small dedicated RF (SRF) Ex Inst. Queue F R D ARF Write SRF Write In-order front end ROB Read Out-of-order core 6 In-order retirement Register Renaming – – – 7 Used to avoid false data dependencies. A new physical register is allocated for EVERY new result P6 style: ROB slots serve as physical registers LOAD R1, R2, 100 LOAD P31, P2, 100 SUB R5, R1, R3 SUB P32, P31, P3 ADD R1, R5, R4 ADD P33, P32, P4 Register Renaming: the Implementation – 8 Register Alias Table (RAT) maintains the mappings between logical and physical registers Arch. Reg Phys. Reg. Location 0 0 1 1 1 1 2 2 1 3 3 1 4 4 1 5 5 1 (0-ROB,1-ARF) Original code LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 Register Renaming: the Implementation – 9 Register Alias Table (RAT) maintains the mappings between logical and physical registers Arch. Reg Phys. Reg. Location 0 0 1 1 31 0 2 2 1 3 3 1 4 4 1 5 5 1 (0-ROB,1-ARF) Original code LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 Renamed code LOAD P31, R2, 100 Register Renaming: the Implementation – 10 Rename Table (RT) is used to maintain the mappings between logical and physical registers Arch. Reg Phys. Reg. Location 0 0 1 1 31 0 2 2 1 3 3 1 4 4 1 5 32 0 (0-ROB,1-ARF) Original code LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 Renamed code LOAD P31, R2, 100 SUB P32, P31, R3 Register Renaming: the Implementation – 11 Rename Table (RT) is used to maintain the mappings between logical and physical registers Arch. Reg Phys. Reg. Location 0 0 1 1 33 0 2 2 1 3 3 1 4 4 1 5 32 0 (0-ROB,1-ARF) Original code LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 Renamed code LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 Short-Lived Values – Our definition: a value is short-lived if the destination register is renamed by the time of the result generation. – Identified one cycle before the result writeback RENAMER 12 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 The Good News : 80%+ of the Values are Short-Lived 96-entry ROB, 4-way processor 100 90 80 70 60 50 40 30 20 10 vg . . A fp. ve ra ge A .I nt vg A pl u ap si a eq rt ua ke m es a m gr id sw w im up w ise ap r vp gc c gz ip m c pa f r pe ser rl bm k tw ol vo f rt ex ga p bz ip 2 0 As rename-to-writeback latency increases in future datapaths, the percentage of short-lived values will also go up 13 The Idea : Isolating Short-Lived Values Write short-lived values into a small dedicated RF (SRF) Ex Inst. Queue F R D ARF Write SRF Write In-order front end LOAD SUB ADD 14 ROB Read R1, R2, 100 R5, R1, R3 R1, R5, R4 Out-of-order core In-order retirement Why do we need the SRF ? Need to hang on to the short-lived values to: Recover from branch mispredictions Reconstruct precise state LOAD R1, R2, 100 BEQ R5, R1, #100 ADD R1, R5, R4 15 Identifying Short-Lived Values – – 16 Maintain the bit-vector Renamed Set by the Renamer at the time of renaming Arch. Reg Phys. Reg. Location 0 0 1 1 31 0 2 2 1 3 3 1 4 4 1 5 32 0 (0-ROB,1-ARF) LOAD SUB ADD R1, R2, 100 R5, R1, R3 R1, R5, R4 LOAD SUB ADD P31, R2, 100 P32, P31, R3 P33, P32, R4 Renamed 1 31 Identifying Short-Lived Values – – 17 Maintain the bit-vector Renamed Set by the Renamer at the time of renaming Arch. Reg Phys. Reg. Location 0 0 1 1 33 0 2 2 1 3 3 1 4 4 1 5 32 0 (0-ROB,1-ARF) LOAD SUB ADD R1, R2, 100 R5, R1, R3 R1, R5, R4 LOAD SUB ADD P31, R2, 100 P32, P31, R3 P33, P32, R4 Renamed 1 31 Identifying Short-Lived Values – – Renamed bit is checked one cycle before writeback Value produced by LOAD is short-lived because Renamed [31]=1 LOAD SUB ADD LOAD SUB ADD R1, R2, 100 R5, R1, R3 R1, R5, R4 Renamed 1 31 18 P31, R2, 100 P32, P31, R3 P33, P32, R4 Managing the SRF: the Issues 19 – When do we write short-lived values into the SRF? – When and how are the short-lived values removed from the SRF? – What happens on a branch misprediction? – How do we reconstruct a precise state? Format of an SRF entry Valid ROB idx Dest. Arch. Reg. Data Branch Branch Tag 1 Tag 2 Branch Identifier for Renamer : used to remove this entry if renamer gets squashed Branch Identifier for this instruction : used to remove this entry if this instruction gets squashed Branch Identifier of an instruction = id/tag of immediately preceding conditional branch 20 Writing to the SRF: the Conditions – An instruction writes a short-lived result value into the SRF if: A free entry exists in the SRF No SRF entry keyed with the same ROB slot is already established – Bit-vector Allocated_in_SRF is maintained – One bit for each ROB entry – Set at the time of writeback if value is written into the SRF – Reset at the time of removing the value from the SRF Valid 21 ROB idx Dest. reg Data Branch Branch Tag 1 Tag 2 Scenarios for Removing the Values from the SRF Scenario 1 : Normal Commitment of Renamer Scenario 2 : Renamer gets squashed Scenario 3 : The instruction generating the shortlived value itself gets squashed 22 Removing the Values from the SRF : Scenario 1 – – Values are removed by the Renamer 2-step process: Mark the instruction whose value is to be removed from the SRF (done at the time of renaming) Remove the marked value from the SRF IF NEED BE (done at the time of commitment) Renamer – 23 LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 When ADD commits, it removes the value written by LOAD Marking the Values for Removal Arch. Reg 24 Phys. Reg. Location (0-ROB,1ARF) 0 0 1 1 31 0 2 2 1 3 3 1 4 4 1 5 32 0 LOAD SUB ADD R1, R2, 100 R5, R1, R3 R1, R5, R4 ROB LOAD SUB ADD 31 32 33 P31, R2, 100 P32, P31, R3 P33, P32, R4 Marking the Values for Removal Arch. Reg 25 Phys. Reg. Location (0-ROB,1ARF) 0 0 1 1 31 0 2 2 1 3 3 1 4 4 1 5 32 0 LOAD SUB ADD R1, R2, 100 R5, R1, R3 R1, R5, R4 ROB LOAD SUB ADD 31 32 33 31 FS (Flush SRF) field of the ROB P31, R2, 100 P32, P31, R3 P33, P32, R4 Removing the Values (B is the renamer for A) – – FS field of B must match the ROB index field of a SRF entry SRF This SRF entry must belong to A ROB 31 32 33 1 31 1 load A LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 26 31 Valid B ROB idx SRF format Dest Data Branch Branch Tag 1 Tag 2 Another Example (LOAD could not write to SRF) Original code Arch. Reg Phys. Reg. 0 0 1 1 33 0 2 2 1 3 3 1 4 4 1 5 32 0 Renamed 1 31 27 Location (0-ROB,1-ARF) LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 Renamed code LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 SRF was full! Another Example Original code Arch. Reg Phys. Reg. 0 0 1 1 33 0 2 2 1 3 3 1 4 4 1 5 5 1 Renamed 0 31 28 Location (0-ROB,1-ARF) LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 … MUL R2, R3, R4 DIV R2, R2, R5 Renamed code LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 Committed Committed Another Example Original code Arch. Reg Phys. Reg. 0 0 1 1 33 0 2 31 0 3 3 1 4 4 1 5 5 1 Renamed 0 31 29 Location (0-ROB,1-ARF) LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 … MUL R2, R3, R4 DIV R2, R2, R5 Renamed code LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 … MUL P31, R3, R4 Committed Committed Another Example Original code Arch. Reg Phys. Reg. 0 0 1 1 33 0 2 32 0 3 3 1 4 4 1 5 5 1 Renamed 1 31 30 Location (0-ROB,1-ARF) LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4 … MUL R2, R3, R4 DIV R2, R2, R5 Renamed code LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 … MUL P31, R3, R4 DIV P32, R31, R5 Committed Committed Another Example (A’s ROB slot is assigned for C) ROB SRF 31 32 33 0 31 A LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 Valid 31 B ROB idx SRF format Dest Data Branch Branch Tag 1 Tag 2 Another Example (A’s ROB slot is assigned for C) ROB SRF 31 32 33 1 31 2 mul C 31 D LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4 … MUL P31, R3, R4 DIV P32, R31, R5 Valid 32 B ROB idx SRF format Dest Data Branch Branch Tag 1 Tag 2 Ensuring that the right values are removed – Bit-vector Uncommitted_Write is maintained – Instruction B removes the value written by A (allocated to ROB slot i) if: 33 One bit for each ROB entry Set at the time of establishing SRF entry Reset at the time of commitment Allocated_in_SRF[i]=1, and (this needs to be better explained) Uncommitted_Write[i]=0; Avoiding Unnecessary Committments – When an instruction allocated to ROB slot i commits and Allocated_in_SRF[i]=1, the data is not copied to the ARF. Dest. reg Inst. Queue F R D Ex ARF Write SRF Write ROB Read 34 Handling Branch Mispredictions : Scenario 2 – Problem: – Renamer can get squashed -> stale entries remain in the SRF if nothing is done Example: 31 32 33 34 1 31 1 load 31 SRF ROB 35 Handling Branch Mispredictions – Problem: – Renamer can get squashed -> stale entries remain in the SRF if nothing is done Example: 31 32 33 34 1 31 1 load SRF ROB 36 Handling Branch Mispredictions – Solution: Tag each entry in the SRF with the id of the branch preceding the renamer (BT1). When the renamer is squashed, the value is removed from the SRF and is written to either the ROB (based on the value of Uncommitted_Write bit) Multiplex the ports to reduce complexity SRF format Valid 37 ROB idx Dest Data Branch Branch Tag 1 Tag 2 Obtaining Branch Tag BT1 – – 38 Maintain the array Branch_Tags One entry for each ROB slot Arch. Reg Phys. Reg. Location 0 0 1 1 31 0 2 2 1 3 3 1 4 4 1 5 33 0 LOAD BEQ SUB ADD R1, R2, 100 R6, R7, 200 R5, R1, R3 R1, R5, R4 LOAD BEQ SUB ADD P31, P2, 100 P6, P7, 200 P33, P31, P3 P34, P33, P4 (0-ROB,1-ARF) Branch_Tags 7 31 Handling Branch Mispredictions : Scenario 3 – Problem: – The instruction whose value was inserted into the SRF can itself be squashed Example: 30 31 32 33 1 31 1 load 31 SRF ROB 39 Handling Branch Mispredictions – Problem: – The instruction whose value was inserted into the SRF can itself be squashed Example: 30 31 32 33 1 31 1 load SRF ROB 40 Handling Branch Mispredictions – Solution: Tag each entry in the SRF with the id of the branch preceding the instruction itself (BT2). Simply remove the value from the SRF if such a branch in mispredicted SRF format Valid 41 ROB idx Dest Data Branch Branch Tag 1 Tag 2 Supporting Precise Interrupts – – – Allow all instructions preceding the faulting instruction to commit Squash all instructions following the faulting instruction Copy the values of ALL valid SRF entries to the ARF. SRF format Valid 42 ROB idx Dest Data Branch Branch Tag 1 Tag 2 Experimental Setup Compiled SPEC benchmarks Datapath specs Performance stats Microarchitectural Simulator Transition counts, Context information Two separate threads Inter-thread buffers Data analyzer/ Intra-stream analysis Energy/Power Estimator VLSI layout data SPICE decks 43 SPICE SPICE measures of Energy per transition Power/energy stats Results: Percentage of Values Written into the SRF % 100 80 60 40 20 0 bzip2 gap 44 gcc gzip mcf pars perl twolf vort vpr applu apsi art 8 entries 16 entries 32 entries 48 entries 40.5% 60.1% 77.5% 82.3% eq mesa mgrid swim wupw % of short-lived results 86.7% Results: Average Time Spent by a Value in the SRF cycles bzip2 gap gcc gzip mcf pars perl twolf vort vpr applu apsi 8 entries 16 entries 32 entries Average: 12-15 cycles 45 art eq mesa mgrid swim wupw 48 entries Results: Percentage of Values not copied into the ARF % 100 80 60 40 20 0 bzip2 gap 46 gcc gzip mcf pars perl twolf vort vpr applu apsi art 8 entries 16 entries 32 entries 48 entries 42.2% 61.9% 79.3% 84.1% eq mesa mgrid swim wupw % of short-lived results 86.7% Results: Net Energy Reduction pJ 800 600 400 200 0 Baseline 8 entries 16 entries 32 entries 48 entries 9% ROB+additional logic 47 16% ARF 21% SRF 23% Results: Net Energy Reduction pJ 800 SRF 600 400 ARF 200 0 Baseline 48 8 entries 16 entries 32 entries 48 entries 9% 16% 21% 23% ROB + additional logic Related Work – Register Traffic Analysis (Franklin and Sohi, MICRO’92). – Lozano and Gao (MICRO’95) – 90% of all results values are short-lived (consumed while in the ROB) Mechanism to avoid commitment of these values and also avoid register allocation for them is proposed ROB slots are exposed to the compiler in the form of symbolic registers Lazy Retirement (Savransky, Ronen, Gonzalez, WCED’02) 49 Studied the useful lifetime of register instances Delaying the writes until 30 more instructions are dispatched, can eliminate 80% of the writes (if perfect knowledge of the last use is available) Buffering 30 most recently generated results avoids 80% of wbks Hardware-based scheme to avoid unnecessary commitments Copying from the ROB to the ARF is delayed until the ROB slot is reused. In many cases, the register is invalidated by the newer instruction Additional rename table is needed. About 75% of commits are avoided. Conclusions – Significant power savings & negligible impact on performance – Sources of power savings: 50 majority of generated results written into small lightly-ported SRF Unnecessary commitments are avoided Additional logic/ storage needed to do this is simple – For a 32-entry SRF, more than 77% of writebacks and more than 79% of commitments can be avoided – This results in the energy savings of 21% on the ROB and the ARF THANK YOU ! LOW POWER RESEARCH GROUP Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower Parallel Architectures and Compilation Techniques (PACT’03) October 1st 2003 This work was supported in part by DARPA through the PAC-C program and NSF 51 Complexity of the Solution – SRF – Three bit vectors (same size as the ROB) – 52 Renamed Allocated_in_SRF Uncommitted_Write 4-bit array Branch_Tags (same size as the ROB)
© Copyright 2024 Paperzz