EE382A Lecture 6: Register Renaming

EE382A Lecture 6:
Register Renaming
Department of Electrical Engineering
Stanford University
http://eeclass.stanford.edu/ee382a
EE382A – Autumn 2009
Lecture 6- 1
John P Shen
Announcements
•
Project proposal due on Wed 10/14
– 2-3
2 3 pages submitted through email
–
–
–
–
–
•
List the group members
Describe the topic including why it is important and your thesis
Describe the methodology you will use (experiments
(experiments, tools
tools, machines)
Statement of expected results
Few key references to related work
Still missing some photos
EE382A – Autumn 2009
Lecture 6- 2
John P Shen
Lecture 6 Outline
1. Branch Prediction (epilog)
a. 2-level Predictors
b. AMD Opteron Example
c. Confidence Prediction
d. Trace Cache
2. Register Data Flow
a. False Register Dependences
b. Register Renaming Technique
c. Register
R i t R
Renaming
i IImplementation
l
t ti
EE382A – Autumn 2009
Lecture 6- 3
John P Shen
Dynamic Branch Prediction Using History
nPC to Icache
prediction
FA-mux
t
t
specu. target
PC
Branch
Predictor
specu. cond. (using
a BTB)
nPC=BP(PC)
nPC(seq.) = PC+4
Fetch
Decode Buffer
Decode
BTB
update
(target addr.
and history)
Dispatch Buffer
Dispatch
Reservation
Stations
Issue
Branch
Execute
Finish
EE382A – Autumn 2009
Completion Buffer
Lecture 6- 4
John P Shen
2-Level Adaptive Prediction [Yeh & Patt]
Nomenclature: {G,P}A{g,p,s}
{ , } {g,p, }
Pattern History Table (PHT)
PC
00...00
Branch History Shift
00...01
Register (BHSR)
(shift left when update) 00...10
101
1 1
111
0 0
111
index
1 0
PHT
Bits
old
new
11...10
11...11
Branch Result
To achieve 97% average prediction accuracy:
G (1) BHR: 18 bits;
g (1) PHT: 218 x 2 bits
P (512x4)
(512 4) BHR
BHR: 12 bit
bits;
g (1) PHT:
PHT 212 x 2 bit
bits
P (512x4) BHR: 6 bits;
s (512) PHT: 26 x 2 bits
EE382A – Autumn 2009
Lecture 6- 5
FSM
Logic
Prediction
total = 524 kbits
t t l = 33 kbit
total
kbits
total = 78 kbits
John P Shen
Example: Global BHSR Scheme (GAs)
Branch Address
j bits
Prediiction
Branch History
Shift Register (BHSR)
k bits
BHT of 2 x 2j+k
EE382A – Autumn 2009
Lecture 6- 6
John P Shen
Example: Per-Branch BHSR Scheme (PAs)
Branch Address
j bits
i bits
k bit
bits
EE382A – Autumn 2009
Prediction
Branch Historyy
Shift Register
R i t (BHSR)
k x 2i
Standard BHT
BHT of 2 x 2j+k
Lecture 6- 7
John P Shen
Gshare Branch Prediction [McFarling]
Branch Address
j bits
Branch History
Shift Register (BHSR)
Predicttion
xor
k bits
BHT of 2 x 2 max(j,k)
EE382A – Autumn 2009
Lecture 6- 8
John P Shen
Fetch & Predict Example:
AMD Opteron
EE382A – Autumn 2009
Lecture 6- 9
John P Shen
Why is Prediction Important in Opteron?
EE382A – Autumn 2009
Lecture 6- 10
John P Shen
Fetch & Predict Example:
AMD Opteron
EE382A – Autumn 2009
Lecture 6- 11
John P Shen
Other Branch Prediction Related Issues
•
Multi-cycle BTB
– Keep fetching sequentially,
sequentially repair later (bubbles for taken branches)
– Need pipelined access though
•
BTB & predictor in series
– Get fast target/direction prediction from BTB only
– After decoding, use predictor to verify BTB
• Causes a p
pipeline
p
mini-flush if BTB was wrong
g
– This approach allows for a much larger/slower predictor
•
BTB and predictor integration
– Can
C merge BTB with
ith th
the llocall partt off a predictor
di t
– Can merge both with I-cache entries
•
Predictor/BTB/RAS updates
– Can you see any issue?
EE382A – Autumn 2009
Lecture 6- 12
John P Shen
Prediction Confidence
A Very Useful Tool for Speculation
•
Estimate if your prediction is likely to be correct
•
A li ti
Applications
– Avoid fetching down unlikely path
• Save time & power by waiting
– Start executing down both paths (selective eager execution)
– Switch to another thread (for multithreaded processors)
•
Implementation
– Naïve: don’t use NT or TN states in 2-bit counters
– Better: array of CIR (correct/incorrect registers)
• Shift in if last prediction was correct/incorrect
• Count the number of 0s to determine confidence
– Many other implementations are possible
• Using counters etc
EE382A – Autumn 2009
Lecture 6- 13
John P Shen
Branch Confidence Prediction
EE382A – Autumn 2009
Lecture 6- 14
John P Shen
Dynamic History Length
•
Four types of history
– Local (bimodal) history (Smith predictor)
• Table of counters summarizes local history
• Simple, but only effective for biased branches
– Local outcome history
• Shift register of individual branch outcomes
• Separate counter for each outcome history
– Global
Gl b l outcome
t
hi
history
t
• Shift register of recent branch outcomes
• Separate counter for each outcome history
– Path history
• Shift register of recent (partial) block addresses
• Can differentiate similar global outcome histories
•
Can combine or “alloy” histories in many ways
EE382A – Autumn 2009
Lecture 6- 15
John P Shen
Understanding Advanced Predictors
•
•
•
History length
– Short
Sh t history—lower
hi t
l
ttraining
i i costt
– Long history—captures macro-level behavior
– Variable history length predictors
Really long history (long loops)
– Loop count predictors
– Fourier transform into frequency domain
Limited capacity & interference
– Constructive vs. destructive
– Bi-mode,
Bi
d gskewed,
k
d agree, YAGS
– Read sec. 9.3.2 carefully
EE382A – Autumn 2009
Lecture 6- 16
John P Shen
High-Bandwidth Fetch: Trace Cache
Instruction Cache
E F G
Trace Cache
H I J
A B
A B C D E F G H I
J
C
D
(a)
•
•
(b)
Fold out taken branches by tracing instructions as they commit into a fill
buffer
Eric Rotenberg, S. Bennett, and James E. Smith. Trace Cache: A Low
Latency Approach to High Bandwidth Instruction Fetching. MICRO,
December 1996.
EE382A – Autumn 2009
Lecture 6- 17
John P Shen
Intel Pentium 4 Trace Cache
Front-End BTB
Instruction TLB
and Prefetcher
Level-Two
Unified Data and
Instruction Cache
Instruction Decode
Trace Cache BTB
Trace Cache
Ins truction Fetch Queue
To renamer, execute, etc.
•
•
•
•
No first-level instruction cache: trace cache only
Trace cache BTB identifies next trace
Miss leads to fetch from level two cache
Trace cache instructions are decoded (uops)
EE382A – Autumn 2009
Lecture 6- 18
John P Shen
Modern Superscalar, Out-of-order Processor
•
Pipelining reduces cycle time
•
Superscalar
S
l iincreases IPC
(instruction per cycle)
•
Both schemes need to find lots of
ILP in the program
I-cache
Branch
Predictor
FETCH
Instruction
Buffer
Instruction
Flow
DECODE
Integer
Floating-point
Media
Memory
EXECUTE
Register
Data
Flow
Reorder
Buffer
(ROB)
Store
Queue
EE382A – Autumn 2009
– Must simultaneously increase
number of instructions considered
considered,
Memory number of instructions executed,
Data
and allow for out-of-order execution
Flow
COMMIT
D-cache
Lecture 6- 19
John P Shen
What Limits ILP
INSTRUCTION PROCESSING CONSTRAINTS
Resource C
R
Contention
t ti
(Structural Dependences)
C d D
Code
Dependences
d
Control Dependences
(RAW) True
T
D
Dependences
d
(WAR) Anti-Dependences
EE382A – Autumn 2009
Lecture 6- 20
Data Dependences
St
Storage
Conflicts
C fli t
Output Dependences (WAW)
John P Shen
Register Renaming &
Dynamic Scheduling
• Register Renaming: address limitations of the scoreboard
– Scoreboard limitation
• Up to one pending instruction per destination register
– Eliminate WAR and WAW dependences without stalling
• Dynamic
y
scheduling
g
– Track & resolve true-data dependences (RAW)
– Scheduling hardware:
• Instruction window, reservation stations, common data bus, …
– Original proposal: Tomasulo’s algorithm [Tomasulo, 1967]
EE382A – Autumn 2009
Lecture 6- 21
John P Shen
Register Data Flow
INSTRUCTION EXECUTION MODEL
Each ALU Instruction:
Ri
Fn
Dest.
Reg.
Funct.
Unit
(Rj, Rk)
Source
Registers
“Register
Register Transfer”
Transfer
R0
R1
FU1
•
•
•
FU2
Interconnect
Rm
Registers
“Read”
•
•
•
FUn
Functional
Units
“Execute”
“Write”
Need Availability of F n (Structural Dependences)
Need Availability of Rj, Rk (True Data Dependences)
Need Availability
y of Ri (Anti-and
(
output
p Dependences)
p
)
EE382A – Autumn 2009
Lecture 6- 22
John P Shen
Causes of (Register) Storage Conflict
REGISTER RECYCLING
MAXIMIZE USE OF REGISTERS
MULTIPLE ASSIGNMENTS OF VALUES TO REGISTERS
OUT OF ORDER ISSUING AND COMPLETION
LOSE IMPLIED PRECEDENCE OF SEQUENTIAL CODE
LOSE 1-1 CORRESPONDENCE BETWEEN VALUES AND REGITERS
WAW
Ri
•
•
•
•
•
•
•
•
•
Ri
EE382A – Autumn 2009
•••
DEF
Ri
USE
Ri
USE
•••
DEF
WAR
Lecture 6- 23
John P Shen
The Reason for WAW and WAR:
Register Recycling
COMPILER REGISTER ALLOCATION
CODE GENERATION
REG. ALLOCATION
Single Assignment
Assignment, Symbolic Reg
Reg.
Map Symbolic Reg. to Physical Reg.
Maximize
a
e Reuse
euse o
of Reg.
eg
INSTRUCTION LOOPS
9 $34:
10
11
12
13
14
15
16
17
18
19
20
21
22
mul
addu
mull
addu
lw
mul
addu
mul
addu
lw
mul
addu
dd
addu
ble
EE382A – Autumn 2009
$14
$15,
$24
$24,
$25,
$11,
$12,
$13,,
$
$14,
$15,
$24,
$25,
$10
$10,
$9,
$9,
$7, 40
$4, $14
$9
$9,
4
$15, $24
0($25)
$9, 40
$$5,, $
$12
$8, 4
$13, $14
0($15)
$11, $24
$10 $25
$10,
$9, 1
10, $34
For (k=1;k<= 10; k++)
t += a [i] [k] * b [k] [j] ;
Reuse Same Set of Reg. in
each Iteration
Overlapped Execution of
different Iterations
Lecture 6- 24
John P Shen
Resolving False Dependences
•
•
•
(1) R4 ← R3 + 1
Must Prevent (2) from completing
before (1) is dispatched
(2) R3 ← R5 + 1
(1) R3 ← R3 + R5
•
•
•
•
•
•
← R3
Must Prevent (2) from completing
before (1) completes
(2) R3 ← R5 + 1
Stalling: delay dispatching (or write back) of the later instruction
Copy Operands: Copy not-yet-used operand to prevent being overwritten
((WAR))
Register Renaming: use a different register (WAW & WAR)
EE382A – Autumn 2009
Lecture 6- 25
John P Shen
Register Renaming: The Idea
•
Anti and output dependences are false dependences
•
The dependence is on name/location rather than data
•
Given unlimited number of registers, anti and output dependences
can always be eliminated
r3 ← r1 op r2
r5 ← r3 op r4
r3 ← r6 op r7
Original
Renamed
r1 ← r2 / r3
r4 ← r1 * r5
r1 ← r3 + r6
r3 ← r1 - r4
EE382A – Autumn 2009
r1 ← r2 / r3
r4 ← r1 * r5
r8 ← r3 + r6
r9 ← r8 - r4
Lecture 6- 26
John P Shen
Register Renaming Technique
Register Renaming Resolves:
Anti-Dependences
Output Dependences
Architected
A
i
Registers
R1
R2
•
•
•
Rn
Physical
i
Registers
P1
P2
•
•
•
Pn
•
•
•
Pn + k
EE382A – Autumn 2009
:
Design of Redundant Registers
Number:
One
Multiple
Allocation:
Fixed for Each Register
Pooled for all Regsiters
Location:
Attached to Register File
(Centralized)
Attached to functional units
(Distributed)
Lecture 6- 27
John P Shen
Register Renaming Implementation
•
Renaming:
– Map a small set of architecture registers to a large set of physical registers
– New mapping for an architectural register when it is assigned a new value
•
Renaming buffer organization (how are registers stored)
– Unified RF, split RF, renaming in the ROB
– RF = register file
•
Number of renaming registers
•
Number of read/write ports
•
Register mapping (how do I find the register I am looking for)
– Allocation, de-allocation, and tracking
EE382A – Autumn 2009
Lecture 6- 28
John P Shen
Renaming Buffer Options
•
Unified/merged register file – MIPS R10K,
R10K Alpha 21264
– Registers change role architecture to renamed
•
Rename register file (RRF) – PA 8500, PPC 620
– Holds new values until they are committed to ARF
– Extra data transfer…
•
Renaming in the ROB – Pentium III
•
Note: can have a single scheme or separate for integer/FP
EE382A – Autumn 2009
Lecture 6- 29
John P Shen
Unified Register File:
Physical Register FSM
EE382A – Autumn 2009
Lecture 6- 30
John P Shen
Number of Rename Registers
•
Naïve: as many as the number of pending instructions
– Waiting to be scheduled + executing + waiting to commit
•
Simplification
– Do not need renaming for stores, branches, …
•
Usual approach:
– # scheduler entries ≤ # RRF entries ≤ # ROB entries
•
Examples:
– PPC 620:
scheduler 15, RRF 16 (RRF),
ROB 16
– MIPS R12000:
scheduler 48, RRF 64 (merged),
ROB 48
– Pentium III:
scheduler 20, RRF 40 (in ROB),
ROB 40
EE382A – Autumn 2009
Lecture 6- 31
John P Shen
Register File Ports
•
Read: if operands read as instructions enter scheduler
– Max # ports = 2 * # instructions dispatched
•
Read: if operands read as instruction leave scheduler
– Max #ports = 2* # instructions issued
• Can
C b
be wider
id th
than th
the # off iinstructions
t ti
di
dispatched…
t h d
•
Write: # of FUs or # of instructions committing
– Depends on unified vs separate rename registers
•
Notes:
– Ca
Can implement
p e e t less
ess po
ports
ts a
and
d have
a e st
structural
uctu a hazards
a a ds
• Need control logic for port assignment & hazard handling
– When using separate RRF and ARF, need ports for the final transfer
– Alternatives to increasing
gp
ports: duplicated
p
RF or banked RF
• What are the issues?
EE382A – Autumn 2009
Lecture 6- 32
John P Shen
Register Mapping
(From Architectural to Physical Address)
•
Option 1: use a map table (ARF # → physical location)
– Map holds the state of the register too…
– Simple, but need two steps for reading (ok if operands read late)
•
Option 2: associative search in RRF, ROB, …
– Each physical
y
register
g
remembers its status ((ok if operand read early)
y)
– More complicated but one step read
EE382A – Autumn 2009
Lecture 6- 33
John P Shen
Integrating Map Tables with the ARF
EE382A – Autumn 2009
Lecture 6- 34
John P Shen
Renaming Operation:
Allocation Lookup
Allocation,
Lookup, De-allocation
De allocation
•
At dispatch: for each instruction handled in parallel
– Check the physical location & availability of source operands
– Map destination register to new physical register
• Stall if no register available
– Note:
N t mustt have
h
enough
h ports
t tto any map tables
t bl
•
At complete: update physical location
•
At commit/retire: for each instruction handled in parallel
– Copy from RRF/ROB to ARF & deallocate RRF entry OR
– Upgrade physical location and deallocate register with old value
• It is now safe to do that
•
Question: can we allocate later or deallocate earlier?
EE382A – Autumn 2009
Lecture 6- 35
John P Shen
Renaming Operation
EE382A – Autumn 2009
Lecture 6- 36
John P Shen
Renaming Difficulties:
Wide Instruction Issue
•
Need many ports in RFs and mapping tables
•
Instruction dependencies during dispatching/issuing/committing
– Must handle dependencies across instructions
– E.g. add R1←R2+R3; sub R6←R1+R5
– Implementation: use comparators, multiplexors, counters
• Comparators: discover RAW dependencies
• Multiplexors: generate right physical address (old or new allocation)
physical
y
registers
g
allocated
• Counters: determine number of p
EE382A – Autumn 2009
Lecture 6- 37
John P Shen
Renaming Difficulties:
Mispredictions & Exceptions
•
If exception/misprediction occurs, register mapping must be precise
•
Separate RRF: consider all RRF entries free
•
g consider all ROB entries free
ROB renaming:
•
Unified RF: restore precise mapping
– Single map: traverse ROB to undo mapping (history file approach)
• ROB mustt remember
b old
ld mapping…
i
– Two maps: architectural and future register map
• On exception, copy architectural map into future map…
– Checkpointing:
Ch k i ti
kkeep regular
l check
h k points
i t off map, restore
t
when
h needed
d d
• When do we make a checkpoint? On every instruction? On every branch?
• What are the trade-offs?
• We’ll
W ’ll revisit
i it thi
this approach
h llater
t on…
EE382A – Autumn 2009
Lecture 6- 38
John P Shen
inorrder
out-off-order
inord
der
Dynamic Scheduling Based on Reservation Stations
Reg. Write Back
Dispatch Buffer
Dispatch
Reg. File
Allocate
Reorder
Buffer
entries
Ren. Reg.
Reservation
Stations
Branch
Integer
g
g
Integer
Float.Point
Load/
Store
Compl. Buffer
(Reorder Buff.)
EE382A – Autumn 2009
Complete
Lecture 6- 39
John P Shen
Embedded “Data Flow” Engine
Dispatch Buffer
Dispatch
Reservation
Stations
- Read register or
- Assign register tag
- Advance instructions
to reservation stations
- Monitor reg. tag
- Receive data
being forwarded
- Issue when all
operands ready
Branch
“Dynamic
Execution”
Completion Buffer
Complete
EE382A – Autumn 2009
Lecture 6- 40
John P Shen