Short-Lived Values

Reducing Datapath Energy Through
the Isolation of Short-Lived Operands
Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose
Department of Computer Science
State University of New York
Binghamton, NY 13902-6000
http://www.cs.binghamton.edu/~lowpower
1
Outline
–
–
–
–
–
–
2
Introduction
Motivations
Contributions
 Basic idea: isolate short-lived operands in a small dedicated
register file and avoid their writes to the ROB and the ARF
 Resources impacted: ROB, ARF
 Power savings: 21% with 32-entry additional RF
Results
Conclusions
Future work
A P6-like Superscalar Datapath
Function
Units
Instruction Issue
IQ
F1
F2
D1
D2
Architectural
Register File
FU1
FU2
ROB
Fetch
FUm
Decode/Dispatch
LSQ
Instruction
dispatch
3
ARF
EX
D-cache
Result/status
forwarding buses
Out-of-Order Execution and In-Order Retirement
Inst. Queue
F
R
In-order
front end
Ex
D
ROB
Out-of-order core
4
ARF
In-order
retirement
Energy-dissipating Events
Ex
Inst. Queue
F
R
D
ARF
Write
Write
In-order
front end
ROB
Read
Out-of-order core
5
In-order
retirement
The Idea : Isolating Short-Lived Values
Write short-lived
values into a small
dedicated RF (SRF)
Ex
Inst. Queue
F
R
D
ARF
Write
SRF
Write
In-order
front end
ROB
Read
Out-of-order core
6
In-order
retirement
Register Renaming
–
–
–
7
Used to avoid false data dependencies.
A new physical register is allocated for EVERY new
result
P6 style: ROB slots serve as physical registers
LOAD R1, R2, 100
LOAD P31, P2, 100
SUB
R5, R1, R3
SUB
P32, P31, P3
ADD
R1, R5, R4
ADD
P33, P32, P4
Register Renaming: the Implementation
–
8
Register Alias Table (RAT) maintains the mappings
between logical and physical registers
Arch.
Reg
Phys.
Reg.
Location
0
0
1
1
1
1
2
2
1
3
3
1
4
4
1
5
5
1
(0-ROB,1-ARF)
Original code
LOAD R1, R2, 100
SUB R5, R1, R3
ADD R1, R5, R4
Register Renaming: the Implementation
–
9
Register Alias Table (RAT) maintains the mappings
between logical and physical registers
Arch.
Reg
Phys.
Reg.
Location
0
0
1
1
31
0
2
2
1
3
3
1
4
4
1
5
5
1
(0-ROB,1-ARF)
Original code
LOAD R1, R2, 100
SUB R5, R1, R3
ADD R1, R5, R4
Renamed code
LOAD P31, R2, 100
Register Renaming: the Implementation
–
10
Rename Table (RT) is used to maintain the mappings
between logical and physical registers
Arch.
Reg
Phys.
Reg.
Location
0
0
1
1
31
0
2
2
1
3
3
1
4
4
1
5
32
0
(0-ROB,1-ARF)
Original code
LOAD R1, R2, 100
SUB R5, R1, R3
ADD R1, R5, R4
Renamed code
LOAD P31, R2, 100
SUB P32, P31, R3
Register Renaming: the Implementation
–
11
Rename Table (RT) is used to maintain the mappings
between logical and physical registers
Arch.
Reg
Phys.
Reg.
Location
0
0
1
1
33
0
2
2
1
3
3
1
4
4
1
5
32
0
(0-ROB,1-ARF)
Original code
LOAD R1, R2, 100
SUB R5, R1, R3
ADD R1, R5, R4
Renamed code
LOAD P31, R2, 100
SUB P32, P31, R3
ADD P33, P32, R4
Short-Lived Values
–
Our definition: a value is short-lived if the destination
register is renamed by the time of the result generation.
–
Identified one cycle before the result writeback
RENAMER
12
LOAD R1, R2, 100
SUB R5, R1, R3
ADD R1, R5, R4
LOAD P31, R2, 100
SUB P32, P31, R3
ADD P33, P32, R4
The Good News : 80%+ of the Values are Short-Lived
96-entry ROB, 4-way processor
100
90
80
70
60
50
40
30
20
10
vg .
.
A fp.
ve
ra
ge
A
.I
nt
vg
A
pl
u
ap
si
a
eq rt
ua
ke
m
es
a
m
gr
id
sw
w im
up
w
ise
ap
r
vp
gc
c
gz
ip
m
c
pa f
r
pe ser
rl
bm
k
tw
ol
vo f
rt
ex
ga
p
bz
ip
2
0
As rename-to-writeback latency increases in future datapaths, the percentage of
short-lived values will also go up
13
The Idea : Isolating Short-Lived Values
Write short-lived
values into a small
dedicated RF (SRF)
Ex
Inst. Queue
F
R
D
ARF
Write
SRF
Write
In-order
front end
LOAD
SUB
ADD
14
ROB
Read
R1, R2, 100
R5, R1, R3
R1, R5, R4
Out-of-order core
In-order
retirement
Why do we need the SRF ?
Need to hang on to the short-lived values to:
 Recover from branch mispredictions
 Reconstruct precise state
LOAD R1, R2, 100
BEQ R5, R1, #100
ADD R1, R5, R4
15
Identifying Short-Lived Values
–
–
16
Maintain the bit-vector Renamed
Set by the Renamer at the time of renaming
Arch.
Reg
Phys.
Reg.
Location
0
0
1
1
31
0
2
2
1
3
3
1
4
4
1
5
32
0
(0-ROB,1-ARF)
LOAD
SUB
ADD
R1, R2, 100
R5, R1, R3
R1, R5, R4
LOAD
SUB
ADD
P31, R2, 100
P32, P31, R3
P33, P32, R4
Renamed
1
31
Identifying Short-Lived Values
–
–
17
Maintain the bit-vector Renamed
Set by the Renamer at the time of renaming
Arch.
Reg
Phys.
Reg.
Location
0
0
1
1
33
0
2
2
1
3
3
1
4
4
1
5
32
0
(0-ROB,1-ARF)
LOAD
SUB
ADD
R1, R2, 100
R5, R1, R3
R1, R5, R4
LOAD
SUB
ADD
P31, R2, 100
P32, P31, R3
P33, P32, R4
Renamed
1
31
Identifying Short-Lived Values
–
–
Renamed bit is checked one cycle before writeback
Value produced by LOAD is short-lived because
Renamed [31]=1
LOAD
SUB
ADD
LOAD
SUB
ADD
R1, R2, 100
R5, R1, R3
R1, R5, R4
Renamed
1
31
18
P31, R2, 100
P32, P31, R3
P33, P32, R4
Managing the SRF: the Issues
19
–
When do we write short-lived values into the
SRF?
–
When and how are the short-lived values
removed from the SRF?
–
What happens on a branch misprediction?
–
How do we reconstruct a precise state?
Format of an SRF entry
Valid ROB idx Dest. Arch.
Reg.
Data
Branch Branch
Tag 1 Tag 2
Branch Identifier for Renamer : used to
remove this entry if renamer gets squashed
Branch Identifier for this instruction : used to remove this
entry if this instruction gets squashed
Branch Identifier of an instruction = id/tag of immediately
preceding conditional branch
20
Writing to the SRF: the Conditions
–
An instruction writes a short-lived result value into the
SRF if:


A free entry exists in the SRF
No SRF entry keyed with the same ROB slot is already
established
–
Bit-vector Allocated_in_SRF is maintained
– One bit for each ROB entry
– Set at the time of writeback if value is written into the SRF
– Reset at the time of removing the value from the SRF
Valid
21
ROB idx Dest. reg
Data
Branch Branch
Tag 1 Tag 2
Scenarios for Removing the Values from the SRF
Scenario 1 : Normal Commitment of Renamer
Scenario 2 : Renamer gets squashed
Scenario 3 : The instruction generating the shortlived value itself gets squashed
22
Removing the Values from the SRF : Scenario 1
–
–
Values are removed by the Renamer
2-step process:


Mark the instruction whose value is to be removed from the
SRF (done at the time of renaming)
Remove the marked value from the SRF
IF NEED BE (done at the time of commitment)
Renamer
–
23
LOAD R1, R2, 100
SUB R5, R1, R3
ADD R1, R5, R4
LOAD P31, R2, 100
SUB P32, P31, R3
ADD P33, P32, R4
When ADD commits, it removes the value written by
LOAD
Marking the Values for Removal
Arch.
Reg
24
Phys.
Reg.
Location
(0-ROB,1ARF)
0
0
1
1
31
0
2
2
1
3
3
1
4
4
1
5
32
0
LOAD
SUB
ADD
R1, R2, 100
R5, R1, R3
R1, R5, R4
ROB
LOAD
SUB
ADD
31 32 33
P31, R2, 100
P32, P31, R3
P33, P32, R4
Marking the Values for Removal
Arch.
Reg
25
Phys.
Reg.
Location
(0-ROB,1ARF)
0
0
1
1
31
0
2
2
1
3
3
1
4
4
1
5
32
0
LOAD
SUB
ADD
R1, R2, 100
R5, R1, R3
R1, R5, R4
ROB
LOAD
SUB
ADD
31 32 33
31
FS (Flush SRF)
field of the ROB
P31, R2, 100
P32, P31, R3
P33, P32, R4
Removing the Values (B is the renamer for A)
–
–
FS field of B must match the ROB index field of a SRF
entry
SRF
This SRF entry must belong to A
ROB
31 32 33
1 31 1 load
A
LOAD R1, R2, 100
SUB
R5, R1, R3
ADD R1, R5, R4
26
31
Valid
B
ROB idx
SRF format
Dest Data Branch Branch
Tag 1 Tag 2
Another Example (LOAD could not write to SRF)
Original code
Arch.
Reg
Phys.
Reg.
0
0
1
1
33
0
2
2
1
3
3
1
4
4
1
5
32
0
Renamed
1
31
27
Location
(0-ROB,1-ARF)
LOAD R1, R2, 100
SUB
R5, R1, R3
ADD R1, R5, R4
Renamed code
LOAD P31, R2, 100
SUB
P32, P31, R3
ADD P33, P32, R4
SRF was full!
Another Example
Original code
Arch.
Reg
Phys.
Reg.
0
0
1
1
33
0
2
2
1
3
3
1
4
4
1
5
5
1
Renamed
0
31
28
Location
(0-ROB,1-ARF)
LOAD R1, R2, 100
SUB
R5, R1, R3
ADD R1, R5, R4
…
MUL R2, R3, R4
DIV
R2, R2, R5
Renamed code
LOAD P31, R2, 100
SUB
P32, P31, R3
ADD P33, P32, R4
Committed
Committed
Another Example
Original code
Arch.
Reg
Phys.
Reg.
0
0
1
1
33
0
2
31
0
3
3
1
4
4
1
5
5
1
Renamed
0
31
29
Location
(0-ROB,1-ARF)
LOAD R1, R2, 100
SUB
R5, R1, R3
ADD R1, R5, R4
…
MUL R2, R3, R4
DIV
R2, R2, R5
Renamed code
LOAD P31, R2, 100
SUB
P32, P31, R3
ADD P33, P32, R4
…
MUL P31, R3, R4
Committed
Committed
Another Example
Original code
Arch.
Reg
Phys.
Reg.
0
0
1
1
33
0
2
32
0
3
3
1
4
4
1
5
5
1
Renamed
1
31
30
Location
(0-ROB,1-ARF)
LOAD R1, R2, 100
SUB
R5, R1, R3
ADD R1, R5, R4
…
MUL R2, R3, R4
DIV
R2, R2, R5
Renamed code
LOAD P31, R2, 100
SUB
P32, P31, R3
ADD P33, P32, R4
…
MUL P31, R3, R4
DIV
P32, R31, R5
Committed
Committed
Another Example (A’s ROB slot is assigned for C)
ROB
SRF
31 32 33
0
31
A
LOAD P31, R2, 100
SUB
P32, P31, R3
ADD P33, P32, R4
Valid
31
B
ROB idx
SRF format
Dest Data Branch Branch
Tag 1 Tag 2
Another Example (A’s ROB slot is assigned for C)
ROB
SRF
31 32 33
1 31 2 mul
C
31
D
LOAD P31, R2, 100
SUB
P32, P31, R3
ADD P33, P32, R4
…
MUL P31, R3, R4
DIV
P32, R31, R5
Valid
32
B
ROB idx
SRF format
Dest Data Branch Branch
Tag 1 Tag 2
Ensuring that the right values are removed
–
Bit-vector Uncommitted_Write is maintained



–
Instruction B removes the value written by A (allocated
to ROB slot i) if:


33
One bit for each ROB entry
Set at the time of establishing SRF entry
Reset at the time of commitment
Allocated_in_SRF[i]=1, and (this needs to be better explained)
Uncommitted_Write[i]=0;
Avoiding Unnecessary Committments
–
When an instruction allocated to ROB slot i commits
and Allocated_in_SRF[i]=1, the data is not copied to
the ARF.
Dest.
reg
Inst. Queue
F
R
D
Ex
ARF
Write
SRF
Write
ROB
Read
34
Handling Branch Mispredictions : Scenario 2
–
Problem:

–
Renamer can get squashed -> stale entries remain in the SRF if
nothing is done
Example:
31 32 33 34
1 31 1 load
31
SRF
ROB
35
Handling Branch Mispredictions
–
Problem:

–
Renamer can get squashed -> stale entries remain in the SRF if
nothing is done
Example:
31 32 33 34
1 31 1 load
SRF
ROB
36
Handling Branch Mispredictions
–
Solution:



Tag each entry in the SRF with the id of the branch preceding
the renamer (BT1).
When the renamer is squashed, the value is removed from the
SRF and is written to either the ROB (based on the value of
Uncommitted_Write bit)
Multiplex the ports to reduce complexity
SRF format
Valid
37
ROB idx
Dest
Data Branch Branch
Tag 1 Tag 2
Obtaining Branch Tag BT1
–
–
38
Maintain the array Branch_Tags
One entry for each ROB slot
Arch.
Reg
Phys.
Reg.
Location
0
0
1
1
31
0
2
2
1
3
3
1
4
4
1
5
33
0
LOAD
BEQ
SUB
ADD
R1, R2, 100
R6, R7, 200
R5, R1, R3
R1, R5, R4
LOAD
BEQ
SUB
ADD
P31, P2, 100
P6, P7, 200
P33, P31, P3
P34, P33, P4
(0-ROB,1-ARF)
Branch_Tags
7
31
Handling Branch Mispredictions : Scenario 3
–
Problem:

–
The instruction whose value was inserted into the SRF can itself
be squashed
Example:
30 31 32 33
1 31 1 load
31
SRF
ROB
39
Handling Branch Mispredictions
–
Problem:

–
The instruction whose value was inserted into the SRF can itself
be squashed
Example:
30 31 32 33
1 31 1 load
SRF
ROB
40
Handling Branch Mispredictions
–
Solution:


Tag each entry in the SRF with the id of the branch preceding
the instruction itself (BT2).
Simply remove the value from the SRF if such a branch in
mispredicted
SRF format
Valid
41
ROB idx
Dest
Data Branch Branch
Tag 1 Tag 2
Supporting Precise Interrupts
–
–
–
Allow all instructions preceding the faulting instruction
to commit
Squash all instructions following the faulting instruction
Copy the values of ALL valid SRF entries to the ARF.
SRF format
Valid
42
ROB idx
Dest
Data Branch Branch
Tag 1 Tag 2
Experimental Setup
Compiled
SPEC
benchmarks
Datapath
specs
Performance stats
Microarchitectural
Simulator
Transition counts,
Context information
Two separate threads
Inter-thread
buffers
Data analyzer/
Intra-stream analysis
Energy/Power
Estimator
VLSI layout
data
SPICE
decks
43
SPICE
SPICE measures of
Energy per transition
Power/energy
stats
Results: Percentage of Values Written into the SRF
%
100
80
60
40
20
0
bzip2 gap
44
gcc gzip mcf pars perl twolf vort vpr applu apsi art
8 entries
16 entries
32 entries
48 entries
40.5%
60.1%
77.5%
82.3%
eq mesa mgrid swim wupw
% of short-lived results
86.7%
Results: Average Time Spent by a Value in the SRF
cycles
bzip2 gap
gcc gzip mcf pars perl twolf vort vpr applu apsi
8 entries
16 entries
32 entries
Average: 12-15 cycles
45
art
eq mesa mgrid swim wupw
48 entries
Results: Percentage of Values not copied into the ARF
%
100
80
60
40
20
0
bzip2 gap
46
gcc gzip mcf pars perl twolf vort vpr applu apsi art
8 entries
16 entries
32 entries
48 entries
42.2%
61.9%
79.3%
84.1%
eq mesa mgrid swim wupw
% of short-lived results
86.7%
Results: Net Energy Reduction
pJ
800
600
400
200
0
Baseline
8 entries 16 entries 32 entries 48 entries
9%
ROB+additional
logic
47
16%
ARF
21%
SRF
23%
Results: Net Energy Reduction
pJ 800
SRF
600
400
ARF
200
0
Baseline
48
8 entries 16 entries 32 entries 48 entries
9%
16%
21%
23%
ROB +
additional
logic
Related Work
–
Register Traffic Analysis (Franklin and Sohi, MICRO’92).



–
Lozano and Gao (MICRO’95)



–
90% of all results values are short-lived (consumed while in the ROB)
Mechanism to avoid commitment of these values and also avoid register
allocation for them is proposed
ROB slots are exposed to the compiler in the form of symbolic registers
Lazy Retirement (Savransky, Ronen, Gonzalez, WCED’02)



49
Studied the useful lifetime of register instances
Delaying the writes until 30 more instructions are dispatched, can eliminate
80% of the writes (if perfect knowledge of the last use is available)
Buffering 30 most recently generated results avoids 80% of wbks
Hardware-based scheme to avoid unnecessary commitments
Copying from the ROB to the ARF is delayed until the ROB slot is reused. In
many cases, the register is invalidated by the newer instruction
Additional rename table is needed. About 75% of commits are avoided.
Conclusions
–
Significant power savings & negligible impact on
performance
–
Sources of power savings:



50
majority of generated results written into small lightly-ported
SRF
Unnecessary commitments are avoided
Additional logic/ storage needed to do this is simple
–
For a 32-entry SRF, more than 77% of writebacks and
more than 79% of commitments can be avoided
–
This results in the energy savings of 21% on the ROB
and the ARF
THANK YOU !
LOW POWER RESEARCH GROUP
Department of Computer Science
State University of New York
Binghamton, NY 13902-6000
http://www.cs.binghamton.edu/~lowpower
Parallel Architectures and Compilation Techniques (PACT’03)
October 1st 2003
This work was supported in part by DARPA through the PAC-C program and NSF
51
Complexity of the Solution
–
SRF
–
Three bit vectors (same size as the ROB)



–
52
Renamed
Allocated_in_SRF
Uncommitted_Write
4-bit array Branch_Tags (same size as the ROB)