Less is More - Cache Replacement Championship

Less is More: Leveraging Belady’s
Algorithm with Demand-based
Learning
Jiajun Wang, Lu Zhang, Reena Panda, Lizy John
The Univeristy of Texas at Austin
Introduction
• Why efficient LLC replacement policy is important?
– LLC shared by multicores
– LLC accesses have low temporal locality and long data reuse distance
– Small capacity compared with big data application working set size
• Goal
– Ideally, every LLC cache blocks get reused before eviction. (Maximize
total reuse counts)
– It requires:
• Bypass streaming accesses
• Select dead block as victim
Review of Belady’s Optimal Algorithm
• Gives the most optimal case of cache behavior with the
knowledge of future, the block with the largest forward
distance in the string of future references should be replaced
at the time of a miss.
Access A
B
B
C
B
A
D
C
A
time
A
A
B
A
B
C
B
C
B
C
A
C
A
C
A
C
A
2-way fully
associative
cache
Motivation
• However…
Access Type:
Access Addr:
LD
LD
LD
ST
LD
LD
LD
WB
LD
A
B
B
C
B
A
D
C
A
A
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
time
• Same miss counts != same cycle penalty cost
– Miss latency variance (e.g., get missed data from LLC vs DRAM)
– Access type priority (e.g., writeback or prefetch is not on critical path)
Lime Proposal
• Basic idea:
A cache replacement policy which leverages key idea of
Belady’s algorithm but focuses on demand accesses (i.e. loads
and stores) that have direct impact on system performance, and
bypasses training process for writeback and prefetch accesses.
• Builds on prior work
– Caching behavior of past load instructions can guide future caching
decisions[1][2]
– Leverages Belady’s algorithm on past accesses[3]
[1]W. A. Wong and J.-L. Baer. Modified LRU policies for improving second-level cache behavior. In HPCA 2000,
[2]C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr.,and J. Emer. SHiP: Signature-based hit predictor
for high performance caching. In MICRO 2011
[3] A. Jain and C. Lin, “Back to the future: Leveraging belady’s algorithm for improved cache replacement. In ISCA 2016
Background: Hawkeye
• OPTgen:
Cached
Non-Cached
Unique Addr
D
C
B
A
A
Occupancy
Vector
B
B
C
B
A
D
C
A
0
0
0
0
1
0
0
1
0
0
0
1
1
1
0
1
2
2
2
1
0
1
2
2
2
1
0
0
1
2
2
2
1
0
0
0
1
2
2
2
1
1
1
1
0
time
Lime Structure: Overall
Lime Structure: Belady’s Trainer
PC
Addr Tag
Cached?
Occupancy
Vector
Oldest
Access
Entry
Belady
Trainer
…
Latest
Access
Entry
Handle Writeback and Prefetch
Load / Store Writeback
Prefetch
Belady
Trainer
Cache/Bypass
Data
Cache
Cache
SRRIP
replacement
Replace way[0]
Replace way[0]
Cache
Lime Structure: PC Classifier
Input: PC
Output: Cached
PC
PC Classifier
KEEP
(bloom filter)
BYPASS
(bloom filter)
RANDOM
(lut)
Should install
data into cache
If PC is not found in PC Classifier:
Cached=true
Else:
If PC is in RANDOM bin
Cached=latest Cache decision
Else if PC is in KEEP bin:
Cached=true
Else if PC is in BYPASS bin:
Cached=false
Configuration
• Storage Cost
• Workloads
– Simpoint length of 200M
– Single core, 2MB LLC
– Multicore, multiprogram, 8MB LLC
– Compare against LRU
Results: Single Core. w/o prefetch
Normalized IPC
1.2
1.15
1.1
1.05
1
0.95
0.9
50.00%
40.00%
MPKI REDUCTION
30.00%
20.00%
10.00%
0.00%
-10.00%
-20.00%
-30.00%
-40.00%
Total misses reduction
Load/Store misses reduction
Results: Single Core. w/ prefetch
Normalized IPC
1.2
1.15
1.1
1.05
1
0.95
0.9
100.00%
80.00%
MPKI REDUCTION
60.00%
40.00%
20.00%
0.00%
-20.00%
-40.00%
-60.00%
-80.00%
-100.00%
Total misses reduction
Load/Store misses reduction
Results: Multicore. w/o prefetch
1.25
Normalized IPC
1.2
1.15
1.1
1.05
1
0.95
0.9
Mix 0
Mix 1
Mix 2
Mix 3
Mix 4
Mix 5
Mix 6
Mix 7
Mix 8
Results: Multicore. w/ prefetch
1.25
Normalized IPC
1.2
1.15
1.1
1.05
1
0.95
0.9
Mix 0
Mix 1
Mix 2
Mix 3
Mix 4
Mix 5
Mix 6
Mix 7
Mix 8
Conclusion
• LIME respects the observation that load/store misses are more
likely to cause pipeline stall than writeback and prefetch
misses
• Significant IPC improvement can be achieved with LIME, even
with increasing total misses in some cases.
Thank you!