Less is More: Leveraging Belady’s Algorithm with Demand-based Learning Jiajun Wang, Lu Zhang, Reena Panda, Lizy John The Univeristy of Texas at Austin Introduction • Why efficient LLC replacement policy is important? – LLC shared by multicores – LLC accesses have low temporal locality and long data reuse distance – Small capacity compared with big data application working set size • Goal – Ideally, every LLC cache blocks get reused before eviction. (Maximize total reuse counts) – It requires: • Bypass streaming accesses • Select dead block as victim Review of Belady’s Optimal Algorithm • Gives the most optimal case of cache behavior with the knowledge of future, the block with the largest forward distance in the string of future references should be replaced at the time of a miss. Access A B B C B A D C A time A A B A B C B C B C A C A C A C A 2-way fully associative cache Motivation • However… Access Type: Access Addr: LD LD LD ST LD LD LD WB LD A B B C B A D C A A A B A B A B A B A B A B A B A B time • Same miss counts != same cycle penalty cost – Miss latency variance (e.g., get missed data from LLC vs DRAM) – Access type priority (e.g., writeback or prefetch is not on critical path) Lime Proposal • Basic idea: A cache replacement policy which leverages key idea of Belady’s algorithm but focuses on demand accesses (i.e. loads and stores) that have direct impact on system performance, and bypasses training process for writeback and prefetch accesses. • Builds on prior work – Caching behavior of past load instructions can guide future caching decisions[1][2] – Leverages Belady’s algorithm on past accesses[3] [1]W. A. Wong and J.-L. Baer. Modified LRU policies for improving second-level cache behavior. In HPCA 2000, [2]C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr.,and J. Emer. SHiP: Signature-based hit predictor for high performance caching. In MICRO 2011 [3] A. Jain and C. Lin, “Back to the future: Leveraging belady’s algorithm for improved cache replacement. In ISCA 2016 Background: Hawkeye • OPTgen: Cached Non-Cached Unique Addr D C B A A Occupancy Vector B B C B A D C A 0 0 0 0 1 0 0 1 0 0 0 1 1 1 0 1 2 2 2 1 0 1 2 2 2 1 0 0 1 2 2 2 1 0 0 0 1 2 2 2 1 1 1 1 0 time Lime Structure: Overall Lime Structure: Belady’s Trainer PC Addr Tag Cached? Occupancy Vector Oldest Access Entry Belady Trainer … Latest Access Entry Handle Writeback and Prefetch Load / Store Writeback Prefetch Belady Trainer Cache/Bypass Data Cache Cache SRRIP replacement Replace way[0] Replace way[0] Cache Lime Structure: PC Classifier Input: PC Output: Cached PC PC Classifier KEEP (bloom filter) BYPASS (bloom filter) RANDOM (lut) Should install data into cache If PC is not found in PC Classifier: Cached=true Else: If PC is in RANDOM bin Cached=latest Cache decision Else if PC is in KEEP bin: Cached=true Else if PC is in BYPASS bin: Cached=false Configuration • Storage Cost • Workloads – Simpoint length of 200M – Single core, 2MB LLC – Multicore, multiprogram, 8MB LLC – Compare against LRU Results: Single Core. w/o prefetch Normalized IPC 1.2 1.15 1.1 1.05 1 0.95 0.9 50.00% 40.00% MPKI REDUCTION 30.00% 20.00% 10.00% 0.00% -10.00% -20.00% -30.00% -40.00% Total misses reduction Load/Store misses reduction Results: Single Core. w/ prefetch Normalized IPC 1.2 1.15 1.1 1.05 1 0.95 0.9 100.00% 80.00% MPKI REDUCTION 60.00% 40.00% 20.00% 0.00% -20.00% -40.00% -60.00% -80.00% -100.00% Total misses reduction Load/Store misses reduction Results: Multicore. w/o prefetch 1.25 Normalized IPC 1.2 1.15 1.1 1.05 1 0.95 0.9 Mix 0 Mix 1 Mix 2 Mix 3 Mix 4 Mix 5 Mix 6 Mix 7 Mix 8 Results: Multicore. w/ prefetch 1.25 Normalized IPC 1.2 1.15 1.1 1.05 1 0.95 0.9 Mix 0 Mix 1 Mix 2 Mix 3 Mix 4 Mix 5 Mix 6 Mix 7 Mix 8 Conclusion • LIME respects the observation that load/store misses are more likely to cause pipeline stall than writeback and prefetch misses • Significant IPC improvement can be achieved with LIME, even with increasing total misses in some cases. Thank you!
© Copyright 2026 Paperzz