An Adaptive Shared/Private NUCA Cache Partitioning Scheme for CMPs Haakon Dybdahl, Per Stenström HPCA 2007 1 CMP Caching • Extremes – Private and shared caches • NUCA organizations • Shared Caches Chang and Sohi, ISCA ‘06 – Adaptive vs uncontrolled sharing • Pollution issues So, is custom cache partitioning better? 2 Adaptive Partitioning • Dynamic sharing of last-level caches among cores – Private and shared cache partitions – Who needs more, gets more! – Overall goal – minimize total cache misses 3 Issues to be considered? • How to estimate private/shared space for a core? • How to share the “shared space” among cores? • Replacement policy for shared spaces? 4 Private/Shared Cache Partition Size • Private partition : Increase/decrease blocks per set, keep # of sets constant 5 Private/Shared Cache Partition Size • Shared partition – estimate relative gain – Estimate misses that can be avoided by increasing one block per set – Estimate increased cache misses if decrease in one block per set. 6 H/W Support • Core Id • Shadow Tags • Counters • Max # of blocks in a set • Cost: 7 Shadow Tags Core ID Counters Relative comparisons • Avoiding cache misses? – Shadow tags : one per set per core • Hits in shadow tags. – Hits of LRU blocks. • Re-evaluation – Core_with_most_hits_to_shadow_tags(1) compared with core_with_lowest_hits_to_LRU_block (2) • Done every 2000 cycles – If 1>2 one cache block/set added to core 1 8 Managing Partitions • Private partition – LRU replacement • Some key events: – Cache hit in private L3 • Block found, classified as MRU – Cache hit in neighboring L3 • All neighboring caches checked in parallel • Block brought to private L3, LRU replacement • Block evicted from private moved to shared $(???) – Cache miss • Block fetched from memory, placed in private $ • LRU block from private moved to shared 9 Shared partition block replacement Algorithm 10 Results • Single threaded workloads on each core • 4MB shared L3, 1 MB private L3 • Workload characterization – Last level cache sensitive/insensitive • Overall goal: Maximize HM of IPC’s of all 4 cores – Forms basis for comparison 11 Speedups For last level-cache sensitive benchmarks 12 Larger Caches 13 8 MB L3 Technology scaling • Smaller techs, delay is more dominant 14 Wrt Chang/Sohi, ISCA’06 • Uncontrolled vs adaptive partitioning 15 Summary • Adaptive cache partitioning gives you: – Better performance – Less interference – Improved sharing • Can do more with less (cache) 16
© Copyright 2026 Paperzz