Lfkaslfk

An Adaptive Shared/Private
NUCA Cache Partitioning
Scheme for CMPs
Haakon Dybdahl, Per Stenström
HPCA 2007
1
CMP Caching
• Extremes
– Private and shared caches
• NUCA organizations
• Shared Caches
Chang and Sohi, ISCA ‘06
– Adaptive vs uncontrolled sharing
• Pollution issues
So, is custom cache partitioning better?
2
Adaptive Partitioning
• Dynamic sharing of last-level caches among
cores
– Private and shared cache partitions
– Who needs more, gets more!
– Overall goal – minimize total cache misses
3
Issues to be considered?
• How to estimate private/shared space
for a core?
• How to share the “shared space” among
cores?
• Replacement policy for shared spaces?
4
Private/Shared Cache Partition
Size
• Private partition : Increase/decrease
blocks per set, keep # of sets constant
5
Private/Shared Cache Partition
Size
• Shared partition
– estimate relative gain
– Estimate misses that can be avoided by
increasing one block per set
– Estimate increased cache misses if
decrease in one block per set.
6
H/W Support
• Core Id
• Shadow Tags
• Counters
• Max # of blocks in a set
• Cost:
7
Shadow
Tags
Core ID
Counters
Relative comparisons
• Avoiding cache misses?
– Shadow tags : one per set per core
• Hits in shadow tags.
– Hits of LRU blocks.
• Re-evaluation
– Core_with_most_hits_to_shadow_tags(1)
compared with
core_with_lowest_hits_to_LRU_block (2)
• Done every 2000 cycles
– If 1>2 one cache block/set added to core 1
8
Managing Partitions
• Private partition
– LRU replacement
• Some key events:
– Cache hit in private L3
• Block found, classified as MRU
– Cache hit in neighboring L3
• All neighboring caches checked in parallel
• Block brought to private L3, LRU replacement
• Block evicted from private moved to shared $(???)
– Cache miss
• Block fetched from memory, placed in private $
• LRU block from private moved to shared
9
Shared partition block
replacement Algorithm
10
Results
• Single threaded workloads on each core
• 4MB shared L3, 1 MB private L3
• Workload characterization
– Last level cache sensitive/insensitive
• Overall goal: Maximize HM of IPC’s of
all 4 cores
– Forms basis for comparison
11
Speedups
For last level-cache sensitive
benchmarks
12
Larger Caches
13
8 MB L3
Technology scaling
• Smaller techs, delay is more dominant
14
Wrt Chang/Sohi, ISCA’06
• Uncontrolled vs adaptive partitioning
15
Summary
• Adaptive cache partitioning gives you:
– Better performance
– Less interference
– Improved sharing
• Can do more with less (cache)
16