Accurate and Complexity-Effective
Spatial Pattern Prediction
AENAO: Power Aware Memory Coherence &
Hierarchies for Servers
http://eecg.toronto.edu/~aenao
Computer Architecture Lab
University of Toronto
at
Chi Chen
Se-Hyun Yang
Babak Falsafi
Andreas Moshovos
Motivation – Variation in Spatial Locality
Caches Exploit Spatial Locality via Block Size
“One Size Fits All” Solution
Prefetch Nearby Data Improve Performance
Large enough for prefetching
Small enough to avoid memory link saturation
Opportunity
Variation Within and Across Applications
If “Best Block Size” was known:
1. Prefetch even further Higher Performance
2. “Turn-off” unused data in cache Lower Leakage Power
CALCM
2
This Work
Dynamic Spatial Pattern Prediction
Leakage Power Reduction
Prefetching
Sub-blocks of a block as a Group
Place “unused” block parts in low leakage state
Consecutive Memory Blocks as a Group
Selectively Prefetch Blocks Upon First Access in Group
Key Contribution: PC + Offset Within Group
Quick
Learning
Compact Representation
High Coverage
CALCM
3
How Well it Works
Spatial Pattern Predictor (SPP)
L1 Data Leakage Energy Reduction
256-entry Tag-Less Direct-Mapped
~95% coverage
~40% reduction w/ 70nm CMOS technology
< 1% average performance degradation
Prefetching w/ 1024 byte Group
CALCM
Up to 2x speedup and 56% Average
Conventional Cache: 14% Slowdown
4
Outline
Conventional Cache: Optimization Opportunities
Variation in Spatial Locality
Prediction Framework
Prior Work
Results
CALCM
5
Optimization Opportunity #1
Conventional Cache
typedef struct person {
char name[20];
…
int age;
int isAdult;
struct person* next;
} // total 64 bytes
// do something …
while ( people ) {
if ( peopleage >= 21 )
peopleisAdult = TRUE;
people = peoplenext;
}
L1D with 64-Byte cache lines
miss
age
isAdult
next
miss
age
isAdult
next
miss
age
isAdult
next
untouched
touched
Resident untouched data
Wasteful Leakage
CALCM
6
Optimization Opportunity #2
Conventional Cache
// do something …
for i {
if ( people[i].age >= 21 )
people[i].isAdult = TRUE;
}
L1D with 64-Byte cache lines
Group #2 Group #1
typedef struct person {
char name[20];
…
int age;
int isAdult;
} people[LARGE]
age
isAdult
age
isAdult
age
isAdult
age
isAdult
Detech Access Patterns at Group Level
Selectively Prefetch Same Block Members
Improve Performance w/o Saturating Memory
CALCM
7
Variation in Spatial Locality
Average Line Usage
40%
89%
26%
48%
100%
8/8
All Cache Lines Touched
7/8
80%
6/8
5/8
60%
4/8
40%
3/8
2/8
20%
1/8
0%
facerec
gcc
mcf
vortex
Fraction of data used before eviction
Measured on 64KB 2-way L1D w/ 64B cache lines
CALCM
8
Prediction Framework
Minimum Fetch Unit (MFU):
• replacement unit of cache
• e.g., cache line or sub block
Spatial Group:
• group of adjacent MFUs
• indexed by logical tag
Spatial Pattern:
1
0
...
• reference pattern of a spatial group
1
Spatial Group Generation:
• starts with a new logical tag
Tag0
CALCM
Tag0
...
Tag1
Tag1
Tag1
...
Time
9
Spatial Pattern Predictor
Data
Cache
Current Pattern Table (CPT)
Pattern History Table (PHT)
Spatial Pattern
Register
Prediction Spatial Pattern
Index
History
PHT Entry
Pointer
0
1
1
0
001
0
0
0
0
1
1
0
0
000
0
0
0
0
1
0
0
0
011
1
0
0
0
1
1
1
1
010
1
1
1
1
Prediction Index: 32 bits
PC
SPG Offset
=?
Spatial Pattern Prediction
Current Pattern Table records patterns
Pattern History Table stores captured patterns
CALCM
10
Prior Work
Static profiling, V. Vleet, et al. ICCD 1999
Adjustable block size, Dubnicki & LeBlanc. ISCA 1992
Fetching adjacent cache lines, Temam & Jegou. ICS 1994
Dual cache, Gonzalez, Aliagas & Valero. ICS 1995
Spatial Locality Detection Table, Johnson, Merten & Hwu.
MICRO 1998
Spatial Footprint Predictor (SFP), Kumar & Wilkerson. ISCA
1998
Key Difference is Prediction Handle: PC + Group Offset
1. Compact Representation
2. Quick Learning
3. High Coverage
CALCM
11
Results Overview
Predictor Performance Statistics
Leakage Power Reduction
Performance Improvement w/ Prefetching
CALCM
12
Methodology
SimpleScalar simulator
SPEC CPU2000
Simulated to completion
Performance impact evaluation
Alpha binaries + reference inputs
Predictor performance evaluation
64KB 2-way L1D/L1I cache, 2-cycle latency
2MB 8-way L2 cache, 12-cycle latency
Skipped 10B and simulated next 500M instructions
Energy reduction evaluation
CALCM
SPICE w/ 70nm CMOS technology & 1V supply voltage
13
160%
better
% of perfect predictions
Practical Predictor: Performance
Training Over-Prediction
Over-Prediction
Under-Prediction
Correct Prediction
100%
80%
60%
256 Entries
A: 16-way
B: DM
C: FA
40%
20%
0%
ABC
fecerec
ABC
mcf
ABC
vortex
256-entry tag-less direct-mapped
CALCM
ABC
gcc
average prediction accuracy of 96%
14
Predictor Applications
Leakage energy reduction
Sub blocks as minimum fetch units
Cache lines as spatial groups
A cache miss starts a spatial group generation
Assuming Gated-Ground by Agarwal, Li, & Roy
Spatial group prefetcher
CALCM
Cache lines as minimum fetch units
Adjacent cache lines grouped into spatial groups
A new logical tag starts a spatial group generation
15
better
Relative Leakage Power
Execution Time Increase
100%
80%
better
Leakage Energy Reduction
60%
60%
40%
20%
0%
5%
fecerec
<1%
~2%
gcc
mcf
vortex
AVG
Up to 73% leakage energy reduction
~40% average leakage energy reduction
< 1% average performance degradation
CALCM
16
Performance Improvement
SPG 1024
150%
SPG 512
CONV. 1024
CONV. 512
100%
50%
0%
-50%
facerec
gcc
mcf
vortex
AVG
Up to 2x speedup with 1024B spatial groups
~60% average speedup with 1024B spatial groups
CALCM
17
Summary
Spatial Pattern Predictor (SPP)
Key Contribution: PC + Group Offset
L1 Data Leakage Energy Reduction
Small and Effective, High Coverage
256-entry Tag-Less Direct-Mapped
~95% coverage
~40% reduction w/ 70nm CMOS technology
< 1% average performance degradation
Prefetching w/ 1024 byte Group
CALCM
Up to 2x speedup and 56% Average
Conventional Cache: 14% Slowdown
18
Accurate and Complexity-Effective
Spatial Pattern Prediction
AENAO: Power Aware Memory Coherence &
Hierarchies for Servers
http://eecg.toronto.edu/~aenao
Computer Architecture Lab
University of Toronto
at
Chi Chen
Se-Hyun Yang
Babak Falsafi
Andreas Moshovos
Prediction Index
160%
Training
Over-Prediction
Under-Prediction
Correct Prediction
A: PC
B: PC+SPG ID
C: PC+SPG OFFSET
D: PC+ADDR
100%
80%
60%
40%
20%
0%
AB C D
AB C D
AB C D
AB C D
facerec
gcc
mcf
vortex
Infinite Tables
PC + SPG offset yields high prediction accuracy
PC + SPG offset has low prediction memory requirements
CALCM
20
Contributions
Spatial Pattern Predictor (SPP)
Leakage Energy Reduction
256-entry Tag-Less Direct-Mapped
~95% coverage
~40% reduction w/ 70nm CMOS technology
< 1% average performance degradation
Processor Performance Improvement
CALCM
Up to 2x speedup
21
Variations in Spatial Locality
Percentage of All Cache Line Usages
100%
89-100%
80%
76-88%
64-75%
60%
51-63%
39-50%
26-38%
40%
14-25%
<=13%
20%
vortex
mgrid
mcf
lucas
gcc
gap
fma3d
facerec
equake
bzip
art
ammp
0%
Fraction of data used before eviction
Measured on 64KB 2-way L1D w/ 64B cache lines
CALCM
22
Prediction Index
Training
Underprediction
Overprediction
Correct Prediction
140%
120%
100%
80%
60%
40%
vortex
mgrid
mcf
lucas
gcc
gap
fma3d
facerec
equake
ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD
ammp
0%
A: PC- only
B: PC+SPG ID
C: PC+SPG OFF SET
D: PC+A DDR
bzip
20%
art
Percent of Perfect Predictions
160%
PC + SPG offset yields high prediction accuracy
PC + SPG offset requires low prediction memory
requirement
CALCM
23
Predictor Memory Organization
Training
Underprediction
Overprediction
Correct Prediction
140%
120%
100%
80%
60%
vortex
mgrid
mcf
lucas
gcc
gap
fma3d
facerec
ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF
ammp
0%
128- en try 16-way
128- en try DM
128- en try FA
256- en try 16-way
256-ent ry DM
256- entry FA
equake
20%
A:
B:
C:
D:
E:
F:
bzip
40%
art
Percent of Perfect Predictions
160%
256-entry tag-less direct-mapped yields average
prediction accuracy of 96%
CALCM
24
Spatial Group Size
(1/2)
160%
Percentage of Perfect Predictions
140%
120%
Training
100%
Overprediction
80%
Underprediction
60%
Correct Prediction
40%
20%
0%
A:
B:
C:
D:
E:
16B Spatial Gr oup 8B Fetch Unit
32B Spatial Gr oup 8B Fetch Unit
64B S patial G roup 8B Fetc h Unit
128B S patial G roup 8B Fetc h Unit
256B Spatial Gr oup 8B Fetch Unit
CALCM
vortex
mgrid
mcf
lucas
gcc
gap
fma3d
facerec
equake
bzip
art
A ammp
ABCDE ABCDE ABCDE ABCDE ABCDE ABCDE ABCDE ABCDE ABCDE ABCDE ABCDE ABCDE
25
Spatial Group Size
Correct Prediction
Underprediction
Overprediction
(2/2)
Training
140%
120%
100%
80%
60%
CALCM
vortex
mgrid
mcf
lucas
gcc
gap
fma3d
facerec
ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF
ammp
0%
equake
20%
A : 32B Spatial Gr oup 8B Fetch Unit
B : 64B Spatial Gr oup 8B Fetch Unit
C: 128B S patial G roup 8B Fetch Unit
D: 128B S patial G roup 64B Fetch Unit
E : 256B Spatial Group 64B F etc h Unit
F: 512B S patial G roup 64B Fetc h Unit
bzip
40%
art
Percentage of Perfect Predictions
160%
26
Predictor Memory Organization
140%
120%
Training
100%
CALCM
Overprediction
80%
Underprediction
60%
Correct Prediction
BDF
BDF
BDF
BDF
BDF
BDF
BDF
B DF
BDF
BDF
BDF
BDF
art
bzip
equake
facerec
fma3d
gap
gcc
lucas
mcf
mgrid
vortex
A : 8- entr y
B : 16- entr y
40%
C: 32-entry
D: 64-entry
20%
E : 128- entr y
F: 256-entry
G: INF
0%
ACEGACEGACEGACEGACEGACEGACEGACEGACEGACEGACEGACEG
ammp
Percentage of Perfect Predictions
160%
27
Leakage Energy Reduction
100%
Execution Time Increase
Fraction of Baseline Leakage Dissipation
80%
60%
40%
20%
AVG
vortex
mgrid
mcf
lucas
gcc
gap
fma3d
facerec
equake
bzip
art
5%
ammp
0%
Up to 73% leakage energy reduction
~40% average leakage energy reduction
< 1% average performance degradation
CALCM
28
SPG 1024B
-25
305
8
99
103
0
47
1
51
67
53
1
59
SPG 512B
10
121
6
59
58
0
31
1
34
38
36
1
33
1024B
-63
96
-49
-41
-3
-9
31
-2
-67
-32
12
-43
-14
512B
-41
32
-43
-34
-13
-9
20
-2
-23
-27
6
-27
-13
ammp
art
bzip
equake
facerec
fma3d
gap
gcc
lucas
mcf
mgrid
vortex
AVG
Performance Improvement
Up to 2x speedup with 1024B spatial groups
~60% average speedup with 1024B spatial groups
CALCM
29
Predictor Memory Organization
160%
Training
Over-Prediction
Under-Prediction
Correct Prediction
100%
80%
A: 128-entry 16-way
B: 128-entry DM
C: 128-entry FA
D: 256-entry 16-way
E: 256-entry DM
F: 256-entry FA
60%
40%
20%
0%
A B C D E F
fecerec
A B C D E F
gcc
A B C D E F
mcf
A B C D E F
vortex
256-entry tag-less direct-mapped
CALCM
average prediction accuracy of 96%
30
© Copyright 2026 Paperzz