Spatial Pattern Prediction

Accurate and Complexity-Effective
Spatial Pattern Prediction
AENAO: Power Aware Memory Coherence &
Hierarchies for Servers
http://eecg.toronto.edu/~aenao
Computer Architecture Lab
University of Toronto
at
Chi Chen
Se-Hyun Yang
Babak Falsafi
Andreas Moshovos
Motivation – Variation in Spatial Locality

Caches Exploit Spatial Locality via Block Size


“One Size Fits All” Solution




Prefetch Nearby Data  Improve Performance
Large enough for prefetching
Small enough to avoid memory link saturation
Opportunity
Variation Within and Across Applications
If “Best Block Size” was known:
1. Prefetch even further  Higher Performance
2. “Turn-off” unused data in cache  Lower Leakage Power
CALCM
2
This Work


Dynamic Spatial Pattern Prediction
Leakage Power Reduction



Prefetching



Sub-blocks of a block as a Group
Place “unused” block parts in low leakage state
Consecutive Memory Blocks as a Group
Selectively Prefetch Blocks Upon First Access in Group
Key Contribution: PC + Offset Within Group
 Quick
Learning
 Compact Representation
 High Coverage
CALCM
3
How Well it Works

Spatial Pattern Predictor (SPP)



L1 Data Leakage Energy Reduction



256-entry Tag-Less Direct-Mapped
~95% coverage
~40% reduction w/ 70nm CMOS technology
< 1% average performance degradation
Prefetching w/ 1024 byte Group


CALCM
Up to 2x speedup and 56% Average
Conventional Cache: 14% Slowdown
4
Outline

Conventional Cache: Optimization Opportunities

Variation in Spatial Locality

Prediction Framework

Prior Work

Results
CALCM
5
Optimization Opportunity #1
Conventional Cache
typedef struct person {
char name[20];
…
int age;
int isAdult;
struct person* next;
} // total 64 bytes
// do something …
while ( people ) {
if ( peopleage >= 21 )
peopleisAdult = TRUE;
people = peoplenext;
}
L1D with 64-Byte cache lines
miss
age
isAdult
next
miss
age
isAdult
next
miss
age
isAdult
next
untouched
touched
Resident untouched data 
Wasteful Leakage
CALCM
6
Optimization Opportunity #2
Conventional Cache
// do something …
for i {
if ( people[i].age >= 21 )
people[i].isAdult = TRUE;
}
L1D with 64-Byte cache lines
Group #2 Group #1
typedef struct person {
char name[20];
…
int age;
int isAdult;
} people[LARGE]
age
isAdult
age
isAdult
age
isAdult
age
isAdult
Detech Access Patterns at Group Level 
Selectively Prefetch Same Block Members 
Improve Performance w/o Saturating Memory
CALCM
7
Variation in Spatial Locality
Average Line Usage
40%
89%
26%
48%
100%
8/8
All Cache Lines Touched
7/8
80%
6/8
5/8
60%
4/8
40%
3/8
2/8
20%
1/8
0%
facerec


gcc
mcf
vortex
Fraction of data used before eviction
Measured on 64KB 2-way L1D w/ 64B cache lines
CALCM
8
Prediction Framework
Minimum Fetch Unit (MFU):
• replacement unit of cache
• e.g., cache line or sub block
Spatial Group:
• group of adjacent MFUs
• indexed by logical tag
Spatial Pattern:
1
0
...
• reference pattern of a spatial group
1
Spatial Group Generation:
• starts with a new logical tag
Tag0
CALCM
Tag0
...
Tag1
Tag1
Tag1
...
Time
9
Spatial Pattern Predictor
Data
Cache
Current Pattern Table (CPT)
Pattern History Table (PHT)
Spatial Pattern
Register
Prediction Spatial Pattern
Index
History
PHT Entry
Pointer
0
1
1
0
001
0
0
0
0
1
1
0
0
000
0
0
0
0
1
0
0
0
011
1
0
0
0
1
1
1
1
010
1
1
1
1
Prediction Index: 32 bits
PC
SPG Offset
=?
Spatial Pattern Prediction


Current Pattern Table records patterns
Pattern History Table stores captured patterns
CALCM
10
Prior Work






Static profiling, V. Vleet, et al. ICCD 1999
Adjustable block size, Dubnicki & LeBlanc. ISCA 1992
Fetching adjacent cache lines, Temam & Jegou. ICS 1994
Dual cache, Gonzalez, Aliagas & Valero. ICS 1995
Spatial Locality Detection Table, Johnson, Merten & Hwu.
MICRO 1998
Spatial Footprint Predictor (SFP), Kumar & Wilkerson. ISCA
1998
Key Difference is Prediction Handle: PC + Group Offset
1. Compact Representation
2. Quick Learning
3. High Coverage
CALCM
11
Results Overview

Predictor Performance Statistics

Leakage Power Reduction

Performance Improvement w/ Prefetching
CALCM
12
Methodology

SimpleScalar simulator



SPEC CPU2000


Simulated to completion
Performance impact evaluation


Alpha binaries + reference inputs
Predictor performance evaluation


64KB 2-way L1D/L1I cache, 2-cycle latency
2MB 8-way L2 cache, 12-cycle latency
Skipped 10B and simulated next 500M instructions
Energy reduction evaluation

CALCM
SPICE w/ 70nm CMOS technology & 1V supply voltage
13
160%
better
% of perfect predictions
Practical Predictor: Performance
Training Over-Prediction
Over-Prediction
Under-Prediction
Correct Prediction
100%
80%
60%
256 Entries
A: 16-way
B: DM
C: FA
40%
20%
0%
ABC
fecerec

ABC
mcf
ABC
vortex
256-entry tag-less direct-mapped

CALCM
ABC
gcc
average prediction accuracy of 96%
14
Predictor Applications

Leakage energy reduction





Sub blocks as minimum fetch units
Cache lines as spatial groups
A cache miss starts a spatial group generation
Assuming Gated-Ground by Agarwal, Li, & Roy
Spatial group prefetcher



CALCM
Cache lines as minimum fetch units
Adjacent cache lines grouped into spatial groups
A new logical tag starts a spatial group generation
15
better
Relative Leakage Power
Execution Time Increase
100%
80%
better
Leakage Energy Reduction
60%
60%
40%
20%
0%
5%
fecerec



<1%
~2%
gcc
mcf
vortex
AVG
Up to 73% leakage energy reduction
~40% average leakage energy reduction
< 1% average performance degradation
CALCM
16
Performance Improvement
SPG 1024
150%
SPG 512
CONV. 1024
CONV. 512
100%
50%
0%
-50%


facerec
gcc
mcf
vortex
AVG
Up to 2x speedup with 1024B spatial groups
~60% average speedup with 1024B spatial groups
CALCM
17
Summary


Spatial Pattern Predictor (SPP)
Key Contribution: PC + Group Offset




L1 Data Leakage Energy Reduction



Small and Effective, High Coverage
256-entry Tag-Less Direct-Mapped
~95% coverage
~40% reduction w/ 70nm CMOS technology
< 1% average performance degradation
Prefetching w/ 1024 byte Group


CALCM
Up to 2x speedup and 56% Average
Conventional Cache: 14% Slowdown
18
Accurate and Complexity-Effective
Spatial Pattern Prediction
AENAO: Power Aware Memory Coherence &
Hierarchies for Servers
http://eecg.toronto.edu/~aenao
Computer Architecture Lab
University of Toronto
at
Chi Chen
Se-Hyun Yang
Babak Falsafi
Andreas Moshovos
Prediction Index
160%
Training
Over-Prediction
Under-Prediction
Correct Prediction
A: PC
B: PC+SPG ID
C: PC+SPG OFFSET
D: PC+ADDR
100%
80%
60%
40%
20%
0%



AB C D
AB C D
AB C D
AB C D
facerec
gcc
mcf
vortex
Infinite Tables
PC + SPG offset yields high prediction accuracy
PC + SPG offset has low prediction memory requirements
CALCM
20
Contributions

Spatial Pattern Predictor (SPP)



Leakage Energy Reduction



256-entry Tag-Less Direct-Mapped
~95% coverage
~40% reduction w/ 70nm CMOS technology
< 1% average performance degradation
Processor Performance Improvement

CALCM
Up to 2x speedup
21
Variations in Spatial Locality
Percentage of All Cache Line Usages
100%
89-100%
80%
76-88%
64-75%
60%
51-63%
39-50%
26-38%
40%
14-25%
<=13%
20%
vortex
mgrid
mcf
lucas
gcc
gap
fma3d
facerec
equake
bzip
art
ammp
0%
Fraction of data used before eviction

Measured on 64KB 2-way L1D w/ 64B cache lines
CALCM

22
Prediction Index
Training
Underprediction
Overprediction
Correct Prediction
140%
120%
100%
80%
60%
40%


vortex
mgrid
mcf
lucas
gcc
gap
fma3d
facerec
equake
ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD
ammp
0%
A: PC- only
B: PC+SPG ID
C: PC+SPG OFF SET
D: PC+A DDR
bzip
20%
art
Percent of Perfect Predictions
160%
PC + SPG offset yields high prediction accuracy
PC + SPG offset requires low prediction memory
requirement
CALCM
23
Predictor Memory Organization
Training
Underprediction
Overprediction
Correct Prediction
140%
120%
100%
80%
60%

vortex
mgrid
mcf
lucas
gcc
gap
fma3d
facerec
ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF
ammp
0%
128- en try 16-way
128- en try DM
128- en try FA
256- en try 16-way
256-ent ry DM
256- entry FA
equake
20%
A:
B:
C:
D:
E:
F:
bzip
40%
art
Percent of Perfect Predictions
160%
256-entry tag-less direct-mapped yields average
prediction accuracy of 96%
CALCM
24
Spatial Group Size
(1/2)
160%
Percentage of Perfect Predictions
140%
120%
Training
100%
Overprediction
80%
Underprediction
60%
Correct Prediction
40%
20%
0%
A:
B:
C:
D:
E:
16B Spatial Gr oup 8B Fetch Unit
32B Spatial Gr oup 8B Fetch Unit
64B S patial G roup 8B Fetc h Unit
128B S patial G roup 8B Fetc h Unit
256B Spatial Gr oup 8B Fetch Unit
CALCM
vortex
mgrid
mcf
lucas
gcc
gap
fma3d
facerec
equake
bzip
art
A ammp
ABCDE ABCDE ABCDE ABCDE ABCDE ABCDE ABCDE ABCDE ABCDE ABCDE ABCDE ABCDE
25
Spatial Group Size
Correct Prediction
Underprediction
Overprediction
(2/2)
Training
140%
120%
100%
80%
60%
CALCM
vortex
mgrid
mcf
lucas
gcc
gap
fma3d
facerec
ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF ABCDEF
ammp
0%
equake
20%
A : 32B Spatial Gr oup 8B Fetch Unit
B : 64B Spatial Gr oup 8B Fetch Unit
C: 128B S patial G roup 8B Fetch Unit
D: 128B S patial G roup 64B Fetch Unit
E : 256B Spatial Group 64B F etc h Unit
F: 512B S patial G roup 64B Fetc h Unit
bzip
40%
art
Percentage of Perfect Predictions
160%
26
Predictor Memory Organization
140%
120%
Training
100%
CALCM
Overprediction
80%
Underprediction
60%
Correct Prediction
BDF
BDF
BDF
BDF
BDF
BDF
BDF
B DF
BDF
BDF
BDF
BDF
art
bzip
equake
facerec
fma3d
gap
gcc
lucas
mcf
mgrid
vortex
A : 8- entr y
B : 16- entr y
40%
C: 32-entry
D: 64-entry
20%
E : 128- entr y
F: 256-entry
G: INF
0%
ACEGACEGACEGACEGACEGACEGACEGACEGACEGACEGACEGACEG
ammp
Percentage of Perfect Predictions
160%
27
Leakage Energy Reduction
100%
Execution Time Increase
Fraction of Baseline Leakage Dissipation
80%
60%
40%
20%



AVG
vortex
mgrid
mcf
lucas
gcc
gap
fma3d
facerec
equake
bzip
art
5%
ammp
0%
Up to 73% leakage energy reduction
~40% average leakage energy reduction
< 1% average performance degradation
CALCM
28


SPG 1024B
-25
305
8
99
103
0
47
1
51
67
53
1
59
SPG 512B
10
121
6
59
58
0
31
1
34
38
36
1
33
1024B
-63
96
-49
-41
-3
-9
31
-2
-67
-32
12
-43
-14
512B
-41
32
-43
-34
-13
-9
20
-2
-23
-27
6
-27
-13
ammp
art
bzip
equake
facerec
fma3d
gap
gcc
lucas
mcf
mgrid
vortex
AVG
Performance Improvement
Up to 2x speedup with 1024B spatial groups
~60% average speedup with 1024B spatial groups
CALCM
29
Predictor Memory Organization
160%
Training
Over-Prediction
Under-Prediction
Correct Prediction
100%
80%
A: 128-entry 16-way
B: 128-entry DM
C: 128-entry FA
D: 256-entry 16-way
E: 256-entry DM
F: 256-entry FA
60%
40%
20%
0%
A B C D E F
fecerec

A B C D E F
gcc
A B C D E F
mcf
A B C D E F
vortex
256-entry tag-less direct-mapped

CALCM
average prediction accuracy of 96%
30