Software and Hardware Support for Locality
Aware High Performance Computing
Xiaodong Zhang
National Science Foundation
College of William and Mary
This talk does not necessarily reflect NSF`s official opinions
Acknowledgement
Participants of the project
David Bryan, Jefferson Labs (DOE)
Stefan Kubricht, Vsys Inc.
Song Jiang and Zhichun Zhu, William and Mary
Li Xiao, Michigan State University.
Yong Yan, HP Labs.
Zhao Zhang, Iowa State University.
Sponsors of the project
Air Force Office of Scientific Research
National Science Foundation
Sun Microsystems Inc.
CPU-DRAM Gap
10000
60% per year
1000
CPU
DRAM
100
50% per year
10
7% per year
00
20
98
19
96
19
94
19
92
19
90
19
88
19
86
19
84
19
82
19
19
80
1
Cache Miss Penalty
A cache miss = Executing hundreds of CPU
instructions (thousands in the future).
2 GHz, 2.5 avg. issue rate: issue 350 instructions in
70 ns access latency.
A small cache miss rate A high memory stall
time in total execution time.
On average, 62% memory stall time for SPEC2000.
I/O Bottleneck is Much Worse
Disk access time is limited by mechanical delays.
A fast Seagate Cheetah X15 disk (15000 rpm):
External transfer rate increases 40% per year.
average seek time: 3.9 ms, rotation latency: 2 ms
internal transfer time for a strip unit (8KB): 0.16 ms
Total disk latency: 6.06 ms.
from disk to DRAM: 160 MBps (UltraSCSI I/O bus)
To get 8KB from disk to DRAM takes 11.06 ms.
More than 22 million CPU cycles of 2GHz!
Memory Hierarchy with Multi-level Caching
CPU
registers
Registers
L1
TLB
L2
Algorithm implementation
Compiler
Micro architecture
L3
CPU-memory bus
Row buffer
Micro architecture
Bus adapter
Controller
Controller
buffer
DRAM
Buffer cache
I/O bus
I/O controller
Disk
disk cache
disk
Operating system
Micro architecture
Other Systems Effects to Locality
Locality exploitation is not guaranteed by the buffers!
Initial and runtime data placement.
Data replacement at different caching levels.
static and dynamic data allocations, and interleaving.
LRU is used but fails sometimes.
Locality aware memory access scheduling.
reorder access sequences to use cached data.
Outline
Cache
optimization at the application level.
Designing fast and high associativity caches
Exploiting multiprocessor cache locality at runtime.
Exploiting
Fine-grain memory access scheduling.
Efficient
locality in DRAM row buffer.
replacement in buffer cache.
Conclusion
Application Software Effort: Algorithm
Restructuring for Cache Optimization
Traditional algorithm design means:
to give a sequence of computing steps subject
to minimize CPU operations.
It ignores:
inherent parallelizations and interactions (e.g.
ILP, pipelining, and multiprogramming),
memory hierarchy where data are laid out, and
increasingly high data access cost.
Mutually Adaptive Between Algorithms
and Architecture
Restructuring commonly used algorithms
by effectively utilizing caches and TLB,
minimizing cache and TLB misses.
A highly optimized application library is very useful.
Restructuring techniques
data blocking: grouping data in cache for repeat usage
data padding to avoid conflict misses
using registers as fast data buffers
Two Case Studies
Bit-Reversals:
basic operations in FFT and other applications
data layout and operations cause large conflict misses
Sortings: merge-, quick-, and insertion-.
Our library outperforms systems approaches
TLB and cache misses are sensitive to the operations.
We know exactly where to pad and block!
Usage of the two libraries (both are open sources)
bit-reversals: an alternative in Sun’s scientific library.
Sorting codes are used a benchmark for testing compilers.
Microarchitecture Effort: Exploit DRAM
Row Buffer Locality
DRAM features:
High density and high capacity
Low cost but slow access (compared to SRAM)
Non-uniform access latency
Row-buffer serves as a fast cache
the access patterns here has been paid little attention.
Reusing buffer data minimizes the DRAM latency.
Locality Exploitation in Row Buffer
CPU
registers
Registers
L1
TLB
L2
L3
CPU-memory bus
Row buffer
Bus adapter
Controller
Controller
buffer
DRAM
Buffer cache
I/O bus
I/O controller
Disk
disk cache
disk
DRAM Access = Latency + Bandwidth Time
Processor
Bus bandwidth time
Column Access
DRAM
Latency
Row Access
Precharge
Row Buffer
DRAM Core
Nonuniform DRAM Access Latency
Case 1: Row buffer hit (20+ ns)
col. access
Case 2: Row buffer miss (core is precharged, 40+ ns)
row access
col. access
Case 3: Row buffer miss (not precharged, ≈ 70 ns)
precharge
row access
col. access
Row buffer misses come from a sequence of accesses to
different pages in the same bank.
Amdahl’s Law applies in DRAM
Time (ns) to fetch a 128-byte cache block:
0.8GB/s
(PC100)
70
2.1GB/s
(PC2100)
70
6.4GB/s
(Rambus)
70
160
60
20
As the bandwidth improves, DRAM latency
will decide cache miss penalty.
Row Buffer Locality Benefit
Latencyrow buffer hit Latencyrow buffer miss
Reduce latency by up to 67%.
Objective: serve memory requests
without accessing the DRAM core as
much as possible.
Row Buffer Misses are Surprisingly High
Standard configuration
ijpeg
compress
applu
mgrid
hydro2d
tomcatv
100
90
80
70
60
50
40
30
20
10
0
Conventional cache
mapping
Page interleaving for
DRAM memories
32 DRAM banks, 2KB
page size
SPEC95 and SPEC2000
Why is the reason
behind this?
Conventional Page Interleaving
Page 0
Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
Page 7
…
…
…
…
Bank 0
Bank 1
Bank 2
Bank 3
Address format
r
page index
k
p
bank
page offset
Address Mapping Symmetry
r
page:
cache:
page index
k
p
bank
page offset
t
s
b
cache tag
cache set index
block offset
cache-conflicting: same cache index, different tags.
row-buffer conflicting: same bank index, different pages.
address mapping: bank index cache set index
Property: xy, x and y conflict on cache also on row buffer.
Sources of Misses
Symmetry: invariance in results under transformations.
Address mapping symmetry propogates conflicts from
cache address to memory address space:
cache-conflicting addresses are also row-buffer
conflicting addresses
cache write-back address conflicts with the address of
the to be fetched block in the row-buffer.
Cache conflict misses are also row-buffer conflict
misses.
Breaking the Symmetry by
Permutation-based Page Interleaving
L2 Cache tag
index
bank
k
k
page offset
XOR
k
page index
new bank
page offset
Permutation Property (1)
Conflicting addresses are distributed onto
different banks
Permutation-based
Conventional
L2 Conflicting addresses
1000
1010
1001
1010
1010
1010
1011
1010
interleaving
xor
Different
Same bank
bank
index
indexes
memory banks
0000
0001
0010
0011
0100
0101
0110
0111
1010
1011
Permutation Property (2)
The spatial locality of memory references is
preserved.
Permutation-based
Conventional
Within one page
1000
1010
1000
1010
1000
1010
1000
1010
…
…
interleaving
xor
Same bank index
memory banks
0000
0001
0010
0011
0100
0101
0110
0111
1010
1011
Permutation Property (3)
Pages are uniformly mapped onto ALL
memory banks.
bank 0
bank 1
bank 2
bank 3
0
4P
1P
5P
2P
6P
3P
7P
…
…
…
…
C+1P
C+5P
C
C+4P
C+3P
C+7P
C+2P
C+6P
…
…
…
…
2C+2P
2C+6P
2C+3P
2C+7P
2C
2C+4P
2C+1P
2C+5P
…
…
…
…
Row-buffer Miss Rates
100
90
80
70
60
50
40
30
20
10
0
Cache line
Page
ap
si
w
av
e5
sw
im
su
2c
o
hy r
dr
o2
d
m
gr
id
ap
pl
u
tu
rb
3d
to
m
ca
tv
Swap
Permutation
1.4
Cache line
Page
Swap
Permutation
1.2
1
0.8
0.6
0.4
0.2
CC
TP
e5
wa
v
or
hy
dr
o2
d
m
gr
id
ap
pl
u
tu
rb
3d
2c
su
ca
to
m
sw
im
0
tv
Normalized Memory Stall Time
Comparison of Memory Stall Time
Improvement of IPC
1.80
1.40
1.20
1.00
cacheline
page
swap
permutation
0.80
0.60
0.40
TPC-C
wave5
turb3d
applu
mgrid
hydro2d
su2cor
0.00
swim
0.20
tomcatv
Normilized IPC
1.60
Where to Break the Symmetry?
Break the symmetry at the bottom level (DRAM
address) is most effective:
Far away from the critical path (little overhead)
Reduce the both address conflicts and write-back
conflicts.
Our experiments confirm this (30% difference).
System Software Effort: Efficient
Buffer Cache Replacement
Buffer cache borrows a variable space in DRAM.
Accessing I/O data in buffer cache is about a million
times faster than in the disk.
Performance of data intensive applications relies on
exploiting locality of buffer cache.
Buffer cache replacement is a key factor.
Locality Exploitation in Buffer Cache
CPU
registers
Registers
L1
TLB
L2
L3
CPU-memory bus
Row buffer
Bus adapter
Controller
Controller
buffer
DRAM
Buffer cache
I/O bus
I/O controller
Disk
disk cache
disk
The Problem of LRU Replacement
Inability to cope with weak access locality
File scanning: one-time accessed blocks are not replaced
timely;
Loop-like accesses: blocks to be accessed soonest can be
unfortunately replaced;
Accesses with distinct frequencies: Frequently accessed
blocks can be unfortunately replaced.
Reasons for LRU to Fail and but Powerful
• Why LRU fails sometimes?
• A recently used block will not necessarily be used again
or soon.
• The prediction is based on a single source information.
• Why it is so widely used?
• Simplicity: an easy and simple data structure.
• Work well for accesses following LRU assumption.
Our Objectives and Contributions
Significant efforts have been made to improve/replace LRU,
• Case by case; or
• High runtime overhead
Our objectives:
• Address the limits of LRU fundamentally.
• Retain the low overhead and strong locality merits of LRU.
Related Work
Aided by user-level hints
Application-hinted caching and prefetching [OSDI, SOSP, ...]
rely on users` understanding of data access patterns.
Detection and adaptation of access regularities
SEQ, EELRU, DEAR, AFC, UBM [OSDI, SIGMETRICS …]
case-by-case oriented approaches
Tracing and utilizing deeper history information
LRFU, LRU-k, 2Q (VLDB, SIGMETRICS, SIGMOD …)
high implementation cost, and runtime overhead.
Observation of Data Flow in LRU Stack
• Blocks are ordered by recency in the LRU stack.
• Blocks enter the stack top, and leave from its bottom.
The stack is long and bottom is the only exit.
5
3
2
A block evicted from the
bottom of the stack should
have been evicted much
earlier !
.
.
.
1
6
LRU stack
Inter-Reference Recency (IRR)
IRR of a block: number of other unique blocks accessed between two
consecutive references to the block.
Recency: number of other unique blocks accessed from last reference to
the current time.
IRR = 3
R=2
1 2 3 4 3 1 5 6 5
Basic Ideas of LIRS
A high IRR block will not be frequently used.
High IRR blocks are selected for replacement.
Recency is used as a second reference.
LIRS: Low Inter-reference Recency Set algorithm
Keep Low IRR blocks in buffer cache.
Foundations of LIRS:
effectively use multiple sources of access information.
Responsively determine and change the status of each block.
Low cost implementations.
Data Structure: Keep LIR Blocks in Cache
Low IRR (LIR) block and High IRR (HIR) block
Block
Sets
LIR block set
Physical Cache
Llirs
Cache size
(size is Llirs )
Lhirs
HIR block
set
L = Llirs + Lhirs
Replacement Operations of LIRS
Llirs=2, Lhirs=1
V time /
Blocks
A
1
2
5
6
7
X
X
8
9
X
X
X
C
E
4
X
B
D
3
X
X
X
10
R
IRR
1
1
3
1
4
inf
2
3
0
inf
LIR block set = {A, B}, HIR block set = {C, D, E}
E becomes a resident HIR determined by its low recency
Which Block is replaced ? Replace an HIR Block
D is referenced at time 10
V time /
Blocks
A
1
2
5
6
7
X
X
8
9
10 R IRR
X
X
X
C
E
4
X
B
D
3
X
X
X
X
The resident HIR block E is replaced !
1
1
3
1
4
inf
0
3
1
Inf
How is LIR Set Updated ? LIR Block Recency is Used
V time /
Blocks
A
1
2
5
6
7
X
X
8
9
10 R IRR
X
X
X
C
E
4
X
B
D
3
X
X
X
X
2
1
3
1
4
inf
0
2
1
Inf
HIR is a natural place for D, but this is not insightful.
After D is Referenced at Time 10
V time /
Blocks
A
1
2
5
6
7
X
X
8
9
10 R IRR
X
X
X
C
E
4
X
B
D
3
X
X
X
X
2
1
3
1
4
inf
0
2
1
Inf
D enters LIR set, and B steps down to HIR set
Because D`s IRR< Rmax in LIR set
The Power of LIRS Replacement
Capability to cope with weak access locality
File scanning: one-time access blocks will be replaced
timely; (due to their high IRRs)
Loop-like accesses: blocks to be accessed soonest will
NOT be replaced; (due to their low IRRs)
Accesses with distinct frequencies: Frequently accessed
blocks will NOT be replaced. (dynamic status changes)
LIRS Efficiency: O(1)
IRR HIR
Rmax
(New IRR of a
HIR block)
(Maximum Recency of LIR
blocks)
Can O(LIRS) = O(LRU)?
Yes!. this efficiency is achieved by our LIRS stack:
• Both recencies and useful IRRs are automatically recorded.
• Rmax of the block in the stack bottom is larger than IRRs of others.
• No comparison operations are needed.
LIRS Operations
• Initialization: All the referenced blocks are given an
LIR status until LIR block set is full.
We place resident HIR blocks in a small LRU Stack.
5
resident in cache
3
LIR block
HIR block
2
Cache size
L=5
1
• Upon accessing an LIR block (a hit)
6
• Upon accessing a resident HIR block (a hit)
9
• Upon accessing a non-resident HIR block (a miss)
4
5
8
3
LIRS stack
Llir = 3
Lhir =2
LRU Stack for HIRs
Access an LIR block (a Hit)
5
4
8
3
5
4
LIR block
2
3
5
HIR block
3
Cache size
L=5
Access 4
1
Access 8
2
6
1
2
9
6
1
resident in cache
4
5
9
5
S6
5
8
3
8
3
9
3
S
Q
S
Q
S
Q
d
Llir = 3
Lhir =2
Access an HIR Resident Block (a Hit)
4
3
5
8
4
3
LIR block
5
8
4
HIR block
8
Cache size
L=5
3
Access 3
2
Access 5
2
Llir = 3
Lhir =2
d
1
S
S5
resident in cache
5
1
5
3
5
1
Q
S
Q
S
Q
Access a Non-Resident HIR Block ( a Miss)
5
7
resident in cache
3
5
LIR block
4
3
HIR block
4
Cache size
L=5
8
Access 7
8
S
Llir = 3
Lhir =2
5
7
1
5
Q
S
Q
Access a Non-Resident HIR block (a Miss) (Cont)
7
9
5
5
7
8
resident in cache
block number
LIR block
3
5
7
HIR block
3
Cache size
L=5
4
Access 9
8
3
Access 5
5
4
4
8
S
7
9
8
5
7
9
Q
S
Q
S
Q
Llir = 3
Lhir =2
LIRS Stack Simplifies Replacement
Recency is ordered in stack with Rmax LIR block in bottom
No need to keep track of each HIR block`s IRR because
A newly accessed HIR block`s IRRs in stack = recency < Rmax.
A small LRU stack is used to store resident HIR blocks.
Additional operations of pruning and demoting are constant.
Although LIRS operations are much more dynamic than
LRU, its complexity is identical to LRU.
Performance Evaluation
Trace-driven simulations on different patterns shows
LIRS outperforms existing replacement algorithms in
almost all the cases.
The performance of LIRS is not sensitive to its only
parameter Lhirs.
Performance is not affected even when LIRS stack size is
bounded.
The time/space overhead is as low as LRU.
LRU can be regarded as a special case of LIRS.
Selected Workload Traces
• 2-pools
is a synthetic trace to simulate the distinct frequency case.
• cpp is a GNU C compiler pre-processor trace
• cs is an interactive C source program examination tool trace.
• glimpse is a text information retrieval utility trace.
• link is a UNIX link-editor trace.
• postgres is a trace of join queries among four relations in a relational database system
• sprite is from the Sprite network file system
• mulit1: by executing 2 workloads, cs and cpp, together.
• multi2: by executing 3 workloads, cs, cpp, and postgres, together.
• multi3: by executing 4 workloads, cpp, gnuplot, glimpse, and postgres, together
(1) various patterns, (2) non-regular accesses , (3) large traces.
Looping Pattern: postgres (Time-space map)
Looping Pattern: postgres (Hit Rates)
Potential Impact of LIRS
A LIRS patent has been filed, pending for approval.
Has been positively evaluated by IBM Almaden Research.
A potential adoption from LaserFiche in digital library.
The trace-driven simulation package has been distributed to
many universities for research and classroom teaching.
Conclusion
Locality-aware research is long term and multidisciplinary.
Application software support
+: optimization is effective for architecture dependent library.
-: cache optimization only, and case by case
Hardware support
+: touching fundamental problems, such as address symmetry.
- : optimization space is very limited due to cost consideration.
System software support
+: a key for locality optimization of I/O and virtual memory
-: lack application knowledge, and kernel modifications.
Selected References
Application software for cache optimization
Fast and high associativity cache designs
Multicolumn caches, IEEE Micro, 1997
Low power caches, IEEE Micro, 2002.
Hardware support for DRAM locality exploitation
Cache effective sortings, ACM Journal on Exp. Alg., 2000.
Fast bit-reversals, SIAM Journal on Sci. Comp., 2001
Permutation-based page interleaving, Micro-33, 2000.
Fine-grain memory access scheduling, HPCA-8, 2002.
System software support buffer cache optimization
LIRS replacement, SIGMETRICS’02, 2002.
TPF systems, Software: Practice & Experience, 2002.
© Copyright 2025 Paperzz