Prefetch

The Performance Impact of Kernel
Prefetching on Buffer Cache
Replacement Algorithms
by Ali R. Butt, Chris Gniady, and Y.Charlie Hu,
SIGMETRICS’05
Course:
Presenter:
07/05/2005
CSCI 780 – Advanced Topics on Caching
Techniques in Computer and Distributed Systems
Chuan Yue
1
Outline
• The Buffer Cache
• Linux Kernel Prefetching
• Adapted Buffer Cache Replacement Algorithms
• Simulation Results
• Conclusions
• Discussions
07/05/2005
2
Buffer Cache in Main Memory
• Two kinds of I/O operations:
– Direct access read()/write()
use block-based buffer cache
– Memory-mapped I/O share
page cache with the virtual
memory system
• Naturally that leads to two
separate buffers
• Problems:
– Double buffering
– Inconsistencies
07/05/2005
virtual
memory
memorymapped I/O
page cache
I/O using
read/write
buffer cache
disk
3
Unification of Buffer Cache and Page Cache
• A unified buffer cache uses
the same page cache to
store
virtual
memory
memorymapped I/O
I/O using
read/write
– Virtual memory pages
– Memory-mapped pages
– Ordinary file system I/O
• Issues:
unified
buffer cache
– complex interactions between
file system and VM
disk
07/05/2005
4
Buffer Cache Management
• Designing effective buffer cache replacement algorithms is a
fundamental challenge in improving system performance
– Traditional file I/O system
– Virtual memory system
• Various buffer cache replacement algorithms
– LRU replacement is widely used
– LRU’s inability to cope with access patterns with weak locality
– Other well-known algorithms that utilize recency information:
LRU-2, 2Q, LIRS, LRFU, MQ, ARC
07/05/2005
5
Prefetching
• Prefetching is another highly effective technique used for
improving the I/O performance
• The main motivation for prefetching is to overlap computation
with I/O and thus reduce exposed latency of I/O
• Various prefetching techniques:
– Prefetching using user inserted hints of I/O access patterns
• Drawback: placing burden on programmer
– File system kernel-driven prefetching in modern operating systems
• Synchronous read-ahead to amortize seek cost
• Asynchronous prefetching after detecting sequential access patterns
07/05/2005
6
The impact of kernel prefetching on buffer cache
replacement algorithms’ performance
• The close interactions between caching and prefetching
–
–
–
–
Prefetching file blocks into cache can be harmful (P. Cao, et. al., 1995)
Both replacement policy & prefetching  buffer cache hit ratio
Hit ratio, prefetching & clustering  I/O disk traffic
I/O disk traffic  file system performance
• Almost all proposed buffer cache replacement algorithms
didn’t take into account the kernel driven prefetching
• The work in this paper:
– Shows the potential performance impact of kernel prefetching on buffer
cache replacement algorithms
– Presents the simulation results on 8 adapted replacement algorithms
07/05/2005
7
Kernel components on the path from file system
operations to the disk
07/05/2005
8
Kernel Prefetching in Linux
• Prefetching is based on the pattern of accesses to the file
– Only considers prefetching for read accesses
– Beneficial for sequential accesses to a file
• Read-ahead Group and Read-ahead Window
• Synchronous Prefetching and Asynchronous Prefetching
window
window
group
1
07/05/2005
2
3
4
window
group
new group
5
6
7
8
9
10
9
Belady’s algorithm can be non-optimal given
kernel prefetching
• Access sequence: a c e g i k m o a b c d e f g h i j k l m n o p
• Without prefetching: Belady’s Alg. 16 cache misses; LRU 23 cache misses
• With prefetching: Belady’s Alg. 8 cache misses; LRU 6 cache misses
07/05/2005
10
Prefetching has been ignored in algorithm design
• Caching algorithms have been proposed and studied
without considering prefetching
–
–
–
–
–
–
–
–
OPT
LRU
LRU-K [SIGMOD 1993]
2Q [VLDB 1994]
LRFU [TC 2001]
MQ [USENIX 2001]
LIRS [SIGMETRICS 2002]
ARC [FAST 2003]
• Changes to OPT, LRU, 2Q, LIRS will be explained
07/05/2005
11
OPT
• OPT is based on Belady’s cache replacement algorithms.
– Off-line, has the knowledge of future references
• In the presence of the Linux kernel prefetching
– Prefetched blocks are assumed to be accessed most recently and
inserted into the cache according to the original OPT algorithm
– But, OPT is added the capability to immediately determine wrong
prefetches, i.e., prefetched blocks that
• will not be accessed on-demand at all, or
• will be accessed further in future than all other blocks in the cache
– Wrong prefetched blocks become immediate candidates for removal
07/05/2005
12
LRU
• LRU is the most widely used replacement policy
• In the presence of the kernel prefetching, adapted LRU:
– Each access, kernel determines the number of blocks that need to
be prefetched
– Prefetched blocks are inserted in the MRU locations just like regular
blocks
07/05/2005
13
2Q
• Three buffers and the
algorithm:
Block 10, 11, 12, 13, 14, 11, 12, 22,
– A1in queue: all missed
blocks are initially placed
A1in
Am
11
14
10
12
22
13
– A1out queue: when blocks
are replaced from the A1in
queue in the FIFO order,
their addresses are
temporarily placed
07/05/2005
Address only
– Am queue: When a block
is re-referenced and its
address is in the A1out
queue, it is promoted to
Am queue
A1out
14
2Q – With Adaptation
(In the presence of the kernel prefetching)
• Prefetched blocks are treated
as on-demand blocks:
– A prefetched block is placed into the
A1in queue initially
Demand & Prefetch blocks
10, 11, 12, 11, 13, 14, 11, 22, 23
A1in
– On the subsequent on-demand
access, the block stays in the A1in
queue
– If a block currently in the A1out queue
is prefetched, it is promoted into Am
queue as if it is accessed on-demand
07/05/2005
23
10
11
14
22
12
13
A1out
Address only
– If the prefetched block is evicted from
the A1in queue before any ondemand access, it is simply
discarded, as opposed to being
moved into the A1out queue
Am
15
LIRS
• Dynamically and responsively maintains the LIR block set
and HIR block set and keeps LIR block set in the cache
• In the presence of the kernel prefetching, adapted LIRS:
– Prefetched blocks are not inserted into the LIRS stack S, they are
only inserted into the HIR stack Q
– If a prefetched block did not have an existing entry in LIRS stack S,
the first on-demand access to the block will cause it to be inserted
onto the top of LIRS stack S as a HIR block
– If a prefetched block exists in LIRS stack S, the first on-demand
access to the block will be treated as a LIR block access
07/05/2005
16
Performance Evaluation
• Trace collection
– Interception of I/O system calls (using modified linux strace utility)
– Collect I/O access type, time, file identifier (inode), and I/O size
• Timing accurate trace simulator
– Detailed implementation of kernel prefetching and clustering
– Interface with DiskSim simulator to simulator I/O time
– Implementation of: OPT, LRU, LRU-2, LRFU, LIRS, MQ, 2Q, ARC
• Metrics
– Hit ratio
– Aggregated synchronous and asynchronous disk I/O requests
– Actual running time
07/05/2005
17
Applications and Trace Statistics
(Concurrent applications: Multi1: cscope, gcc; Multi2: cscope, gcc, viewperf;
Multi3: glimpse, TPC-H.)
07/05/2005
18
Hit ratio results for cscope
• Kernel prefetching has a significant impact on the hit ratio
• The improvement for different algorithms differ
• Prefeching can result in significant changes in the relative
performance of replacement algorithms
07/05/2005
19
Disk requests results for cscope
• The clustering of I/O requests in the presence of prefetching
results in a significant reduction in the number of disk requests
• The effect is complex and closely tied to the file access patterns
07/05/2005
20
Execution time results for cscope
• Reduction in the # of disk requests due to kernel prefetching does
not necessarily translate into reduction in execution time.
07/05/2005
21
Results for other three sequential access applications
• Glimpse
– It also benefits from prefetching
– The changes in the relative behavior of different algorithms observed
in cscope with prefetching are also observed in glimpse
• Viewperf
– It benefits the most from prefetching
– The behavior of different cache replacement algorithms is similar to
that observed in cscope
• Gcc
– Many accesses are to small files, little opportunity for prefetching
– All three performance metrics are almost identical with and without
prefetching
07/05/2005
22
Hit ratio results for tpc-h
• Prefetching provides little improvement on the hit ratio for
random access pattern
07/05/2005
23
Disk requests results for tpc-h
• Most of prefetched blocks are not accessed and as a result the
number of disk requests is doubled
07/05/2005
24
Execution time results for tpc-h
• The significant increase in the number of I/Os translates into a
significant increase in the execution time
07/05/2005
25
Results for concurrent applications
• Multi1: cscope, gcc
– Similar as that of cscope
• Multi2: cscope, gcc, viewperf
– Similar as that of Multi1, however, prefetching does not improve the
execution time because viewperf is CPU-bound
• Multi3: glimpse, TPC-H
– Similar as that of tpc-h
07/05/2005
26
Number and size of synchronous and asynchronous
disk I/Os in cscope at 128MB cache size
• The total number of disk requests with prefetching is as least 30%
lower than without prefetching for all schemes except OPT
• Most reduction in disk requests comes from issuing asynchronous
disk requests which can be overlapped with CPU time
07/05/2005
27
Conclusions
• In this research work, the authors
– Proposed prefetching implementation for different replacement algorithms
– Built a timing simulator to evaluate relative performances
• The paper shows
– Prefetching impacts hit ratio, disk requests, execution time
– Comparison of hit ratios is insufficient
– Kernel prefetching can narrow the performance gap of different replacement
algorithms
– Kernel prefetching can also change the relative performance benefits of
different replacement algorithms
• Future buffer caching research should
– Take into consideration prefetching and I/O clustering
– Simulate execution time
07/05/2005
28
Discussions (1)
• Good points
– No new algorithm; but the paper is the first to simulate and compare the impact
of kernel prefetching on well-known cache buffer replacement algorithms
– Results are not very astonishing, we can guess the general results for
sequential and random workloads; but this paper is the first to report the results
• Bad points
– The simulation is only based on I/O traces. It better VM traces based results are
also presented.
– Concurrent applications simulation results are not analyzed in detail (in this
paper itself).
– It better the unification of buffer cache and page cache in many OSes be
considered. It better the competition between process page access and file
cache page access be simulated and analyzed.
07/05/2005
29
Discussions (2)
• Some questions:
– Regarding Belady’s anomaly:
• In LIRS paper: Belady's anomaly appears in 2Q and ARC for glimpse workload
• In this paper: Without prefetching, their simulation results didn't show Belady's
anomaly. With prefetching, Belady's anomaly appears in ARC for glimpse workload
• Why the differences? LRU has no Belady's anomaly. How about other algorithms?
– Regarding simulations:
• Is there any relationship between cache size selection (in simulation) with the real
environment where the trace is collected?
• Is the performance under thrashing condition still worth simulating?
07/05/2005
30
References
•
•
•
•
•
•
•
•
•
“A Study of Integrated Prefetching and Caching Strategies”, P.Cao, et., al., ACM SIGMETRICS, 1995
“Making LRU Friendly to Weak Locality Workloads: A Novel Replacement Algorithm to Improve Buffer
Cache Performance”, S. Jiang and X. Zhang, IEEE Transactions on Computers, VOL.54, NO.9,
SEPTEMBER 2005
“CLOCK-Pro: An Effective Improvement of the CLOCK Replacement”, S. Jiang, F. Chen, and X. Zhang,
Proceedings of 2005 USENIX Annual Technical Conference (USENIX'05)
"Page Replacement in Linux 2.4 Memory Management," Rik van Riel, Proc. of 2001 USENIX Technical
Conference, FREENIX track
Towards and O(1) VM: Making Linux virtual memory management scale towards large amounts of
physical memory, Rik van Riel, Proceedings of the Linux Symposium, July 2003
“Journal File Systems in Linux, June 21th, 2005”
(http://bulma.net/impresion.phtml?nIdNoticia=1154)
“The Buffer Cache, June 21th, 2005”
(http://www.faqs.org/docs/linux_admin/buffer-cache.html)
“The Performance Impact of Kernel Prefetching on Buffer Cache Replacement”, Chris Gniady, et., al.,
(Purdue University), ACM SIGMETRICS 2005 presentation slides
More on File System (lecture notes, June 22th, 2005)
(http://www.cs.rochester.edu/~kshen/csc256-spring2005/lectures/lecture16-file2.pdf)
07/05/2005
31
Thank you!
07/05/2005
32