in-page node size

Fractal Prefetching B+-Trees:
Optimizing Both
Cache and Disk Performance
Author: Shimin Chen, Phillip B. Gibbons, Todd C. Mowry, Gary Valentin
Members: Iris Zhang, Grace Yung, Kara Kwon, Jessica Wong
Outline
1. Introduction
2. Optimizing I/O Performance
a. Searches
b. Range Scans
3. Optimizing Cache Performance
a. Disk-First fpB+-Trees
b. Cache-First fpB+-Trees
4. Conclusion
Introduction
• Traditional B+-Trees
– Optimized for I/O performance
– tree nodes = disk pages
• Recent new types of B+-Trees
– Optimized for CPU cache performance
– tree nodes sizes = one or few cache lines
– Introduce concept of prefetching
Introduction (cont’d)
Page Control Info
Index entry
(key and
page/tuple ID)
Figure 1: Traditional B+-Trees
Introduction (cont’d)
•
Problem (due to large discrepancy in
optimal node sizes)
1. Disk-optimized B+-Trees suffer from poor
cache performance
2. Cache-optimized B+-Trees suffer from poor
disk performance
Introduction (cont’d)
•
Proposal: Fractal Prefetching B+-Trees
(fpB+-Trees)
1. Embed “cache-optimized” trees within “diskoptimized” trees
2. Optimize both cache and I/O performance
3. Two approaches:
-> disk-first
-> cache-first
Introduction (cont’d)
Figure 2: Self-similar “tree within a tree” structure
Introduction (cont’d)
•
•
•
Disk-first and Cache-first
What is done to optimize performance
How to process operations efficiently
–
–
–
–
Bulkload
Search
Insertion
Deletion
Optimizing I/O Performance
•
•
fpB+-Trees combine features of disk- and
cache-optimized B+-Trees to achieve best
of both structures
Consider two concepts from pB+-Trees
–
–
Searches: Prefetching and node sizes
Range Scans: Prefetching via jump-pointer
arrays
Optimizing I/O Performance
(cont’d)
•
Prefetching:
–
–
Modern db servers are composed of multiple
disks per processor
Goal: effectively exploit I/O parallelism
•
Explicitly prefetching disk pages even when the
access patterns are not sequential
Searches: Prefetching
and Node Sizes (cont’d)
•
For disk-resident data
–
–
Increase the B+-Tree node size to be a multiple of the
disk page size
Prefetch all pages of a node when accessing it
•
Pages are placed on different disks so that
requests can be serviced in parallel
•
Result: faster search
Searches: Prefetching
and Node Sizes (cont’d)
•
Problem
–
–
•
I/O latency improves for a single search, but
may become worse when there are extra seeks
for a node
Additional seeks may degrade performance
Conclusion: target node-size for fpB+-Tree
will be a single disk page
Range Scans: Prefetching
via Jump-Pointer Arrays
•
Range scan
–
•
•
searching for the starting key of the range,
then reading consecutive leaf nodes in the tree
Jump-pointer array helps leaves to be
effectively prefetched
One implementation: add sibling pointers
to each node that is a parent of leaves
Range Scans: Prefetching
via Jump-Pointer Arrays (cont’d)
Leaf
Parent
Tree
Figure 3: Internal jump-pointer array
Range Scans: Prefetching
via Jump-Pointer Arrays (cont’d)
•
•
This technique can be applied to fpB+Tree
Enhancement to avoid overshooting:
–
–
fpB+-Trees begin by searching for both start
and end key in order to remember the range
end page
This technique does not decrease throughput
Optimizing Cache Performance
•
The search operation of B+-Trees suffers
poor cache performance
–
–
–
During a search, each page on the path to a
key is visited
In each page, binary search is performed on
the large continuous array
Costly in terms of cache misses
Optimizing Cache
Performance (cont’d)
•
Example:
–
–
–
–
–
Key, page ID and tuple ID are all 4 bytes
With a 8KB page, can hold over 1000 entries
Cache line is 64 bytes => hold 8 entries
Suppose page has 1023 entries (1 to 1023)
Locate a matching entry 71, requires 10
probes with binary search
•
512, 256, 128, 64, 96, 80, 72, 68, 70, 71
Optimizing Cache
Performance (cont’d)
•
The update operation of B+-Trees is costly
–
–
Insertion and deletion both begin with search
To insert an entry in a sorted array, on average
half of the page must be copied to make room
for the new entry
Disk-First
•
•
•
+
fpB -Trees
Start with disk-optimized B+-Trees
Organize keys and pointers in each pagesized node into a cache-optimized tree
In each node - small cache-optimized tree:
in-page tree
–
Modeled after pB+-Trees, which is shown to
have best cache performance
Disk-First
+
fpB -Trees
(cont’d)
page control info
Figure 4: Disk-optimized fpB+-Trees :
a cache-optimized tree inside each page
Disk-First
•
•
•
(cont’d)
In-page tree has nodes aligned on cache
line boundaries
Each node is several cache lines wide
–
•
+
fpB -Trees
When a node is visited as part of a search, all
cache lines in the node are prefetched
Increases fan-out of the node and reduce
height of the in-page tree
Result: better overall performance
Disk-First
•
(cont’d)
Non-leaf nodes
–
–
•
+
fpB -Trees
Contains pointers to other in-page nodes
within the same page
To further pack more entries into each node,
use short in-page offsets instead of full
pointers
Leaf nodes
–
Contains pointers to nodes external to their
in-page tree
Disk-First
•
•
•
+
fpB -Trees
(cont’d)
Optimal in-page node size is determined
by memory system parameters and key
and pointer sizes
Optimal page size is determined by I/O
parameters and disk and memory prices
With a mismatch between the two sizes,
tree may have overflow or underflow
Disk-First
+
fpB -Trees
(cont’d)
page control info
page control info
Unused Space
Figure 5: Overflow and Underflow
Disk-First
+
fpB -Trees
(cont’d)
page control info
- use smaller nodes
when overflow
page control info
- use larger nodes
when underflow
Figure 6: Fitting cache-optimized trees in a page
Disk-First fpB+-Trees:
Operations
•
Bulkload: operations at two granularities
–
–
–
At a page granularity: follow common B+Tree bulkload algorithm
For in-page trees of non-leaf pages, pack
entries into one in-page leaf node after
another
For in-page trees of leaf pages, try to
distribute entries across all in-page leaf nodes
•
Maintain a linked list of all in-page leaf nodes
Disk-First fpB+-Trees:
Operations (cont’d)
•
Search
–
Straightforward search done for each
granularity
Disk-First fpB+-Trees:
Operations (cont’d)
•
Insertion: operations at two granularities
–
If there are empty slots in the in-page leaf
node, insert the entry into the sorted array for
the node
Disk-First fpB+-Trees:
Operations (cont’d)
•
Insertion: operations at two granularities
–
Otherwise, split the leaf node into two
a. Allocate new nodes in the same page
b. Reorganize in-page tree if number of entries is
fewer than page maximum fan-out
c. Split the page by copying half of the in-page leaf
nodes to a new page, and rebuild the two in-page
trees in their respective pages
Disk-First fpB+-Trees:
Operations (cont’d)
•
Deletion
–
–
–
A search for the entry
Follow by a lazy deletion of entry in a leaf
node
Do not merge leaf nodes that become half
empty
Cache-First
•
•
•
+
fpB -Trees
Start with cache-optimized B+-Trees
Ignore page boundaries
Then try to intelligently place cacheoptimized nodes into disk pages
Cache-First
•
(cont’d)
Non-leaf node
–
–
Contains an array of keys and pointers
A pointer is a combination of a page ID and
an offset in the page
•
•
•
+
fpB -Trees
Use the page ID to retrieve a disk page
Visit a node in the page by the offset
Leaf node
–
Contains an array of keys and tuple ids
Cache-First fpB+-Trees:
Node Placement
•
Goal 1: group sibling leaf nodes together
into the same page to reduce disk
operations during range scans
•
Approach: designate certain pages as leaf
pages that contain only leaf nodes
–
Leaf nodes in the same page are siblings
Cache-First fpB+-Trees:
Node Placement (cont’d)
•
Goal 2: group a parent node and its
children together into the same page to
ensure searches only need one disk
operation for a parent and its child
•
Problems:
–
–
Not possible for all nodes
Node size mismatch (overflow and
underflow)
Cache-First fpB+-Trees:
Node Placement (cont’d)
•
For underflow (i.e. “not enough” children)
–
•
Place grandchildren, great grandchildren, etc
onto the same page
For overflow: two approaches
a. Place overflowed child into its own page as
top-level node with its own children
b. Store overflowed child in special overflow
pages
Cache-First fpB+-Trees:
Node Placement (cont’d)
Nonleaf nodes
Aggressive Placement
Overflow pages for
leaf node parents
Figure 8: Cache-first fpB+-Tree design
Cache-First fpB+-Trees:
Operations
•
Bulkload: Leaf nodes
–
Placed consecutively in leaf pages, and linked
together with sibling links
Cache-First fpB+-Trees:
Operations
•
Bulkload: Non-leaf nodes
–
Determine whether there is space for the node
to fit into the same page as its parent
If not, then
–
•
•
Allocate the node as the top level node in a new
page, or
If the non-leaf node is a parent of a leaf node,
place it into the overflow page
Cache-First fpB+-Trees:
Operations (cont’d)
•
Search
–
–
–
Straightforward with one thing to note
When proceeding from a parent to one of its
children, compare the page ID
Same page ID indicates parent and child are
on the same page
•
Can directly access the node in the page without
retrieving the page from buffer manager
Cache-First fpB+-Trees:
Operations (cont’d)
•
Insertion:
–
–
If there are empty slots in the leaf node,
simply insert the entry; else need to split node
into two
If leaf page has space, accommodate the new
node; else need to split the leaf page
•
•
Move second half of the leaf nodes to a new page
Update corresponding child pointers in their
parents
Cache-First fpB+-Trees:
Operations (cont’d)
•
Insertion:
–
–
After leaf node split, need to insert an entry
into the parent node
If parent node is full, it needs to be split
•
•
For leaf parent node, the new node may be
allocated from overflow pages
If further splits up the tree are needed, the new
node must be allocated as described in bulkload
Cache-First fpB+-Trees:
Operations (cont’d)
•
Deletion
–
Similar to disk-first fpB+-Trees
Conclusion
1. Problems of traditional B+-Trees
2. In optimizing I/O performance, considered
two concepts from pB+-Trees: searches
and range scans
3. How disk-first and cache-first fpB+-Trees
performances better traditional B+-Trees
4. Operations (bulkload, search, insertion,
deletion)