COMP5138 Relational Database Management Systems

COMP5138
Relational Database
Management Systems
Lecture 10:
Storage and Indexing – Part II
L9
Storage&Indexing
Today’s Agenda
„ Storage
f Disk
f Buffer Management
f RAID Systems
f File Organization
„ Indexing
f Access Paths
f Tree-structured Indexing
f Hash-based Indexing
f Bitmap Index
09-2
09-2
1
L9
Storage&Indexing
Access Path
„ A way of retrieving tuples from a tables: scan or index based
selection condition (ways in which rows of a table can be
retrieved)
„ Refers to the algorithm + data structure (e.g., an index) used
for retrieving and storing data in a table
„ The choice of an access path to use in the execution of an
SQL statement has no effect on the semantics of the
statement
„ This choice can have a major effect on the execution time of
the statement
09-3
09-3
L9
Storage&Indexing
Organization of Records in Files
„ Heap – a record can be placed anywhere in the file where
there is space
„ Sequential – store records in sequential order, based on
the value of the search key of each record
„ Index based
f Trees
f Hashing – a hash function computed on some attribute of each
record; the result specifies in which block of the file the record should
be placed
09-4
09-4
2
L9
Storage&Indexing
Heap Files
„ Rows appended to end of file as they are inserted
f Hence the file is unordered
„ Deleted rows create gaps in file
f File must be periodically compacted to recover space
„ Performance:
(Assume file contains F pages)
f Inserting a row: Access path is scan
„
„
Avg. F/2 page transfers if row already exists (row is unique)
F+1 page transfers if row does not already exist
f Deleting a row:
„
„
Avg. F/2+1 page transfers if row exists
F page transfers if row does not exist
f Query:
„
„
Access path is scan
Access path is a scan
Efficient if all rows are returned (SELECT * FROM table)
Very inefficient if a few rows are requested
09-5
09-5
L9
Storage&Indexing
Sorted File
„ Rows are sorted based on some attribute(s)
f Access path is binary search
f Equality or range query based on that attribute has cost log2F to
retrieve page containing first row
f Successive rows are in same (or successive) page(s) and cache hits
are likely
f By storing all pages on the same track, seek time can be minimized
„ Problem: Maintaining sorted order
f After the correct position for an insert has been determined, inserting
the row requires (on average) F/2 reads and F/2 writes (because
shifting is necessary to make space)
f Partial solution 1: Leave empty space in each page
f Partial solution 2: Use overflow pages (chains)
„
Successive pages no longer stored contiguously, costs > log2F
09-6
09-6
3
L9
Index Structures
Storage&Indexing
„ An index on a relation is an access path to speed up
selections on the search key fields for the index.
f Any subset of the fields of a relation can be the search key for an
index on the relation.
f Search key is not the same as key (minimal set of fields that uniquely
identify a record in a relation).
f Primary keys are typically automatically indexed
„ An index consists of records (called index entries) of the
form
search-key
pointer
„ Index files are typically much smaller than the original file
09-7
09-7
L9
Index Example
Storage&Indexing
students
Index(name)
Ahmed
Ha Tschi
James
Jesse
Nga
Peter
sid
name
birthdate
country
300697336
300673435
300136899
300304642
300002001
300254672
Peter
Ha Tschi
James
Nga
Jesse
Ahmed
01.01.84
31.5.79
29.02.82
04.05.85
11.10.86
30.12.80
India
China
Australia
Singapur
China
Pakistan
„ Ordered index: search keys are stored in sorted order
„ Hash index: search keys are distributed uniformly across
“buckets” using a “hash function”.
09-8
09-8
4
L9
Storage&Indexing
Primary vs. Secondary Indices
„ Primary index: in a sequentially ordered file, the index
whose search key specifies the sequential order of the file.
f Also called main index or clustering index
f The search key of a primary index is usually but not necessarily the
primary key.
„ Secondary index: an index whose search key specifies an
order different from the sequential order of the file.
f Also called non-clustering index.
„ Sequential scan using primary index is efficient, but a
sequential scan using a secondary index is expensive
f Because each record access may fetch a new block from disk
09-9
09-9
L9
Storage&Indexing
Example: Primary Index
„ Primary Index on branch-name field of account
Source: Silberschatz/Korth/Sudarshan: Database System Concepts, 2002.
„ Files with a primary index on the search key are also called
index-sequential files (ISAM)
09-10
09-10
5
L9
Storage&Indexing
Example: Secondary Index
„ Secondary Index on balance field of account
„ As balance is not a candidate key, we need buckets as an
indirection with pointers to tuples with the same search-key
Source: Silberschatz/Korth/Sudarshan: Database System Concepts, 2002.
09-11
09-11
L9
Storage&Indexing
Index Definition in SQL
„ Create an index
CREATE INDEX name ON relation-name
(<attributelist>)
f E.g.:
create index b-index on branch(branch-name)
„ Use CREATE UNIQUE INDEX to indirectly specify and
enforce the condition that the search key is a candidate key.
f Not really required if SQL unique integrity constraint is supported
„ To drop an index
DROP INDEX index-name
09-12
09-12
6
L9
Storage&Indexing
Tree-Structured Indices
„ Index Sequential Access Method (ISAM)
f Ordered sequential file with a (fixed) primary index. (static)
f Disadvantage of ISAM
„
„
performance degrades as file grows, since many overflow blocks get
created.
Periodic reorganization of entire file is required.
„ B+ Tree
f Dynamic multi-level index structure
„
reorganization of entire file is not required to maintain performance.
f Supports equality and range searches, multiple attribute keys and
partial key searches
f Disadvantages:
extra insertion and deletion overhead and space overhead.
09-13
09-13
L9
Storage&Indexing
ISAM Tree
Non-leaf
pages
Leaf
Pages
Low locking cost!
30
37
09-14
09-14
7
L9
B+-Tree Structure
Storage&Indexing
Non-leaf
pages
Leaf
Pages
(Sorted by search key)
„ Leaf level is a (sorted) linked list of index entries (clustered: data entries)
„ Non-leaf nodes have index entries; only used to direct searches
P1
K1
P2
...
Pi
keys < Ki
Ki Pi+1
...
Pn-1 Kn-1 Pn
Ki <= keys < Ki=1
09-15
09-15
L9
Example of a B+-Tree
Storage&Indexing
Root
17
Entries < 17
5
2*
3*
Entries >= 17
27
14
5*
7* 8*
14* 16*
17* 24*
33
27* 29*
33* 34* 38* 39*
„ Note how data entries in the leafs are sorted
f Primary index: leaves hold records themselves, else pointers to records
„ Find 14? 29? All values >20 and <30?
„ Insert/Delete:
f Find data entry in leaf, than change it; need to adjust parent sometimes.
f And change sometimes ‘bubbles up’ the tree
09-16
09-16
8
L9
Storage&Indexing
B+-Tree Index Structure
„ A B+-tree is a rooted tree satisfying the following properties:
f All paths from root to leaf index entries have the same length
„
i.e., it is a balanced tree
f Each node (except the root) has at least ⎡n/2⎤ (pointers to) children.
„
„
„
The number n of pointers in a node is also called fanout (typical >100)
The root node can have between 1 and n children.
The search keys within a node are sorted.
f Special cases:
„
„
If the root is not a leaf, it has at least 2 children.
If the root is a leaf (that is, there are no other nodes in the tree), it can hold
between 0 and (n–1) search key values.
09-17
09-17
L9
Storage&Indexing
Queries on B+-Tree
„ Find all records with a search-key value of k.
1. Start with the root node
„
„
„
Examine the node for the smallest search-key >= k.
If such a value exists, assume it is Ki. Then follow Pi to the child node
Otherwise k ≥ Kn–1, where there are n pointers in the node. Then follow
Pn to the child node.
2. Repeat the above procedure until a leaf node is reached.
3. Eventually reach a leaf node.
If for some i, key Ki = k follow pointer Pi to the desired record.
Else no record with search-key value k exists.
09-18
09-18
9
L9
Storage&Indexing
Updates on B+-Trees: Insertion
„ Find leaf node in which the search-key value would appear.
„ If there is room in the leaf node, insert (key-value, pointer) pair at the
correct sorted position in the leaf node.
„ Otherwise, split the leaf node as follows:
f Rule 1: Leaf node split
„
take the n (key-value, pointer) pairs (including the one being inserted) in sorted
order. Place the first ⎡n/2⎤ in the original node, and the rest in a new node.
„ let the new node be p, and let k be the least key value in p.
Insert (k,p) in the parent of the node being split.
f If the parent is full, split it and propagate the split further up using
Rule 2: Index node split
„
take the n (key,pointer) pairs (including the one being inserted) in sorted order.
Remove the middle key km, then place the first (n-1)/2 pairs in the original node,
and the rest in a new intermediate node.
„ let the new node (pointer) be p, and let km be the middle key value not stored so
far.
Push the pair (km,p) up the tree (if there is no parent node, create a new root)
„ Splitting of nodes proceeds upwards till a node that is not full is found.
In the worst case the root may be split increasing the tree height by 1.
09-19
09-19
L9
Storage&Indexing
Examples of B+-Tree Insertion
B+-Tree before and after insertion of “Clearview”
09-20
09-20
10
L9
Storage&Indexing
Updates on B+-Trees: Deletion
„ Find the leaf node N holding the search key value
„ Remove (search-key-value, pointer) from leaf node N
„ If the node has too few entries (<⎡n/2⎤) due to the removal
f IF N has a (left or right) sibling node S with more than ⎡n/2⎤ entries:
„
„
Redistribute the entries between nodes N and S such that both have
more than the minimum number of entries.
Update the corresponding search-key value in the parent of the node.
f Otherwise choose any sibling of N
„
„
Insert all the search-key values of the two nodes into a single node (the
one on the left), and delete the other node.
Delete the pair (Ki–1, Pi), where Pi is the pointer to the deleted node, from
its parent, recursively using the above procedure.
„ The node deletions may cascade upwards till a node which
has ⎡n/2⎤ or more pointers is found.
09-21
09-21
L9
Storage&Indexing
Examples of B+-Tree Deletion
Clearview
Clearview
Before and after deleting “Downtown”
09-22
09-22
11
L9
Hash Index
Storage&Indexing
„ Index entries partitioned into buckets in accordance with a
hash function, h(v), where v ranges over search key
values
„ Each bucket is identified by an address, a
„ Bucket at address a contains all index entries with search
key v such that h(v) = a
„ A bucket is a unit of storage containing one or more records
that is stored in a page (with possible overflow chain)
„ If index entries contain rows, set of buckets forms an
integrated storage structure (’clustered file’); else set of
buckets forms an (unclustered) secondary index
09-23
09-23
L9
Storage&Indexing
Equality Search with Hash Index
Location
mechanism
Given v:
1. Compute h(v)
2. Fetch bucket at h(v)
3. Search bucket
Cost = number of pages
in bucket (cheaper than
B+ tree, if no overflow
chains)
09-24
09-24
12
L9
Storage&Indexing
Example of Hash File Organisation
Hash file organization of account file, using branch-name as key
e.g. h(Perryridge) = 5
h(Round Hill) = 3 h(Brighton) = 3
09-25
09-25
L9
Storage&Indexing
Hash Indices - Problems
„ Does not support range search
f Since adjacent elements in range might hash to different buckets,
there is no efficient way to scan buckets to locate all search key
values v between v1 and v2
„ Although it supports multi-attribute keys, it does not support
partial key search
f Entire value of v must be provided to h
„ Dynamically growing files produce overflow chains, which
negate the efficiency of the algorithm
09-26
09-26
13
L9
Choosing an Index
Storage&Indexing
„ An index should support a query of the application that has
a significant impact on performance
f Choice based on frequency of invocation, execution time, acquired
locks, table size
Example 1:
SELECT E.Id
FROM Employee E
WHERE E.Salary < :upper AND E.Salary > :lower
– This is a range search on Salary.
– Since the primary key is Id, it is likely that there is a clustered,
main index on that attribute that is of no use for this query.
– Choose a secondary, B+ tree index with search key Salary
09-27
09-27
L9
Storage&Indexing
Choosing An Index (cont’d)
Example 2:
SELECT E.sid
FROM Enrolled E
WHERE E.grade = :grade
„ This is an equality search on grade.
f Since the primary key is (sid, CourseId) it is likely that
there is a main, clustered index on these attributes
that is of no use for this query.
„ Choose a secondary, B+ tree or hash index with search key
grade
09-28
09-28
14
L9
Storage&Indexing
Choosing an Index (cont’d)
Example 3:
SELECT E.CourseCode, E.grade
FROM Enrolled E
WHERE E.StudId = :sid AND E.grade = ‘D’
„ Equality search on StudId and grade.
„ If the primary key is (StudId, CourseId) it is likely that there
is a main, clustered index on this sequence of attributes.
f If the main index is a B+ tree it can be used for this search.
f If the main index is a hash it cannot be used for this search. Choose
B+ tree or hash with search key StudId (since grade is not as
selective as StudId) or (StudId, grade)
09-29
09-29
L9
Storage&Indexing
Summary
„ DBMS store data persistently on secondary storage
f May use file system, but with own dynamic data layout
f Problem: Access gap between main memory and disk
„ Buffer manager tries to minimize disk accesses by caching
data in main memory.
„ RAIDs to improve throughput and availability of disks.
„ Indexes greatly help speed up query evaluation.
f Can have several indexes on a given relation, each with a different
search key.
f Most commonly used in DBMS: B-tree index
f Interesting for Data Warehousing: Bitmap Index
09-30
09-30
15
L9
Storage&Indexing
Next Week
„ Query Processing and Optimization
f Query Parsing and Translation
f Query Evaluation
f Query Optimization
„
„
Cost-based
Heuristic
„ Textbook
f Chapter 12
f plus Section 14.4
09-31
09-31
16