COMP5138 Relational Database Management Systems Lecture 10: Storage and Indexing – Part II L9 Storage&Indexing Today’s Agenda Storage f Disk f Buffer Management f RAID Systems f File Organization Indexing f Access Paths f Tree-structured Indexing f Hash-based Indexing f Bitmap Index 09-2 09-2 1 L9 Storage&Indexing Access Path A way of retrieving tuples from a tables: scan or index based selection condition (ways in which rows of a table can be retrieved) Refers to the algorithm + data structure (e.g., an index) used for retrieving and storing data in a table The choice of an access path to use in the execution of an SQL statement has no effect on the semantics of the statement This choice can have a major effect on the execution time of the statement 09-3 09-3 L9 Storage&Indexing Organization of Records in Files Heap – a record can be placed anywhere in the file where there is space Sequential – store records in sequential order, based on the value of the search key of each record Index based f Trees f Hashing – a hash function computed on some attribute of each record; the result specifies in which block of the file the record should be placed 09-4 09-4 2 L9 Storage&Indexing Heap Files Rows appended to end of file as they are inserted f Hence the file is unordered Deleted rows create gaps in file f File must be periodically compacted to recover space Performance: (Assume file contains F pages) f Inserting a row: Access path is scan Avg. F/2 page transfers if row already exists (row is unique) F+1 page transfers if row does not already exist f Deleting a row: Avg. F/2+1 page transfers if row exists F page transfers if row does not exist f Query: Access path is scan Access path is a scan Efficient if all rows are returned (SELECT * FROM table) Very inefficient if a few rows are requested 09-5 09-5 L9 Storage&Indexing Sorted File Rows are sorted based on some attribute(s) f Access path is binary search f Equality or range query based on that attribute has cost log2F to retrieve page containing first row f Successive rows are in same (or successive) page(s) and cache hits are likely f By storing all pages on the same track, seek time can be minimized Problem: Maintaining sorted order f After the correct position for an insert has been determined, inserting the row requires (on average) F/2 reads and F/2 writes (because shifting is necessary to make space) f Partial solution 1: Leave empty space in each page f Partial solution 2: Use overflow pages (chains) Successive pages no longer stored contiguously, costs > log2F 09-6 09-6 3 L9 Index Structures Storage&Indexing An index on a relation is an access path to speed up selections on the search key fields for the index. f Any subset of the fields of a relation can be the search key for an index on the relation. f Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation). f Primary keys are typically automatically indexed An index consists of records (called index entries) of the form search-key pointer Index files are typically much smaller than the original file 09-7 09-7 L9 Index Example Storage&Indexing students Index(name) Ahmed Ha Tschi James Jesse Nga Peter sid name birthdate country 300697336 300673435 300136899 300304642 300002001 300254672 Peter Ha Tschi James Nga Jesse Ahmed 01.01.84 31.5.79 29.02.82 04.05.85 11.10.86 30.12.80 India China Australia Singapur China Pakistan Ordered index: search keys are stored in sorted order Hash index: search keys are distributed uniformly across “buckets” using a “hash function”. 09-8 09-8 4 L9 Storage&Indexing Primary vs. Secondary Indices Primary index: in a sequentially ordered file, the index whose search key specifies the sequential order of the file. f Also called main index or clustering index f The search key of a primary index is usually but not necessarily the primary key. Secondary index: an index whose search key specifies an order different from the sequential order of the file. f Also called non-clustering index. Sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive f Because each record access may fetch a new block from disk 09-9 09-9 L9 Storage&Indexing Example: Primary Index Primary Index on branch-name field of account Source: Silberschatz/Korth/Sudarshan: Database System Concepts, 2002. Files with a primary index on the search key are also called index-sequential files (ISAM) 09-10 09-10 5 L9 Storage&Indexing Example: Secondary Index Secondary Index on balance field of account As balance is not a candidate key, we need buckets as an indirection with pointers to tuples with the same search-key Source: Silberschatz/Korth/Sudarshan: Database System Concepts, 2002. 09-11 09-11 L9 Storage&Indexing Index Definition in SQL Create an index CREATE INDEX name ON relation-name (<attributelist>) f E.g.: create index b-index on branch(branch-name) Use CREATE UNIQUE INDEX to indirectly specify and enforce the condition that the search key is a candidate key. f Not really required if SQL unique integrity constraint is supported To drop an index DROP INDEX index-name 09-12 09-12 6 L9 Storage&Indexing Tree-Structured Indices Index Sequential Access Method (ISAM) f Ordered sequential file with a (fixed) primary index. (static) f Disadvantage of ISAM performance degrades as file grows, since many overflow blocks get created. Periodic reorganization of entire file is required. B+ Tree f Dynamic multi-level index structure reorganization of entire file is not required to maintain performance. f Supports equality and range searches, multiple attribute keys and partial key searches f Disadvantages: extra insertion and deletion overhead and space overhead. 09-13 09-13 L9 Storage&Indexing ISAM Tree Non-leaf pages Leaf Pages Low locking cost! 30 37 09-14 09-14 7 L9 B+-Tree Structure Storage&Indexing Non-leaf pages Leaf Pages (Sorted by search key) Leaf level is a (sorted) linked list of index entries (clustered: data entries) Non-leaf nodes have index entries; only used to direct searches P1 K1 P2 ... Pi keys < Ki Ki Pi+1 ... Pn-1 Kn-1 Pn Ki <= keys < Ki=1 09-15 09-15 L9 Example of a B+-Tree Storage&Indexing Root 17 Entries < 17 5 2* 3* Entries >= 17 27 14 5* 7* 8* 14* 16* 17* 24* 33 27* 29* 33* 34* 38* 39* Note how data entries in the leafs are sorted f Primary index: leaves hold records themselves, else pointers to records Find 14? 29? All values >20 and <30? Insert/Delete: f Find data entry in leaf, than change it; need to adjust parent sometimes. f And change sometimes ‘bubbles up’ the tree 09-16 09-16 8 L9 Storage&Indexing B+-Tree Index Structure A B+-tree is a rooted tree satisfying the following properties: f All paths from root to leaf index entries have the same length i.e., it is a balanced tree f Each node (except the root) has at least ⎡n/2⎤ (pointers to) children. The number n of pointers in a node is also called fanout (typical >100) The root node can have between 1 and n children. The search keys within a node are sorted. f Special cases: If the root is not a leaf, it has at least 2 children. If the root is a leaf (that is, there are no other nodes in the tree), it can hold between 0 and (n–1) search key values. 09-17 09-17 L9 Storage&Indexing Queries on B+-Tree Find all records with a search-key value of k. 1. Start with the root node Examine the node for the smallest search-key >= k. If such a value exists, assume it is Ki. Then follow Pi to the child node Otherwise k ≥ Kn–1, where there are n pointers in the node. Then follow Pn to the child node. 2. Repeat the above procedure until a leaf node is reached. 3. Eventually reach a leaf node. If for some i, key Ki = k follow pointer Pi to the desired record. Else no record with search-key value k exists. 09-18 09-18 9 L9 Storage&Indexing Updates on B+-Trees: Insertion Find leaf node in which the search-key value would appear. If there is room in the leaf node, insert (key-value, pointer) pair at the correct sorted position in the leaf node. Otherwise, split the leaf node as follows: f Rule 1: Leaf node split take the n (key-value, pointer) pairs (including the one being inserted) in sorted order. Place the first ⎡n/2⎤ in the original node, and the rest in a new node. let the new node be p, and let k be the least key value in p. Insert (k,p) in the parent of the node being split. f If the parent is full, split it and propagate the split further up using Rule 2: Index node split take the n (key,pointer) pairs (including the one being inserted) in sorted order. Remove the middle key km, then place the first (n-1)/2 pairs in the original node, and the rest in a new intermediate node. let the new node (pointer) be p, and let km be the middle key value not stored so far. Push the pair (km,p) up the tree (if there is no parent node, create a new root) Splitting of nodes proceeds upwards till a node that is not full is found. In the worst case the root may be split increasing the tree height by 1. 09-19 09-19 L9 Storage&Indexing Examples of B+-Tree Insertion B+-Tree before and after insertion of “Clearview” 09-20 09-20 10 L9 Storage&Indexing Updates on B+-Trees: Deletion Find the leaf node N holding the search key value Remove (search-key-value, pointer) from leaf node N If the node has too few entries (<⎡n/2⎤) due to the removal f IF N has a (left or right) sibling node S with more than ⎡n/2⎤ entries: Redistribute the entries between nodes N and S such that both have more than the minimum number of entries. Update the corresponding search-key value in the parent of the node. f Otherwise choose any sibling of N Insert all the search-key values of the two nodes into a single node (the one on the left), and delete the other node. Delete the pair (Ki–1, Pi), where Pi is the pointer to the deleted node, from its parent, recursively using the above procedure. The node deletions may cascade upwards till a node which has ⎡n/2⎤ or more pointers is found. 09-21 09-21 L9 Storage&Indexing Examples of B+-Tree Deletion Clearview Clearview Before and after deleting “Downtown” 09-22 09-22 11 L9 Hash Index Storage&Indexing Index entries partitioned into buckets in accordance with a hash function, h(v), where v ranges over search key values Each bucket is identified by an address, a Bucket at address a contains all index entries with search key v such that h(v) = a A bucket is a unit of storage containing one or more records that is stored in a page (with possible overflow chain) If index entries contain rows, set of buckets forms an integrated storage structure (’clustered file’); else set of buckets forms an (unclustered) secondary index 09-23 09-23 L9 Storage&Indexing Equality Search with Hash Index Location mechanism Given v: 1. Compute h(v) 2. Fetch bucket at h(v) 3. Search bucket Cost = number of pages in bucket (cheaper than B+ tree, if no overflow chains) 09-24 09-24 12 L9 Storage&Indexing Example of Hash File Organisation Hash file organization of account file, using branch-name as key e.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3 09-25 09-25 L9 Storage&Indexing Hash Indices - Problems Does not support range search f Since adjacent elements in range might hash to different buckets, there is no efficient way to scan buckets to locate all search key values v between v1 and v2 Although it supports multi-attribute keys, it does not support partial key search f Entire value of v must be provided to h Dynamically growing files produce overflow chains, which negate the efficiency of the algorithm 09-26 09-26 13 L9 Choosing an Index Storage&Indexing An index should support a query of the application that has a significant impact on performance f Choice based on frequency of invocation, execution time, acquired locks, table size Example 1: SELECT E.Id FROM Employee E WHERE E.Salary < :upper AND E.Salary > :lower – This is a range search on Salary. – Since the primary key is Id, it is likely that there is a clustered, main index on that attribute that is of no use for this query. – Choose a secondary, B+ tree index with search key Salary 09-27 09-27 L9 Storage&Indexing Choosing An Index (cont’d) Example 2: SELECT E.sid FROM Enrolled E WHERE E.grade = :grade This is an equality search on grade. f Since the primary key is (sid, CourseId) it is likely that there is a main, clustered index on these attributes that is of no use for this query. Choose a secondary, B+ tree or hash index with search key grade 09-28 09-28 14 L9 Storage&Indexing Choosing an Index (cont’d) Example 3: SELECT E.CourseCode, E.grade FROM Enrolled E WHERE E.StudId = :sid AND E.grade = ‘D’ Equality search on StudId and grade. If the primary key is (StudId, CourseId) it is likely that there is a main, clustered index on this sequence of attributes. f If the main index is a B+ tree it can be used for this search. f If the main index is a hash it cannot be used for this search. Choose B+ tree or hash with search key StudId (since grade is not as selective as StudId) or (StudId, grade) 09-29 09-29 L9 Storage&Indexing Summary DBMS store data persistently on secondary storage f May use file system, but with own dynamic data layout f Problem: Access gap between main memory and disk Buffer manager tries to minimize disk accesses by caching data in main memory. RAIDs to improve throughput and availability of disks. Indexes greatly help speed up query evaluation. f Can have several indexes on a given relation, each with a different search key. f Most commonly used in DBMS: B-tree index f Interesting for Data Warehousing: Bitmap Index 09-30 09-30 15 L9 Storage&Indexing Next Week Query Processing and Optimization f Query Parsing and Translation f Query Evaluation f Query Optimization Cost-based Heuristic Textbook f Chapter 12 f plus Section 14.4 09-31 09-31 16
© Copyright 2026 Paperzz