CS522 Advanced database Systems 8. Hash index Huiping Guo Department of Computer Science California State University, Los Angeles Outline Introduction to hash index Static hash index Dynamic hash index Extendible hash Linear hash 8. Hash CS522_S16 Intuition behind hash indexing Purpose Reduce page I/Os Hash indexing Organize the data entry pages into a number of buckets based on a search key Each bucket contains a primary page and overflow pages Use a hash function to locate the bucket 8. Hash CS522_S16 Static Hashing h(key) mod N key 0 1 h N-1 Primary bucket pages Overflow pages 8. Hash CS522_S16 Static Hashing (Cont.) Buckets contain data entries Sorted? Number of primary pages fixed allocated sequentially never de-allocated overflow pages if needed. 8. Hash CS522_S16 Static Hashing (Cont.) Properties of the hash function Should distribute the search key uniformly Popular hash functions h(key) = (a * key + b) MOD N usually works well. • a and b are constants h(key) MOD N 8. Hash CS522_S16 Problems with static hashing Long overflow chains can develop degrade performance Dynamic hashing Extendible hashing Linear hashing 8. Hash CS522_S16 Extendible Hashing Bucket (primary page) becomes full Overflow pages Why not re-organize file by doubling number of buckets? Reading and writing all pages is expensive! 8. Hash CS522_S16 Extendible Hashing idea Use directory of pointers to buckets Directory much smaller than file double the number of buckets by doubling the directory splitting just the bucket that overflowed! Doubling it is much cheaper Only one page of data entries is split No overflow page! Trick lies in how hash function is adjusted! 8. Hash CS522_S16 Example LOCAL DEPTH GLOBAL DEPTH 2 2 4* 12* 32* 16* Bucket A 2 00 1* 5* 21* 13* Bucket B 01 10 2 11 10* 2* 6* DIRECTORY 22* Bucket C 2 15* 7* 19* 35* Bucket D DATA PAGES 8. Hash CS522_S16 Search Directory is array of size 4. To find bucket for r Computer h(r) Take last `global depth’ bits of h(r) If h(r) = 5 = binary 101, it is in bucket pointed to by 01. 8. Hash CS522_S16 Insert If bucket is full, split it (allocate new page, re- distribute) If necessary, double the directory splitting a bucket does not always require doubling we can tell by comparing global depth with local depth for the split bucket.) 8. Hash CS522_S16 Insert h(r)=20 (Causes Doubling) LOCAL DEPTH GLOBAL DEPTH 2 LOCAL DEPTH 2 4* 12* 32* 16* 1* 5* 21*13* Bucket B 00 2 Bucket A 2 2 5* 21* 13* 32*16* GLOBAL DEPTH 2 1* 2 01 10 2 11 10* Bucket C 10* DIRECTORY DIRECTORY 2 15* 7* 19* Bucket D 2 15* 7* 19* 2 4* 12* 20* DATA PAGES 8. Hash CS522_S16 Bucket A2 (`split image' of Bucket A) Insert h(r)=20 (Cont.) LOCAL DEPTH 3 32* 16* Bucket A GLOBAL DEPTH 2 3 1* 5* 21* 13* Bucket B 000 001 010 2 011 10* Bucket C 100 101 2 110 15* 7* 19* Bucket D 111 3 DIRECTORY 8. Hash 4* 12* 20* Bucket A2 (`split image' of Bucket A) CS522_S16 How to search h(r)=20 20 = binary 10100 Last 2 bits (00) tell us r belongs in A or A2 Last 3 bits needed to tell which. Global depth of directory Max number of bits needed to tell which bucket an entry belongs to. Local depth of a bucket number of bits used to determine if an entry belongs to this bucket. 8. Hash CS522_S16 When double directory? Bucket split does not always cause directory doubling Only split if • local depth of bucket = global depth Initially all local depths are equal to the global depth At any time, global depth is always equal to the maximum local depth 8. Hash CS522_S16 Exercise Insert h(r) = 9 Insert h(r) = 6 Insert h(r) = 27 Insert h(r) = 11 8. Hash CS522_S16 Delete The data entry is located and removed If the bucket is empty Merge with its split image Decrease the local depth If each directory element points to the same bucket as its split image Halve the directory if necessary Reduce the global depth 8. Hash CS522_S16 Comments on Extendible Hashing Equality search Whether directory can fit in memory One or two If the distribution of hash values is skewed Multiple entries with same hash value cause problems! Directory can grow large. Choice of good hash function 8. Hash CS522_S16 Linear Hashing An alternative to Extendible Hashing LH handles the problem of long overflow chains without using a directory May have overflow pages The number of overflow pages is under control 8. Hash CS522_S16 Basic idea Use a family of hash functions h is some hash function h0, h1, h2, … hi(key) = h(key) mod(2iN), N is an initial number (power of 2) range is from 0 to N-1 hi+1 doubles the range of hi similar to directory doubling Ex: N = 32 i=0, h0 = h mod 32, check the last 5 bits i=1, h1 = h mod (2x32), check the last 6 bits, … 8. Hash CS522_S16 Rounds of splitting Current round number is Level hlevel and hlevel+1 are in use Splitting proceeds in `rounds’ Round ends when all initial buckets are split Two kinds of buckets have been split yet to be split 8. Hash CS522_S16 Some notations Level Indicate the current round number Initialized to 0 N: Number of buckets at the beginning of round 0 Nlevel The number of buckets at the beginning of round level Nlevel = N*2level Next: the bucket to split 8. Hash CS522_S16 Overview of LH File In the middle of a round. Bucket to be split Next Buckets that existed at the beginning of this round: this is the range of Buckets split in this round: If h Level ( search key value ) is in this range, must use h Level+1 ( search key value ) to decide if entry is in `split image' bucket. hLevel `split image' buckets: created (through splitting of other buckets) in this round 8. Hash CS522_S16 Example of a Linear Hashed file Level=0, N=4 h h 1 0 000 00 001 01 010 10 011 11 PRIMARY Next=0 PAGES 32*44* 36* 9* 25* 5* 14* 18*10* 30* Data entry r with h(r)=5 Primary bucket page 31*35* 7* 11* 8. Hash CS522_S16 Insert 43: split, overflow page Level=0, N=4 h h 1 0 000 00 001 01 010 10 011 11 (This info is for illustration only!) Level=0 PRIMARY Next=0 PAGES h 32*44* 36* Data entry r with h(r)=5 9* 25* 5* 14* 18*10*30* Primary bucket page 31*35* 7* 11* (The actual contents of the linear hashed file) 8. Hash h PRIMARY PAGES 1 0 000 00 001 Next=1 9* 25* 5* 01 010 10 011 11 100 00 OVERFLOW PAGES 32* 14* 18*10*30* 31*35* 7* 11* 44* 36* CS522_S16 43* Insert 37*, 29*: split, no overflow page Level=0 h h Level=0 PRIMARY PAGES 1 0 000 00 001 Next=1 9* 25* 5* 37* 01 OVERFLOW PAGES 10 011 11 100 00 h 1 0 000 00 001 01 PRIMARY PAGES 32* split 010 h 14* 18*10*30* 31*35* 7* 11* 43* 44* 36* 32* 9* 25* Next=2 010 10 011 11 100 00 14* 18*10* 30* 31*35* 7* 11* 44* 36* 5* 37* 29* 8. Hash OVERFLOW PAGES CS522_S16 43* Insert 22*, 66*, 34* Level=0 Level=0 h h 1 000 001 0 00 01 OVERFLOW PAGES h1 h0 000 00 32* 001 01 9* 25* 010 10 011 11 100 00 44*36* 44* 36* 101 01 5* 37*29* 5* 37* 29* 110 PRIMARY PAGES 10 011 11 100 00 OVERFLOW PAGES 32* 9* 25* Next=2 010 PRIMARY PAGES 14* 18*10* 30* 31*35* 7* 11* 43* 8. Hash 18* 10* 66* 34* Next=3 31*35* 7* 11* 14*30*22* CS522_S16 43* Insert 50: split, increment level Level=1 h1 PRIMARY PAGES h0 Next=0 Level=0 h1 h0 000 00 001 010 01 10 PRIMARY PAGES OVERFLOW PAGES 000 00 32* 001 01 9* 25* 010 10 18* 10* 66* 34* 011 11 43* 35* 11* 100 00 44* 36* 101 11 5* 37* 29* OVERFLOW PAGES 32* 9* 25* 18* 10* 66* 34* Next=3 31*35* 7* 11* 43* 011 11 100 00 44*36* 101 01 5* 37*29* 110 10 14* 30* 22* 110 10 14*30*22* 111 11 31*7* 8. Hash CS522_S16 50* Insert summary Find bucket by applying hLevel / hLevel+1: If bucket to insert into is full: Add overflow page and insert data entry. (Maybe) Split Next bucket and increment Next. Can choose any criterion to `trigger’ split. On split, hLevel+1 is used to re-distribute entries. 8. Hash CS522_S16 Search To find bucket for data entry r find hLevel(r): If hLevel(r) points to an unsplit bucket , r belongs here. Else, r could belong to bucket hLevel(r) or split bucket • must apply hLevel+1(r) to find out. 8. Hash CS522_S16 Deletion The inverse of insertion If the last bucket in the file is empty Remove the bucket Next is decreased by 1 If the last bucket becomes empty AND Next is 0 Next is made to point to (M/2) – 1 • M is the current number of buckets Level is decreased by 1 The last empty bucket is removed 8. Hash CS522_S16 Linear Hashing vs Extendible Hashing The two schemes are actually quite similar Moving from hi to hi+1 Double the directory Image there is a ‘directory’ in LH 8. Hash CS522_S16 Summary of hash indexing Hash-based indexes are best for equality selections Cannot support range searches. Static and dynamic hashing techniques exist trade-offs similar to ISAM vs. B+ trees. 8. Hash CS522_S16
© Copyright 2026 Paperzz