Introduction - csns - California State University, Los Angeles

CS522 Advanced database
Systems
8.
Hash index
Huiping Guo
Department of Computer Science
California State University, Los Angeles
Outline
 Introduction to hash index
 Static hash index
 Dynamic hash index
 Extendible hash
 Linear hash
8. Hash
CS522_S16
Intuition behind hash indexing
 Purpose
 Reduce page I/Os
 Hash indexing
 Organize the data entry pages into a number of buckets
based on a search key
 Each bucket contains a primary page and overflow pages
 Use a hash function to locate the bucket
8. Hash
CS522_S16
Static Hashing
h(key) mod N
key
0
1
h
N-1
Primary bucket pages Overflow pages
8. Hash
CS522_S16
Static Hashing (Cont.)
 Buckets contain data entries
 Sorted?
 Number of primary pages fixed
 allocated sequentially
 never de-allocated
 overflow pages if needed.
8. Hash
CS522_S16
Static Hashing (Cont.)
 Properties of the hash function
 Should distribute the search key uniformly
 Popular hash functions
 h(key) = (a * key + b) MOD N usually works well.
•

a and b are constants
h(key) MOD N
8. Hash
CS522_S16
Problems with static hashing
 Long overflow chains can develop
 degrade performance

Dynamic hashing


Extendible hashing
Linear hashing
8. Hash
CS522_S16
Extendible Hashing
 Bucket (primary page) becomes full
 Overflow pages
 Why not re-organize file by doubling number of buckets?

Reading and writing all pages is expensive!
8. Hash
CS522_S16
Extendible Hashing idea

Use directory of pointers to buckets



Directory much smaller than file




double the number of buckets by doubling the directory
splitting just the bucket that overflowed!
Doubling it is much cheaper
Only one page of data entries is split
No overflow page!
Trick lies in how hash function is adjusted!
8. Hash
CS522_S16
Example
LOCAL DEPTH
GLOBAL DEPTH
2
2
4* 12* 32* 16*
Bucket A
2
00
1*
5* 21* 13*
Bucket B
01
10
2
11
10* 2* 6*
DIRECTORY
22*
Bucket C
2
15* 7* 19* 35*
Bucket D
DATA PAGES
8. Hash
CS522_S16
Search
 Directory is array of size 4.
 To find bucket for r



Computer h(r)
Take last `global depth’ bits of h(r)
If h(r) = 5 = binary 101, it is in bucket pointed to by
01.
8. Hash
CS522_S16
Insert
 If bucket is full, split it (allocate new page, re-
distribute)
 If necessary, double the directory
 splitting a bucket does not always require doubling

we can tell by comparing global depth with local depth for the
split bucket.)
8. Hash
CS522_S16
Insert h(r)=20 (Causes Doubling)
LOCAL DEPTH
GLOBAL DEPTH
2
LOCAL DEPTH
2
4* 12* 32* 16*
1* 5* 21*13* Bucket B
00
2
Bucket A
2
2
5* 21* 13*
32*16*
GLOBAL DEPTH
2
1*
2
01
10
2
11
10*
Bucket C
10*
DIRECTORY
DIRECTORY
2
15* 7* 19*
Bucket D
2
15* 7* 19*
2
4* 12* 20*
DATA PAGES
8. Hash
CS522_S16
Bucket A2
(`split image'
of Bucket A)
Insert h(r)=20 (Cont.)
LOCAL DEPTH
3
32* 16* Bucket A
GLOBAL DEPTH
2
3
1* 5* 21* 13* Bucket B
000
001
010
2
011
10*
Bucket C
100
101
2
110
15* 7* 19*
Bucket D
111
3
DIRECTORY
8. Hash
4* 12* 20*
Bucket A2
(`split image'
of Bucket A)
CS522_S16
How to search h(r)=20
 20 = binary 10100
 Last 2 bits (00) tell us r belongs in A or A2
 Last 3 bits needed to tell which.

Global depth of directory


Max number of bits needed to tell which bucket an entry
belongs to.
Local depth of a bucket

number of bits used to determine if an entry belongs to
this bucket.
8. Hash
CS522_S16
When double directory?
 Bucket split does not always cause directory
doubling

Only split if
•

local depth of bucket = global depth
Initially all local depths are equal to the global
depth

At any time, global depth is always equal to the maximum
local depth
8. Hash
CS522_S16
Exercise
 Insert h(r) = 9
 Insert h(r) = 6
 Insert h(r) = 27
 Insert h(r) = 11
8. Hash
CS522_S16
Delete
 The data entry is located and removed
 If the bucket is empty
 Merge with its split image
 Decrease the local depth
 If each directory element points to the same
bucket as its split image


Halve the directory if necessary
Reduce the global depth
8. Hash
CS522_S16
Comments on Extendible Hashing
 Equality search
 Whether directory can fit in memory
 One or two
 If the distribution of hash values is skewed
 Multiple entries with same hash value cause problems!
 Directory can grow large.
 Choice of good hash function
8. Hash
CS522_S16
Linear Hashing
 An alternative to Extendible Hashing
 LH handles the problem of long overflow chains
without using a directory
 May have overflow pages

The number of overflow pages is under control
8. Hash
CS522_S16
Basic idea
 Use a family of hash functions



h is some hash function


h0, h1, h2, …
hi(key) = h(key) mod(2iN), N is an initial number (power of 2)
range is from 0 to N-1
hi+1 doubles the range of hi

similar to directory doubling
 Ex: N = 32



i=0, h0 = h mod 32, check the last 5 bits
i=1, h1 = h mod (2x32), check the last 6 bits,
…
8. Hash
CS522_S16
Rounds of splitting

Current round number is Level




hlevel and hlevel+1 are in use
Splitting proceeds in `rounds’
Round ends when all initial buckets are split
Two kinds of buckets


have been split
yet to be split
8. Hash
CS522_S16
Some notations
 Level


Indicate the current round number
Initialized to 0
 N: Number of buckets at the beginning of round 0
 Nlevel


The number of buckets at the beginning of round level
Nlevel = N*2level
 Next: the bucket to split
8. Hash
CS522_S16
Overview of LH File
 In the middle of a round.
Bucket to be split
Next
Buckets that existed at the
beginning of this round:
this is the range of
Buckets split in this round:
If h Level ( search key value )
is in this range, must use
h Level+1 ( search key value )
to decide if entry is in
`split image' bucket.
hLevel
`split image' buckets:
created (through splitting
of other buckets) in this round
8. Hash
CS522_S16
Example of a Linear Hashed file
Level=0, N=4
h
h
1
0
000
00
001
01
010
10
011
11
PRIMARY
Next=0 PAGES
32*44* 36*
9* 25* 5*
14* 18*10* 30*
Data entry r
with h(r)=5
Primary
bucket page
31*35* 7* 11*
8. Hash
CS522_S16
Insert 43: split, overflow page
Level=0, N=4
h
h
1
0
000
00
001
01
010
10
011
11
(This info
is for illustration
only!)
Level=0
PRIMARY
Next=0 PAGES
h
32*44* 36*
Data entry r
with h(r)=5
9* 25* 5*
14* 18*10*30*
Primary
bucket page
31*35* 7* 11*
(The actual contents
of the linear hashed
file)
8. Hash
h
PRIMARY
PAGES
1
0
000
00
001
Next=1
9* 25* 5*
01
010
10
011
11
100
00
OVERFLOW
PAGES
32*
14* 18*10*30*
31*35* 7* 11*
44* 36*
CS522_S16
43*
Insert 37*, 29*: split, no overflow
page
Level=0
h
h
Level=0
PRIMARY
PAGES
1
0
000
00
001
Next=1
9* 25* 5* 37*
01
OVERFLOW
PAGES
10
011
11
100
00
h
1
0
000
00
001
01
PRIMARY
PAGES
32*
split
010
h
14* 18*10*30*
31*35* 7* 11*
43*
44* 36*
32*
9* 25*
Next=2
010
10
011
11
100
00
14* 18*10* 30*
31*35* 7* 11*
44* 36*
5* 37* 29*
8. Hash
OVERFLOW
PAGES
CS522_S16
43*
Insert 22*, 66*, 34*
Level=0
Level=0
h
h
1
000
001
0
00
01
OVERFLOW
PAGES
h1
h0
000
00
32*
001
01
9* 25*
010
10
011
11
100
00
44*36*
44* 36*
101
01
5* 37*29*
5* 37* 29*
110
PRIMARY
PAGES
10
011
11
100
00
OVERFLOW
PAGES
32*
9* 25*
Next=2
010
PRIMARY
PAGES
14* 18*10* 30*
31*35* 7* 11*
43*
8. Hash
18* 10* 66* 34*
Next=3
31*35* 7* 11*
14*30*22*
CS522_S16
43*
Insert 50: split, increment level
Level=1
h1
PRIMARY
PAGES
h0
Next=0
Level=0
h1
h0
000
00
001
010
01
10
PRIMARY
PAGES
OVERFLOW
PAGES
000
00
32*
001
01
9* 25*
010
10
18* 10* 66* 34*
011
11
43* 35* 11*
100
00
44* 36*
101
11
5* 37* 29*
OVERFLOW
PAGES
32*
9* 25*
18* 10* 66* 34*
Next=3
31*35* 7* 11*
43*
011
11
100
00
44*36*
101
01
5* 37*29*
110
10
14* 30* 22*
110
10
14*30*22*
111
11
31*7*
8. Hash
CS522_S16
50*
Insert summary
 Find bucket by applying hLevel / hLevel+1:
 If bucket to insert into is full:




Add overflow page and insert data entry.
(Maybe) Split Next bucket and increment Next.
Can choose any criterion to `trigger’ split.
On split, hLevel+1 is used to re-distribute entries.
8. Hash
CS522_S16
Search
To find bucket for data entry r
 find hLevel(r):
 If hLevel(r) points to an unsplit bucket , r belongs
here.
 Else, r could belong to bucket hLevel(r) or split
bucket
• must apply hLevel+1(r) to find out.
8. Hash
CS522_S16
Deletion
 The inverse of insertion
 If the last bucket in the file is empty
 Remove the bucket
 Next is decreased by 1
 If the last bucket becomes empty AND Next is 0
 Next is made to point to (M/2) – 1
• M is the current number of buckets


Level is decreased by 1
The last empty bucket is removed
8. Hash
CS522_S16
Linear Hashing vs Extendible
Hashing
 The two schemes are actually quite similar
 Moving from hi to hi+1
 Double the directory
 Image there is a ‘directory’ in LH
8. Hash
CS522_S16
Summary of hash indexing
 Hash-based indexes are best for equality
selections

Cannot support range searches.
 Static and dynamic hashing techniques exist
 trade-offs similar to ISAM vs. B+ trees.
8. Hash
CS522_S16