Hashing

Principles of Database
Management Systems
4.2: Hashing Techniques
Pekka Kilpeläinen
(after Stanford CS245 slide originals by
Hector Garcia-Molina, Jeff Ullman and
Jennifer Widom)
DBMS 2001
Notes 4.2: Hashing
1
Hashing?
• Locating the storage block of a record
by the hash value h(k) of its key k
• Normally really fast
– records (often) located by a single disk
access
DBMS 2001
Notes 4.2: Hashing
2
Hashing
<key>
key  h(key)
.
.
.
DBMS 2001
Notes 4.2: Hashing
Buckets
(typically 1
disk block)
3
Two alternatives
(1) Hash value determines the storage block directly
.
.
.
records
key  h(key)
.
.
.
• to implement a primary index
DBMS 2001
Notes 4.2: Hashing
4
Two alternatives
(2) Records located indirectly via index buckets
key  h(key)
key 1
record
Index
• for a secondary index
DBMS 2001
Notes 4.2: Hashing
5
Example hash function
• Key = ‘x1 x2 … xn’ n byte character string
• Have b buckets
• h = (x1 + x2 + … + xn) mod b
 {0, 1, …, b-1}
DBMS 2001
Notes 4.2: Hashing
6
 This may not be best function …
Good hash
function:
Expected number of
keys/bucket is the
same for all buckets
 Read Knuth Vol. 3 if you really
need to select a good function.
DBMS 2001
Notes 4.2: Hashing
7
Next: example to illustrate
inserts, overflows, deletes
h(K)
DBMS 2001
Notes 4.2: Hashing
8
EXAMPLE 2 records/bucket
INSERT:
h(a) = 1
h(b) = 2
h(c) = 1
h(d) = 0
0
d
1
a
c
2
b
e
3
h(e) = 1
DBMS 2001
Notes 4.2: Hashing
9
EXAMPLE: deletion
Delete:
e
f
c
0
a
1
b
c d
e
2
3
DBMS 2001
f
g
d
maybe move
“g” up
Notes 4.2: Hashing
10
Rule of thumb:
• Try to keep space utilization
between 50% and 80%
Utilization =
# keys used
total # keys that fit
• If < 50%, wasting space
• If > 80%, overflows significant
depends on how good hash
function is & on # keys/bucket
DBMS 2001
Notes 4.2: Hashing
11
How do we cope with growth?
• Overflows and reorganizations
• Dynamic hashing: # of buckets may
vary
• Extensible
• Linear
• also others ...
DBMS 2001
Notes 4.2: Hashing
12
Extensible hashing: two ideas
(a) Use i of b bits output by hash function
b
h(K) 
For example, b=32
00110101
use i  grows over time….
DBMS 2001
Notes 4.2: Hashing
13
(b) Use directory
h(K)[i ]
.
.
.
to bucket
.
.
.
Directory contains 2i pointers to buckets, and stores i.
Each bucket stores j, indicating #bits used for placing
the records in this block (j  i)
DBMS 2001
Notes 4.2: Hashing
14
Extensible Hashing: Insertion
• If there's room in bucket h(k)[i], place record
there; Otherwise …
• If j=i, set i=i+1 and double the directory
• If j<i, split the block in two, distribute records
among them now using j+1 bits of h(k);
(Repeat until some records end up in the new
bucket); Update pointers of bucket array
• See the next example
DBMS 2001
Notes 4.2: Hashing
15
Example: h(k) is 4 bits; 2 keys/block
i= 1
1 (j)
0001
DBMS 2001
00
01
10
1 2
1001
1010 1100
Insert 1010
i=2
11
1 2
1100
Notes 4.2: Hashing
New directory
16
Example continued
i= 2
00
01
10
11
Insert:
0111
2
0000
0001
1 2
0001 0111
0111
2
1001
1010
2
1100
0000
DBMS 2001
Notes 4.2: Hashing
17
Example continued
0000 2
i= 2
0001
00
0111 2
01
10
11
Insert:
1001
DBMS 2001
1001 3
1001
1010 1001 2 3
1010
1100 2
Notes 4.2: Hashing
i=3
000
001
010
011
100
101
110
111
18
Extensible hashing: deletion
• Reverse insert procedure …
• Example:
Walk thru insert example in reverse!
DBMS 2001
Notes 4.2: Hashing
19
Summary
+
Extensible hashing
Can handle growing files
- without full reorganizations
+
-
Only one data block examined
Indirection
(Not bad if directory in memory)
-
Directory doubles in size
(First it fits in memory, then it does not
-> sudden performance degradation)
DBMS 2001
Notes 4.2: Hashing
20
Linear hashing:
grow # of buckets by one
Two ideas:
(a) Use i low order bits of hash
(b) File grows linearly
b
01110101
grows
i
-> No bucket directory needed
DBMS 2001
Notes 4.2: Hashing
21
Linear Hashing: Parameters
• n: number of buckets in use
– buckets numbered 0…n-1
• i: number of bits of h(k) used to address


i

log(n
)


buckets


• r: number of records in hash table
– ratio r/n limited to fit an avg bucket in a block
– next example: r  1.7n, and block holds 2 records
=> AVG bucket occupancy is  1.7/2 = 0.85 of a block
DBMS 2001
Notes 4.2: Hashing
22
Example: 2 keys/block, b=4 bits, n=2, i =1
• insert 0101
0000
1010
00
0101
1111
01
Rule
• now r=4 >1.7n
-> get new bucket 10
and distribute keys btw
buckets 00 and 10
If h(k)[i ] = (a1 … ai)2 < n, then
look at bucket h(k)[i ]; else
look at bucket h(k)[i ] - 2i -1 = (0a2 … ai)2
DBMS 2001
Notes 4.2: Hashing
23
-> n=3, i =2;
distribute keys btw buckets 00 and 10:
0000
1010
00
DBMS 2001
0101
1111
01
1010
10
Notes 4.2: Hashing
24
n=3, i =2; insert 0001:
0001
• can have overflow chains!
0000
00
DBMS 2001
0101
1111
01
1010
10
Notes 4.2: Hashing
25
n=3, i =2
0001
0111
0000
00
DBMS 2001
0101
1111
01
• insert 0111
1010
10
Notes 4.2: Hashing
• bucket 11 not in use
-> redirect to 01
• now r=6 > 1.7n
-> get new bucket 11
26
n=4, i =2; distribute keys btw 01 and 11
0001
0111
0000
00
DBMS 2001
0101
1111
0001
01
1010
10
1111
0111
11
Notes 4.2: Hashing
27
Example Continued: How to grow beyond this?
i=2 3
0000
0101
0101
0 00
0 01
101
1010
010
110
0101
0101
1111
0 11
111
100
101
...
m = 11 (max used block)
100
101
DBMS 2001
Notes 4.2: Hashing
28
Summary
+
Linear Hashing
Can handle growing files
- without full reorganizations
No indirection directory of extensible hashing
- Can have overflow chains
+
- but probability of long chains can be
kept low by controlling the r/n fill ratio (?)
DBMS 2001
Notes 4.2: Hashing
29
Summary
Hashing
- How it works
- Dynamic hashing
- Extensible
- Linear
DBMS 2001
Notes 4.2: Hashing
30
Next:
• Indexing vs Hashing
• Index definition in SQL
DBMS 2001
Notes 4.2: Hashing
31
Indexing vs Hashing
• Hashing good for probes given key
e.g.,
SELECT …
FROM R
WHERE R.A = 5
DBMS 2001
Notes 4.2: Hashing
32
Indexing vs Hashing
• INDEXING (Including B-Trees) good for
Range Searches:
e.g.,
SELECT
FROM R
WHERE R.A > 5
DBMS 2001
Notes 4.2: Hashing
33
Index definition in SQL
• Create index name on rel (attr)
• Create unique index name on rel (attr)
defines candidate key
• Drop INDEX name
DBMS 2001
Notes 4.2: Hashing
34
Note CANNOT SPECIFY TYPE OF INDEX
(e.g. B-tree, Hashing, …)
OR PARAMETERS
(e.g. Load Factor, Size of Hash,...)
... at least in SQL …
Oracle and IBM DB2 UDB provide a PCTFREE
clause to inditate the proportion of B-tree
blocks initially left unfilled
Oracle: “Hash clusters” with built-in or DBAspecified hash function
DBMS 2001
Notes 4.2: Hashing
35
The
BIG picture….
• Chapters 2 & 3: Storage, records, blocks...
• Chapter 4: Access Mechanisms
- Indexes
- B trees
- Hashing
• Chapters 6 & 7: Query Processing NEXT
DBMS 2001
Notes 4.2: Hashing
36