Hash - People Server at UNCW

CSC 231: Introduction to
Data Structures
Hash
Dr. Curry Guinn
Next two weeks
• Next week is spring break
• The following week you have a quiz on
Tuesday (Mar 14)
• And a homework due Thursday (Mar 16)
2
Today
• Hashing
– Plus we will learn how to install python packages (libraries)
– And do simple graphing in Python!
Hash tables
• Hash table: a list of some fixed size,
that positions elements according to
an algorithm called a hash function
hash function
h(element)
elements (e.g., strings)
0
…
length –1
hash table
Hashing, hash functions
– The idea: somehow we map every element into
some index in the list
• "hash" it
• This is its one and only place that it should go
– Lookup becomes constant-time: simply look at
that one slot again later to see if the element is
there
– add, remove, contains all become O(1) !
If the key is an integer …
• For now, let's look at integers
– a "hash function" h for int is trivial:
store int i at index i (a direct mapping)
• If i >= len(alist),
Then store i at index(i % len(alist))
– h(i) = i % len(alist)
Hash function example
•
•
•
•
Elements = Integers
h(i) = i % 10
Add 41, 34, 7, and 18
constant-time lookup:
– just look at i % 10 again later
• We lose all ordering information:
– Here’s what you can’t do quickly:
– getMin, getMax, removeMin, removeMax
– printing items in sorted order
0
1
2
3
4
5
6
7
8
9
41
34
7
18
Hash collisions
• Collision: the event that two hash table
elements map into the same slot in the
list
• Example: add 41, 34, 7, 18, then 21
– 21 hashes into the same slot as 41!
– 21 should not replace 41 in the hash
table; they should both be there
Collision resolution: means for fixing
collisions in a hash table
0
1
2
3
4
5
6
7
8
9
21
34
7
18
Separate Chaining
• Chaining: All keys that map to the same hash
value are kept in a linked list
0
10
1
2
22
3
4
5
6
7
8
9
107
12
42
Linear probing
• Linear probing: resolving collisions in slot i
by putting the colliding element into the
next available slot (i+1, i+2, ...)
– Add 41, 34, 7, 18, then 21, then 57
• 21 collides (41 is already there), so we search ahead
until we find empty slot 2
• 57 collides (7 is already there), so we search ahead
twice until we find empty slot 9
– Lookup algorithm becomes slightly modified; we
have to loop now until we find the element or an
empty slot
• What happens when the table gets mostly full?
0
1
2
3
4
5
6
7
8
9
41
21
34
7
18
57
Hash function in action
• Add these
elements to the
hash table:
– 89
– 18
– 49
– 58
–9
Clustering problem
• Clustering: nodes being placed close
together by probing, which degrades
hash table's performance
– Add 89, 18, 49, 58, 9
– Now searching for the value 28 will have
to check half the hash table! no longer
constant time...
0
1
2
3
4
5
6
7
8
9
49
58
9
18
89
Quadratic probing
• Quadratic probing: resolving collisions
on slot i by putting the colliding
element into slot i+1, i+4, i+9, i+16, ...
– add 89, 18, 49, 58, 9
• 49 collides (89 is already there), so we search
ahead by +1 to empty slot 0
• 58 collides (18 is already there), so we search
ahead by +1 to occupied slot 9, then +4 to
empty slot 2
• 9 collides (89 is already there), so we search
ahead by +1 to occupied slot 0, then +4 to
empty slot 3
– Clustering is reduced
0
1
2
3
4
5
6
7
8
9
49
58
9
18
89
Quadratic probing in action
Load factor
• load factor: ratio of elements to
capacity
• The book uses the symbol lamda (λ)
for the load factor
– load factor = size / capacity
λ
= 6 / 10
λ
= 0.6
0
1
2
3
4
5
6
7
8
9
41
21
34
7
18
57
Increasing the hash table size
• If the load factor is high, increase the size of a
hash table's list, and re-storing all of the items
into the new list using the hash function
– Can we just copy the old contents to the larger
array?
– When should we do this? Some options:
• When load reaches a certain level (e.g.,  = 0.5)
• When an insertion fails
Why Increase the Size?
• What is the cost (Big-O) of increasing?
• What is a good hash table list size?
– How much bigger should a hash table get when it
grows?
Hash table removal
• lazy removal: instead of actually removing
elements, replace them with a special
REMOVED value
– avoids expensive re-shuffling of elements on
remove
– example: remove 18 -->
0
1
2
3
4
5
6
7
8
9
41
21
34
7
REMOVED
57
Lazy Removal
0
– Lookup algorithm becomes slightly modified
• What should we do when we hit a slot containing the 1
REMOVED value?
2
– Keep going
3
– Add algorithm becomes slightly modified
• What should we do when we hit a slot containing the 4
REMOVED value?
5
– use that slot, replace REMOVED with the new value
– add(17) --> slot 8
6
7
8
9
41
21
34
7
REMOVED
57
Hashing practice problem
• Draw a diagram of the state of a hash table of
size 10, initially empty, after adding the
following elements:
7, 84, 31, 57, 44, 19, 27, 14, and 64
• Assume that the hash table uses linear
probing.
• Repeat the above problem using quadratic
probing.
Writing a hash function
• If we write a hash table that can store objects,
we need a hash function for the objects, so
that we know what index to store them
We want a hash function to:
1.
2.
3.
4.
Be simple/fast to compute
Map equal elements to the same index
Map different elements to different indexes
Have keys distributed evenly among indexes
Hash functions
• Would Social Security numbers be a good hash
value for a database of students?
• Student names?
• Student ID numbers?
Folding Method
• Hash function for integers
• Break key into several equal parts and add
together. Then divide by len(slots)
– 436-555-4601
– 43+65+55+46+01 = 210.
– If we assume our hash table has 11 slots, then we
need to perform the extra step of dividing by 11
210 % 11 is 1
– So the phone number 436-555-4601 hashes to
slot 1
23
Mid-Square Method
• We first square the item, and then extract
some portion of the resulting digits.
• For example, if the item were 44, we would
first compute 442=1,936.
• By extracting the middle two digits, 93, and
performing the remainder step, we get 5
(93 % 11)
24
Hash function for strings
• Elements = Strings
• Let's view a string by its letters:
– String s : s0, s1, s2, …, sn-1
• How do we map a string into an integer index?
(how do we "hash" it?)
• One possible hash function:
– Treat first character as an int, and hash on that
• h(s) = s0 % len(slots)
• is this a good hash function? When will strings collide?
Better string hash functions
• View a string by its letters:
– String s : s0, s1, s2, …, sn-1
• Treat each character as an int, sum them,
and hash on that
 n 1 
• h(s) =   si  % len(slots)
 i 0 
• What's wrong with this hash function? When
will strings collide?
An even better function
• Third option:
– perform a weighted sum of the letters, and hash
on that

i
  si  37 % len(slots)
 i 0

k 1
– h(s) =
Analysis of hash table search
• load: the load  of a hash table is the ratio:
N  no. of elements
M  array size
Analysis of hash table search
• Analysis of search, with chaining:
– unsuccessful: 
(the average length of a list at hash(i))
– Successful: 1 + (/2)
(one node, plus half the avg. length of a list
(not including the item))
Analysis of Hashing with Linear
Probing
• Analysis of search, with
linear probing:
– unsuccessful: 
– successful: 
1
1 
1 

2 
2  (1   ) 
1
1 
1 

2  1  
Linear Probing
Load Factor Vs. Number of Operations
OPERATIONS
250
200
150
100
50
0
0
0.2
0.4
0.6
Load Factor
Successful
Unsucessful
0.8
1
Linear Probing Detail
Load Factor Vs. Number of Operations
OPERATIONS
10
8
6
4
2
0
0.4
0.45
0.5
0.65
0.6
0.55
Load Factor
Successful
Unsucessful
0.7
0.75
0.8
Making the list bigger
• When the load factor exceeds a threshold, double
the table size (smallest prime > 2 * old table size).
• Rehash each record in the old table into the new
table.
• Expensive: O(N) work done in copying.
• However, if the threshold is large (e.g., ½), then we
need to rehash only once per O(N) insertions, so the
cost is “amortized” constant-time.
Some lab activities
• Let’s see how good/bad the different string hashing algorithms are
• Here’s the plan:
– Implement each hash function
– Read in all English words
– Hash each word and count how many times the same hash value is
generated
– Graph it to see if there are any recognizable patterns
• Goals
– Hash functions should evenly distribute the values across all the “bins”
– Why?
• Because that will reduce the number of collisions
• And collision are the main things that prevent hash tables from
truly O(1) behavior
34
Create a file with the different hash
functions
• terribleHash: Uses the first character modulo size of
hash table.
• betterHash: Sums all the characters modulo size of
hash table.
• evenBetterHash: Sums each character times its
position in the string modulo size of hash table
• bestHash: Sums each character times 31**position
modulo size of hash table
35
Use these file as a helpful guide
• wordsEn.txt is a list of English words
• HashingExperiment.py contains a little code
for getting some things done
• We need to install matplotlib (see next page)
36
Installing Packages
• Packages (libraries) can be installed in python either through the PyCharm
interface or at the command line. I’m going to use the command line
• In Windows, open a Command Prompt with Adminstrator’s privileges!
– Under the start menu find Command Prompt, right-click, and Run As Administrator.
– On a mac, just open a terminal window.
• I usually “cd” to the Python directory, but as long as the path is set up
correctly you shouldn’t have to.
– On the lab machines, it’s under Program Files/Python35
• Run “pip install matplotlib”
– It should download everything and install without any other prompting
37
Here’s what we want the code to
do
1. Opens the file of word, read them in, and add them
to a list
– This is missing from HashingExperiment.
• Add that code first
2. Create a table that keeps track of how many times a
word is hashed to some hash value (DONE)
3. For each word, hash it. Increase the appropriate
counter. (DONE)
4. Plot the graph. (DONE)
38
Let’s do the same thing for all 4
potential hash functions (Not
done)
• Some things to play with
– What if we increase the hash table size for each of
the hash functions?
• Who does it help?
• Who is not affected very much by increasing the size?
39
Next two weeks
• Next week is spring break
• The following week you have a quiz on
Tuesday (Mar 14)
• And a homework due Thursday (Mar 16)
40