Testing a Hash Function using Probability

Testing a Hash Function using Probability
Suppose you have a huge square turnip field with 1000 turnips
growing in it. They are all perfectly evenly spaced in a regular pattern.
Suppose also that the Germans fly over your field and drop 10
bombs totally at random, all falling on your turnip field. Each bomb is
so powerful that it will completely destroy the one turnip that it lands
closest to.
How many turnips would you have left?
Sounds easy: start with 1000 and 10 are destroyed, so 990 are left.
Except that there is a possibility that two bombs will land on the
same turnip, so only nine will be destroyed. Not very likely, but
certainly not impossible. You could even find three bombs landing on
the same turnip, or two landing on one and another two landing on
another one.
It is even possible that all 10 bombs will land on the same turnip.
Each turnip has a 1:100 chance, or a 0.01 probability, of being hit
each time a bomb is dropped. So for each turnip the probability of
being hit all ten times is 0.00000000000000000001.
Being left with only 990 turnips is the worst possible case. The
best possible case is 999, and anything in between is also possible.
Exactly what are the probabilities? That is what the Poisson
Distribution is all about.
There are a number of opportunities (1000) for an event that is
individually rare (probability 0.01), but over the whole world of
opportunities in inevitable, and is in fact going to happen 10 times.
The key thing is the average number of events per opportunity. In
our case this is the average number of bombs per turnip, 10/1000.
This average is given the symbol “λ”, a lower case lambda or Greek L.
λ = 0.01
If you want to know the probability of any particular turnip being
hit by N bombs, the poisson distribution tells us that
p(N) =
e-λ×λN
N!
p(N) =
e-λ×λN
N!
In our example:
N is limited to the range 0 to 10.
λ = 0.01
e-0.01 = 0.9900498 (e ≈ 2.71828183)
remember that 0!=1 and anything to the power of 0 is 1.
p(0)
p(1)
p(2)
p(3)
p(4)
p(5)
p(6)
p(7)
p(8)
p(9)
p(10)
=
=
=
=
=
=
=
=
=
=
=
0.99
0.0099
0.000049
0.00000017
0.00000000041
0.00000000000083
0.0000000000000014
0.0000000000000000019
0.0000000000000000000025
0.0000000000000000000000027
0.0000000000000000000000000027
For each turnip, there is a 0.99 chance of not being hit at all. With 1000 turnips, that
means we really do expect to see 990 surviving. But only 9.9 of them get hit exactly
once. Over the course of 10 raids, we would probably see one case of a turnip being
hit more than once.
On average, if we sat through 371,000,000,000,000,000,000,000 raids, we could
expect to see a case of a single turnip being hit by ten bombs only once.
All in all, this isn’t looking very useful, but now look at another example...
In a class of 29 students, what are the chances that two will share a birthday? With
365 days to spread 29 students over, it looks like only about an 8% chance of a
coincidence (29/365).
But the correct analysis is that average number of students per birthday is roughly
29/365 which is 0.079452. Each day of the year can expect just 0.079452 birthdays to
be on it.
λ = 0.079452
p(0) = e-λ = 0.9236
(92.4% of days have no birthday on them)
p(1) = λp(0) = 0.0733
(7.3% of days have one birthday on them)
p(2) = λp(1)/2 = 0.00292 (0.29% of days have two birthdays on them)
But 0.29% of days is 0.29% of 365, which is 1.0585. Meaning that for any random
group of 29 people, on average, there will be just over 1 shared birthday.
So what?
German bombs, people, and strings are all the same kind of thing.
Turnips, days of the year, and hash table positions are all the same kind of thing.
From the turnip’s point of view, being blown up by a bomb is an unlikely event, it
probably isn’t going to happen. From the bomb’s point of view, landing on a turnip is
an absolute certainty.
From the day-of-the-year’s point of view, someone in a small group of people having
their birthday on it is unlikely. From the person-in-the-group’s point of view, having
their birthday land on some day of the year is a certainty.
From the point of view of one of the thousands of positions in a hash table, any
particular string landing on it is quite unlikely. From a string’s point of view, finding
a place in a hash table is a certainty: every string has a hash value.
It all works the same way.
If we have a hash table whose array contains 10,000 pointers and we eventually
store 5,000 strings in it, what would we expect to happen?
If the hash function is working properly, we will get a random distribution of
strings in the array, just like the distribution of people on days-of-the-year.
In this case, λ = 5000/10000 = 0.5
p(0)
p(1)
p(2)
p(3)
p(4)
p(5)
p(6)
=
=
=
=
=
=
=
e-λ
e-λλ
e-λλ2/2
e-λλ3/6
e-λλ4/24
e-λλ5/120
e-λλ6/720
=
=
=
=
=
=
=
0.6065
0.3033
0.0758
0.0126
0.0016
0.0002
0.0000
Interpretation: p(2) is 0.0758. Every one of the 10,000 positions in the
hash table has a 0.0758 probability of containing two strings.
Therefore we should expect 758 of the hash table’s linked lists to have
a length of 2.
Similarly, we should expect 6065 entries to be empty,
3033 linked lists should contain only one string,
Only 2 entries in the whole table should have 5 strings in them.
Notice how the numbers add up to 1.0000? We would expect to
have no linked lists at all with a length greater than 5.
Of course, these are just the most likely figures, we can’t expect
nature to duplicate them exactly. But any properly working hash
function should deliver that shape of distribution whenever λ is 0.5,
i.e. whenever the hash table appears to be at half capacity.
For λ = 0.5.
0
Number of strings in the table = 0.5 times array size.
1000
2000
3000
4000
5000
6000
7000
To test a hash function:
1. Make your hash table quite large.
2. Read a large number of random strings into it (perhaps the text
of a book)
3. Calculate λ = number of strings / size of table
4. Make your program count how many linked lists are empty,
how many have one string in them, how many have two, and so
on and so on.
5. Calculate the probabilistically expected numbers to part 4, but
this time using the poisson formula.
6. Display the two sets of numbers, something like this:
number of empty lists: expected = 6065 actual = 6110
number with length 1: expected = 3033 actual = 2980
number with length 2: expected = 758 actual = 789
... etc
You’ll soon notice if the numbers are significantly different.
Side note:
When you would think a hash table is full, the number of strings in it is
the same as the size of its array, λ = 1, these are the probabilities...
linked list length 0 probability 0.3679
1
0.3679
2
0.1839
3
0.0613
4
0.0153
5
0.0031
6
0.0005
0.9999 total, so only 0.0001 left
Even under such conditions, there should be no long lists, and a hash
table remains a very fast-to-search storage system.
Shape when λ is small, much less than 1
Shape when λ = 1
Shape when λ is large, much more than 1