Probability in Computing
LECTURE 6: BINS AND BALLS,
APPLICATIONS: HASHING & BLOOM FILTERS
© 2010, Van Nguyen
Probability for Computing
Agenda
Review: the problem of bins and balls
Poisson distribution
Hashing
Bloom Filters
© 2010, Van Nguyen
Probability for Computing
Balls into Bins
We have m balls that are thrown into n bins,
with the location of each ball chosen
independently and uniformly at random from n
possibilities.
What does the distribution of the balls into the
bins look like
n
n
n
“Birthday paradox” question: is there a bin with at
least 2 balls
How many of the bins are empty?
How many balls are in the fullest bin?
Answers to these questions give solutions to
many problems in the design and analysis of
algorithms
© 2010, Van Nguyen
Probability for Computing
The maximum load
When n balls are thrown independently and uniformly at
random into n bins, the probability that the maximum
load is more than 3 lnn/lnlnn is at most 1/n for n
sufficiently large.
n
n
n
By Union bound, Pr [bin 1 receives ≥ M balls] £
Note that:
Now, using Union bound again, Pr [ any ball receives ≥
is at most
which is £ 1/n
© 2010, Van Nguyen
Probability for Computing
Application: Bucket Sort
A sorting algorithm that
breaks the W(nlogn) lower
bound under certain input
assumption
Bucket sort works as follows:
n
n
n
n
Set up an array of initially
empty "buckets."
Scatter: Go over the original
array, putting each object in its
bucket.
Sort each non-empty bucket.
Gather: Visit the buckets in
order and put all elements back
into the original array.
© 2010, Van Nguyen
Probability for Computing
A set of n =2m integers,
randomly chosen from
[0,2k),k≥m, can be sorted
in expected time O(n)
n
Why: will analyze later!
The Poisson Distribution
Consider m balls, n bins
n
Pr [ a given bin is empty] =
Let Xj is a indicator r.v. that is 1 if bin j empty, 0 otherwise
Let X be a r.v. that represents # empty bins
n
Generalizing this argument, Pr [a given bin has r balls] =
n
Approximately,
n
So:
n
n
© 2010, Van Nguyen
Probability for Computing
Limit of the Binomial Distribution
© 2010, Van Nguyen
Probability for Computing
Application: Hashing
The balls-and-bins model is good to model hashing
Example: password checker
n
n
Goal: prevent people from choosing common, easily cracked
passwords
Keeping a dictionary of unacceptable passwords and check newly
created password against this dictionary.
Initial approach: Sorting this dictionary and do binary
search on it when checking a password
n
Would require W(log m) time for m words in the dictionary
New approach: chain hashing
n
n
n
Place the words into bins and search appropriate bin for the word
The worlds in a bin: implemented as a linked list
The placement of words into bins is done by using a hash function
© 2010, Van Nguyen
Probability for Computing
Chain hashing
Hash table
n
n
n
n
A hash function f: U ‡ [0,n-1] is a way of placing items from the
universe U into n bins
Here, U consists of all possible password strings
The collection of bins called hash table
Chain hashing: items that fall into the same bin are chained
together in a linked list
Using a hash table turns the dictionary problem into a
balls-and-bins problem
n
n
m words, hashing range [0..n-1] Ë m balls, n bins
Making assumption: we can design perfect hash functions that map
words into bins uniformly random
w A given word could be mapped into any bin with the same probability
© 2010, Van Nguyen
Probability for Computing
Search time in chain hashing
To search for an item
n
n
n
First hash it to find the corresponding bin then find
it in the bin: sequential search through the linked
list
The expected # balls in a bin is about m/n Ë
expected time for the search is Q(m/n)
If we chose m=n then a search takes expectedly
constant time
Worst case
n
n
maximum # balls in a bin: Q(lnn/lnlnn) if choose m=n
Another disadvantage: wasting a lot of space in
empty bins
© 2010, Van Nguyen
Probability for Computing
Hashing: bit strings
In chain hashing, n balls n bins, we waste a lot of
empty bins Ë should have m/n >>1
Hashing using sort fingerprints will help
n
n
n
n
Suppose: passwords are 8-char, i.e. 64 bits
We use a hash function that maps each pwd into a 32string, i.e. a fingerprint
We store the dictionary of fingerprints of the unacceptable
passwords
When checking a password, compute its fingerprint then
check it against the dictionary: if found then reject this
password
But it is possible that our password checker may not
give the correct answer!
© 2010, Van Nguyen
Probability for Computing
False positives
This hashing scheme gives a false positive
when it rejects a good password
n
n
The fingerprint of this password accidentally
matches that of an unacceptable password
For our password checker application this over
conservative approach is, however, acceptable if
the probability of making a false positive is not
too high
© 2010, Van Nguyen
Probability for Computing
False positive probability
How many bits should we use to create
fingerprints?
n
n
n
n
We want reasonably small probability of a false
positive match
Prob [the fingerprint of a given good pwd π any given
unacceptable fingerprint] = 1- 1/2b; here b # bits
Thus for m unacceptable pwd, prob [false positive
occurs on a given good pwd] = 1- (1- 1/2b)m≥
Easy to see that: to make this prob less than a given
small constant, we need b= W(logn)
w If use b=2logm bits Ë Prob [ a false positive]= 1-(1
1/
m
1/
65,536
w Dictionary of 216 words using 32-bit fingerprint Ë false prob
© 2010, Van Nguyen
Probability for Computing
An approximate set membership
problem
Suppose we have a set S = {s1, s2, s3, …,
sm} of m elements from a large universe set
U. We would like to represent the elements of
S in such a way so that
n
n
We can quickly answer the queries of form “Is x is
an element of S?”
We want the representation take as little space as
possible
For saving space we can accept occasional
mistakes in form of false positives
n
E.g. in our password checker application
© 2010, Van Nguyen
Probability for Computing
Bloom filters
A Bloom filter: a data structure for this
approximate set membership problem
n
n
n
n
By generalizing these mentioned hashing ideas to
achieve more interesting trade-off between
required space and the false positive probability
Consists of an array of n bits, A[0] to A[n-1],
initially set to 0
Uses k independent hash functions h1, h2, …, h
with range {0,…n-1}; all these are uniformly
random
Represent an element sŒS by setting A[hi(s)] to 1,
i=1,..k
© 2010, Van Nguyen
Probability for Computing
Checking: For any
value x, to see if xŒS
simply check if
A[hi(x)] =1 for all
i=1,..k
n
n
If not, clearly x is not a
member of S
If right, we assume
that x is in S but we
could be wrong! Ë
false positive
© 2010, Van Nguyen
Probability for Computing
False positive probability
The probability of a false positive for an element not in
the set
n
n
n
After all m elements of S are hashed into Bloom filter, Prob[a
give bit =0] = (1-1/n)km ª e –km/n. Let p= e –km/n.
Prob [a false positive] = (1- (1-1/n)km)k ª (1-e –km/n)k = (1
Let f= (1-p)k .
Given m, n what is the optimum k to minimize f?
w Note that a higher k gives us more chance to find a 0-bit for an
element not in S, but using fewer h-functions increases the fraction
of 0-bit in the array.
n
n
Optimal k = ln2.n/m which reaches minimum f = ½k
ª(0.6185)n/m
Thus Bloom filters allow a small probability of a false positive
while keep the number of storage bit per item a constant
w Note in previous consideration of fingerprints we need W(log
per items
© 2010, Van Nguyen
Probability for Computing
Bloom filters: applications
Discovering DoS attack attempt
n
Computing the difference between SYN
and FIN packets
w Matching between SYN and FIN packets by
tuples of addresses (source and destination ports)
Many, many other applications
© 2010, Van Nguyen
Probability for Computing
Application of hashing: breaking
symmetry
Suppose that n users want a unique resource
(processes demand CPU time) how can we decide a
permutation quickly and fairly?
n
Hashing the User ID into 2b bits then sort the resulting numbers
w That is, smallest hash will go first
w How to avoid two users being hashed to the same value?
If b large enough we can avoid such collisions as in
birthday paradox analysis
n
n
Fix an user. Prob [another user has the same hash] = 1
1/ b)n-1 £(n-1)/ b
2
2
By union bound, prob [two users have the same hash] =
w Thus, choosing b =3logn guarantees success with probability
n
Leader election
© 2010, Van Nguyen
Probability for Computing
SYN FLOOD DEFENSE
SOLUTIONS
© 2010, Van Nguyen
TCP SYN-Flooding Attack
TCP services are often susceptible to various
types of DoS attacks
SYN flood: external hosts attempt to overwhelm the
server machine by sending a constant stream of TCP
connection requests
n
w
w
Streaming spoofed TCP SYNs
Forcing the server to allocate resources for each new connection
until all resources are exhausted
90% of DoS attacks use TCP SYN floods
Takes advantage of three way handshake
n
n
w
w
Server start “half-open” connections
These build up… until queue is full and all additional requests are
blocked
© 2010, Van Nguyen
TCP: Overview
RFCs: 793, 1122, 1323, 2018,
2581
full duplex data:
point-to-point:
n
n
one sender, one receiver
n
reliable, in-order byte
steam:
n
connection-oriented:
no “message boundaries”
n
pipelined:
n
TCP congestion and flow
control set window size
n
socket
door
application
reads data
TCP
send buffer
TCP
receive buffer
segment
© 2010, Van Nguyen
handshaking (exchange of
control msgs) init’s
sender, receiver state
before data exchange
flow controlled:
send & receive buffers
application
writes data
bi-directional data flow in
same connection
MSS: maximum segment
size
socket
door
sender will not overwhelm
receiver
TCP segment structure
URG: urgent data
(generally not used)
ACK: ACK #
valid
PSH: push data now
(generally not used)
RST, SYN, FIN:
connection estab
(setup, teardown
commands)
Internet
checksum
(as in UDP)
© 2010, Van Nguyen
32 bits
source port #
dest port #
sequence number
acknowledgement number
head not
len used U A P R S F
Receive window
checksum Urg data pnter
Options (variable length)
application
data
(variable length)
counting
by bytes
of data
(not segments!)
# bytes
rcvr willing
to accept
Attack Mechanism
Transmission Control Block
(TCB) is reserved
TCP SYN-RECEIVED state:
connection is half-opened
n
n
Up on receiving SYN,
segment TCB
Transited to ESTABLISHED
until last ACK
© 2010, Van Nguyen
Attack Mechanism
attacker sends a
flood of SYNs Ë too
manyTCB Ë host is
exhauted in memory.
To avoid this, OS only
allows a fixed
maximum number of
TCBs in SYNRECEIVED
If this threshold is
reached, new coming
SYN will be rejected
© 2010, Van Nguyen
SYN Flooding
C
S
SYNC1
SYNC2
SYNC3
SYNC4
SYNC5
© 2010, Van Nguyen
Listening
Store data
Implementation Method
© 2010, Van Nguyen
How to create a successful
flood
Making drops of incomplete connection (IC)
n
n
n
Standard TCP: a connection times out only after some retranmisstion Ë
Assuming 1024 ICs are allowed per socketË 2 connection attempts per second to
exhaust all allocated resources.
Note that existing ICs are dropped when a new SYN request is received.
If an ACK arrives at the server but does not find a corresponding IC
state Ë the server fail to establish such required connection
n
n
Round trip time (RTT): time required for the server to have the client reply
Forcing the server to drop IC state at a rate larger than the RTT, Ë no connections
are able to complete Ë success in attack!
The goal of attack is to recycle every connection before the average
RTT
n
n
For a listen queue size of 1024, and a 100 millisecond RTT Ë need 10,000 packets
per second.
A minimal size TCP packet is 64 bytes, so the total bandwidth used is only
4Mb/second Ë practical!
© 2010, Van Nguyen
28
Jan 2013
Flood Detection System (FDS)
Stateless, simple, edge (leaf) routers
Utilize SYN-FIN pair behavior
Include (SYNACK - FIN) so client or
server
However, RST violates SYN-FIN behavior
Placement: First/last mile leaf routers
n
n
First mile – detect large DoS attacker
Last mile – detect DDoS attacks that first mile
would miss
© 2010, Van Nguyen
SYN – FIN Behavior
© 2010, Van Nguyen
SYN – FIN Behavior
Generally every SYN has a FIN
© 2010, Van Nguyen
SFD-Method
1- Classification of packets
2-Computing the # of SYN and FIN packets
going through
3-Using algorithm CUSUM to analyze the
(SYN-FIN) pair behaviour
© 2010, Van Nguyen
Operating System
Concepts
SFD-BF Method
Improvement on previous SFD:
n Compute the difference between #SYN and
#FIN when the packets are matched on the 4
tuple:
w When a SYN packet comes, determine the
corresponding 4-tuple and insert this into BF.
Increase the counter specified by this 4
tuple.
w When a FIN/RST packet comes: determines
the 4-tuples and find it’s hash in BF to
decrease the corresponding counter
© 2010, Van Nguyen
Intentional Dropping Scheme for
SYN Flooding Mitigation
© 2010, Van Nguyen
Idea
n
n
Normally, if it does not receive a SYN-ACK after
sending a SYN for a certain time a client machine
then would resend another SYN until it gets
connected to the wanted server.
The idea of this method is to drop all the first SYN
from all the source machine, which would help to
reduce SYN flood which is usually first SYNs with
spoofed addresses
© 2010, Van Nguyen
Method
The solution is to propose using 3 different BFs:
n BF1: stores the 4-tuple address of the first
SYN coming from a given source
n BF2: stores the 4-tupple of all SYNs, with
which the 3-way handshake is already
completed
n BF-3: Store the 4-tupple of other SYNs.
© 2010, Van Nguyen
Operating System
Concepts
Method
Once a SYN arrives, its 4-tuple address is checked against
the 3 BFs, where occurs 1 of the 3 following cases:
w 1. Not in any BFË This is the first SYN then will
be dropped, also insert the 4-tuple into BF1
w 2. If found in BF-1Ë this is a second SYN we
just move the 4-tuple from BF1 to BF3
w 3. If in BF-2 Ë Let it go through.
w 4. If in BF-3 Ë let it goes through with
probability p=1/n, where n is the value of
corresponding counter in BF-3
© 2010, Van Nguyen
Method
When an ACK comes, its 4-tupple address is checked
against the BFs, which may results in 1 of 3
following cases”
1. Not in any BF Ë drop the packet
2. If it matches one in BF-2 Ë let it through
3. If in BF-3 Ë the connection is completed
move the 4-tuple address from BF3 to BF-2
© 2010, Van Nguyen
Result
w First SYN from any source will be dropped
w The second SYN from the same source will
go through
w If this same source continue sending SYNs,
the probability that the SYN numbered n is
allowed to go through is 1/n
Ë Thus, the SYN flood caused by an attacking
source will be mitigated.
© 2010, Van Nguyen
© Copyright 2026 Paperzz