Beyond Bloom Filters: From Approximate Membership Checks to

Fast Statistical Spam Filter by
Approximate Classifications
Authors:
Kang Li
Zhenyu Zhong
University of Georgia
Reader: Deke Guo
Outline




Motivations of this paper
The concrete problems
Basic idea and solutions
Questions needed to clarify
Motivations
1. Speedup the classification process in
order to defense against spam quickly,
furthermore, improve the throughout
of system.
2. Improve the scalability of the
statistical-based classification methods.
3. Keep high classification accuracy.
The background and concrete
problem
 Background
 Statistical-based Bayesian filters and its variants are
used to block spam.
 The statistical value of each individual token is
stored by a dictionary.
 A decision-making is based on the summarization of
values of much tokens.
 Problems needed to research
 How to improve the performance of value retrieval
operation for each individual token. (the motivation
1 and 2)
 The solutions should not have much negative effect
on the classification accuracy. (the motivation 3)
Basic idea and solutions (1)
 A straightforward idea
 Use the Bloom filters to store the values
of tokens, and retrieve the value of any
token on demand.
 The first obstacle
 How to extend the standard Bloom filter?
x
Data set A
y
a
b
Data set B
c
d
A hash function family
A bit
vector
0 1 0 1 0 0 1 1 0 1
0
1 1 0 0 0
m-1
x
token universe
y
token1
token2
test set B
Second
dimension
Multibit
vector
token3
token4
A hash function family
0 1 0 0 0 0 0 0 0 0
0 1 0 0 0
q-1
0 0 0 1 0 0 0 0 0 0
0 1 0 1 0 0 0 0 0 0
0 0 0 0 0
0 1 0 0 0
0
First
dimension
Bit-wise
AND
0
0
1
output
value
Basic idea and solutions (2)
 Instead the bit vector with a two dimensions
vector, with (multiply m by q) size.
 The first dimension denotes the hash locations for
each token in a m bits vector, the same as the
standard Bloom filter.
 The second dimension of each hash locations
denotes the value of token. One bit for one
identical value.
 The second obstacle
 The size of value universe is usually large even
huge. It is impossible to allocate bits in the second
dimension for all elements of the value universe.
Basic idea and solutions (3)
 Encode
 In this field, the value universe ranges
from 0 to 1.
 This paper does not propose new
encoding method, just use a algorithm
referred from the paper [20].
 Choose and tune the parameter q ,
which denotes the number of possible
elements resulting from encoding
algorithm.
Why the idea can meet the
motivation one and two?
 Space (for the set of pairs (token, value))
 If use the extended Bloom filter to store them, it
need less space than others . K bits for each token.
 Given the allocated memory, the solution can store
more pairs (token, value) than others.
 Time
 Extended Bloom filter are small enough to load in
memory. No other I/O operations.
 The response delay is a constant for the query with
any input no matter how many pairs have been
stored.
 In the same time slot, the solution can retrieve the
values of more tokens than previous solutions.
The negative effects on the
classification accuracy (1)
 The query based on the extended Bloom
filter may output two kinds of mistake.
 For any query with a token outside of the test
data set as input, may get a useful output entry
(just one bit is set to 1).
 For any query with a token inside the test data
set as input, may get a conflict output entry
(more than one bits are set to 1).
 For any token, the decoding result usually
does not equal the real statistical value.
x
token set A
y
token1
token2
token set B
Second
dimension
Multibit
vector
token3
token4
A hash function family
0 1 0 1 0 0 0 0 0 0
0 1 0 0 0
q-1
0 0 0 1 0 0 0 0 0 0
0 1 0 1 0 0 0 0 0 0
0 0 0 0 0
0 1 0 0 0
0
First
dimension
Bit-wise
AND
1
0
1
output
value
The negative effects on the
classification accuracy (2)
 The misclassification
 The former error will affect the summarization of
values of a message, and maybe influence the
decision.
 For a multi-bits error, choose the smallest value.
If it is wrongly chosen, the error only makes the
classification result less likely as spam, and
maybe result in a false negative. This can be
tolerated.
 The decoding deviation
 It can not been avoided. Design better
algorithms and/or select the parameters
carefully.
Questions needed to clarify(1)
 For a query output entry, the possibility for a single
bit of the output entry being zero as
Pm,n,h(0)=1-Pm,n,h(fpos)
=1-(1-(1-1/m)n*h)h
 For a query output entry , the probability of the
former case:
Pm,n,h,q(fpos)=1-(Pm,n,h(0))q
 The probability of the latter case:
Pm,n,h,q(multi)=1-(Pm,n,h(0))q
-q* (1-Pm,n,h(0))(q-1)
(6)
(7)
Questions needed to clarify(1)
 The formulas 6 and 7 are wrong or not
consistent with the error definitions.
 The probability of the event (just one bit of
the output entry is set to 1) is:
q 1
q
Pm, n, h, q (fpos)     Pm, n, h  fpos    Pm, n, h  0 
1 
 The probability of the event (more than one
bits of the output entry are set to 1) is:
 One minus the probability of all bits being set to
0 and the probability of only one bit getting 1.
Pm, n , h , q (multi)  1  Pm, n , h (0) q  Pm, n , h , q (fpos)
Questions needed to clarify(2)
 In order to store and retrieve values,
can this idea be a general way to
improve the standard Bloom filter?
 The size of value universe.
 The multi-bit output error.
 Deletion operation of pairs (key,value).
 Questions and Answers
Beyond Bloom Filters: From
Approximate Membership
Checks to Approximate State
Machines
Authors:
Flavio Bonomi
Michael Mitzenmacher
Rina Panigrahy
SIGCOMM 2006
Reader: Deke Guo
Questions
 How to track the simultaneous state
of a large number of connections at
each network device.
 The size of tracking result should be
small in order to load in on-chip
memory.
Solution(1)
 Uses standard bloom filters to summarize
the simultaneous state of a large number
of connections.
 lookups the state of each connection
according to its summarization.
 Introduces a new error named “don’t know”
besides false positive and false negative.
Solution(1)
 Introduces the timing-based deletion
mechanism to deal with ill-behaving or
non-terminating.
 Operations:




Put (id, state)
Lookup (id) or Lookup (id, state)
Delete (id, state)
Update (id, old state, new state)
 Ill-behaving or attacking may result in false
negative error.
x
Data set A
y
a
b
Data set B
0 1 0 1 0 0 1 1 0 1
h1(x)
h1(y)
h2(x) h3(x)
h2(y) h3(y)
c
d
1 1 0 0 0
hk(x)
hk(y)
x doesn’t belong to set B, yet its bits have been set 1
y doesn’t belong to set B, and its bits aren’t all 1.
a belongs to set B, and its bits are all 1.
x
Data set A
y
a
b
Data set B
0 1 0 0 0 0 1 0 0 1
c
d
1 0 0 0 0
a belongs to set B, and its bits are not all 1 after the false deletion of x.
A false positive error may result in at most k false negative.
Solution(2)
 Introduce the Stateful Bloom Filter
Approach.
 Instead the bit vector used by standard
bloom filters with cell vector.
 Its rate of false positive is less than that
of standard bloom filters. Note that the
storage space used by two filters are not
same. Thus, it is need to compare more
carefully.
x
Data set A
y
a
b
Data set B
0 1 0 1 0 0 1 0 0 1
2 h (x) h2 (x)
h1(x)
2
3
c
d
1 3 0 0 0
hk(x)
X don’t belong to set B.
The lookup based on the filter also make right judge.
Solution(3)
 An Approach Using d-left Hashing
 The authors did not explain why it is the
best solution among the three solutions
through formal compare and analysis.
 The simulation tries to prove it, but it is
not strong enough, especially don’t
compare under the same space used.
x
Data set A
y
a
b
1 3 2 0 0
Data set B
c
1 3 0 0 0
d
1 3 0 0 0
Questions needed to analyze
 Analyze the relationship between false positive and false
negative, and try to give formula.
 If the old value of a cell was “don’t know”, then the cell
keeps the value before its register becomes 0.
 Analyze the fraction of cell which value is “don’t know”,
and compute the rate of this error.
 If the register becomes 1 from a larger value, value
“don’t know” should become a identify value, but SBF
can’t support this transformation.
 If we use the idea of SBF to redesign the standard Bloom
Filters, whether we can achieve some benefits, such as
lower false positive rate.