Discrete Methods in
Mathematical Informatics
Kunihiko Sadakane
The University of Tokyo
http://researchmap.jp/sada/resources/
Count-Min Sketches
• Data structure of CM Sketch for parameter (, )
– 2D array count[1,1]...count[d,w] (initially 0)
d ln 1 , w e
– d pairwise independent hash functions
hi : 1u 1 w
• Space: 2 e ln 1 words
• When an element (it, ct) arrives
– for each 1 j d ,
+ct
h1
+ct
+ct
hd
+ct
1
w
count j , h j it : count j , h j it ct
2
Quantile Query
• Return jk (k = 0,...,1/) with
k a 1 ai k a 1
• We use
i jk
log 2 u
,
log 2 u
CM sketch for range queries.
1
ai aˆi ai
a1
(Pr. log u )
log 2 u
2
Ql , r Qˆ l , r Ql , r 2 a 1
(Pr.
1
log 2 u
)
• Because the left end point of query ranges is 1,
we use only one dyadic range for each level.
Q1, r Qˆ 1, r Q1, r a 1
(Pr.
1
log 2 u
)
3
Data Structures
• We consider an imaginary complete binary tree
for U = {1,2,...,u}
• Each leaf stores frequency of at
• We maintain the data structure so that each node v
satisfies
count (v) n / m
(1)
count (v) count (v p ) count (vs ) n / m
(2)
Do not apply (1) on leaves (vp: parent of v, vs: sibling of v)
Do not apply (2) on root
(m: parameter)
n a1
4
• If there is a node v not satisfying the invariant
– count(vp) += count(v) + count(vs)
– delete v and vs
1
2
1
4
6
1
1
1
1
4
n = 15, m = 5, u = 8
2
6
5
• When an element at comes
– increase count by 1 for the corresponding leaf
– check if the leaf count satisfies (1), (2)
– if count is small, it is merged with others
Lemma 1: The number of nodes of the tree is
at most 3m.
Proof: Let Q be the node set, n be the total freq.
From invariant (2),
n
count (v) count (v p ) count (vs ) Q
m
vQ
6
In the left side, the count value of each node appears
at most three times.
count (v) count (v
vQ
p
) count (vs )
3 count (v) 3n
vQ
Therefore
n
Q 3n
m
Q 3m
7
• When an element comes, if its frequency is small,
it is accumulated in an ancestor node.
• If all the ancestors of a leaf have 0 count values,
the leaf count is correct.
• Otherwise the leaf count is at most the real value.
Lemma 2: The error of a leaf count value is at most
log u
n
m
Proof: The error is at most the summation of count
values in ancestors of the leaf.
error (v)
count ( x)
xancestor( v )
n
n
log u
m
xancestor( v ) m
8
Consider all leaves al,...,ar in the subtree rooted at an
internal node v.
If the count values of all ancestors of v are 0,
the summation of frequencies of al,...,ar is equal to
the summation of count values in the subtree.
The error in the sum of count values in a subtree is
[1,8]
log u
at most
n
m
[5,8]
[1,4]
[1,2]
1
[3,4] [5,6]
2
3
4
5
[7,8]
6
7
9
8
Query Algorithm
•
•
•
•
Traverse tree nodes in post-order
c = (sum of count values in visited nodes)
If c becomes at least k a 1 , output the node
If the node is not a leaf, output the rightmost value
in the range corresponding to the node.
[1,8]
1
[5,8]
[1,4]
[1,2]
[3,4] [5,6]
2
n = 15, m = 5, u = 8
= 0.5, k = 1
1
2
4
3
6
4
5
[7,8]
2
6
10
7
8
Theorem 1: If s nodes are used, the output satisfies
3 log u
k a 1 ai k
a1
s
i jk
Proof: Assume that an element x is output at node v.
Consider nodes storing counts of elements 1,...,x.
Because nodes are visited in post-order, counts which
are not stored in nodes before v are stored in ancestors
log u
of v. Therefore the error is at most
n.
m
11
From Lemma 1, s 3m. Thus
3 log u
k a 1 ai k
a1
s
i jk
Required memory to bound the error by is
log u
O
12
Relation between
Heavy Hitters and -Quantile
• (,)-heavy hitters
– output all elements with ai ||a||1
– do not output any element with ai < () ||a||1
– error is at most ||a||1
• (,)-quantile: output jk (k = 0,...,1/) satisfying
k a 1 ai k a 1
i jk
• Assume that the answers are obtained with Pr. 1.
13
Lemma: (+,2)-heavy hitters can be obtained from
(,/2)-quantile.
Proof: Let jk (k = 0,...,1/)be outputs of (,/2)-quantile
If jk = jk+1 , output jk as a heavy hitter. Estimated
frequency is (number of equal jk ’s1) ||a||1
(error 2 ||a||1)
For an element with frequency less than () ||a||1 ,
it does not happen jk = jk+1 and it is not output.
Any element with freq. at least (+) ||a||1 is output.
a 1
aaaabbbbbbcccdddddddddeeeeffffff
0 2 a 1 1 2 a 1 2 2 a 1 3 2 a 1 4 2 a 1
14
Lemma: (,)-quantile is obtained from log u
( /log u , /log u)-heavy hitters (u: #distinct elements)
Proof: Partition range [1,u] into dyadic intervals of
length 2i (i = 0,...,log u1).
For each i , use ( /log u , /log u)-heavy hitter
[1,8]
[5,8]
[1,4]
[1,2]
1
[3,4] [5,6]
2
3
4
5
[7,8]
6
7
15
8
To obtain quantiles, do a binary search using
estimated frequencies of elements [1,i].
A range [1,i] is represented by at most log u
dyadic intervals. For each dyadic interval, if it is
output as a heavy hitter, we add the estimated
frequency to quantile.
The error in heavy hitters is at most /log u.
⇒the error in quantile is at most .
16
Finding Rare Items
• Input: integer ai at time i (i = 1, 2,…)
– ai U = {1,2,...,u}
• Let ct[j] = |{ai | ai = j, i t}| : number of occurrences
of j until time t
• Query: compute rarity [t] at time t
[t ]
j | ct [ j ] 1
u
• Memory: o(u) bits
17
If large memory can be used
• For each i U = {1,2,...,u}, use a 2-bit counter
– 0: i never appeared
– 1: i appeared once
– 2: i appeared more than once
• Using 2u bit memry, [t] is computed exactly.
• Is it possible using o(u) bit memory?
18
Lower bound of required memory
Proposition: Any deterministic algorithm using o(u)
bit working space cannot compute exact solutions.
Proof: Assume to the contrary that there exists a
deterministic algorithm which can compute [t]
exactly. And we also assume that any element of a
set S U appeared once until time t.
S can be recovered using the following algorithm.
• Put each element i U in the stream at time t+i
• [t+i] < [t+i1] ⇔ i S
To represent S, we need u bits. Contradiction. 19
Approximation Algorithm
• At the start time, we choose k elements from U
uniformly at random, then count frequencies of
only those elements.
– X1[t],..., Xk[t]: freq. of each element at time t
• For a query, return ˆ[t ] i | X i [t ] 1
k
j | ct [ j ] 1
[t ]
• For each i, it holds Pr X i [t ] 1
u
where the probability is on the choice of all Xi,
ˆ [t ] is a good approximation of [t].
• It is a good approximation only if [t] 1/k.
20
Let Yi be a random variable for a trial of choosing
an element of U, and let Yi=1 if it is rare. Then
Y
k
Pr[Yi=1] = [t], and if we define Y i 1Yi , ̂ [t ]
k
Prˆ [t ] (1 ) [t ]
e
PrY (1 )E[Y ]
(1 )
(1 )
e
Prˆ [t ] (1 ) [t ]
(1 )
(1 )
k [ t ]
k [ t ]
21
The probability that ˆ [t ] is not an approximation
of [t] decreases exponentially to k.
However if [t] < 1/k, the error probability is large.
In general, u is huge (IP addresses, etc.).
Therefore k should be large.
22
Change of Definition of Rarity
• The ratio of rare items to really appeared items
[t ]
i | ct [ j ] 1
i | ct [ j ] 0
• In this case, it is also not possible to solve the
problem exactly using o(u) bit space.
23
Min-wise Independent Hash Functions
Definition: [u] = {1,...,u}
Definition: A family of hash functions
H [u] [u] is min-wise independent
⇔ For any X [u] and x X,
1
PrhH h( x) min h( y) | y X
X
24
Approximation Algorithm
• Choose k min-wise hash functions randomly
*
h
• Let
i (t ) min j t hi (a j )
Ci(t): number of elements with the current
*
hash value hi (t )
• For each input ai , update the variables
hi* (t ) min hi (at ), hi* (t 1)
If hi* (t ) hi* (t 1) , Ci (t ) Ci (t ) 1
otherwise Ci (t ) 1
i | 1 i k , Ci (t ) 1
• For a query, return ˆ
[t ]
k
25
• PrCi (t ) 1 is the probability that an element among
a1,...,at which has the minimum hash value appeared
only once.
• From the property of min-wise independent hash
functions, the probability that an element has the
1
minimum hash value is
i | ct [ j ] 0
i | ct [ j ] 1
[t ]
• Therefore PrCi (t ) 1
i | ct [ j ] 0
• ˆ[t ] is a good approximating value of [t]
26
Example of Min-wise Independent
Hash Functions
• A family of functions representing all permutations
of [u] = {1,...,u}
• (u lg u) bits are necessary to represent one
permutation. → too much memory consumption
27
-min-wise Independent
Hash Functions
Definition: a family of hash functions H [u] [u]
is -min-wise independent ⇔
For any X [u] and x X,
1
PrhH h( x) min h( y) | y X
(1 )
X
cf.: a polynomial of degree O(log (1/)) on GF(u)
– can be represented in O(log u log (1/)) bits
– function value can be computed in O(log (1/)) time
– Note:u is a prime
28
Performance of
Approximation Algorithm
• With high probability, ˆ[t ] (1 ) [t ]
• By using large k, the success probability
will increase
• Memory: O(k log u log (1/)) bits
29
Generalization of the Problem
• Input: integer ai at time i (i = 1, 2,…)
– ai U = {1,2,...,u}
• ct[j] = |{ai | ai = j, i t}| : number of j until time t
• Query: compute -rarity [t] at time t
i | ct [ j ]
[t ]
i | ct [ j ] 0
• Memory: o(u) bits
30
Approximation Algorithm
• Choose k -min-wise independent hash functions
*
• Variable hi (t ) min j t hi (a j )
*
h
Ci(t): #elements having current hash value i (t )
• Update variables for each input ai
*
*
hi (t ) min hi (at ), hi (t 1)
If h (t ) h (t 1) then Ci (t ) Ci (t ) 1
*
i
*
i
otherwise Ci (t ) 1
• For a query, return ˆ [t ] i | 1 i k , Ci (t )
k
31
2
1
k 3 log
p
Lemma: For 0 < < 1, 1 > p > 0,
prob. that ˆ [t ] 1 [t ] p is at least 1
It is not necessary to fix in advance in the algorithm.
This means rarities for various values of can be
obtained simultaneously.
The ratio of elements with frequency between 1 and
is also obtained.
ˆ [t ]
i | 1 i k , Ci (t )
k
32
Rarity over Windowed Streams
• Input: integer ai at time i (i = 1, 2,…)
• ct[j] = |{ai | ai = j, tN+1 i t}| : number of j
in latest N elements
• Query: compute -rarity [t] at time t
[t ]
i | ct [ j ]
i | ct [ j ] 0
33
How to Store Minimum Values
in a Window
• Def: aj is active ⇔ j {tN+1,...,t}
• Store candidates of minimum hash values in a
window
• If x < y and hi(ax) > hi(ay) , x is not stored
– ax is active ⇒ ay is also active
– hi(ax) cannot be a minimum value
time t
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
11 16 17 20 24
20 12 75 26 23 20 15 29 40 45 32 66 51 47 34
12 15 29 32 34
hi(at)
34
• Store list of candidate minimum values
h (a
i
j1
), j1 , hi (a j2 ), j2 ,, hi (a jl ), jl
j1 j2 jl
• If it holds hi (a j ) hi (a j ) hi (a j ) ⇒ hi* (t ) hi (a j )
• How to update list: When at+1 arrives,
1
–
–
–
–
–
2
l
1
Find maximum such that hi (a j ) hi (at 1 )
Remove all elements to the right of
If hi(aj) hi(at+1), add (hi(at+1), t+1) at the tail
If hi(aj) = hi(at+1), change (hi(aj), j) to (hi(aj), t+1)
If (hi(aj1), j1) becomes inactive (if j1 < tN+2), delete it
35
Lemma: The length of the list is with high probability
1
1 1
( H N ) 1 (log N )
N
2 3
Proof: From the definition of min-wise hash functions,
the probability that the i-th element from the right has
the minimum hash value among the i elements is 1/i.
The length of the list is equal to the number of updates
of minimum values when we scan the list from the
right. Therefore the expected value is HN. From
Chernoff Bound, the probability the value is within
a constant factor of the expected value is high. 36
• Store list of lists of candidate minimum values
h (a
i
j1
)
), Li , j1 , hi (a j2 ), Li , j2 ,, hi (a jl ), Li , jl
Li , jm x | hi (a x ) hi (a jm
• How to update list. When at+1 arrives,
– Find maximum such that hi (a j ) hi (at 1 )
– Delete all elements in the list to the right of
– If hi(aj) hi(at+1), add (hi(at+1), Li,t+1) at the tail
Li,t+1 = {t+1}
– If hi(aj) = hi(at+1), add t+1 at the tail of Li , j
– If the head element of Li , j1 becomes inactive, delete it
37
• The number of elements in the list is with
high probability O( log N)
• An (1)-approximate solution is obtained
with high probability.
38
© Copyright 2025 Paperzz