Discrete Methods in
Mathematical Informatics
Kunihiko Sadakane
The University of Tokyo
http://researchmap.jp/sada/resources/
Input Stream
• (it, ct) at time t (t = 1, 2,…)
– it U = {1,2,...,u}
– It means ait (t ) ait (t 1) ct
• Range of ct
– ct is non-negative (cash register model)
– ct can be negative (turnstile model)
• Range of ai(t)
– always non-negative (strict turnstile model)
– can be negative (non-strict turnstile model)
2
Queries
•
•
•
•
Point query Q(i): approximate value of ai
r
Range query Q(l,r): approximate value of i l ai
n
Inner product query Q(a, b): approx. value of i 1 ai bi
Heavy hitters/frequent items
– output all elements with ai () ||a||1
n
a 1 i 1 ai
• -Quantile: output jk (k = 0,...,1/) such that
k a 1 ai k a 1
i jk
3
Count-Min Sketches
• Data structure of CM Sketch for parameter (, )
– 2D array count[1,1]...count[d,w] (initially 0)
d ln 1 , w e
– d pairwise independent hash functions
hi : 1u 1 w
• Space: 2 e ln 1 words
• When an element (it, ct) arrives
– for each 1 j d ,
+ct
h1
+ct
+ct
hd
+ct
1
w
count j , h j it : count j , h j it ct
4
Pairwise Independent Hash Functions
Definition
[n] = {1, 2, …, n}
A family of functions H = {h | h: [N] [M]} is called
a family of pairwise independent hash functions if
for all i j [N] and k, l [M],
1
Pr hi k h j l 2
hH
M
Example M: prime
h( x) px q mod M
( p, q {0,1,, M 15})
Point Query
• For strict turnstile model
aˆi min j count j , h j i
Theorem 1: the approx. value by CM sketch is
(always)
aˆi ai
aˆi ai a 1
(with Prob. at least 1)
Proof: every time ai appears, count j, h j i is increased
by ct (1 j d ). This counter may be increased by
other elements with the same hash value.
6
Define random variables Ii,j,k as
1 if i k h j i h j k
I i , j ,k
0
otherwise
From pairwise independence of hash functions
E I i , j ,k
1
Pr h j i h j k
w e
u
Let Xi,j be a random variable with X i , j I i , j ,k ak
k 1
Xi,j 0
EX E I
count j , h j i ai X i , j
u
i, j
k 1
u
a1
i , j , k ak ak E I i , j , k
e
k 1
7
From Markov’s inequality
Prj a X a a
Prj X eEX
Pr aˆi ai a 1 Pr j count j , h j i ai a 1
i
i, j
i, j
i
1
i,j
1
d
e
The algorithm has the same approximation ratio for
any b > 1 by setting d log b 1 , w b , but
the space (O(wd)) is minimized if b = e.
8
Point Query
• For non-strict turnstile model
aˆi median j count j , h j i
(median of d values)
Theorem 2: The approx. value by CM sketch is
ai 3 a 1 aˆi ai 3 a 1 (with Pr. at least 1 1/4)
Proof: E count j , h j i ai E X i , j
Pr count j, h j i ai 3 a 1
e
a1
1 1
3 a 3e 8
E X i, j
1
9
(Pr. the error of approx. value of ai is outside range)
=(Pr. median of d values is outside range)
=(Pr. d/2 values are to the right of range)
+(Pr. d/2 values are to the left of range)
<(Pr. d/2 values are outside range)
aˆi ai 3 a 1
Let Yj be a random variable taking
1
if
d
Then E[Yj] < 1/8. Let Y Y j , then
j 1
From Chernoff bound,
d
8
e
d
Pr Y PrY 4EY 4 1/ 4
2
4
3
10
The Chernoff Bound
Theorem:Let X1, X2,...,Xn be independent random
variables and assume Pr[Xi=1] = pi (0 < pi < 1).
n
n
Let X i 1 X i , E[ X ] i 1 pi . Then
e
PrX (1 )
(1 )
(1 )
e
PrX (1 )
(1 )
(1 )
Note: If all pi are the same, called Bernoulli trials.
11
Inner Product Query
• a, b: non-negative vectors
w
ˆ
• Let Q j a, b k 1 counta [ j, k ] countb [ j, k ]
Qˆ a, b min j Qˆ j a, b
n
Theorem 3: The approx. value of Qa,b i1 ai bi is
(always)
Qˆ a, b Qa, b
Qˆ a, b Qa, b a 1 b 1 (Pr. at least 1)
Proof:
n
Qˆ j a , b ai bi
i 1
a b
p q
p q ,h j ( p )h j ( q )
12
Because the vectors are non-negative Qˆ a, b Qa, b
From pairwise independence of hash functions
E Qˆ j a , b Q j a , b Pr h j ( p ) h j (q ) a p bq
pq
pq
a p bq
e
From Markov’s inequality
Pr Qˆ a, b Qa , b a 1 b 1
a1b1
e
Pr j Qˆ j a , b Qa, b a 1 b 1
e d
13
Range Query
• We want to approximate Ql , r i l ai
• Using a point query for every point in the range
r
ai aˆi ai a 1
(Pr. 1 )
Ql , r Qˆ l , r Ql , r r l 1 a 1
• We use dyadic range
[4,11]
[1,8]
[1,4]
[1,2]
[3,4]
(Pr. 1 )
[9,16]
[5,8]
[5,6]
[7,8]
[9,12]
[13,16]
[9,10] [11,12] [13,14] [15,16]
[1] [2] [3] [4] [5] [6] [7] [8] [9][10][11][12][13][14][15][16]
14
Any range is represented by union of at most 2 log2 u
dyadic ranges.
Any point in [1...u] is contained in log2 u dyadic ranges
with distinct lengths.
We say a range is of level k if the length is 2k.
We use a CM sketch with parameter (, ) for
each level k.
• Input:Ak[1...u/2k].Each element is summation of
2k ai ’ s
Ak [ j ] Aˆ k [ j ] Ak [ j ] A 1 Ak [ j ] a 1 (Pr. 1 )
• When ai comes, update log2 u Ak[i]’s including it.
15
For a range query, we use point queries on dyadic
ranges representing the query range.
It holds Ql , r Ak [lk , rk ]
k
A [l , r ] Aˆ [l , r ] A [l , r ] a
k
k
k
k
k
k
k
k
k
k
k
1
k
Ak [lk , rk ] 2 a 1 log 2 u
k
Theorem 4: The approx. value of Ql , r i l ai is
r
Qˆ l , r Ql , r
Qˆ l , r Ql , r 2 a 1 log 2 u
(always)
(Pr. at least 1)
16
ˆ
Proof: Expected error of each Ak [lk , rk ] is at most a
e
Thus the expected error of Aˆ [l , r ] is at most
k
k
k
k
2log 2u
e
1
a1
The probability that it exceeds 2 log 2u a 1 is at most 1/e
By using d hash functions, the failure probability is
bounded by .
Time to update the data structure: Olog u log 1
Space: O log u log 1
17
Improving Approximation Precision
• To make
– space:
Ql , r Qˆ l , r Ql , r a 1 ,
log 2 u
1
O
log
set
2 log 2 u
• For levels with large k, instead of using CM sketch,
store exact values
• If a query range is represented by a small number of
dyadic ranges, the precision is high.
18
Frequent Item Query
• In strict turnstile model, if ct < 0, ||a||1 will decrease.
The algorithms so far cannot be used.
• We use dyadic ranges
2 log u
• For each level, use , CM sketches
• Updating is the same as point query
• Query processing: parallel binary search
2
– Search from the highest level of dyadic ranges
– If the freq. in a range is at least (+) ||a||1 , continue
– Output elements with freq. at least (+) ||a||1 at the
19
lowest level
Theorem 6: Any element with freq. at least (+) ||a||1
is output. With Pr. at least 1 , any element with
freq. less than ||a||1 is not output.
1
2 log u
Space:
O log u log
2 log u
Update time: O log u log
Proof: In each level, the number of elements with
freq. at least ||a||1 is at most 1/ . In each level, the
number of point queries is at most 2/ . Therefore
in total 2/ ·log2 u times.
20
The probability that a query fails is at most
2 log 2 u
.
The probability that some of the 2log2 u/ queries
fail is at most .
21
Quantile Query
• Return jk (k = 0,...,1/) with
k a 1 ai k a 1
• We use
i jk
log 2 u
,
log 2 u
CM sketch for range queries.
1
ai aˆi ai
a1
(Pr. log u )
log 2 u
2
Ql , r Qˆ l , r Ql , r 2 a 1
(Pr.
1
log 2 u
)
• Because the left end point of query ranges is 1,
we use only one dyadic range for each level.
Q1, r Qˆ 1, r Q1, r a 1
(Pr.
1
log 2 u
)
22
Data Structures
• We consider an imaginary complete binary tree
for U = {1,2,...,u}
• Each leaf stores frequency of at
• We maintain the data structure so that each node v
satisfies
count (v) n / m
(1)
count (v) count (v p ) count (vs ) n / m
(2)
Do not apply (1) on leaves (vp: parent of v, vs: sibling of v)
Do not apply (2) on root
(m: parameter)
n a1
23
• If there is a node v not satisfying the invariant
– count(vp) += count(v) + count(vs)
– delete v and vs
1
2
1
4
6
1
1
1
1
4
n = 15, m = 5, u = 8
2
6
24
• When an element at comes
– increase count by 1 for the corresponding leaf
– check if the leaf count satisfies (1), (2)
– if count is small, it is merged with others
Lemma 1: The number of nodes of the tree is
at most 3m.
Proof: Let Q be the node set, n be the total freq.
From invariant (2),
n
count (v) count (v p ) count (vs ) Q
m
vQ
25
In the left side, the count value of each node appears
at most three times.
count (v) count (v
vQ
p
) count (vs )
3 count (v) 3n
vQ
Therefore
n
Q 3n
m
Q 3m
26
• When an element comes, if its frequency is small,
it is accumulated in an ancestor node.
• If all the ancestors of a leaf have 0 count values,
the leaf count is correct.
• Otherwise the leaf count is at most the real value.
Lemma 2: The error of a leaf count value is at most
log u
n
m
Proof: The error is at most the summation of count
values in ancestors of the leaf.
error (v)
count ( x)
xancestor( v )
n
n
log u
m
xancestor( v ) m
27
Consider all leaves al,...,ar in the subtree rooted at an
internal node v.
If the count values of all ancestors of v are 0,
the summation of frequencies of al,...,ar is equal to
the summation of count values in the subtree.
The error in the sum of count values in a subtree is
[1,8]
log u
at most
n
m
[5,8]
[1,4]
[1,2]
1
[3,4] [5,6]
2
3
4
5
[7,8]
6
28
7
8
Query Algorithm
•
•
•
•
Traverse tree nodes in post-order
c = (sum of count values in visited nodes)
If c becomes at least k a 1 , output the node
If the node is not a leaf, output the rightmost value
in the range corresponding to the node.
[1,8]
1
[5,8]
[1,4]
[1,2]
[3,4] [5,6]
2
n = 15, m = 5, u = 8
= 0.5, k = 1
1
2
4
3
6
4
5
[7,8]
2
6
29
7
8
Theorem 1: If s nodes are used, the output satisfies
3 log u
k a 1 ai k
a1
s
i jk
Proof: Assume that an element x is output at node v.
Consider nodes storing counts of elements 1,...,x.
Because nodes are visited in post-order, counts which
are not stored in nodes before v are stored in ancestors
log u
of v. Therefore the error is at most
n.
m
30
From Lemma 1, s 3m. Thus
3 log u
k a 1 ai k
a1
s
i jk
Required memory to bound the error by is
log u
O
31
Relation between
Heavy Hitters and -Quantile
• (,)-heavy hitters
– output all elements with ai ||a||1
– do not output any element with ai < () ||a||1
– error is at most ||a||1
• (,)-quantile: output jk (k = 0,...,1/) satisfying
k a 1 ai k a 1
i jk
• Assume that the answers are obtained with Pr. 1.
32
Lemma: (+,2)-heavy hitters can be obtained from
(,/2)-quantile.
Proof: Let jk (k = 0,...,1/)be outputs of (,/2)-quantile
If jk = jk+1 , output jk as a heavy hitter. Estimated
frequency is (number of equal jk ’s1) ||a||1
(error 2 ||a||1)
For an element with frequency less than () ||a||1 ,
it does not happen jk = jk+1 and it is not output.
Any element with freq. at least (+) ||a||1 is output.
a 1
aaaabbbbbbcccdddddddddeeeeffffff
0 2 a 1 1 2 a 1 2 2 a 1 3 2 a 1 4 2 a 1
33
Lemma: (,)-quantile is obtained from log u
( /log u , /log u)-heavy hitters (u: #distinct elements)
Proof: Partition range [1,u] into dyadic intervals of
length 2i (i = 0,...,log u1).
For each i , use ( /log u , /log u)-heavy hitter
[1,8]
[5,8]
[1,4]
[1,2]
1
[3,4] [5,6]
2
3
4
5
[7,8]
6
7
34
8
To obtain quantiles, do a binary search using
estimated frequencies of elements [1,i].
A range [1,i] is represented by at most log u
dyadic intervals. For each dyadic interval, if it is
output as a heavy hitter, we add the estimated
frequency to quantile.
The error in heavy hitters is at most /log u.
⇒the error in quantile is at most .
35
Finding Rare Items
• Input: integer ai at time i (i = 1, 2,…)
– ai U = {1,2,...,u}
• Let ct[j] = |{ai | ai = j, i t}| : number of occurrences
of j until time t
• Query: compute rarity [t] at time t
[t ]
j | ct [ j ] 1
u
• Memory: o(u) bits
36
If large memory can be used
• For each i U = {1,2,...,u}, use a 2-bit counter
– 0: i never appeared
– 1: i appeared once
– 2: i appeared more than once
• Using 2u bit memry, [t] is computed exactly.
• Is it possible using o(u) bit memory?
37
Lower bound of required memory
Proposition: Any deterministic algorithm using o(u)
bit working space cannot compute exact solutions.
Proof: Assume to the contrary that there exists a
deterministic algorithm which can compute [t]
exactly. And we also assume that any element of a
set S U appeared once until time t.
S can be recovered using the following algorithm.
• Put each element i U in the stream at time t+i
• [t+i] < [t+i1] ⇔ i S
To represent S, we need u bits. Contradiction. 38
Approximation Algorithm
• At the start time, we choose k elements from U
uniformly at random, then count frequencies of
only those elements.
– X1[t],..., Xk[t]: freq. of each element at time t
• For a query, return ˆ[t ] i | X i [t ] 1
k
j | ct [ j ] 1
[t ]
• For each i, it holds Pr X i [t ] 1
u
where the probability is on the choice of all Xi,
ˆ [t ] is a good approximation of [t].
• It is a good approximation only if [t] 1/k.
39
Let Yi be a random variable for a trial of choosing
an element of U, and let Yi=1 if it is rare. Then
Y
k
Pr[Yi=1] = [t], and if we define Y i 1Yi , ̂ [t ]
k
Prˆ [t ] (1 ) [t ]
e
PrY (1 )E[Y ]
(1 )
(1 )
e
Prˆ [t ] (1 ) [t ]
(1 )
(1 )
k [ t ]
k [ t ]
40
The probability that ˆ [t ] is not an approximation
of [t] decreases exponentially to k.
However if [t] < 1/k, the error probability is large.
In general, u is huge (IP addresses, etc.).
Therefore k should be large.
41
Change of Definition of Rarity
• The ratio of rare items to really appeared items
[t ]
i | ct [ j ] 1
i | ct [ j ] 0
• In this case, it is also not possible to solve the
problem exactly using o(u) bit space.
42
Min-wise Independent Hash Functions
Definition: [u] = {1,...,u}
Definition: A family of hash functions
H [u] [u] is min-wise independent
⇔ For any X [u] and x X,
1
PrhH h( x) min h( y) | y X
X
43
Approximation Algorithm
• Choose k min-wise hash functions randomly
*
h
• Let
i (t ) min j t hi (a j )
Ci(t): number of elements with the current
*
hash value hi (t )
• For each input ai , update the variables
hi* (t ) min hi (at ), hi* (t 1)
If hi* (t ) hi* (t 1) , Ci (t ) Ci (t ) 1
otherwise Ci (t ) 1
i | 1 i k , Ci (t ) 1
• For a query, return ˆ
[t ]
k
44
• PrCi (t ) 1 is the probability that an element among
a1,...,at which has the minimum hash value appeared
only once.
• From the property of min-wise independent hash
functions, the probability that an element has the
1
minimum hash value is
i | ct [ j ] 0
i | ct [ j ] 1
[t ]
• Therefore PrCi (t ) 1
i | ct [ j ] 0
• ˆ[t ] is a good approximating value of [t]
45
Example of Min-wise Independent
Hash Functions
• A family of functions representing all permutations
of [u] = {1,...,u}
• (u lg u) bits are necessary to represent one
permutation. → too much memory consumption
46
-min-wise Independent
Hash Functions
Definition: a family of hash functions H [u] [u]
is -min-wise independent ⇔
For any X [u] and x X,
1
PrhH h( x) min h( y) | y X
(1 )
X
cf.: a polynomial of degree O(log (1/)) on GF(u)
– can be represented in O(log u log (1/)) bits
– function value can be computed in O(log (1/)) time
– Note:u is a prime
47
Performance of
Approximation Algorithm
• With high probability, ˆ[t ] (1 ) [t ]
• By using large k, the success probability
will increase
• Memory: O(k log u log (1/)) bits
48
Generalization of the Problem
• Input: integer ai at time i (i = 1, 2,…)
– ai U = {1,2,...,u}
• ct[j] = |{ai | ai = j, i t}| : number of j until time t
• Query: compute -rarity [t] at time t
i | ct [ j ]
[t ]
i | ct [ j ] 0
• Memory: o(u) bits
49
Approximation Algorithm
• Choose k -min-wise independent hash functions
*
• Variable hi (t ) min j t hi (a j )
*
h
Ci(t): #elements having current hash value i (t )
• Update variables for each input ai
*
*
hi (t ) min hi (at ), hi (t 1)
If h (t ) h (t 1) then Ci (t ) Ci (t ) 1
*
i
*
i
otherwise Ci (t ) 1
• For a query, return ˆ [t ] i | 1 i k , Ci (t )
k
50
2
1
k 3 log
p
Lemma: For 0 < < 1, 1 > p > 0,
prob. that ˆ [t ] 1 [t ] p is at least 1
It is not necessary to fix in advance in the algorithm.
This means rarities for various values of can be
obtained simultaneously.
The ratio of elements with frequency between 1 and
is also obtained.
ˆ [t ]
i | 1 i k , Ci (t )
k
51
Rarity over Windowed Streams
• Input: integer ai at time i (i = 1, 2,…)
• ct[j] = |{ai | ai = j, tN+1 i t}| : number of j
in latest N elements
• Query: compute -rarity [t] at time t
[t ]
i | ct [ j ]
i | ct [ j ] 0
52
How to Store Minimum Values
in a Window
• Def: aj is active ⇔ j {tN+1,...,t}
• Store candidates of minimum hash values in a
window
• If x < y and hi(ax) > hi(ay) , x is not stored
– ax is active ⇒ ay is also active
– hi(ax) cannot be a minimum value
time t
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
11 16 17 20 24
20 12 75 26 23 20 15 29 40 45 32 66 51 47 34
12 15 29 32 34
hi(at)
53
• Store list of candidate minimum values
h (a
i
j1
), j1 , hi (a j2 ), j2 ,, hi (a jl ), jl
j1 j2 jl
• If it holds hi (a j ) hi (a j ) hi (a j ) ⇒ hi* (t ) hi (a j )
• How to update list: When at+1 arrives,
1
–
–
–
–
–
2
l
1
Find maximum such that hi (a j ) hi (at 1 )
Remove all elements to the right of
If hi(aj) hi(at+1), add (hi(at+1), t+1) at the tail
If hi(aj) = hi(at+1), change (hi(aj), j) to (hi(aj), t+1)
If (hi(aj1), j1) becomes inactive (if j1 < tN+2), delete it
54
Lemma: The length of the list is with high probability
1
1 1
( H N ) 1 (log N )
N
2 3
Proof: From the definition of min-wise hash functions,
the probability that the i-th element from the right has
the minimum hash value among the i elements is 1/i.
The length of the list is equal to the number of updates
of minimum values when we scan the list from the
right. Therefore the expected value is HN. From
Chernoff Bound, the probability the value is within
a constant factor of the expected value is high. 55
• Store list of lists of candidate minimum values
h (a
i
j1
)
), Li , j1 , hi (a j2 ), Li , j2 ,, hi (a jl ), Li , jl
Li , jm x | hi (a x ) hi (a jm
• How to update list. When at+1 arrives,
– Find maximum such that hi (a j ) hi (at 1 )
– Delete all elements in the list to the right of
– If hi(aj) hi(at+1), add (hi(at+1), Li,t+1) at the tail
Li,t+1 = {t+1}
– If hi(aj) = hi(at+1), add t+1 at the tail of Li , j
– If the head element of Li , j1 becomes inactive, delete it
56
• The number of elements in the list is with
high probability O( log N)
• An (1)-approximate solution is obtained
with high probability.
57
© Copyright 2025 Paperzz