Various Sketching
Techniques
DYLAN GRAY
CS 240B
Outline
Distinct Count Methods
Comparing Data Streams
Count-Min Sketch
Stream Count
[3]
Distinct
s = stream of values (i, v)
◦ i = id ϵ {1, …, N-1}
◦ v = number of i’s to insert or delete
Algorithm 1: k-set
Algorithm 2: table
Stream Count Distinct
s = stream of values (i, v)
◦ i = id ϵ {1, …, N-1}
◦ v = number of i’s to insert or delete
Ex.
s=
res:
ID
Count
1
0
2
0
3
0
Stream Count Distinct
s = stream of values (i, v)
◦ i = id ϵ {1, …, N-1}
◦ v = number of i’s to insert or delete
Ex.
s=
res:
(1, 3)
ID
Count
1
3
2
0
3
0
Stream Count Distinct
s = stream of values (i, v)
◦ i = id ϵ {1, …, N-1}
◦ v = number of i’s to insert or delete
Ex.
s=
res:
(1, 3)
(2, 4)
ID
Count
1
3
2
4
3
0
Stream Count Distinct
s = stream of values (i, v)
◦ i = id ϵ {1, …, N-1}
◦ v = number of i’s to insert or delete
Ex.
s=
res:
(1, 3)
(2, 4)
(1, -1)
ID
Count
1
2
2
4
3
0
Stream Count Distinct
s = stream of values (i, v)
◦ i = id ϵ {1, …, N-1}
◦ v = number of i’s to insert or delete
Ex.
s=
res:
(1, 3)
(2, 4)
(1, -1)
ID
Count
1
2
2
4
3
7
(3, 7)
Stream Count Distinct
s = stream of values (i, v)
◦ i = id ϵ {1, …, N-1}
◦ v = number of i’s to insert or delete
Ex.
s=
res:
(1, 3)
(2, 4)
(1, -1)
ID
Count
1
2
2
-1
3
7
(3, 7)
(2, -5)
Stream Count Distinct
s = stream of values (i, v)
◦ i = id ϵ {1, …, N-1}
◦ v = number of i’s to insert or delete
Ex.
s=
res:
(1, 3)
(2, 4)
(1, -1)
ID
Count
1
3
2
-1
3
7
(3, 7)
(2, -5)
(1, 1)
TESTSINGLETON data structure
TSUPDATE: handles insertions and deletions
TSCARD: tests cardinality (F0) of data stream
◦ Returns EMPTY (F0 = 0), SINGLETON (F0 = 1), or COLLISION (F0 > 1)
◦ If returns SINGLETON, also return the element and its frequency (i, fi)
Naïve TESTSINGLETON
Keep 3 counters:
◦ 𝑚=
◦ 𝑈=
◦ 𝑉=
𝑖 𝑓𝑖
𝑖
𝑖
𝑓𝑖 ∗ 𝑖
𝑓𝑖 ∗ 𝑖 2
TSUPDATE(i, v)
◦ m := m + v; U := U + v*i; v := V + v*i2;
TSCARD()
◦ If m == 0, return EMPTY
◦ Else If (U2 == m*V), return (SINGLETON, U/m, m)
◦ Else, return COLLISION
Analysis: TESTSINGLETON
Space: Ω(log n) bits to store counters
Time: O(1) for insert, deletion, or cardinality check
Problem: This only works for 1 ‘i’ value
◦ Need this to keep track of total distinct values
The k-set Data Structure
A k-Set is a dictionary which has RETURNSET function
◦ returns the set of all values in dictionary S if |S| ≤ k
Implementation
◦ H: 2D matrix (size RxB)
𝑘
◦ R = log ; B = 2k
δ
◦ Each row is a hash table of B buckets indexed (0 … B-1).
◦ Each bucket is a TESTSINGLETON
◦ Hash functions hr : {0, …, N-1} → {0, …, B-1} for all r
◦ 𝑚 = 𝑖 𝑓𝑖
k-set Functions
KSETUPDATE(i, v)
◦ H[r, hr(i)].TSUPDATE(i,v) for 1 ≤ r ≤ R
B
R
k-set Functions
KSETUPDATE(i, v)
◦ H[r, hr(i)].TSUPDATE(i,v) for 1 ≤ r ≤ R
B
R
r
k-set Functions
KSETUPDATE(i, v)
◦ H[r, hr(i)].TSUPDATE(i,v) for 1 ≤ r ≤ R
B
R
r
hr(i)
k-set Functions
KSETUPDATE(i, v)
◦ H[r, hr(i)].TSUPDATE(i,v) for 1 ≤ r ≤ R
B
R
(m, U, V)
hr(i)
r
k-set Functions
KSETUPDATE(i, v)
◦ H[r, hr(i)].TSUPDATE(i,v) for 1 ≤ r ≤ R
RETURNSET()
◦ For every bucket:
◦ Call H[r, b].TSCARD
◦ If SINGLETON, store ID and frequency
◦ Sum all frequencies of distinct items retrieved and check against m
◦ If equal, return distinct IDs found
k-set Analysis
Space: O(k(log m + log N) log(k/δ))
Update time: O(log k/δ)
RETURNSET time: O(k (log k/δ))
Algorithm 2: Table
Goal: Estimate distinct count 𝐹0 s.t. 𝑃𝑟 𝐹0 − 𝐹0 ≤ ε
Space Complexity:
◦ 𝑂
1
ε2
1
δ
log 𝑚 + log 𝑁 log 𝑁 log + log
Query Time Complexity:
1
1
◦ 𝑂 log δ + log ε
1
ε
log
1
δ
≥1−δ
Data Structure
Table T of size 𝐿𝑥𝐾
◦ F = all possible values for queries
◦ 𝐿 = log 𝐹
◦ 𝐾=
720
ε2
Each cell T[l,r] is a TESTSINGLETON
Two hash functions
◦ ℎ: 𝐹 → 𝐹
◦ 𝑔: 𝐹 → {0, … , 𝐾 − 1}
level(i) = index of least significant bit in h(i) which is a 1
Algorithm & Performance
UPDATE(i,v)
◦ T[level(i), g(i)].TSUPDATE
◦ Space: 𝑂
1
ε2
1
log 𝑚 + log 𝑁 log 𝑁 log δ
1
1
◦ Time: 𝑂 log δ + log ε
CARDINALITY()
◦ Find the smallest l s.t. at least x ≥
◦ Solve for 𝐹0 in the equation x =
𝐹0
2𝑙
12
ε2
buckets return SINGLETON and at least
𝐹0
1
1 2𝑙 −1
−𝐾
7𝐾
8
non-empty buckets.
Outline
Distinct Count Methods
Comparing Data Streams
Count-Min Sketch
Hamming
[1]
Norm
Counts the number of occurrences in a stream where values are non-zero
If a,b are count vectors, can perform unions and differences easily
◦ count vector: [a1 … aN] where ai is the number of occurrences of value i
|a|h is the hamming norm of a
◦ Equal to distinct count for count vectors
Distinct count of Union: |a+b|h
Use same stream as for distinct count
Computing the Hamming Norm
Lp Norm
1
◦ 𝐿𝑝 = 𝑎 𝑝 = 𝑖 𝑎 𝑖
◦ L0 = Hamming Norm
𝑝 𝑝
Lp Sketch
◦ 𝑆𝑘(𝑎)𝑗 =
𝑖 𝑥𝑖𝑗 𝑎𝑖
◦ 𝑗𝜖 1, … , 𝑚 ; 𝑚 = 𝐶
◦ Lp is roughly
𝑗
where xij is iid sampled from a random, stable distribution
log
1
δ
ϵ2
for a large constant C
𝑆𝑘(𝑎)𝑗
𝑝
Streaming UPDATE(i,v)
◦ 𝑆𝑘(𝑎)𝑗 ← 𝑆𝑘(𝑎)𝑗 + 𝑣 ∗ 𝑥𝑖𝑗 for all 1 ≤ j ≤ m
Applications
DistinctCount(a):
◦ 𝑎
ℎ
= 𝐿0 𝑎 =
𝑗
𝑆𝑘(𝑎)𝑗
0
Union(a, b):
◦ 𝑎 + 𝑏 ℎ = 𝐿0 (𝑎 + 𝑏)
◦ 𝑆𝑘 𝑎 + 𝑏 𝑗 = 𝑆𝑘 𝑎 𝑗 + 𝑆𝑘 𝑏
𝑗
SetDifference(a, b):
◦ 𝑎 − 𝑏 ℎ = 𝐿0 (𝑎 − 𝑏)
◦ 𝑆𝑘 𝑎 − 𝑏 𝑗 = 𝑆𝑘 𝑎 𝑗 − 𝑆𝑘 𝑏
𝑗
Outline
Distinct Count Methods
Comparing Data Streams
Count-Min Sketch
Supported
[2]
Queries
Let a = [a1, …, aN] where ai is the number of occurrences of value i
Point Query(i)
◦ Return ai
Range(l,r)
◦ Return
𝑟
𝑖=𝑙 𝑎𝑖
DotProduct(a,b)
◦ Return
𝑖 𝑎𝑖 𝑏𝑖
Count-Min Sketch
count = DxW matrix of counters
◦ D = ln(1/δ); W = e/ε; (ε, δ) are hyper-parameters
Hash Functions h1 … hd: {1 … n} → {1…w} where n is the size of all possible inputs
UPDATE(i,v)
◦ count[j, hj(i)] ← count[j, hj(i)] + v for all 1≤ j ≤ D
Estimate Queries
Point(i)
◦ 𝑚𝑖𝑛𝑗 𝑐𝑜𝑢𝑛𝑡[𝑗, ℎ𝑗 (𝑖)]
Range(l,r)
◦ Keep (log n) sketches for dyadic ranges
◦ Can express range query as sum of ranges (next slide)
DotProduct(a,b)
◦ 𝑚𝑖𝑛𝑗
𝑘 𝑐𝑜𝑢𝑛𝑡𝑎
𝑗, 𝑘 ∗ 𝑐𝑜𝑢𝑛𝑡𝑏 𝑗, 𝑘
Range(l,r) Example
Query: Range(48, 107)
◦ Let n = 256
◦ [48 … 107] is covered by summing the ranges below
Keep locally sketches for ranges:
◦ [48…48][49…64][65…96][97…104][105…106][107…107]
◦ Called dyadic ranges.
◦ Any range query can be answered in 2(log n) dyadic range queries
Analysis
Queries are correct with accuracy ε
Space Complexity: O(log(1/δ) (e/ε))
Time Complexity:
◦ Point Query: O(log (1/δ))
◦ Range Query: O((log n) (log 1/ δ))
◦ Dot Product: O(log(1/δ) (e/ε))
Advantages
◦ Supports Quantile & Heavy Hitter Queries
References
[1] G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan. “Comparing Data Streams Using
Hamming Norms (How to Zero In)”. IEEE Transactions on Knowledge and Data Engineering,
Vol 15, 529-540, June 2003.
[2] G. Cormode and S. Muthukrishnan. “An improved data stream summary: the count-min sketch
and its applications”. Journal of Algorithms, Vol 55, 58-75, 2005.
[3] S. Ganguly. “Counting distinct items over update streams”. Theoretical Computer Science, Vol
378, 211-222, 2007.
© Copyright 2026 Paperzz