pptx - Grigory Yaroslavtsev

Sublinear Algorihms for Big Data
Lecture 3
Grigory Yaroslavtsev
http://grigory.us
SOFSEM 2015
• URL: http://www.sofsem.cz
• 41st International Conference on Current Trends in
Theory and Practice of Computer Science (SOFSEM’15)
• When and where?
– January 24-29, 2015. Czech Republic, Pec pod Snezkou
• Deadlines:
– August 1st (tomorrow!): Abstract submission
– August 15th: Full papers (proceedings in LNCS)
• I am on the Program Committee ;)
Today
• Count-Min (continued), Count Sketch
• Sampling methods
– ℓ2 -sampling
– ℓ0 -sampling
• Sparse recovery
– ℓ1 -sparse recovery
• Graph sketching
Recap
• Stream: 𝒎 elements from universe 𝒏 =
{1, 2, … , 𝒏}, e.g.
𝑥1 , 𝑥2 , … , 𝑥𝒎 = 〈5, 8, 1, 1, 1, 4, 3, 5, … , 10〉
• 𝑓𝑖 = frequency of 𝑖 in the stream = # of
occurrences of value 𝑖, 𝑓 = 〈𝑓1 , … , 𝑓𝒏 〉
Count-Min
• 𝐻1 , … , 𝐻𝑑 : 𝑛 → [𝑤] are 2-wise independent
hash functions
• Maintain 𝑑 ⋅ 𝑤 counters with values:
𝑐𝑖,𝑗 = # elements 𝑒 in the stream with 𝐻𝑖 𝑒 = 𝑗
• For every 𝑥 the value 𝑐𝑖,𝐻𝑖 (𝑥) ≥ 𝑓𝑥 and so:
𝑓𝑥 ≤ 𝑓𝑥 = min(𝑐1,𝐻1 𝑥 , … , 𝑐𝑑,𝐻1 𝑑 )
• If 𝑤 =
2
𝜖
and 𝑑 =
1
log 2
𝛿
then:
Pr 𝑓𝑥 ≤ 𝑓𝑥 ≤ 𝑓𝑥 + 𝜖 𝑓
1
≥ 1 − 𝛿.
More about Count-Min
• Authors: Graham Cormode, S. Muthukrishnan [LATIN’04]
• Count-Min is linear:
Count-Min(S1 + S2) = Count-Min(S1) + Count-Min(S2)
• Deterministic version: CR-Precis
• Count-Min vs. Bloom filters
– Allows to approximate values, not just 0/1 (set membership)
– Doesn’t require mutual independence (only 2-wise)
• FAQ and Applications:
– https://sites.google.com/site/countminsketch/home/
– https://sites.google.com/site/countminsketch/home/faq
Fully Dynamic Streams
• Stream: 𝒎 updates 𝑥𝑖 , Δ𝑖 ∈ 𝑛 × ℝ that
define vector 𝑓 where 𝑓𝑗 = 𝑖:𝑥𝑖=𝑗 Δ𝑖 .
• Example: For 𝑛 = 4
〈 1,3 , 3, 0.5 , 1,2 , 2, −2 , 2,1 , 1, −1 , (4,1)〉
𝑓 = (4, −1, 0.5, 1)
• Count Sketch: Count-Min with random signs
and median instead of min:
Pr |𝑓𝑥 − 𝑓𝑥 | + 𝜖 𝑓 1 ≥ 1 − 𝛿
Count Sketch
• In addition to 𝐻𝑖 : 𝑛 → [𝑤] use random signs
𝑟 𝑖 → −1,1
𝑐𝑖,𝑗 =
𝑟𝑖 𝑥 𝑓𝑥
𝑥:𝐻𝑖 𝑥 =𝑗
• Estimate:
𝑓𝑥 = 𝑚𝑒𝑑𝑖𝑎𝑛(𝑟1 (𝑥)𝑐1,𝐻1
• Parameters: 𝑑 = 𝑂
1
log
𝛿
𝑥
, … , 𝑟𝑑 𝑥 𝑐𝑑,𝐻𝑑(𝑥) )
,𝑤 =
3
𝜖𝟐
Pr |𝑓𝑥 − 𝑓𝑥 | + 𝜖| 𝑓 |1 ≥ 1 − 𝛿
ℓ𝑝 -Sampling
• Stream: 𝒎 updates 𝑥𝑖 , Δ𝑖 ∈ 𝑛 × ℝ that
define vector 𝑓 where 𝑓𝑗 = 𝑖:𝑥𝑖=𝑗 Δ𝑖 .
• ℓ𝑝 -Sampling: Return random 𝐼 ∈ [𝑛] and 𝑅 ∈ ℝ:
Pr 𝐼 = 𝑖 = 1 ± 𝜖
𝑓𝑖 𝑝
−𝑐
+
𝑛
𝑝
𝑓 𝑝
𝑅 = 1 ± 𝜖 𝑓𝐼
Application: Social Networks
• Each of 𝑛 people in a social network is friends
with some arbitrary set of other 𝑛 − 1 people
• Each person knows only about their friends
• With no communication in the network, each
person sends a postcard to Mark Z.
• If Mark wants to know if the graph is
connected, how long should the postcards be?
Optimal 𝐹𝑘 estimation
• Yesterday: (𝜖, 𝛿)-approximate 𝐹𝑘
– 𝑂(𝑛1−1/𝑘 ) space for 𝐹𝑘 =
– 𝑂(log 𝑛) space for 𝐹2
𝑖
𝑓𝑖
𝑘
• New algorithm: Let (𝐼, 𝑅) be an ℓ2 -sample.
Return 𝑇 = 𝐹2 𝑅 𝑘−2 , where 𝐹2 is an 𝑒 ±𝜖 estimate of 𝐹2
• Expectation:
𝑃𝑟 𝐼 = 𝑖 𝑒 ±𝜖 𝑓𝑖
𝔼 𝑇 = 𝐹2
𝑖
= 𝑒 ±𝜖𝑘 𝐹2
𝑖∈[𝑛]
𝑘−2
𝑓𝑖2 𝑘−2
𝑓𝑖
= 𝑒 ±𝜖𝑘 𝐹𝑘
𝐹2
Optimal 𝐹𝑘 estimation
• New algorithm: Let (𝐼, 𝑅) be an ℓ2 -sample.
Return 𝑇 = 𝐹2 𝑅 𝑘−2 , where 𝐹2 is an 𝑒 ±𝜖 estimate of 𝐹2
• Variance:
𝑉𝑎𝑟 𝑇 ≤ 𝔼 𝑇 2 =
𝑃𝑟 𝐼 = 𝑖 𝔼 𝑇 2 |𝐼 = 𝑖
𝑖
=
𝑒 ±2𝜖𝑘
𝑖∈ 𝑛
𝑓𝑖2 2 2
𝐹2 𝑓𝑖
𝐹2
𝑘−2
=
𝑒 ±2𝜖𝑘 𝐹2 𝐹2𝑘−2
2
1−𝑘
• Exercise: Show that 𝐹2 𝐹2𝑘−2 ≤ 𝑛
• Overall: 𝔼 𝑇 =
𝑒 ±𝜖𝑘 𝐹𝑘 , 𝑉𝑎𝑟
𝑇 ≤
2
1−𝑘
– Apply average + median to 𝑂 𝑛
𝐹𝑘2
𝑒 ±2 𝜖𝑘
≤
𝑒 ±2 𝜖𝑘
2
𝑛1−𝑘 𝐹𝑘2
𝜖 −2 log 𝛿 −1 copies
2
1−𝑘 2
𝑛 𝐹𝑘
ℓ2 -Sampling: Basic Overview
• Assume 𝐹2 𝑓 = 1. Weight 𝑓𝑖 by 𝑤𝑖 =
1
𝑢𝑖
, where 𝑢𝑖 ∈𝑅 0,1 :
𝑓 = 𝑓1 , 𝑓2 , … , 𝑓𝑛
𝑔 = (𝑔1 , 𝑔2 , … , 𝑔𝑛 ) where 𝑔𝑖 = 𝑤𝑖 𝑓𝑖
• For some value 𝑡, return 𝑖, 𝑓𝑖 if there is a unique 𝑖 such that 𝑔𝑖2 ≥ 𝑡
• Probability (𝑖, 𝑓𝑖 ) is returned if 𝑡 is large enough:
Pr 𝑔𝑖2 ≥ 𝑡 𝑎𝑛𝑑 ∀𝑗 ≠ 𝑖, 𝑔𝑗2 < 𝑡 = Pr 𝑔𝑖2 ≥ 𝑡
𝑓𝑖2
= Pr 𝑢𝑖 ≤
𝑡
𝑃𝑟 𝑔𝑗2 < 𝑡
𝑗≠𝑖
𝑓𝑖2
𝑗≠𝑖
• Probability some value is returned
𝑓𝑗2
Pr 𝑢𝑗 >
≈
𝑡
𝑡
𝑓𝑖2
𝑖 𝑡
1
𝑡
= , repeat 𝑂 𝑡 log
1
𝛿
times.
ℓ2 -Sampling: Part 1
• Use Count-Sketch with parameters (𝑚, 𝑑) to sketch 𝑔
• To estimate 𝑓𝑖2 :
𝑔𝑖2
= 𝑚𝑒𝑑𝑖𝑎𝑛𝑗
2
𝑐𝑗,ℎ
𝑗 𝑖
and
𝑓𝑖2
=
𝑔𝑖2
𝑤𝑖
• Lemma: With high probability if 𝑑 = 𝑂 log 𝑛
𝐹2 𝑔
2 ±𝜖
2
𝑔𝑖 = 𝑔𝑖 𝑒 ± 𝑂
𝜖𝑚
• Corollary: With high probability if 𝑑 = 𝑂 log 𝑛 and 𝑚 ≫
𝐹2 𝑔
,
𝜖
1
2 ±𝜖
2
𝑓𝑖 = 𝑓𝑖 𝑒 ±
𝑤𝑖
99
• Exercise: Pr 𝐹2 𝑔 ≤ 𝑐 log 𝑛 ≤
for large 𝑐 > 0.
100
Proof of Lemma
• Let 𝑐𝑗 = 𝑟𝑗 𝑖 𝑔𝑖 + 𝑍𝑗
𝐹 (𝑔)
• By the analysis of Count Sketch 𝔼 𝑍𝑗2 ≤ 2 and by Markov:
𝑚
3𝐹2 𝑔
2
2
Pr 𝑍𝑗 ≤
≥
𝑚
3
• If 𝑔𝑖 ≥
• If 𝑔𝑖 ≤
2
𝜖
2
𝜖
𝑍𝑗 , then 𝑐𝑗,ℎ𝑗
2
𝑖
= 𝑒 ±𝜖 𝑔𝑖
2
𝑍𝑗 , then
6 𝑍𝑗
≤ 𝑔𝑖 + 𝑍𝑗
− 𝑔𝑖 = 𝑍𝑗 + 2 𝑔𝑖 𝑍𝑗 ≤
𝜖
where the last inequality holds with probability 2/3
2
𝑐𝑗,ℎ
𝑗 𝑖
2
2
2
2
𝐹2 𝑔
≤ 18
𝜖𝑚
• Take median over 𝑑 = 𝑂(log 𝑛) repetitions ⇒ high probability
ℓ2 -Sampling: Part 2
2
4
𝜖
• Let 𝑠𝑖 = 1 if 𝑓𝑖 𝑤𝑖 ≥ and 𝑠𝑖 = 0 otherwise
• If there is a unique 𝑖 with 𝑠𝑖 = 1 then return 𝑖, 𝑓𝑖
2
• Note that if 𝑓𝑖 𝑤𝑖 ≥
4
𝜖
2
1
then
𝑤𝑖
𝑓𝑖 = 𝑓𝑖2 𝑒 ±𝜖 ±
therefore
𝑓𝑖2
=
𝑒 ±4𝜖
𝑓𝑖
2
1
𝑤𝑖
≤
=
𝜖 𝑓𝑖
4
2
.
2
and so
𝜖 𝑓𝑖
2 ±𝜖
𝑓𝑖 𝑒 ±
4
2
,
• Lemma: With probability Ω(𝜖) there is a unique 𝑖 such
that s𝑖 = 1. If so then Pr 𝑖 = 𝑖 ∗ = 𝑒 ±8 𝜖 𝑓𝑖2∗
• Thm: Repeat Ω 𝜖 −1 log 𝑛 times. Space: 𝑂(𝜖 −2 𝑝𝑜𝑙𝑦𝑙𝑜𝑔 𝑛)
Proof of Lemma
4
• Let t = . We can upper-bound Pr 𝑠𝑖 = 1 :
𝜖
𝑒 4𝜖 𝑓𝑖2
𝑒 4𝜖 𝑓𝑖2
2
Pr 𝑠𝑖 = 1 = Pr 𝑓𝑖 𝑤𝑖 ≥ 𝑡 ≤ Pr
≥ 𝑢𝑖 ≤
𝑡
𝑡
𝑒 −4𝜖 𝑓𝑖2
Similarly, Pr 𝑠𝑖 = 1 ≥
.
𝑡
• Using independence of 𝑤𝑖 , probability of unique 𝑖 with 𝑠𝑖 = 1:
Pr 𝑠𝑖 = 1,
𝑖
𝑠𝑗 = 0 ≥
𝑗≠𝑖
Pr 𝑠𝑖 = 1
1 −
𝑖
≥
𝑖
𝑒 −4𝜖 𝑓𝑖2
1 −
𝑡
𝑒 −4𝜖
≥
Pr 𝑠𝑗 = 1
𝑗≠𝑖
4𝜖 𝑓 2
𝑒
𝑗≠𝑖
𝑖
𝑡
𝑒 4𝜖
1 −
𝑡
≈ 1/𝑡
𝑡
Proof of Lemma
• Let t =
4
.
𝜖
We can upper-bound Pr 𝑠𝑖 = 1 :
4𝜖 2
4𝜖 2
𝑒
𝑓
𝑒
𝑓𝑖
2
𝑖
Pr 𝑠𝑖 = 1 = Pr 𝑓𝑖 𝑤𝑖 ≥ 𝑡 ≤ Pr
≥ 𝑢𝑖 ≤
𝑡
𝑡
Similarly, Pr 𝑠𝑖 = 1 ≥
• We just showed:
𝑒 −4𝜖 𝑓𝑖2
𝑡
.
Pr 𝑠𝑖 = 1,
𝑖
𝑠𝑗 = 0 ≈ 1/𝑡
𝑗≠𝑖
• If there is a unique i, probability 𝑖 = 𝑖 ∗ is:
Pr[𝑠𝑖 ∗ = 1, 𝑗≠𝑖 𝑠𝑗 = 0]
= 𝑒 ±8𝜖 𝑓𝑖2∗
𝑖 Pr 𝑠𝑖 = 1, 𝑗≠𝑖 𝑠𝑗 = 0
ℓ0 -sampling
• Maintain 𝐹0 , and 1 ± 0.1 -approximation to 𝐹0 .
• Hash items using ℎ𝑗 : 𝑛 → 0,2𝑗 − 1 for 𝑗 ∈ log 𝑛
• For each 𝑗, maintain:
𝐷𝑗 = 1 ± 0.1 |{𝑡|ℎ𝑗 𝑡 = 0}|
𝑆𝑗 =
𝑓𝑡 𝑖𝑡
𝑡,ℎ𝑗 𝑡 =0
𝐶𝑗 =
𝑓𝑡
𝑡,ℎ𝑗 𝑡 =0
• Lemma: At level 𝑗 = 2 + ⌈log 𝐹0 ⌉ there is a unique element
in the streams that maps to 0 (with constant probability)
• Uniqueness is verified if 𝐷𝑗 = 1 ± 0.1. If so, then output
𝑆𝑗 /𝐶𝑗 as the index and 𝐶𝑗 as the count.
Proof of Lemma
• Let 𝑗 = ⌈log 𝐹0 ⌉ and note that 2𝐹0 < 2𝑗 < 12 𝐹0
1
• For any 𝑖, Pr ℎ𝑗 𝑖 = 0 = 𝑗
2
• Probability there exists a unique 𝑖 such that ℎ𝑗 𝑖 = 0,
Pr ℎ𝑗 𝑖 = 0 𝑎𝑛𝑑 ∀𝑘 ≠ 𝑖, ℎ𝑗 𝑘 ≠ 0
𝑖
=
Pr ℎ𝑗 𝑖 = 0 Pr ∀𝑘 ≠ 𝑖, ℎ𝑗 𝑘 ≠ 0 ℎ𝑗
𝑖
= 0]
𝑖
≥
Pr ℎ𝑗 𝑖 = 0
1−
𝑖
=
Pr ℎ𝑗 𝑖 = 0
𝑖
Pr ℎ𝑗 𝑘 = 0 ℎ𝑗 𝑖 = 0
𝑘≠𝑖
1 −
Pr ℎ𝑗 𝑘 = 0
𝑘≠𝑖
• Holds even if ℎ𝑗 are only 2-wise independent
≥
𝑖
1
𝐹0
1
1 − 𝑗 ≥
𝑗
24
2
2
Sparse Recovery
• Goal: Find 𝑔 such that 𝑓 − 𝑔 1 is minimized
among 𝑔′s with at most 𝑘 non-zero entries.
• Definition: 𝐸𝑟𝑟 𝑘 𝑓 =
• Exercise: 𝐸𝑟𝑟 𝑘 𝑓 =
of 𝑘 largest 𝑓𝑖
min
g: g 0 ≤𝑘
𝑖∉𝑆 𝑓𝑖
𝑓 −𝑔
1
where 𝑆 are indices
• Using 𝑂 𝜖 −1 𝑘 log 𝑛 space we can find 𝑔 such
that 𝑔 0 ≤ 𝑘 and
𝑔 − 𝑓 1 ≤ 1 + 𝜖 𝐸𝑟𝑟 𝑘 (𝑓)
Count-Min Revisited
• Use Count-Min with 𝑑 = 𝑂 log 𝑛 , 𝑤 = 4𝑘/𝜖
• For 𝑖 ∈ 𝑛 , let 𝑓𝑖 = 𝑐𝑗,ℎ𝑗 𝑖 for some row 𝑗 ∈ 𝑑
• Let 𝑆 = {𝑖1 , … , 𝑖𝑘 } be the indices with max. frequencies. Let 𝐴𝑖 be
the event there doesn’t exist 𝑘 ∈ 𝑆/𝑖 with ℎ𝑗 𝑖 = ℎ𝑗 (𝑘)
• Then for 𝑖 ∈ 𝑛 :
𝜖𝐸𝑟𝑟 𝑘 𝑓
Pr 𝑓𝑖 − 𝑓𝑖 ≥
=
𝑘
𝜖𝐸𝑟𝑟 𝑘 𝑓
Pr not 𝐴𝑖 × Pr 𝑓𝑖 − 𝑓𝑖 ≥
𝑛𝑜𝑡 𝐴𝑖 +
𝑘
𝜖𝐸𝑟𝑟 𝑘 𝑓
Pr 𝐴𝑖 × Pr 𝑓𝑖 − 𝑓𝑖 ≥
𝑘
𝜖𝐸𝑟𝑟 𝑘 𝑓
≤ Pr not 𝐴𝑖 + Pr 𝑓𝑖 − 𝑓𝑖 ≥
𝑘
𝐴𝑖
𝑘 1 1
𝐴𝑖 ≤ + ≤
𝑤 4 2
• Because 𝑑 = 𝑂(log 𝑛) w.h.p. all 𝑓𝑖 ’s approx . up to
𝜖𝐸𝑟𝑟 𝑘 𝑓
𝑘
Sparse Recovery Algorithm
• Use Count-Min with 𝑑 = 𝑂 log 𝑛 , 𝑤 = 4𝑘/𝜖
• Let 𝑓 ′ = (𝑓1 , 𝑓2 , … , 𝑓𝑛 ) be frequency estimates:
𝜖𝐸𝑟𝑟 𝑘 (𝑓)
𝑓𝑖 − 𝑓𝑖 ≤
𝑘
• Let 𝑔 be 𝑓′ with all but the k-th largest entries
replaced by 0.
• Lemma: 𝑔 − 𝑓
1
≤ 1 + 3 𝜖 𝐸𝑟𝑟 𝑘 (𝑓)
𝑘
𝑔−𝑓
1
≤ 1 + 3 𝜖 𝐸𝑟𝑟 (𝑓)
• Let 𝑆, 𝑇 ⊆ 𝑛 be indices corresponding to largest value of 𝑓𝑖 and 𝑓𝑖 .
• For a vector 𝑥 ∈ ℝ𝑛 and 𝐼 ⊆ [𝑛] denote as 𝑥𝐼 the vector formed by
zeroing out all entries of 𝑥 except for those in 𝐼.
𝑓 − 𝑓𝑇′ 1 ≤ 𝑓 − 𝑓𝑇 1 + 𝑓𝑇 − 𝑓𝑇′ 1
= 𝑓
= 𝑓
′
−
𝑓
𝑇
1
− 𝑓𝑇
1
+
1
′
−
𝑓
𝑇
1
≤ 𝑓
≤ 𝑓
≤ 𝑓
− 𝑓𝑆
1
≤ 𝑓 − 𝑓𝑆
≤
𝐸𝑟𝑟 𝑘
𝑓𝑇′
1
1
−
+
′
+
𝑓
−
𝑓
𝑇
𝑇
1
− 𝑓𝑇
1
1
+ 𝑓𝑇 − 𝑓𝑇′
1
1
′
+
2
𝑓
−
𝑓
𝑇
𝑇
1
𝑓𝑆′ 1 +
2 𝑓𝑇 −
𝑓𝑆
− 𝑓𝑆′
1
1
′
𝑓𝑇 1
1
+ 2 𝑓𝑇 − 𝑓𝑇′
′
+
𝑓
−
𝑓
𝑆
𝑆
1
′
+
2
𝑓
−
𝑓
𝑇
𝑇
1
𝐸𝑟𝑟 𝑘 𝑓
𝑘
𝐸𝑟𝑟 𝑘 𝑓
𝑘
𝑓 +𝑘𝜖
+ 2𝑘 𝜖
≤ 1 + 3 𝜖 𝐸𝑟𝑟 𝑘 (𝑓)
1
1
Thank you!
• Questions?

Download Report

pptx - Grigory Yaroslavtsev

Paperzz.com

Your Paperzz