Beating CountSketch for Heavy Hitters in Insertion

New Algorithms for Heavy
Hitters in Data Streams
David Woodruff
IBM Almaden
Joint works with Arnab Bhattacharyya, Vladimir Braverman,
Stephen R. Chestnut, Palash Dey, Nikita Ivkin, Jelani Nelson, and
Zhengyu Wang
Streaming Model
4
3
7
3
1
1
2
…
• Stream of elements a1, …, am in [n] = {1, …, n}. Assume m = poly(n)
• Arbitrary order
• One pass over the data
• Minimize memory usage (space complexity) in bits for solving a task
• Let fj be the number of occurrences of item j
• Heavy Hitters Problem: find those j for which fj is large
Guarantees
• l1 – guarantee
• output a set containing all items j for which fj ≥ φ m
• the set should not contain any j with fj ≤ (φ-ε) m
• l2 – guarantee
• F2 = j fj2
• output a set containing all items j for which fj 2 ≥ φ F2
• the set should not contain any j with fj 2 ≤ (φ-ε) F2
f1, f2 f3 f4 f5 f6
• l2 – guarantee can be much stronger than the l1 – guarantee
• Suppose frequency vector is (√𝑛, 1, 1, 1, …, 1)
• Item 1 is an l2-heavy hitter for constant φ, ε, but not an l1-heavy hitter
Outline
• Optimal algorithm in all parameters φ, ε for l1-guarantee
• Optimal algorithm for l2-guarantee for constant φ, ε
Misra-Gries
• Maintain a list L of c = O(1/ε) pairs of the form (key, value)
key
• Given an update to item i
value
• If i is in L, increment value by 1
• If i is not in L, and there are fewer than c pairs in L, put (i,1) in L
• Otherwise, subtract 1 from all values in L. Remove pairs with value 0
• (Intuition) If an item i is not a key, charge its updates to c-1 distinct updates of other items:
fi ⋅ c − 1 ≤ m, so fi ≤ εm
• Charge each update not included in the value f’i of a key i to c-1 updates of other items:
fi − ϵm ≤ fi′ ≤ fi
Space Complexity of Misra-Gries
• c ⋅ log n = O(ϵ−1 log n) bits, assuming stream length m ≤ poly n
• Optimal if ϕ = Θ ϵ , since output size is O(ϕ−1 log n) bits
• But what if say, ϕ = ½ and ϵ = 1/log n?
• Misra-Gries uses O(log 2 n) bits but lower bound only Ω(log n) bits
Our Results
• Obtain an optimal algorithm using
1
−1
O(ϵ log
ϕ
+ ϕ−1 log n ) bits
• If ϕ = ½ and ϵ = 1/log n we obtain the optimal O(log n) bits!
• For general stream lengths m, there is an additive O(log log m) in upper
and lower bounds, so also optimal
• O(1) update and reporting times provided m > poly
1
ϵ
A Simple Initial Improvement
•
1
First show O(ϵ log
+ ϕ−1 log n ) bit
1 ϵ
−1
optimal O(ϵ log
+ ϕ−1 log n ) bits
ϕ
−1
algorithm, then improve it to
• Idea: use same number c of (key, value) pairs, but compress each pair
1
ϵ2
• Compress values by sampling O
random stream positions. If sample items
C
with probability p = 2 then for all i in [n], new frequency g i satisfies
ϵ m
g i /p − fi ≤ ϵm, or g i − p fi ≤ ϵpm
•O
1
ϵ2
distinct keys after sampling, so hash identities to universe of size O
1
ϵ4
Why Sampling Works?
1
• Compress the values by sampling O 2 random stream positions. If sample items with
ϵ
C
probability p = 2 then for all i in [n], new frequency g i satisfies
ϵ m
g i /p − fi ≤ ϵm
• E
• Pr
gi
p
= fi
gi
p
and
− fi ≥ ϵm
• Pr[∃ i for which
gi
p
Var
≤
gi
p
g
Var pi
ϵ2 m2
=
≤
1
p2
⋅ fi ⋅ p 1 − p ≤
fi
p ϵ2 m2
− fi ≥ ϵm]≤
=
fi
i Cm
7
3
1
1
fi
p
fi
Cm
≤
1
C
3
1
2
Misra-Gries after Hashing
• Stream length is O
1
ϵ2
after sampling
1
ϵ2
• O
distinct keys after sampling, so hash identities with a hash function
h:[n] -> [1/ε4]
• Misra-Gries on (key, value)
1
−1
pairs takes O(ϵ log
ϵ
) bits of space
• Heavy hitters in sampled stream correspond to heavy hitters in original stream, and
frequencies are preserved up to additive ϵm
• Problem: want original (non-hashed) identities of heavy hitters!
Maintaining Identities
• For the O(ϕ−1 ) items with largest counts, as reported by our data structure,
maintain actual log n bit identities
• Always possible to maintain since if we sample an insertion of an item i, we
have its actual identity in hand
hashed key
value
3
1
4
6
5
9
1000
20
33
5000
1
1
actual key
value
458938 \
\
30903
\
\
100
33
5000
1
1
20
Summary of Initial Improvement
•
1
−1
O(ϵ log
ϵ
+ ϕ−1 log n ) bit algorithm
• Update and reporting time can be made O(1) provided m > poly
1
ϵ
• For most stream updates, they’re not sampled so do nothing!
• Spread out computation of expensive operations over future updates
for which you do nothing
An Optimal Algorithm
• O(ϵ−1 log
1
ϵ
+ ϕ−1 log n ) space, but want O(ϵ−1 log
1
ϕ
+ ϕ−1 log n )
• Too much space for (key, value) pairs in Misra-Gries!
• Instead, run Misra-Gries to find items with frequency > ϕm, then use a separate data
structure to estimate their frequencies up to additive ϵm
• Misra-Gries data structure takes O(ϕ−1 log n) bits of space
1
• Separate data structure will be O(log
) independent repetitions of a data structure
ϕ
using O(ϵ−1 ) bits. What can you do with O(𝜖 −1 ) bits?
An Optimal Algorithm
• Want to use O(ϵ−1 ) bits so that for any given item i, can report an ϵm additive
approximation to fi with probability > 2/3
1
• Median of estimates across O(log
) repetitions is an ϵm additive approximation
ϕ
with probability 1 – φ/100. Union bound over 1/φ items
• Keep O(ϵ−1 ) counters as in Misra-Gries, but each on average uses O(1) bits!
• Can’t afford to keep item identifiers, even hashed ones..
• Can’t afford to keep exact counts, even on the sampled stream..
Dealing with Item Identifiers
• Choose a pairwise-independent hash function h:[n] -> {1, 2, …, 1/ε}
• Don’t keep item identifiers, just treat all items that go to the same
hash bucket as one item
• Expected “noise” in a bucket is ε ⋅ (sampled stream length) = 1/ε
• Solves the problem with item identifiers, but what about counts?
Dealing with Item Counts
• We have r = O(1/ε) counters c1 , c2 , … , cr , with
store each ci up to additive error 1/ϵ
i ci
=O
1
ε2
, and want to
• Round each ci to its nearest integer multiple of 1/ϵ
• Gives O(1/ε) bits of space
• But how to maintain this as the stream progresses?
• classic “probabilistic counters” do not work
• design “accelerated counters” which are more accurate as count increases
• For more details, please see the paper!
Conclusions on l1-guarantee
−1
• O(ϵ
log
1
ϕ
• If m > poly
+ ϕ−1 log n ) bits of space
1
ϵ
, then update and reporting times are O(1)
• Show a matching lower bound
• Is this also a significant practical improvement over Misra-Gries?
Outline
• Optimal algorithm in all parameters φ, ε for l1-guarantee
• Optimal algorithm for l2-guarantee for constant φ, ε
CountSketch achieves the l2–guarantee [CCFC]
• Assign each coordinate i a random sign ¾(i) 2 {-1,1}
• E[¾(i) ¢ ch(i)] = ¾(i) ⋅ Σi’: h(i’) = h(i) ¾(i’)¢fi’ = fi
• Randomly partition coordinates into B buckets, maintain cj = Σi: h(i) = j ¾(i)¢fi in j-th
bucket • Repeat this hashing scheme O(log n) times
• Output median of estimates
f1 f2in af3bucket
f4 isf5 ¾(i)f6⋅ Σf7 f8 f9 ¾(i’)¢f
f10
• Noise
i’ ≠ i: h(i’) = h(i)
i’
• Ensures every fj is approximated up to an
additive (F2 /B)1/2
. 2Σn)
. .
• Gives O(log
bits
space
i: h(i)
= 2of¾(i)¢f
i
• Estimate fi as ¾(i) ¢ ch(i)
Known Space Bounds for l2– heavy hitters
• CountSketch achieves O(log2 n) bits of space
• If the stream is allowed to have deletions, this is optimal [DPIW]
• What about insertion-only streams?
• This is the model originally introduced by Alon, Matias, and Szegedy
• Models internet search logs, network traffic, databases, scientific data, etc.
• The only known lower bound is Ω(log n) bits, just to report the
identity of the heavy hitter
Our Results [BCIW]
• We give an algorithm using O(log n log log n) bits of space!
• Same techniques give a number of other results:
• (F2 at all times) Estimate F2 at all times in a stream with O(log n log log n)
bits of space
• Improves the union bound which would take O(log2 n) bits of space
• (L∞ -Estimation) Compute maxi fi up to additive (εF2 )1/2 using
O(log n log log n) bits of space (Resolves IITK Open Question 3)
Simplifications
• Output a set containing all items i for which fi 2 ≥ φ F2 for constant φ
• There are at most O(1/φ) = O(1) such items i
• Hash items into O(1) buckets
• All items i for which fi 2 ≥ φ F2 will go to different buckets with good
probability
• Problem reduces to having a single i* in {1, 2, …, n} with fi* ≥ (φ F2 )1/2
Intuition
• Suppose first that fi∗ ≥ n1/2 log n and fi in {0,1} for all i in {1, 2, …, n} \ {i*}
• For the moment, let us also not count the space to store random hash functions
• Assign each coordinate i a random sign ¾(i) 2 {-1,1}
• Randomly partition items into 2 buckets
• Maintain
c1 = Σi: h(i) = 1 ¾(i)¢fi and c2 = Σi: h(i) = 2 ¾(i)¢fi
• Suppose h(i*) = 1. What do the values c1 and c2 look like?
Eventually, fi* > 2𝐶𝑛1/2 ,
𝑡ℎ𝑒𝑛 𝑤𝑒 𝑘𝑛𝑜𝑤
𝑤ℎ𝑖𝑐ℎ 𝑏𝑢𝑐𝑘𝑒𝑡 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑠 𝑖 ∗ !
• c1 = ¾(i*)¢fi* +
i≠i∗, h i =1 σ
i ⋅ fi and c2 =
i, h i =2
Only gives 1 bit of information.
Can’t repeat log n times in
parallel, but can repeat log n
σ i ⋅ f times sequentially!
i
• c1 - ¾(i*)¢fi* and c2 evolve as random walks as the stream progresses
• (Random Walks) There is a constant C > 0 so that with probability 9/10, at all times,
|c1 - ¾(i*)¢fi*| < Cn1/2 and |c2| < Cn1/2
Repeating Sequentially
• Wait until either |c1| or |c2| exceeds Cn1/2
• If |c1| > Cn1/2 then h(i*) = 1, otherwise h(i*) = 2
• This gives 1 bit of information about i*
• (Repeat) initialize 2 new counters to 0 and perform the procedure again!
• Assuming fi∗ = Ω(n1/2 log n), we will have at least 10 log n repetitions, and we will be
correct in a 2/3 fraction of them
• (Chernoff) only a single value of i* whose hash values match a 2/3 fraction of repetitions
Gaussian Processes
• We don’t actually have fi∗ ≥ n1/2 log n and fi in {0,1} for all i in {1, 2, …, n} \ {i*}
• Fix both problems using Gaussian processes
• (Gaussian Process) Collection {Xt}t in T of random variables, for an index set T, for
which every finite linear combination of random variables is Gaussian
• Assume E[Xt] = 0 for all t
• Process entirely determined by covariances E[XsXt]
• Distance function d(s,t) = (E[|Xs-Xt|2])1/2 is a pseudo-metric on T
• (Connection to Data Streams) Suppose we replace the signs ¾(i) with normal
random variables g(i), and consider a counter c at time t: c(t) = Σi g(i)¢fi(t)
• fi(t) is frequency of item i after processing t stream insertions
• c(t) is a Gaussian process!
Chaining Inequality [Fernique, Talagrand]
• Let {Xt}t in T be a Gaussian process and let T0 ⊆ T1 ⊆ T2 ⊆ ⋯ ⊆ T be such that T0 = 1
2i
and Ti ≤ 2 for i ≥ 1. Then,
2i/2 d(t, Ti )
E sup |Xt | ≤ O 1 sup
t∈T
t∈T
i≥0
• How can we apply this to c(t) = Σi g(i)¢fi(t)?
• Let F2 t be the value of F2 after t stream insertions
• Let the Ti be a recursive partitioning of the stream where F2 (t) changes by a factor of 2
a1 a2 a3 a 4 a5
…
• at is the first point in the stream for which
F2 m
2
…
at
≤ F2 t
Apply the chaining
inequality!
• T0 = t
2i
• Let Ti be the set of 2 times t1 , t 2 , … , t
the stream with
j ⋅ F2 m
i
22
22
i
am
in the stream such that tj is the first point in
≤ F2 t j
• Then T0 ⊆ T1 ⊆ T2 ⊆ ⋯ ⊆ T, and T0 = 1, and Ti ≤
i
2
2
for i ≥ 1
Applying the Chaining Inequality
• Let {Xt}t in T be a Gaussian process and let T0 ⊆ T1 ⊆ T2 ⊆ ⋯ ⊆ T be
i
2
such that T0 = 1 and Ti ≤ 2 for i ≥ 1. Then,
2i/2 d(t, Ti )
E sup |Xt | ≤ O 1 sup
t∈T
• d(t, Ti ) = (min t
j
∈ 𝑇𝑖
E|c(t) –
t∈T
c(tj)|2])1/2
• Hence, E sup |Xt | ≤ O 1 sup
t∈T
t∈T
i≥0
≤
𝐹2 1/2
( 2𝑖)
2
𝐹2 1/2
i/2
( 2𝑖) =
i≥0 2
2
Same behavior as for
random walks!
O(F21/2)
Removing Frequency Assumptions
• We don’t actually have fi∗ ≥ n1/2 log n and fj in {0,1} for all j in {1, 2, …, n} \ {t}
• Gaussian process removes the restriction that fj in {0,1} for all j in {1, 2, …, n} \ {t}
• The random walk bound of Cn1/2 we needed on counters holds without this restriction
1/2
• But we still need fi∗ ≥ F2 (-i*) log n to learn log n bits about the heavy hitter
• How to replace this restriction with fi∗ ≥ (φ F2(-i*)) 1/2?
• Assume φ > log log n by hashing into log log n buckets and incurring a log log n factor in space
Amplification
• Create O(log log n) pairs of streams from the input stream
(streamL1 , streamR1), (streamL2 , streamR2), …, (streamLlog log n , streamRlog log n)
• For each j in O(log log n), choose a hash function hj :{1, …, n} -> {0,1}
• streamLj is the original stream restricted to items i with hj(i) = 0
• streamRj is the remaining part of the input stream
• maintain counters cL = Σi: hj(i) = 0 g(i)¢fi and cR = Σi: hj(i) = 1 g(i)¢fi
• (Chaining Inequality + Chernoff) the larger counter is usually the substream with i*
• The larger counter stays larger forever if the Chaining Inequality holds
• Run algorithm on items corresponding to the larger counts
• Expected F2 value of items, excluding i*, is F2/poly(log n), so i* is heavier
Derandomization
• We have to account for the randomness in our algorithm
• We need to
(1) derandomize a Gaussian process
(2) derandomize the hash functions used to sequentially learn bits of i*
• We achieve (1) by
• (Derandomized Johnson Lindenstrauss) defining our counters by first applying a
Johnson-Lindenstrauss (JL) transform [KMN] to the frequency vector, reducing n
dimensions to log n, then taking the inner product with fully independent
Gaussians
• (Slepian’s Lemma) counters don’t change much because a Gaussian process is
determined by its covariances and all covariances are roughly preserved by JL
• For (2), derandomize an auxiliary algorithm via Nisan’s pseudorandom generator [I]
An Optimal Algorithm [BCINWW]
• Want O(log n) bits instead of O(log n log log n) bits
• Multiple sources where the O(log log n) factor is coming from
• Amplification
• Use a tree-based scheme and that the heavy hitter becomes heavier!
• Derandomization
• Show O(1)-wise independence suffices for derandomizing a Gaussian process!
Conclusions on l2-guarantee
• Beat CountSketch for finding l2 -heavy hitters in a data stream
• Achieve O(log n) bits of space instead of O(log2 n) bits
• New results for estimating F2 at all points and L∞ - estimation
• Questions:
• Is this a significant practical improvement over CountSketch as well?
• Can we use Gaussian processes for other insertion-only stream problems?
Accelerated Counters
• What if we update a counter c for fi with probability p = ε?
• E[c/p] = fi
• Sum of the counts is expected to be O(1/ ε)
1
ϵ
• We have counters with sum O(1/ ε) , can store all with O
• Problem: very inaccurate if fi = Θ
1
ϵ2
1
ϵ
bits
Accelerated Counters
• Instead, suppose you knew a value r with r = Θ fi
2
• Update a counter c with probability p = ϵ r. Output
c
p
• Var[ ] =
fi
p2
• Pr fi −
c
p
⋅p 1−p ≤
>
1
ϵ
≤
ϵ2 f i
p
fi
p
=Θ 1
• Problem: don’t know r = Θ fi in advance!
c
p
Accelerated Counters
• Solution: increase sampling probability as counter increases!
• Opposite of standard probabilistic counters
• A frequency fi will in expectation have count value about O(ϵ2 fi2 )
1
ϵ
• With counters subject to
i fi
=
1
ϵ2
, space is maximized at O
1
ϵ
bits

Download Report

Beating CountSketch for Heavy Hitters in Insertion

Paperzz.com

Your Paperzz