Continuous Distributed Counting
for Non-monotonic Streams
Milan Vojnovic
Microsoft Research
Joint work with Zhenming Liu and Bozidar Radunovic
Workshop “Big Data in Digital Life”, Paris, June 20-21, 2012
Big Data Algorithmic Challenges
• Massive scale of data
• May arrive in a continuous and distributed fashion
• Efficiency: time, space, communication
• A wide range of computational problems:
Database Queries
Machine Learning
Distributed computing system
Combinatorial
Optimization
+1, -2, -2, …
2
Functional Monitoring
• Distributed streams: k sites
connected to a coordinator
1
𝑋2
𝑋3
2
3
k
𝑋1
𝑋2
𝑋3
𝑋1
𝑋5
• Traditional streaming
computation
𝑋4
3
SUM Tracking Problem
Maintain estimate 𝑆𝑡 : 1 − 𝜖 𝑆𝑡 ≤ 𝑆𝑡 ≤ 1 + 𝜖 𝑆𝑡
1
2
k
3
𝑋1
𝑋2
𝑋3
𝑋4
𝑋5
SUM: 𝑆𝑡 =
𝑖≤t
𝑋𝑖
4
Outline
• Applications
• Overview of Related Work
• Sum Tracking Algorithm for Non-Monotonic
Randomized Streams
• Back to Applications
• Conclusion
5
Applications
• Network monitoring
• Frequency moments
• Machine learning
6
Ex 1: Real Time Monitoring
• A financial product
traded in different
markets
1
𝑋2
𝑋3
2
3
k
𝑋1
𝑋5
– Hedge funds want to
track bid-offer
queues in all the
markets
– The queues change
frequently (high
frequency)
𝑋4
7
Ex 2: 𝐹2 Tracking
• 𝑎𝑡 = 𝛼𝑡 , 𝑧𝑡 , 𝛼𝑡 ∈ [𝑚], 𝑧𝑡 ∈ −1,1
• 𝑚𝑖 𝑡 =
𝑠≤𝑡:𝛼𝑠 =𝑖 𝑧𝑠
𝐹2 𝑡 =
2
𝑚
𝑖∈[𝑚] 𝑖 (𝑡)
• Problem: track 𝐹2 𝑡 within a prescribed relative
tolerance 𝜖 > 0 with high probability
8
AMS Sketch
• Maintain 𝑠1 × 𝑠2 counters
𝑖,𝑗
• 𝑆𝑡 =
•
𝑆𝑡𝑖
=
1
𝑠1
𝑠≤𝑡 𝑧𝑠 ℎ(𝛼𝑠 )
=
𝑎∈[𝑚] ℎ
𝑎 𝑚𝑎 𝑡
𝑖,𝑗 2
𝑗 𝑆𝑡
• 𝐹2 estimate: 𝑆𝑡 = median(𝑆𝑡𝑖 )
• Hash function: ℎ: 𝑚 → {−1,1}
9
Ex 3: Bayesian Linear Regression
• Feature vector 𝒙𝑡 ∈ 𝑅𝑑
• Output 𝑦𝑡 ∈ 𝑅
• 𝑦𝑡 = 𝒘𝑇 𝒙𝑡 + 𝑁 0, 𝛽−1 ,
𝑦𝑡
• Prior 𝒘 ∼ 𝑁 𝒎0 , 𝑺0
• Posterior 𝒘 ∼ 𝑁 𝒎𝑡 , 𝑺𝑡
𝑥𝑡
𝒎𝑡 = 𝑺𝑡 𝑺0−1 𝒎0 + 𝛽𝑨𝑇𝑡 𝒚𝑡
𝑺𝑡−1 = 𝑺0−1 + 𝛽𝑨𝑇𝑡 𝑨𝑡
𝑨𝑡 = 𝒙1 , … , 𝒙𝑡
𝑇
10
Outline
• Applications
• Overview of Related Work
• Sum Tracking Algorithm for Non-Monotonic
Randomized Streams
• Back to Applications
• Conclusion
11
Key Notation
• Number of sites: 𝑘
• Total length of stream: 𝑛
• Multiplicative error guarantee 1 ± 𝜖
1 − 𝜖 𝑆𝑡 ≤ 𝑆𝑡 ≤ 1 + 𝜖 𝑆𝑡
• Worst-case input arrival
• Adversary decides when and where the next input arrives
12
Related Work
• Count tracking [Huang, Yi and Zhang, 2011]
– monotonic sum: all 𝑋𝑖 either positive or negative
– Expected communication cost:
𝑘
𝑂
log 𝑛
𝜖
• Lower bound for adversary value updates
[Arackaparambil, Brody and Chakrabarti, 2009]
– Expected communication cost:
𝑛
Ω( )
𝑘
13
Related Work (cont’d)
coordinator
St
site
• Lower bound (single site case)
• Input:+1, −1, +1, −1, +1, −1, …
• Sum: 𝑆1 = 1, 𝑆2 = 0, 𝑆3 = 1, 𝑆4 = 0,….
• Suppose at 𝑡 = 4, the site does not report
– Need: 1 − 𝜖 𝑆𝑡 ≤ 𝑆𝑡 ≤ 1 + 𝜖 𝑆𝑡
– 𝑆4 = 1, compared as 𝑆4 = 0 bad estimate
14
Outline
• Applications
• Overview of Related Work
• Sum Tracking Algorithm for Non-Monotonic
Randomized Streams
• Back to Applications
• Conclusion
15
Our Work
• Relaxed adversarial setting
– Random input values
– Adversary assignment of values
• Random input streams
– Random permutation
– Random i. i. d.
– Fractional Brownian motion
16
Our Work (cont’d)
• An algorithm with sub-linear communication
• For drift defined as 𝜇 = E 𝑋𝑖 , the total
communication cost:
𝑘
𝑂( min{1/|𝜇|, 𝑛})
𝜖
• Matching lower bounds
• Our algorithm is optimal
17
Our Tracker Algorithm
𝑆𝑡
Sample based algorithm
𝑘/𝜖
Always report
Use two
monotonic
counters
𝑡
− 𝑘/𝜖
Sample based algorithm
1/(𝜇2 𝜖)
18
Our Tracker Algorithm (cont’d)
• Each site reports to the coordinator upon receiving a
value update 𝑡 with probability
𝑝𝑡 = min
𝛼 log𝛽 𝑛
,1
2
𝜖𝑆𝑡
• Sync all whenever the coordinator receives an update
from a site
1
S
S, S1 site
1
k
S=S +…+S
coordinator
Sk
S, Sk
site
Mi = 1
Xi
19
Design Intuition
• At time 𝑡, the site sync with the coordinator
• Q: what is the laziest way to sync next time?
– A: sync before exist of the safe region
– Estimating using a random input model
(ex. those that we used)
Safe region
20
Fractional Brownian Motion
• Updates according to a fractional Brownian motion with Hurst
1
parameter ≤ 𝐻 ≤ 1/𝛿
2
• Sampling probability:
𝛿
1+2
𝛼𝛿 log
𝑛
, 1}
𝜖 𝑆𝑡 𝛿
𝑝𝑡 = min{
• Continual tracking within relative accuracy 𝜖 > 0 with high
probability
3−𝛿
𝑘 2
• Expected communication cost: 𝑂(min{
𝜖
𝑛1−𝐻 , 𝑛})
21
Lower Bounds
• Single site
• i.i.d. Bernoulli input
1
𝐏 𝑋𝑖 = −1 = 1 − 𝐏 𝑋𝑖 = 1 =
2
• Expected total communication cost:
1
Ω(min{ 𝑛, 𝑛})
𝜖
1/𝜖
−1/𝜖
22
Lower Bounds (cont’d)
• Multiple sites
• Input: i.i.d. Bernoulli
1
𝐏 𝑋𝑖 = −1 = 1 − 𝐏 𝑋𝑖 = 1 =
2
or a random permutation
• Expected total communication:
𝑘
Ω(min{
𝑛, 𝑛})
𝜖
23
Outline
• Applications
• Overview of Related Work
• Sum Tracking Algorithm for Non-Monotonic
Randomized Streams
• Back to Applications
• Conclusion
24
F2 Frequency Moment
• 𝑚𝑖 𝑡 =
𝑠≤𝑡:𝛼𝑠 =𝑖 𝑧𝑠
𝐹2 𝑡 =
2
𝑖∈[𝑚] 𝑚𝑖 (𝑡)
• Expected number of messages:
𝑘
𝑂( 2 𝑛)
𝜖
25
Bayesian Linear Regression
𝒎𝑡 = 𝑺𝑡 𝑺0−1 𝒎0 + 𝛽𝑨𝑇𝑡 𝒚𝑡
𝑺𝑡−1 = 𝑺0−1 + 𝛽𝑨𝑇𝑡 𝑨𝑡
𝑦𝑡
𝑥𝑡
• Expected number of messages:
𝑘
2
𝑂(𝑑
𝑛 log 𝑛)
𝜖
26
Conclusion
• First results for sum tracking with nonmonotonic distributed streams inputs
• Practical algorithms
• Matching lower bounds
• Implications to other computational problems
27
Some Open Questions
• Sliding windows?
• Other classes of random inputs?
• Other queries with non-monotonic distributed
stream inputs?
28
Proof Key Ideas
0
𝑘
2𝑘
⋯
𝑗𝑘
(𝑗 + 1)𝑘
𝑗𝑘 𝑗𝑘 + 1 𝑗𝑘 + 2
1
• 𝐼𝑗 = 𝐼(𝑆𝑘𝑗 ∈ [− min
𝑘
,
𝜖
2
𝑗𝑘 , min
⋯
𝑛
(𝑗 + 1)𝑘
⋯
k
𝑘
,
𝜖
𝑗𝑘 ])
• Under 𝐼𝑗 = 1, maximum deviation 𝜖 𝑆𝑗𝑘 ≤ 𝑘
29
Proof Key Ideas (cont’d)
k-input problem:
1
2
⋯
𝑋1 𝑋2
• Query: H0 :
k
𝑋𝑘
𝑖 𝑋𝑖
∼ i. i. d. 𝐏𝐫 𝑋𝑖 = −1 = 𝐏𝐫 𝑋𝑖 = 1 =
> 𝑘 or H1 :
𝑖 𝑋𝑖
1
2
<− 𝑘?
• Answer: incorrect if | 𝑖 𝑋𝑖 | > 𝑘 and the answer is
𝑘 under H1 or 𝑖 𝑋𝑖 < − 𝑘 under H0
𝑖 𝑋𝑖
>
• Lemma: Ω(𝑘) messages is necessary to answer the query
correctly with a constant positive probability
30
© Copyright 2026 Paperzz