A 1 − ∈ Approximate Streaming Algorithm for the Densest Interval

A 1 − Approximate Streaming Algorithm for
the Densest Interval
Guntash Singh Arora, Sai Praneeth Reddy
1
Points in sorted order
We look at streaming algorithms for the densest interval of size α when the
points are arriving in a sorted order. More elaborately, let X = {x1 , . . . , xn } be
a ordered set of n points where xi is the position of the ith point i.e. xi ≤ xj if
i < j. In a streaming setting, the algorithm has very limited memory and can
only pass over the data set X from x1 through xn a constant number of times.
With such restrictions, the problem is to identify the densest interval of size α
i.e. to identify
OP T = max |{y ≤ xi < y + α}|
y∈X
Let the interval starting at y be Iα (y). A 1 − approximate algorithm outputs
y 0 such that OP T (1 − ) ≤ |Iα (y 0 )| ≤ OP T .
To get a 21 approximation to the problem, we can simply divide the line into
segments each of size α and then output the densest amongst them. To convert
this into a streaming algorithm, we maintain a counter and an active interval
Iα (y). If the point xi ∈ Iα (y), increment the counter. Else, compare the size of
Iα (y) with the best seen so far and store the denser of the two. Also update the
active interval to Iα (xi ) and counter to 0.
We generalize the above idea to the subroutine (k, B) − process with k and
B as integers as follows: Maintain B active counters in parallel, considering α
intervals starting at every k th stream element. When a new point xki arrives,
update the counters of intervals it lies in and remove all intervals such that
xik ∈
/ Iα (y) and update the record of the densest interval seen so far. Then try
create a new counter for Iα (xki ). If there are already B active counters, abort.
When this process terminates, it either aborts when OP T > kB or terminates
with OP T − k ≤ nmax ≤ OP T .
Assume that ∆ is an input parameter and let be the allowed error. We
1
simply output the result of (∆, + 1) − process. Thus we will get an answer
∆
). If we can ensure
as long as OP T ≤ ∆ with an approximation of (1 −
OP T
that OP T ≤ ∆ ≤ 2OP T , we will get 1 − 2 approximation in O(1) space.
We can simply run the above algorithm with log n estimates of OP T (which
are 2i , i = 0, log n). Of all the processes which do not abort, choose the one with
1
1
+ 1) − processes with 2i ≥ OP T
do not abort. The smallest of these satisfies 2i ≤ 2OP T and we get the desired
log n
O(
) space algorithm with approximation factor of 1 − 2.
Instead of running log n processes to estimate the value of ∆, we could do
1
it in an online manner. We run a (2i , + 1) − process and if we are going to
abort it means 2i ≤ OP T and we need to increment i. This is equivalent to just
dropping every alternative counter we are maintaining. Note that at all times
(even after doubling), 2i ≤ 2OP T . Thus the final O(1) space algorithm is
the smallest estimate (smallest i). All (2i ,
Algorithm 1 new-process(, α)
1: Initialize k ← 1, a set of active counters C, nmax
2: for each new point s0 ∈ {sik |i = 1, . . . } do
3:
Close all active counters for sets Is0 (α) and update nmax
1
4:
if |C| > then
5:
k ← 2k
6:
Store counters for only sik i.e. drop every other counter from C
7:
If s0 ∈ {sik |i = 1, . . . }, add counter corresponding to s0
1
Theorem 1.1. The algorithm new-process using O( ) words of storage, out
puts an Iα (y) such that OP T (1 − ) ≤ |Iα (y)| ≤ OP T .
2
Unsorted Points
We now look at a natural generalization of the above problem when the points
X = {x1 , . . . , xn } is unordered. The above idea of maintaining counters crucially
relied on the fact that the points are sorted. Here, we will prove some lower
bounds for any streaming algorithm in terms of the minimum storage required.
Note that while the algorithm was in terms of words, our lower bound will be
in terms of bits.
Setting α = 0 immediately reduces the problem to that of estimating the
F∞ of the set X i.e. identifying a value xi which is repeated the most number of
times. Thus any lowerbound on estimating the F∞ also applies to our setting.
To prove a lowerbound for this, we use the following theorem.
Theorem 2.1. (Chakrabarti et al. [1 ]) The communication complexity of
the set disjointness problem in the one-way communication model, in which t
players are required to speak
in a predetermined order each holding a subset of
the universe [n] is Ω nt bits.
1 Chakrabarti, Amit, Subhash Khot, and Xiaodong Sun. ”Near-optimal lower bounds on
the multi-party communication complexity of set disjointness.” Computational Complexity,
2003. Proceedings. 18th IEEE Annual Conference on. IEEE, 2003.
2
Suppose we have an algorithm A, that given the set X and a parameter ∆,
distinguishes between the two cases where the maximum frequency of F∞ of X
is 1 or is ∆ and takes memory M EM . Then to solve the disjoint intersection
∆
S
problem with ∆ parties, each with the subset Si , let X =
Si . Party i
i=1
communicates to party i + 1 the M EM bits stored in the memory of A. If the
sets are disjoint, then the F∞ is 1, else it is ∆ and A can distinguish between
the two. Thus in ∆ ∗ M EM communication, we can solve the set disjointness
problem. Note that the above proof works even if the algorithm A goes over X
c times. This gives us
Theorem 2.2. Any streaming algorithm which given a set X and a parameter
∆ can distinguish between the cases where maximum frequency of F∞ of X is 1
n
bits.
or is ∆ in c passes over X requires memory Ω c∆
2
In particular any constant approximation of F∞ requires O(n) space. The
above theorem has another implication as well. Consider the problem (called the
threshold query) where we are given a set X and we have to answer the question
if there exists an element xi whose frequency is greater than ∆. This is a harder
problem than just distinguishing between the cases where the frequency is either
1 or ∆. Let ∆ > log n and we have to distinguish between the cases where the
frequency is log n and where it is 1. We duplicate every element ∆/ log n times
and run the algorithm to check if there exists an element more than ∆. Thus , if
we can do the original problem in M EM memory, we can do the distinguishing
∆
with M EM ∗ log
n .
Theorem 2.3. Any randomized algorithm which given a set X and a parameter
∆ can decide if there is an element
whose
frequency is greater than ∆ in c passes
n
over X requires memory Ω c∆ log
bits.
n
This theorem however is not tight i.e. there exists algorithms which
gives
n
a 1 − approximation to the distinguishing problem with memory O nlog
3∆
n log n
words and hence the threshold query problem with space O
words.
2 ∆
To move the upper bounds closer to the lower bounds would require improving
Theorem 2.2. However Theorem 2.1 is known to be tight. To improve upon
Theorem 2.2, we try instead prove a theorem similar to Theorem 2.2, but in an
alternate model.
Conjecture 2.4. Any algorithm A which solves the set disjointness problem in
the one-way communication model, in which t players are required to speak in a
predetermined order each holding a subset
of the universe [n], requires party at
least one party i to communicate Ω nt bits to the party i + 1
Conjecture 2.4 directly implies Theorem 2.1 and is stronger (and hence also
n
tight). Conjecture 2.4 also would prove an almost optimal lowerbound of Ω c∆
bits for the distinguishing problem as well as for the threshold query.
3