conf pdf

Tight Lower Bounds for the Distinct Elements Problem
Piotr Indyk
MIT
[email protected]
Abstract
We prove strong lower bounds for the space complexity of -approximating the number of distinct elements in a data stream. Let be the size
of the universe from which the stream elements are
drawn. We show that any one-pass streaming algorithm for -approximating must use space when
!#"$&%
, for any
'(*)
, im-
proving upon the known lower bound of + for
this range of . This lower bound is tight up to a factor of ,.-0/1,2-3/4 . Our lower bound is derived from
a reduction from the one-way communication complexity of approximating a boolean function in Euclidean space. The reduction makes use of a lowdistortion embedding from an 56 to an 5 norm.
1 Introduction
Let a 87 9:9:9;7=< be a sequence of elements,
which we will refer to as a stream, from a universe
of size , which we denote by > @?ACB+)D:99:9E;GF
H
I
. In this paper we examine the space complexity of algorithms that count the number of distinct
elements JK= a in a. All algorithms will be
given one-pass over the elements of a, which are arranged in adversarial order. An algorithm A is said
to +;
approximate on stream a if A outputs
a number L such that Pr >.MD L F MN(OP ?RQO .
Since there is a provable deterministic space lower
bound of ST for computing or even approximatHVU
ing within a multiplicative factor of W [1],
there has been considerable effort to devise randomized approximation algorithms.
There are several practical motivations for deX
This work was supported in part by NSF ITR grant CCR0220280
and Sloan Research Fellowship.
Y
Supported by Akamai Presidential Fellowship for the duration of this work.
David Woodruff
MIT
[email protected]
signing space-efficient algorithms for approximating . Computing the number of distinct elements
is very valuable to the database community. Query
optimizers can use to find the number of unique
values in a database with a certain attribute without having to perform an expensive sort on the values. With commercial databases approaching the
size of 100 terabytes, it is infeasible to make multiple passes over the data since the sheer amount of
time spent looking at the data is prohibitive.
Other applications include networking: internet
routers that only get one quick view at incoming
packets can gather the number of distinct destination addresses passing through them with only limited memory. For an application of algorithms
to detecting Denial of Service attacks, see [2].
There have been a multitude (e.g., [7, 1, 4]) of
algorithms proposed for computing in a data
stream, beginning with the work of Flajolet and
Martin [7]. The best known algorithm was presented in [4] and +;
approximates in space
6
Z
[ \
,2-3/4,.-0/4^]_,.-0/1`,.-0/a =,.-0/bc . In this paper we will take to be constant.
For reasonable values of and (e.g., dfeg 6 ,
H
@
)h ), the storage bound of the algorithms is
H\i
dominated by the 6 term. If one wants even
H
better approximation quality (e.g., h ), the
Hji
quadratic dependence on constitutes a severe
drawback of the existing algorithms. This raises the
question (posed in [4]) if it is possible to reduce the
H\i
dependence on . Till now, the best known lower
Hji
P [1, 3].
bound for the problem was S,2-3/1k]
In this paper we answer this question in the negative. In particular, we show that any algorithm
H
for approximating up to a factor of ]fW reHji 6
Hji
quires S storage, as long as lnm&olpq
for certain rs(k) . This matches (up to a factor of
Hji
,2-3/4,.-0/4
for small , and ,2-3/t W for large ) the
upper bound of [4].
1.1 Overview of techniques and technical
results
One way of establishing a lower bound is to
lower bound the communication complexity of
computing certain boolean functions and reduce
them to that of computing uv [3]. In this model
there are two parties, Alice and Bob, who have inputs w and x respectively and wish to compute the
value of a boolean function y{zow}|;x~ with probability
at least €=R‚ . The idea is that Alice can run a distinct
elements algorithm ƒ on w and transmit the state „
of ƒ to Bob. Then Bob can plug „ into his copy of
ƒ and continue the computation on x . This computation will return a number u… v which is a z[€‡†‰ˆP~
approximation to u v zow‡Š{x&~ with probability at least
€‹Œ‚ . If u … v can be used to determine y{zw}|;x&~ with
error probability less than ‚ , then the communication cost is just the space used by ƒ , which must
be at least the one-way communication complexity
of computing y{zow|[x~ . In [1] the authors reduce the
communication complexity of computing equality
Ž
zw}|;x&~ to computing uvzowTŠRx~ . Since the ranaŽ
domized communication complexity of
zow|[x~
is bzo‘.’0“4”•~ , this established an –Sz‘2’3“1”•~ lower
bound for computing uv .
We would like to obtain lower bounds in terms
of the approximation error ˆ . In [3] the one-way
communication complexity of the ˆ -set disjointness
problem is used to derive an –Sz\—˜\~ space lower
bound for approximating u v . Here Alice is given
a set wO™nš ”œ› with  wŸž¡ ¢ and Bob is given a
set x£™¤š ”@› with  xAž¥ˆŸ¦§” . Furthermore, both
parties are given the promise that either x™Gw or
x¨‹w@ž‰© . In case of the former, u v zwªŠtx&~4žG ¢ and
in the latter uv=zowŸŠ«x~4ž ¢a¬ ˆ¦­” . Hence, an algorithm which zˆ+|;‚
~ approximates uv can distinguish
these two cases. However, this reduction is particularly weak since Alice can replace the u{v approximation algorithm with an algorithm which simply
samples ®z — ˜ ~ of the elements ¯+° of š ”œ› , checks if
¯+°V±lw , and sends the result to Bob. This motivates
the search for promise problems which use the full
power of a distinct elements algorithm as a subroutine.
Given a stream a, its characteristic vector ² a is
the ” -dimensional vector with ³ th coordinate 1 iff
element ³ of š ”@› appears in a. Note that u v z a ~ is just
the Hamming weight of ² a . One natural boolean
function to consider is the following: Alice and Bob
are given w|[x´±fµ+¶D|:€0· with the promise that ei-
ther ¸Jzow|[x~Ÿ¹ ¢ºˆ” , in which case y{zw}|[x~4ž`¶ ,
or ¸Jzow|[x~S» ¢ , in which case y{zow}|;x~‹žK€ . Here
¸Jzow}|;x~ denotes the Hamming distance between w
and x , that is, the number of bit positions in which
w and x differ. Alice and Bob view their inputs w
and x as characteristic vectors of certain streams a ¼
and a ½ . The value u v z a¼1Š a½
~ is then just the Hamming weight of w1¾§x , the bitwise OR of w and x . By
constraining the weights of the inputs w and x to be
close to each other and less than ¢^Œˆ” , one can
use an zˆ+|;‚
~Vuv -approximation algorithm to determine the value of y{zw}|;x&~ with error probability at
most ‚ . Unfortunately, it is rather difficult to lower
bound the one-way communication complexity of y
directly.
˜— Á , where we asFix ˆ and set ¿Àž¡¿EzˆP~fž
sume ¿ is a power of 2 for convenience. In this paper we consider the one-way communication complexity of computing the following related promise
problem ‹à Á zo¿[~ . Alice has a vector wı£š ¶|:€:›ÆÅ with
small rational coordinates and with . w{2 ¢ žÇ€ . Bob
has a basis vector x drawn from the standard basis µjÈ |:É:ÉÉ:|È · of ʇŠ. Both parties are given the
Å
—
Í — ,
promise that either Ëow}|;x̞¡¶ or Ëw}|;x&ÌÀž
Å
where Ë[|Ì denotes Euclidean inner product. We
show the one-way communication complexity of
deciding  à Á zo¿[~ with error probability at most ‚
when w}|;x are drawn from a uniform distribution is
–Szo¿[~ . We do this by using tools from information
theory developed in [5] which generalize the notion
of VC-dimension [10] to shatter coefficients.
We then reduce  à Á zo¿[~ to that of approximating uv of a certain stream. In the reduction we
use a deterministic z΀ ¬ÐÏ ~ -distorting embedding
Ñ
between an Ò ¢Å and an ÒÔÓ norm (defined below)
developed in [6], where
Ñ
—
z0Ü Û ~
Á
Þ
ÕCžÖ®K×&¿\Ø ÙÚ Ý ß
. Al-
Ñ
ice computes zow­~ and Bob computes zox~ . Depending on whether Ëow}|;xÌàžá¶ or Ëw}|;x&Ìàž Í — ,
¢
¢
¢
Å
2 w£ÇxA. ¢
žâ2 w. ¢ ¬
. xA. ¢ ÇãDËow|[xÌ will be ã
or ãz΀S Í — ~ . By choosing Ï appropriately small,
Ñ
Ñ
Å
 zwq~R
zox~:
2 wxA. ¢ . We then rationally
Ñ
Ñ
—Ää
approximate the coordinates of zow­~ and zx~ and
scale all coordinates by a common denominator.
We convert the integer coordinates of the scaled
Ñ
and zx&~ into their unary equivalents, obtaining bit strings wqå and x&å of length ” , where ” is
determined by parameters chosen in the reduction
(specifically, ” is a function of ˆ and Ï ). We show
how a z‡Í — žæˆ+|;‚
~ approximation algorithm comÑ
zowq~
Å
puting çè=éê=ë+ìtíŸê=îEìï can decide ð‹ñóò3éoô[ï with probability at least õVöl÷ . Since the communication complexity of ð ñ ò
éoô[ï is ô1øúû ù ò , the space complexity of
éü+ý÷
ï approximating ç è in a universe of dimension
þ
is ÿ û ù ò . The goal is then to find the smallest
þ
for a fixed ü so that an éüý÷
ïŸç è -approximation
algorithm can decide ð ñ ò0éô[ï . We determine þ to be
ùû
, which shows an ÿ û ù ò space lower
ûù
boundò for üVøfÿSé þ
, and hence
ï , for any
þ
lower bound for all smaller ü .
an
2 Preliminaries
+) .*
Definition 1 For each randomized protocol ð as
described above for computing , the communication cost of ð is the expected length of the longest
message sent from Alice to Bob over all inputs. The
÷ -error randomized communication complexity of
, 3é «ï , is the communication cost of the optimal
protocol computing with error probability ÷ (that
ðé ý ï ø
{é ý ï ÷ ).
is,
021 3546 +) ,* 897 +) ,* :<;
=
For deterministic protocols with input distribution
é «ï , the ÷ -error
, define
-distributional
communication complexity of , to be the communication cost of an optimal such protocol. Using
the Yao Minimax Principle, é «ï is bounded from
below by
for any [12].
>@?A 1 B
= 0C1 B
> ?A 1
D E%&FGH# &% '' be a family of
. Each IJD
K LK
NMMOM QP RSP .
Definition 2 For a subset TVUW , the shatter coefficient XZY B([ of T is given by K\%&]K ^<'`_(a(bCK , the
number of distinct bit strings obtained by restricting D to T . The c -th shatter coefficient XZY /D dc of
D is the largest number of different bit patterns one
é
T
D
D
gS$h! #i%& (' =
kj kj
> ?A 1 mln po@q
where n
VCD R .
é «ï
‹ø
ù
ö
0é÷
ï;ïWý
ï
rGYC> B(s
tpu9!v#
cLlxrGYC> = R
yx
z!
> ?A 1 B ml XZY B R .c pc|{Oo@q
Theorem 4 For every function
Dý:õ , every
é
ï , and every ÷
,
there exists a distribution on
such that:
%w ('
é «ï
é
é
ý Ôï[ï{ö
0é÷
ï
2.3 Embeddings
For a survey on low-distortion embeddings the
reader is referred to [8]. The norm in
of a
c/~}
€}
~ ‰
vector ) is defined to be K
K )K
K ~ ƒ‚&„ }…‡† ) …wˆ . A
5Š‹ -distortion embedding ŒLc/} #ŽcB is a mapping such that: for any ‘ d’GI“c} ,
”Šp‹ K
K ‘ p’QK‡K  ;•K‡K Œ \‘ pŒ /’ K‡K  ;–K
K ‘ —’QK
K 
ø
éÎõ
ù
Aï
}ý
õ
ºö
é qïNö
é ï
ö
We will need the following theorem [6] in our reduction:
Theorem 5 For every
distortion embedding
› ‚ `œ džw£ Ÿ ¢ ¡ ˆ .
ô
qï
é @ý ï
é
0éÎõ
The following generalization of this theorem [5] is
lé Sï is difficult.
useful when computing
õ
2.2 VC dimension
ø
ý:õ
Let
Boolean functions on a domain
can be viewed as a
-bit string
X
The connection between VC dimension and randomized one-way communication complexity was
first explored in [9]. The following theorem lower
bounds the (one-way) communcation complexity of
in terms of information theory.
!
ýõ
Let
be a Boolean function. We will consider two parties, Alice and Bob,
receiving and respectively, who try to compute
{é ý ï . In the protocols we consider, Alice computes some function é ­ï of and sends the result
to Bob. Bob then attempts to compute {é }ý ï from
é qï and . Note that only one message is sent, and
it can only be from Alice to Bob.
=
T e P^ P
D
T U
f
c
ýõ
Theorem 3 For every
and
õ , there exists a distribution
on
every G÷
such that
2.1 Communication Complexity
" !$# &% ('
) *
+) ,*
- +) )
- /) *
]K [
D
can obtain by considering all possible
, where
ranges over all subsets of size . If the shatter coefficient of is
, then is shattered by . The VC
dimension of , VCD( ), is the size of the largest
subset
shattered by .
‹ , there exists a @Š˜‹
Œ™c/q} # cB with š
é[õ
-
ù
ò
K )K
We will use the notation
to mean the
and
to mean the norm of .
)
«ï
ßø
K
K )K
K
c+q
)
c
ù
norm of
3 Reduction
We proceed as outlined in section 1.1. Recall
,
that is fixed, and we set
which we assume to be a power of 2 for convenience. We first define and lower bound the communication complexity of
.
¤
¥¦E¥¨§¤ª©¦™«–¬N®­ ¯±°
²2³ ¯ §/¥,©
3.1 Complexity of ² ³ ¯ §+¥,©
Let ´µ¦·¶w¸
­w¹OººO» º¹ ¸w»d¼ be the standard basis of ¥
unit vectors in ½ . We define the promise problem
²³¾ ¯ §+¥,© as follows:
² ³¾ ¯ §+¥,©¿¦ ¶§/À ¹.Á ©mÂhÃ Ä ¹Å¨Æ »ZÇ ´gÈ¢È‡È ÀÈ
ȱ¦ Å
and either É+À
¹,ÁQÊ ¦ËÄ or É+À ¹,ÁQÊ ¦ Í Ì ¥ ¼
We will only be interested in tuples §+À
¹,Á ©Î“²2³ ¯ §/¥,©
in which the descriptions of the coordinates of À
and are finite, so we impose the constraint that
Á
these coordinates be rational. We will also need
ǖРto
to assign probability distributions on Ï
use Theorem 4, so it will be convenient to assume
rational numbers have size bounded by
Ñ ¦Óthese
Ò
size of a rational number
ØÙ ¦ÚÒ Ì]ԇձ֧ÝÜÞ¥©× ß , where§à(the
©
ß
Ô
ÕÖÛ Ô
ÕÖÛ Å × . We define the promise
problem ² ³ ¯ §/¥,© :
² ³ ¯ §/¥,©¿¦ ¶¢§+À ¹.Á ©Oȇ§+À ¹.Á ©ÎÂz² ³¾ ¯ §/¥,© and á|â ¹ Ñ
ÀÞã and Á ã are rational with size ä ¼
We let Ï denote the set of all À for which there
©ÎÂå² ³ ¯ §/¥,© , and we define Ð
exists a such that §+À
Á
,
¹
Á
©æÂç² ³ ¯ §+¥,© , we define è§/À ©Î¦
similarly. For §+À
Ä if É+À ¹.ÁQÊ ¦•Ä and¹.Á è§+À ¹,Á ©m¦ Å if É+À ¹,ÁQÊ ¦ é Û ¹,» Á . As
stated in the preliminaries, we can view è§/À
Ð–í ¶wÄ ¹.Á ¼8© È asÀL a
family of functions ꦕ¶&è(ëQ§ ©æì
¹Å
Ïz¼ , where è ë § Á ©5¦îè§+À ¹,Á © . Á
»
Theorem 6 The ï th shatter coefficient of ê is
¯¨ñóôwò õ »
Ì(ð .
Proof For any subset ö9¦÷¶w¸ ã
ººOº ¸ ã¨ôø ¼úùfû of
ï» vectors, we define À|ü to be theò ¹Onormalized
average é Û »<ý9þdÿ ü ¸ . From our assumptions on ¥ , the
coordinates of À ü are rational with size bounded
Ñ
¦
above by . We define the set Ï
as Ï
­ ùVÏ length¶À ü Èö÷ù Ð ¼ . The» claim is that every
¥ ­ bit
string with exactly ï 1s will occur in the truth table
of è ò . Consider any such string with 1s in positions â
­ ¹OººOº¨¹ â ôø , and let öÚ¦ ¶w¸ ã ò ¹ºOºº ¸ ãªôø ¼ . Then
¸ Âö É+ÀÞü ¹ ¸ Ê ¦ É é Û » ý þdÿ ü ¸ ¹ ¸ Ê ¦ é Û »
è ë §/¸`© ¦ Å
¸ –ö +É ÀÞü ¹ ¸ Ê ¦ Ä
ÀÞü
¸
¬ ô»ø ° Ì𠨯 ñóô&ò õ »
è ë §/¸`©C¦÷Ä
for all
,
so
that , and for all
,
since
is in an orthogonal subspace to , and
hence . Since there are such strings, the theorem follows.
ï­
½ (§è © C§/¥,©
Corollary 7 For all , the one-way communication complexity is .
¥ We can apply Theorem 5 so long as §Bè ò © , but this is clear
since È û€È ¦¥ so that
ù­ , §Bè ò © 䘥 . We defor all subsets duce that there exists an input distribution such
§ §è ò ¹ ¥,©,© ¥ Û § (© that
§è © is at least
¥¨§ Û § ï­ ©!" Û § (©.©,©C¦#Ô
ÕCÖ §+¥,© . By the
Yao minimax
Proof
principle [12] the corollary follows.
»
Û
§+À ©"ÂÚ² ³ ¯ §+¥,©
È
È Á îÀÈ
È Û ¦·È‡È Á ¹,ȇÁ È Û ß È
È ÀÈ‡È Û
È
® È Á È
È ¦ÚÈ‡È ÀÈ
È ¦ Å
Ì É/À ¹,ÁQÊ Í p¤
Å
Å Û Ä ¤ Å
è§/À ¹.Á ©”¦ËÄ ' È
È Á ÀÈ‡È Û ¦ Í Ì
'
È
È Á ÀȇȦ Ì (
è§/À ¹.Á ©”¦ Å ' È
È Á ÀÈ‡È Û ¦ Ì Í ¥
)
Í
'
È
È Á ÀȇȦ Ì Å Í Ì ¥
Í § ÍÅ ©
Ì Å ¥
Ø * ,³ + ñ » õ
3.2 Embedding $ » into $
­
Û
¦ -]§+¥,© be a function to be specified» í later.
Let -p.
$32 ,
Let / be a § ß0- © -distortion embedding /çì1$
Å ñ <ò õ
­
Û
with 4@6
¦ 57O¥ 8 9;: = ?
¯ > , as per Theorem 6. Alice and
Bob can construct the same embedding / locally
without any communication overhead. Let §/À
¹,Á ©
be an instance of ²³ ¯ §+¥,© and let Alice possess À and
Bob possess . Alice computes /§/ÀÞ© , Bob comÁ
putes /§ © . By the distance-distorting properties of
Á
/ , we have:
We will need the following connection between $
distances and dot products: Let
.
Using the relation
, the property that
, and the
%
for & , we see:
inequality
ß Å - ä È /§+À|©OÈ ¹ È /§ Á ©ÈŽä Å
@
Å
È
È Á —ÀÈ
È ä È /§ ©!A/§+À|©OÈxäyÈ
È — ÀÈ
È
Á
Á
Å Aß -
(1)
(2)
3.3 Rational Approximation
We will need the coordinates of BDCFEHG and BDCFIJG
to be rational. This will change K BDCFEHGLK , K BMC IJGLK , and
K BMC ENGPOQBDCFIJGRGK , but not by much if we choose a
good rational approximation. We fix a function
S@TUS C VRGWHXZY[X to be specified later. Let \ S1]
denote the set of nonnegative integers less than S .
Let ^_` denote the fractional part of a real number
_ , so that _aOb^_` is an integer. Then it easily
follows that for any real number _ , there exists a
unique c T cCF_Ged"\ S] with fhg.^_`iObk j gmk l . For
this value of c we define the rational approximation
n
n
of _ to be C _G T C _oOp^_`Giq k j . For a
r C _G
-dimensional real vector s T CF_
_wvwG , we
n
n
n
ltuLuuLt
define CFsxG T C C _ G
CF_wvGRG .
l tLuLuuyt
By our choice of rational approximation, we
have:
r
r
n
K BDCFEHGLK1O S g?K 3C BDCFEHGRGKzgQK BDCFENGKq S
(3)
r
r
n
(4)
K BDCFIJGLKO S g{K 3C BDC IxG|GLKigQK BDCFIJGLKq S
r
K BDC IxG!OABDC ENGLK1O
n
K C3BDCFIJG|G~O
n
S
n
n
g}K C BDCFIJGRG!O
C3BDCFEHGRGLK (5)
r
C3BDC ENGRGKzg?K BDCFIJGO@BMC ENGKq
S (6)
3.4 Reduction to Distinct Elements
Alice and Bob now convert their transformed inputs to integer vectors. To do this, they scale each
coordinate by S , scaling the norm of their vectors by
S . For r -dimensional vectors s T CFs
s1vwG , let
S€ s1lvtG uL, uLsout that AlcCFsxG denote the vector C S€ s
n lwtuLuLut
n
ice and Bob now have c‚C CBMC ENG|GRG and cC C3BDCFIJG|GRG
respectively.
The idea is to convert each of these integer coordinates to their unary representation so that we can
reduce this problem to that of computing Hamming
n
distances between bit strings. Since K C3BDCFEHGRGKg
n
v
K BMC ENGK1q k g„ƒ , each coordinate of C3BDCFEHGRG is rational with absolute value less
than or equal to ƒ .
n
Hence, each coordinate of cC CBDCFENG|GRG is an integer
with absolute value less than or equal to ƒ S . For
each integer … , Oiƒ S g†…g„ƒ S , we define its unary
equivalent ‡CF…G to be a bit string of length ˆ S with
first ƒ S q.… bit positions to be 1s, and remaining
k k
bit positions to be 0s, namely, ‡DC …‰G TmŠw‹ŒŽ f Ž ‹ .
r
For a -dimensional integer vector s T C s
s1vwG
ltLuuLuLt
with coordinates in the range \,Oiƒ S ƒ S1] , we det
‡CFs v G|G . For any two
fine ‡CFsxG to be CF‡CFs G
l tuLuuLt
such vectors s s Ž with coordinates in \‘Oiƒ S ƒ S1] ,
t
l t
it is easy to see that K s O’s Ž K T#“ CF‡CFs G ‡CFs Ž GRG ,
t
l
l
“ refers to Hamming distance. Let EH” T
where
n
n
‡DCcC CBDCFENG|GRG|G and Ix” T ‡DCc‚C CBMC IJGRGRG|G . Note r that
both EN” and Ix” are bit strings of length CFˆ S q Š G .
Let •%VyCFEHG denote the number of 1s in bit string E .
n
n
Since K cC C3BDC ENGRG|GLK T •%VyC E ” G and K cC CBDCFIJGRG|GLK T
•%VyCFIx”–G , combining (1), (3) and (4), we see that:
r
r
Š
S˜—
˜
S

—
Š
q SMš
Š q@™ O S›š gœ•%VyCFE ” Gg
r
r
Š
—
S —
S
Š
q S š
Š qA™ O S š gž•%VyC I ” Gzg
n
T
K cC C3BDCFEHGRG|GŸO
Also,
since “ C EN” n Ix”G
n
n
t
S K C3BDCFEHGRG"O
cC CBDCFIJGRG|GLK T
C3BDCFIJG|GLK , com-
bining (2), (5), and (6), we observe:
S KK I O¡EK‘K
r
O
g
Š q@™
“ CFE ” I ” Gg
t
S KK IPO¡EK‘Kq r
(7)
We now transform these observations on Hamming
distances into observations on the number of distinct elements of streams. Alice and Bob can pretend their bit strings EH” and Ix” are the characteristic
vectors of certainr streams a ¢£ and a¤y£ in a universe
of size CFˆ S q Š G . There are an unbounded number of streams that Alice and Bob can choose from
which have characteristic vectors EH” and Ix” . They
choose two such streams arbitrarily, say a ¢ £ and a¤ £ ,
which will be fed into local copies of an ¥D¦ approximation algorithm. Let a ¢£¨§ a¤y£ be the stream which
is the concatenation of streams a ¢ £ and a ¤ £ , so that
its characteristic vector is just the bitwise OR EH”©%Iª”
of E«” and Ix” .
¢
¤
Claim
¬ S 8 Let EN” t Iª” t a £ t a £ ¬ beS
C Gzg­•%VyCFEN”®G •%VyC Ix”®G%g
Ž C
t
¬ l S
S . Then ¬
C
G
functions
of
Ž
¬ S
¢ £–± ¤
¥ ¦ C a ¢ £ § a¤ £ Gg
Ž C Gq„¯° Ž
as above. ¬Suppose
G for some C S Gzg
¢ £®l ± ¤ £,²
C S G0q{¯° Ž
g
£‘l ²
.
Proof
As noted above, ¥ ¦ C a ¢ £ § a¤ £ G is just
•%VyCFEN”©PIª”–G . Let ³ be the number of positions which
are 1 in Ix” but 0 in EN” . Then “ CFEN” Ix”®G T C •%VyCFEN”FGO
t
CF•%VyC Ix”®GO³LG|GyqP³ T •%VyCFEN”®GO0•%VyC Iª”–Gyq ƒ³ , so that ³ T
“
T
Žl C C E«” t Ix”®G›q’•%VyCFIx”–GDO¡•iVyCFE«”FG|G and •iVyCFEN”y©0Iª”®G
T
“
•%VyCFEN”–G;q³
Žl C C E«” t Ix”®G›q@•iVyCFE«”FGMq@•%VyC Ix”G|G . The
claim follows from the bounds on •iVyCFEH”–G •%VyCFIx”FG .
t
Now suppose ´µ ¶›·|¸x¹!º6» . Then ¼¼ ¶¾½o¸¼‘¼ºÀ¿ Á , so
that we have ÂyÅRÆÈ
Ã Ç Ä ½É˜Ê­ËoµF¶NÌ·R¸x̍¹ by (7). By Claim
8 we conclude,
ÍMÎ
¿ Á
×
Ö Á × Á
ÖAØ Ù
Å Ö Ù
É
º†½˜Õ
Ø Ö Ã ÄÛ
Á Ö × Ú
Ö@Ù
Ø
On the other hand, if ´µ ¶›·R¸J¹!º
, we have
Ø
Ø
¼¼ ¶½¡¸¼¼ªÜ ¿ ÁÝ ½
¿ Þß
µ a ÏÐÑ aÒyÐ3¹ÔÓÀ½˜Õ
É
ÁpÖ
Ø
4 Conclusions
so that
Ëhµ ¶ Ì ·R¸ Ì ¹Ü
Ø
Ý ¿ Á Ý
Ø
½
É
¿ Þ ß Ö
×
by (7), and hence by Claim 8 we have
ÍMÎ
Å
Å
Our computations mean that Éíºèî†ï ðFñ!òó1ô ï ðõõ
so that
º
Éöº
睵 ¹ so long as we set
÷ ï Å ò‘ó‚ô ï Å
in dimension ìíµ É× ¹Aº
ð3Å ø
ðÅõõ and× hence
á
Í Î
÷ ï ò‘ó‚ô ï
approximator
ð3ù
ðõõ an µúìíµ ¹¨·;ã¹
× can
distinguish the above two
cases with error probaá
bility at most ã , Å which means by the reduction that
it must take ûœï ð ü õ space.
Å
Å
Let ýþºÿìÀï ð3ù òó1ô ï ð õõ . Then we see that for
ù
» , there is a space
all ºèû ý
, for any
Û Å
á
Ú
lower bound of û ï ð ü‚õ on µ ·;ã¹ approximating the
number of distinct elements á in a universe of size ý .
µ aÏ Ð Ñ a Ò Ð ¹àÊ
×
Ö
º
Õ
É
Á Ö
p
É
Ö
×
Ý
Ö
Ø
¿ ×Á Þ Ö
¿ ÁÈß
½
We
Å have shown a tight space lower bound of
û ï ð3ü‚õ on µ ·|ã¹ approximating the number of distinct elements
á in a universe of size ý when º
ù
, for any
û ý
» .
á
Û
Ø
Ú
The upper bound of çxµFý ù ¹ on
can be somewhat relaxed by strengthening theá analysis presented in this paper. For example, one could use
a randomized embedding of the relevant â vectors
Ä
into Å ; this would give Éíºèîµ òó1ô Þ Ä ¹ and lead to
ØÙ
a somewhat higher upper bound on
. Instead of
following along these lines, we mention
á that, very
recently, the second author Å managed to improve the
Ø
upper bound on
to ý
Ä [11], which is optimal. The approach of
á that paper is as follows. Observe that one can reformulate the reductions given
in Section 3 as a method for constructing a set of
Ø
vectors in »·
that results in a large bound for
the shatter coefficient of the function ´µF¶M·R¸J¹ . The
set is constructed indirectly: first, the vectors are
chosen in â , then they are mapped into Å , and then
Ä
finally into the Hamming space. This indirect route
blows up the dimension of the vectors by a polynomial factor. Instead, in [11], the vectors are constructed directly in the Hamming space via a (fairly
involved) probabilistic argument.
É
½
¿ × Á
Ø
×
ß
Á
¿ ×Á Þ
It remains to choose and so that these two cases
Å
Ù
× an
º
·|ã approximacan be distinguished by
à â Û
Ú
á
tion algorithm for distinct elements. Let ÌDºpäæå
for some small constant ä toÍMbe
determined
á shortly.á
Î
Suppose we have an µ Ì ·;ã¹
-approximation algorithm which can distinguish
these two cases. Back
Å á
substituting for , we want:
à â
á
Ø
Ø
É
Ø
Ì ¹%Ý!Õ
µ
Ý
½
(8)
Ö
Á„Ö
Ö ¿ Á ß
¿á ×Á ß
×
á
Ø
Ø
É
Ø
ÜÀµ ½ Ì ¹&ݏ½˜Õ
Ý
Ø
Á Ö
×
Ö ¿ Á ßiß
á
Ö@Ù
Î
If É ºè睵 ¹ , there exists a constant such that for
Î
all Ü
×, á(8) is equivalent to:
á
á
á
Ø
Ø
Ø
Ì ¹&Ý Ý
µ
½
Ö
Ö ¿ Á ß
¿á ×Á ß
×
á
Ø
Ø
Ø
ÜQµ ½ Ì ¹&Ý Ø
Ý
×
Ö ¿ Á ß%ß
Ö@Ù
Åá
Setting äº éLê ÆMÅë , one finds after some algebra
à Ä
that for sufficiently small, but constant , one can
set ºèìíµ ¹ .
á
Ù
á
5 Acknowledgments
For helpful discussions, we thank: Johnny Chen,
Ron Rivest, Naveen Sunkavally, Hoeteck Wee.
Also many thanks to Hoeteck for reading and commenting on previous drafts of this paper.
References
[1] N. Alon, Y. Matias, and M. Szegedy. The
space complexity of approximating the frequency moments. In Proceedings of the 28th
Annual ACM Symposum on the Theory of Computing, p. 20-29, 1996.
[2] A. Akella, A. R. Bharambe, M. Reiter and S.
Seshan. Detecting DDoS attacks on ISP networks. In Proceedings of the Workshop on
Management and Processing of Data Streams,
2003.
[3] Z. Bar Yossef. The complexity of massive data
set computations. Ph.D. Thesis, U.C. Berkeley,
2002.
[4] Z. Bar Yossef, T.S. Jayram, R. Kumar, D.
Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. RANDOM 2002,
6th. International Workshop on Randomization
and Approximation Techniques in Computer
Science, p. 1-10, 2002.
[5] Z. Bar Yossef, T.S. Jayram, R. Kumar, and
D. Sivakumar. Information Theory Methods in
Communication Complexity. 17th IEEE Annual Conference on Computational Complexity, p. 93, 2002.
[6] T. Figiel, J. Lindenstrauss, and V. D. Milman.
The Dimension of Almost Spherical Sections
of Convex Bodies. Acta Mathematica (139) 5394, 1977.
[7] P. Flajolet and G.N. Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 18(2)
143-154, 1979.
[8] P. Indyk. Algorithmic applications of lowdistortion geometric embeddings. In Proceedings of the 42nd Annual IEEE Symposium on
Foundations of Computer Science, invited talk,
Las Vegas, Nevada, 14-17 October, 2001.
[9] I. Kremer, N. Nisan, and D. Ron. On randomized one-round communication complexity.
Computational Complexity, 8(1):21-49, 1999.
[10] V.N. Vapnik and A.Y. Chervonenkis. On the
uniform converges of relative frequences of
events to their probabilities. Theory of Probability and its Applications, XVI(2):264-280,
1971.
[11] D. P. Woodruff. Optimal space lower
bounds for all frequency moments. Available:
http://web.mit.edu/dpwood/www
[12] A. C-C. Yao. Lower bounds by probabilistic
arguments. In Proceedings of the 24th Annual
IEEE Symposium on Foundations of Computer
Science, p. 420-428, 1983.