extended version of this paper.

RJ 10040 (90536) August 22, 1996
Computer Science
(Revised 7/16/99)
Research Report
HOEFFDING INEQUALITIES FOR JOIN-SELECTIVITY ESTIMATION
AND ONLINE AGGREGATION
Peter J. Haas
IBM Research Division
Almaden Research Center
650 Harry Road
San Jose, CA 95120-6099
LIMITED DISTRIBUTION NOTICE
This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication.
It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the
outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specic
requests. After outside publication, requests should be lled only by reprints or legally obtained copies of the article (e.g.,
payment of royalties).
IBM
Research Division
Yorktown Heights, New York
San Jose, California
Zurich, Switzerland
HOEFFDING INEQUALITIES FOR JOIN-SELECTIVITY ESTIMATION
AND ONLINE AGGREGATION
Peter J. Haas
IBM Research Division
Almaden Research Center
650 Harry Road
San Jose, CA 95120-6099
e-mail: [email protected]
ABSTRACT: We extend Hoeding's inequalities for simple averages of random variables
to the case of cross-product averages. We also survey some new and existing Hoeding
inequalities for estimators of the mean, variance, and standard deviation of a subpopulation.
These results are applicable to two problems in object-relational database management
systems: xed-precision estimation of the selectivity of a join and online processing of
aggregation queries. For the rst problem, the new results can be used to modify the
asymptotically ecient sampling-based procedures of Haas, Naughton, Seshadri, and Swami
so that there is a guaranteed upper bound on the number of sampling steps. For the
second problem, the inequalities can be used to develop conservative condence intervals
for online aggregation such intervals avoid the large intermediate storage requirements and
undercoverage problems of intervals based on large-sample theory.
Keywords: Hoeding inequality, join-selectivity estimation, query optimization, database
sampling, online aggregation
1. Introduction and Summary
In many applications, it is necessary to estimate a population mean
m
X
1
= m v(i)
i=1
using random sampling here m > 1 is a xed integer and v is a real-valued function dened
on the set f 1 2 : : : m g. Denote by L1 L2 : : : Ln a random sample drawn uniformly with
replacement from the set f 1 2 : : : m g and by L1 L2 : : : Ln a random sample drawn
uniformly without replacement. For n 1, the estimators
0
0
0
n
X
1
Y n = n v(Li)
(1.1)
i=1
and
Y n = n ^1 m
0
nX
m
^
i=1
v(Li )
(1.2)
0
are each unbiased for in that E Y n ] = E Y n ] = . (Here n ^ m denotes the minimum of
n and m.) In a famous paper, Hoeding 11, Theorem 4] shows that
0
(1.3)
E f (Y n)] E f (Y n )]
for n 1 and any convex function
f . In particular, it follows by taking f (x) = x2 ; 2 in
(1.3) that VarY n ] Var Y n .
It is frequently useful to bound the probability that Y n (resp., Y n ) deviates from 0
0
0
by more than a specied amount. Suppose that the only information available prior to
sampling consists of lower and upper bounds a and b, respectively, on the function v:
a v(i) b
(1.4)
for 1 i m. In 11], Hoeding not only establishes (1.3), but also shows that
P jY n ; j t 2e 2nt2 =(b a)2
(1.5)
n
o
P jY n ; j t 2e 2n t2 =(b a)2 :
(1.6)
;
and
0
;
for t > 0 and n 1, where
(
n= n
0
;
0
if n < m
+1 if n m:
1
;
In the following, we extend Hoeding's results by allowing v to be a function of more
than one variable and establishing analogues of inequalities (1.3), (1.5), and (1.6) for \crossproduct averages." We also survey some new and existing Hoeding inequalities for the
case in which we wish to estimate the mean, variance, or standard deviation of the numbers
f v(i): i 2 S g, where S f 1 2 : : : m g. We refer to S as a \subpopulation" and assume
that the membership or non-membership in S of an element i is discovered only when i is
sampled.
Our results, which are stated formally in the remainder of this section, have a number
of applications in object-relational database management systems (ordbms's). In Section 2
we consider the problem of using sampling to estimate the selectivity of a join to within a
prespecied precision. Haas, Naughton, Seshadri, and Swami 7] have previously provided
an asymptotically ecient procedure called f p cross for xed-precision estimation of selectivities. We show how the new inequalities can be used to modify f p cross so that there
is a guaranteed upper bound on the number of sampling steps executed by the procedure.
In Section 3, we show that our results are pertinent to online aggregation processing as
described in Hellerstein, Haas, and Wang 9]. Section 4 contains the proofs of our results.
1.1. Cross-Product Averages
Let m1 m2 : : : mK (K 1) be nite positive integers and v be a real-valued function
dened on the set = 1 2 K , where k = f 1 2 : : : mk g for 1 k K . To
avoid trivialities we assume throughout that mk > 1 for 1 k K . Suppose we wish to
estimate the population mean
m X
m
X
= m m 1 m
1 2
1
2
K l1 =1 l2 =1
mK
X
lK =1
v(l1 l2 : : : lK )
(1.7)
using random sampling. For 1 k K , denote by Lk1 Lk2 : : : Lkn a random sample of size n drawn uniformly with replacement from the set k . Similarly, denote by
Lk1 Lk2 : : : Lkn a random sample drawn uniformly without replacement. We assume
throughout that the sampling mechanisms for sets 1 2 : : : K are mutually independent. Set N = f 1 2 : : : gK and, for n = (n1 n2 : : : nK ) 2 N, dene the cross-product
averages Y~ n and Y~ n by
0
0
0
0
nK
n1 X
n2
X
X
Y~ n = n n 1 n
v(L1i1 L2i2 : : : LKiK )
1 2
K i1 =1 i2 =1 iK =1
and
Y~ n =
0
Y
K
k=1
(nk ^ mk )
! 1 nXm nXm
;
1^
1
2^
i1 =1 i2 =1
2
2
nKXmK
^
iK =1
v(L1i1 L2i2 : : : LKiK )
0
0
0
(1.8)
(1.9)
respectively. It is straightforward to show that E Y~ n ] = E Y~ n ] = for n 2 N . Moreover,
Y~ n for all n = (n1 n2 : : : nK ) 2 N such that nk mk for 1 k K .
Although Y~ n can be viewed as a simple average of n1 n2 nK random variables (and
similarly for Y~ n ), the inequalities (1.3), (1.5), and (1.6) are not directly applicable. The
problem is that the random variables that make up the average are not mutually independent: in general, v(L1i1 L2i2 : : : LKiK ) and v(L1j1 L2j2 : : : LKjK ) are dependent
unless ik 6= jk for 1 k K . An analogous remark applies to Y~ n .
Our rst result extends the inequality in (1.3).
Theorem 1. Let Y~ n and Y~ n be dened as in (1.8) and (1.9), respectively. Then
E f (Y~ n) E f (Y~ n)]
(1.10)
for n 2 N and any convex function f . In particular, Var Y~ n ] Var Y~ n ].
Since Y~ n and Y~ n are each unbiased, it follows from Theorem 1 that Y~ n has a lower mean
squared error than Y~ n .
Our next result, Theorem 2 below, generalizes the inequalities in (1.5) and (1.6) and
bounds the probability that Y~ n (resp., Y~ n ) deviates from by more than a specied amount.
The inequalities require a priori knowledge only of lower and upper bounds a and b, respectively, on the function v:
a v(l1 l2 : : : lK ) b
(1.11)
for (l1 l2 : : : lK ) 2 . For n 2 N, set
m(n) = 1 min
n
k K k
0
0
0
0
0
0
0
0
0
0
and
m (n) = 1 min
n
k K k
0
0
where
(
nk = nk
0
if nk < mk +1 if nk mk
for 1 k K .
Theorem 2. Let Y~ n and Y~ n be dened as in (1.8) and (1.9), respectively, and let a and b
satisfy (1.11). Then
P f jY~ n ; j t g 2e 2m(n)t2 =(b a)2
(1.12)
and
P jY~ n ; j t 2e 2m (n)t2 =(b a)2
(1.13)
for t > 0 and n 2 N.
0
;
0
;
3
;
0
;
Clearly, the tighter the bounds a and b on the function v, the tighter the above inequalities.
Suppose as a worst-case scenario that the function v depends only upon the rst of its
K arguments and that n1 nk for 2 k K . Then the cross-product averages dened in
(1.8) and (1.9) reduce to ordinary averages as in (1.1) and (1.2), and the inequalities (1.12)
and (1.13) reduce to (1.5) and (1.6). These latter inequalities represent the best available
Hoeding bounds for this situation. In this sense, the inequalities in (1.12) and (1.13) can
be viewed as tight worst-case Hoeding bounds.
The bound in (1.13) sometimes can be tightened as follows. Fix n 2 N and suppose
that for some positive integer r = r(n) < K we have nk < mk for 1 k r and nk mk
for r < k K . Set
mX
mK
r+1 mX
r+2
X
1
w (l l : : : l ) =
v(l l : : : l l l : : : l )
n
1 2
r
mr+1mr+2 mK lr+1=1 lr+2=1
1 2
lK =1
r r+1 r+2
K
(1.14)
for (l1 l2 : : : lr ) 2 1 2 r . Let a(n) and b(n) be lower and upper bounds,
respectively, on the function wn . Applying Theorem 2 to the r-dimensional cross-product
average of the function wn , we nd that
P jY~ n ; j t 2e 2m (n)t2 =(b(n) a(n))2
(1.15)
for t > 0 and n 2 N. The key point is that it is sometimes possible to choose a(n) and b(n)
such that a(n) > a and/or b(n) < b, so that the bound in (1.15) is tighter than the bound
in (1.13) see Section 3.1 below for an example. Of course, this approach to tightening
the bound in (1.13) can be applied with obvious modications when f k : nk < mk g is an
arbitrary strict subset of f 1 2 : : : K g.
When n = (n n : : : n) for some n 1, we write Y~ n and Y~ n instead of Y~ n and Y~ n ,
respectively, so that
n n
n
~Y = 1 X X X v(L L : : : L )
(1.16)
0
0
;
;
0
n
and
Y~ n =
0
Y
K
k=1
nK i1 =1 i2 =1
(n ^ mk )
! 1 nXm nXm
;
1
^
1i1
iK =1
^
2
i1 =1 i2 =1
nX
mK
^
iK =1
2i2
0
KiK
v(L1i1 L2i2 : : : LKiK ):
0
0
0
In this case, the inequalities (1.12) and (1.13) take the form
P f jY~ n ; j t g 2e 2nt2 =(b a)2
;
and
(
P jY~ n ; j t 2e
0
2nt2 =(b a)2 if n < max(m1 m2 : : : mK )
if n max(m1 m2 : : : mK )
;
0
;
;
4
(1.17)
(1.18)
for t > 0 and n 1.
Hoeding actually establishes his inequalities (1.3), (1.5), and (1.6) for an arbitrary
collection of mutually independent, real-valued random variables (not necessarily discrete).
The inequalities in (1.10), (1.12), and (1.13) also can be shown to hold at this level of
generality. Hoeding also establishes one-sided bounds, such as the following one-sided
analogues of (1.5):
P Y n ; t e 2nt2 =(b a)2
;
and
;
P ; Y n t e 2nt2 =(b a)2 :
;
;
For each two-sided bound presented in this paper, there exists a pair of one-sided analogues
as in the above example.
In applications, the foregoing bounds are often \inverted" for purposes of obtaining
condence intervals. For example, it follows from (1.13) that for xed p 2 (0 1) and n 2 N
P jY~ n ; j p
(1.19)
0
where
1=2
;
2
1
= (b ; a) 2m (n) ln 1 ; p
(1.20)
0
Our nal result gives an analogue of (1.19) for the ratio of two cross-product averages.
Consider two real-valued functions f and g, each dened on , along with nite constants
af , bf , ag ( 0), and bg such that
af f (l1 : : : lK ) bf
and
ag g(l1 : : : lK ) bg
(1.21)
for (l1 : : : lK ) 2 . For p 2 (0 1) and n 2 N, set
1=2
;
1
4
np = 2m (n) ln 1 ; p
:
0
Also set
np = np
~
Y n (g)(bf; ; af ) + jY~ n (f )j(bg ;
ag )
Y~n (g) Y~n(g) ; (bg ; ag )np
0
0
0
0
!
(1.22)
if Y~ n (g) > (bg ; ag )np otherwise, set np = 1. In (1.22), Y~ n (f ) is dened as in (1.9), but
with v replaced by f , and similarly for Y~ n (g). Dene corresponding population averages
(f ) and (g) analogously. Take 0=0 = 0.
0
0
0
5
Theorem 3. Suppose that (1.21) holds and (g) > 0. Then
Y~ (f ) (f ) P ~n ; (g) np p
Yn (g)
0
0
for p 2 (0 1) and n 2 N, where np is dened by (1.22).
Theorem 3 applies to the case of sampling without replacement, but analogous results hold
for sampling with replacement. The previously-mentioned techniques for tightening bounds
also can be applied in the current setting. Condence intervals for other aggregates such as
VARIANCE also can be obtained by adapting the techniques used to establish Theorem 3.
1.2. Subpopulations
We now consider the problem of estimating the mean, variance, and standard deviation
of the numbers f v(i): i 2 S g, where v is a real-valued function dened on f 1 2 : : : m g
(with m > 1) and S is a nonempty subset of f 1 2 : : : m g. We assume that the set S is
specied by means of an indicator function u:
(
u(i) = 1 if i 2 S 0 if i 2 f 1 2 : : : m g ; S
(1.23)
for 1 i m. We also assume that it is not possible to sample directly from S as in
previous sections, each element in the sample is selected randomly and uniformly from the
set f 1 2 : : : m g. The function u is then applied to determine whether the sampled element
is a member of S .
1.2.1. Mean
Denote by (S ) the average of the function v over the set S :
P
m u(i)v(i)
X
i=1
(S ) = jS1 j v(i) = P
(1.24)
m u(i) i=1
i S
where jS j denotes the number of elements in S . We refer to (S ) as the subpopulation mean
over S .
Dene random indexes L1 L2 : : : Ln and L1 L2 : : : Ln as at the beginning of Section 1. For n 1, two possible estimators of (S ) are
2
0
Y n(S ) = I1
n
X
n i=1
6
0
u(Li )v(Li )
0
(1.25)
and
Y n (S ) = I1
0
nX
m
^
u(Li)v(Li )
(1.26)
u(Li )
(1.27)
u(Li):
(1.28)
n i=1
0
where
In =
and
In =
0
n
X
i=1
nX
m
^
i=1
0
0
0
We take Y n (S ) = 0 when In = 0 and similarly for Y n (S ). Neither Y n (S ) nor Y n (S ) is
unbiased for . As n becomes large, however, the bias of each estimator approaches 0 and
each estimator converges to with probability 1.
Because Y n (S ) and Y n (S ) are each biased estimators, it appears quite dicult to develop Hoeding inequalities analogous to (1.12) and (1.13). In practice, however, it suces
to bound the conditional probability that Y n (S ) (resp., Y n (S )) deviates from by more
than a specied amount, given the observed value of In (resp., In ). The key observation (cf
Section 2.12 in Cochran 2]) isPas follows: given that In = k (where k > 0), the estimator
Y n (S ) is distributed as (1=k) ki=1 v(Li ), where f L1 L2 : : : Lk g is a random sample of
size k drawn from the set S uniformly with replacement. An analogous statement holds
for the conditional distribution of Y n (S ) given In = k. Thus, Y n (S ) and Y n (S ) are conditionally unbiased for (S ) and, using the inequalities in (1.5) and (1.6), we obtain the
bounds
0
0
0
0
0
0
0
0
P jY n (S ) ; (S )j t In = k 2e 2kt2 =(b a)2
for t > 0, n 1, and 1 k n, and
;
;
o
n
P jY n (S ) ; (S )j t In = k 2e 2kt2 =(b a)2
0
0
;
;
(1.29)
for t > 0, n 1, and 1 k jS j ^ n.
1.2.2. Variance and Standard Deviation: Sampling with Replacement
Denote by 2 (S ) the variance of the function v over the set S :
2 (S ) = 1
X;
jS j i
S
Pm u(i);v(i) ; (S )
2
2
v(i) ; (S ) = i=1 P
:
m u(i)
i=1
2
7
(1.30)
We refer to 2 (S ) and (S ) as the subpopulation variance over S and subpopulation standard
deviation over S , respectively.
First suppose that S = f 1 2 : : : m g and that this fact is known a priori. (The value
of m need not be known.) In this case we write for (S ) and 2 for 2 (S ). Fix n 2 and
dene the sample average Y n as in (1.1). It is well-known that the estimator
Zn = n ;1 1
n
X
i=1
(v(Li ) ; Y n )2
(1.31)
is unbiased for 2 see, for example, Cramer 3, p. 347]. As pointed out by Hoeding 10],
Zn can be written in the form
X;
Zn = n(n1; 1) g v(Li ) v(Lj ) i=j
6
where g(x y) = (x ; y)2 =2 that is, Zn is a \one-sample U -statistic" as dened in 10,
11]. Hoeding's inequalities for U statistics 11, Section 5a] therefore can be applied in a
straightforward manner to yield inequalities for Zn . These computations, which have been
carried out by Krat and Schmitz 13], yield the inequalities
P Zn ; 2 t e 8 n=2 t2 =(b a)4 (1.32)
P 2 ; Zn t e 8 n=2 t2 =(b a)4 (1.33)
; b
; b
and
c
;
c
;
P jZn ; 2 j t 2e 8 n=2 t2 =(b a)4
(1.34)
for t > 0 and n 2, wherep bxc denotes the largest integer less than or equal to x.
The random variable Zn often serves as an estimator of the standard deviation . To
; b
c
;
obtain a bound analogous to that in (1.34), observe that
px ; py = (x ; 2pxy + y)1=2 px ; y
for 0 y x. Using (1.32), we nd that
P
np
o
Zn ; t P
np
o
Zn ; 2 t = P Zn ; 2 t2 e 8 n=2 t4 =(b a)4 :
; b
p
c
;
(1.35)
A similar argument using (1.33) yields the same bound for P ; Zn t , and combination of the two bounds yields the inequality
o
np
P Zn ; t 2e 8 n=2 t4 =(b a)4 :
; b
8
c
;
(1.36)
Now suppose that S is an arbitrary nonempty subset of f 1 2 : : : m g. The quantity
2 (S ) can be estimated by
n
X
;
Zn (S ) = I 1; 1 u(Li ) v(Li ) ; Y n (S ) 2 n
i=1
where u, Y n (S ), and In and are dened as in (1.23), (1.25), and (1.27), respectively. Using
the above results and arguing as in the case of the subpopulation mean, we obtain the
conditional inequalities
P jZn (S ) ; 2 (S )j t In = k 2e 8 k=2 t2 =(b a)4
; b
and
c
;
np
o
P Zn(S ) ; (S ) t In = k 2e 8 k=2 t4 =(b a)4
; b
c
;
for t > 0, n 2, and 2 k n.
1.2.3. Variance and Standard Deviation: Sampling without Replacement
Suppose at rst that S = f 1 2 : : : m g, that this fact is known a priori, and that m is
known. Fix n 2 and dene the sample average Y n as in (1.2). It is well-known that the
0
estimator
Zn =
m ; 1
1
nXm
^
(v(Li ) ; Y n )2
(1.37)
(n ^ m) ; 1 i=1
is unbiased for 2 see, for example, 2, Theorem 2.4]. In analogy with (1.3), we have the
following result.
Proposition 1. Let Zn and Zn be dened as in (1.31) and (1.37), respectively. Then
0
m
0
0
0
E f (Zn) E f (Zn )]
0
for n 2 and any convex function f . In particular, Var Zn ] Var Zn ].
Pathak 18] gives a proof of this result based on the Rao-Blackwell inequality. In Section 4
we give a \rst principles" proof along the lines of the original argument used by Hoeding
to establish (1.3).
As in Hoeding's proof of (1.6), the inequality in Proposition
p 1 can be combined with
the results in Section 1.2.2 to establish inequalities for Zn and Zn .
0
0
9
0
Theorem 4. Let Zn be dened as in (1.37), and let a and b satisfy (1.4). Then
P jZn ; 2 j t 2e 8 n=2 t2 =(b a)4
0
0
and
; b
c
;
o
np
P Zn ; t 2e 8 n=2 t4 =(b a)4
; b
0
c
(1.38)
(1.39)
;
for t > 0 and 2 n m.
The inequality (1.38) is stated in 13], but the supporting proof is incomplete.
Now suppose that S is an arbitrary nonempty subset of f 1 2 : : : m g. If jS j is known,
then for n 2 the estimator
jS j ; 1 Zn (S ) =
0
jS j
1 nXm u(L );v(L ) ; Y 2 :
i
n
In ; 1 i=1 i
^
0
0
0
0
is conditionally unbiased for 2 (S ), given the value of In . Arguing as in previous sections,
we obtain the conditional inequalities
0
P jZn(S ) ; 2j t In = k 2e 8 k=2 t2 =(b a)4
0
and
0
; b
c
;
o
np
P Zn (S ) ; t In = k 2e 8 k=2 t4 =(b a)4
0
0
; b
c
;
for t > 0, n 2, and 2 k jS j^ n. If jS j is unknown, we cannot use the estimator Zn (S ).
If S is known to contain a sizable number of elements, however, a reasonable estimator
can be obtained by ignoring the troublesome bias-correction term (jS j ; 1)=jS j and simply
computing the sample variance over all elements of S observed so far. This approach yields
a biased estimator of 2 (S ) given by
0
nX
m
;
Zn (S ) = I 1; 1 u(Li) v(Li ) ; Y n (S ) 2
n
i=1
^
00
0
0
(1.40)
0
0
where Y n (S ) and In are dened as in (1.26) and (1.28), respectively. Because Zn also
is biased when given the value of In , it is dicult to obtain even conditional Hoeding
inequalities without some extra information. When jS j is large relative to (b ; a)2 and a
good lower bound d jS j is available, the following inequality can be useful.
Theorem 5. Let Zn (S ) be dened as in (1.40) and let d jS j. Then
0
0
00
0
00
P jZn (S ) ; 2 (S )j t In = k 2e 8 k=2 2 =(b a)4
00
0
10
; b
c
;
(1.41)
and
np
o
P j Zn (S ) ; (S )j t In = k 2e 8 k=2 2 =(b a)4
0
00
; b
c
(1.42)
;
for t > 0, n 2, and 2 k jS j ^ n, where
(b ; a)2
= (t d a b) = d ;d 1 t ; 4(
d ; 1)
and
(b ; a)2 :
= (t d a b) = d ;d 1 t2 ; 4(
d ; 1)
Note that, given In = k, we can set d = k if k is suciently large.
0
1.3. Some Further Renements
Krat and Schmitz 13] provide several techniques that can be used to tighten the
various bounds given in this section. The rst (trivial) observation is that if a v b,
then, for example, P f jY~ n ; j t g = 0 for t > b ; a, where Y~ n and are as in Theorem 2.
Similarly, it follows from the inequality in (4.5) below that jZn ; 2 j tn for n 0, where
Zn and 2 are as in (1.34) and
2
tn = n4((bn;;a1)) :
p
p
Thus, P jZn ; 2 j t = 0 for t > tn and P j Zn ; j t = 0 for t > tn . All of
the other inequalities in this section can be tightened in a similar manner.
A less trivial result is obtained after rewriting (1.12), (1.34), and (1.36) in the form
P f jY~ n ; j t g 2e m(n)(t=(b a)) ;
P jZn ; 2 j t 2e
and
;b
o
np
P Zn ; t 2e
;
n=2 (2t=(b a)2 ) ;b
c
;
n=2 (2t2 =(b a)2 ) c
;
respectively, where (x) = 2x2 . It follows from results in 13] that the above inequalities
hold with replaced by 0 , where
0(x) = 2x2 + 94 x4 + 29 x6 :
All of the other inequalities in this section can be rewritten and tightened in a similar
manner.
11
2. Application to Join-Selectivity Estimation
A (select-)join query with K ( 2) input relations R1 R2 : : : RK species a subset of
the cross product R1 R2 RK . For each element (j1 j2 : : : jK ) in this subset,
tuples j1 j2 : : : jK are concatenated to form a tuple of the output relation R12:::K . The
\selectivity of the join" is the number of tuples in R12::: K divided by the number of
elements in the cross product of the input relations. Selectivity estimates play a key role in
query optimization for ordbms's as well as for capacity planning, cost estimation for online
queries, system access control, load balancing, and statistical studies.
Our formulation of the selectivity-estimation problem follows 7]. For j1 2 R1 j2 2
R2 : : : jK 2 RK , set v0 (j1 j2 : : : jK ) = 1 if tuples j1 j2 : : : jK join (that is, if these
tuples are concatenated to form a tuple of R12::: K ) otherwise, set v0 (j1 j2 : : : jK ) = 0.
The function v0 is determined by the join and selection predicates that make up the query.
We wish to estimate the selectivity , dened as
X X X
1
= jR12::: K j =
v (j j : : : j ):
jR1 RK j jR1 RK j j1
R1 j2 R2
2
2
jK RK
0 1 2
K
2
A na!ve method for estimating the selectivity is to use a \tuple level, independent"
sampling scheme, denoted by t indep. At the nth sampling step of t indep, a tuple is
selected randomly and uniformly from each input relation. The observation Xn is then
computed from the K selected tuples, where Xn = 1 if the tuples join, and Xn = 0
otherwise. In other words, Xn is the selectivity of the K -way join of the randomly selected
tuples. The tuples are then discarded prior to the next sampling step. For each n 1, the observations X1 X2 : : : Xn are identically distributed, and the average of these
observations is an unbiased estimator of if samples are drawn with replacement, then
these observations are also mutually independent.
An alternative approach is to estimate using the \page-level, cross-product" sampling
scheme proposed in Hou, Ozsoyoglu, and Taneja 12] and analyzed in 7, 19]. We denote this
scheme by p cross. At each sampling step of p cross and for each of the K input relations,
a page of tuples is selected randomly and uniformly from among the pages that make up
the relation this randomly selected page is stored in main memory. At the nth sampling
step, nK ; (n ; 1)K observations are generated by computing
(i) the selectivity of the K -way join of the pages selected at the current sampling step
and
(ii) the selectivities of all possible K -way joins among pages selected at the current sampling step and pages selected at previous sampling steps.
Although p cross examines many more tuples per sampling step than t indep, the resulting
observations are not mutually independent, even when samples are drawn with replacement.
Haas et al. 7] compare p cross and t indep when samples are drawn with replacement.
They show that, for any xed number of sampling steps, selectivity estimators based on
12
the p cross scheme always have variance less than or equal to that of estimators based on
the t indep scheme. In practice, the variance of the selectivity estimator can be smaller by
orders of magnitude when p cross is used. We therefore focus throughout on the p cross
sampling scheme and its variants.
The sampling cost of p cross sometimes can be reduced by using \index-assisted" sampling as proposed by Lipton, Naughton, and Schneider 14, 15] and extended in 7]. Suppose,
for example, that we wish to estimate the selectivity of an equijoin of relations R1 and R2
using the t indep scheme and that R2 has an index on its join attribute that is, the join
predicate is \R1 :a = R2 :a" and R2 has an index on attribute a. At each sampling step, one
tuple is selected randomly and uniformly from relation R1 and the total number of tuples
from R2 that join with this random tuple is computed using the index no sampling from
R2 is required. Thus, we obtain jR2j observations at each sampling step. This idea extends
in a straightforward manner to the p cross scheme and to general K -way joins when one
or more of the input relations has a combined index on (the concatenation of) all relevant
join and selection attributes.
In the following, we assume that samples are drawn with replacement. We also assume
that tuples are stored and brought into main memory in pages, where each page contains
N ( 1) tuples. To discuss page-level and index-assisted sampling schemes in a unied
way, we consider a generalized scheme in which, for 1 k K , the tuples in relation Rk
are partitioned into mk blocks B (k 1) B (k 2) : : : B (k mk ) with tk (= jRk j=mk ) tuples
per block. For each relation Rk , a block of tuples is selected at each sampling step, and
all of the tuples in the block are brought into main memory. When t1 = = tK = N
we have pure page-level sampling. When tj = jRj j for one or more values of j we have
index-assisted sampling. (As indicated above, we don't actually bring all of the tuples in an
indexed relation into memory, we just perform an index lookup. Such a lookup, however,
is equivalent to examining all of the tuples in the indexed relation, as far as their join
and selection attributes are concerned. In general, one or more I/O's may be required to
perform the lookup see 7] for a detailed discussion of sampling costs.) We assume that at
most K ; 1 indexes are available Ganguly, Gibbons, Matias, and Silberschatz 4] provide
an estimation procedure when K = 2 and there is an index on each input relation.
For (l1 l2 : : : lK ) 2 , denote by v(l1 l2 : : : lK ) the selectivity of the join of blocks
B (1 l1 ) B (2 l2 ) : : : B (K lK ):
X X
X
1
v (l l : : : l ) =
v (j j : : : j ):
1 2
K
t1t2 tK j1 B(1l1 ) j2 B(2l2 )
2
2
jK B(KlK )
0 1 2
K
2
The function v is called the selectivity function for the join. Observe that the selectivity can be represented as a cross-product average of the form (1.7). Denote by Lki (resp., Lki )
the random index of the block of tuples selected from relation Rk at the ith sampling step
when samples are drawn with (resp., without) replacement. Then for n 1 the estimators
Y~ n and Y~ n dened by (1.16) and (1.17), respectively, are each unbiased for .
0
0
13
Suppose that we wish to estimate the selectivity to within with probability p,
where > 0 and p 2 (0 1). In general, it is impossible to satisfy the precision criterion
with a probability exactly equal to p. To address this problem, Haas et al. 7] introduced
a xed-precision estimation procedure called f p cross. In this procedure, the basic p cross
sampling procedure is executed for a random number of sampling steps. After each sampling
step of p cross, a stopping rule is used to determine whether to continue sampling or to
stop sampling and return the cross-product average based on all of the blocks sampled so
far. The nal estimate Y~ () is given by Y~ () = Y~ N () , where Y~ n is dened by (1.16) for
n 1 and N () is the random number of sampling steps executed by the procedure.
The stopping rule for f p cross is obtained by approximating the distribution of Y~ ()
by a normal distribution. Use of this approximation leads to an estimate of the probability that the precision criterion is satised. If the estimated probability is greater than or
equal to the prespecied probability, then no further samples are taken see 7] for details.
It is shown in 7] that the f p cross procedure is \asymptotically consistent" in the sense
that the probability of satisfying the precision criterion converges to the prespecied value
as the precision criterion becomes increasingly stringent: lim 0 P f jY~ () ; j g = p.
Moreover, f p cross is \asymptotically ecient" in the sense that the total number of sampling steps converges to the theoretical minimum number of required steps as becomes
small. The number of sampling steps required by f p cross is independent of the size of the
input relations. Thus, for a xed precision criterion, the cost of sampling relative to the
cost of computing exactly decreases as the size of input relations increases see Haas and
Naughton 8] for further discussion.
One shortcoming of f p cross is that there is no guaranteed upper bound on the number
of sampling steps. This can be an issue because the normal approximation that underlies
the stopping rule is exact only in the limit as ! 0. For xed positive , the probability that
the precision criterion is satised can be underestimated, resulting in too many sampling
steps.
We can use the results in Section 1.1 to alleviate this problem and provide a guaranteed
upper bound on the number of sampling steps. (See 15] for a related discussion of \sanity
bounds.") Set
1 2 n0 = n0 ( p) = 22 ln 1 ; p where dxe denotes the smallest; integer greater
than or equal to x. Then the idea is to stop
sampling in f p cross after min n0 N () steps, where N () is the number of steps executed
in the original f p cross procedure. It follows from (1.12) that no more than n0 sampling
steps are ever needed to satisfy the precision criterion.
The bound n0 can be improved upon when there is additional a priori information on
the selectivity of the join. For example, consider a join query consisting of the single join
predicate
!
R1 :a1 = R2 :a1 and R2 :a2 = R3 :a2 and and RK 1:aK 1 = RK :aK 1 :
;
14
;
;
Suppose it is known a priori that each tuple in relation Rk (1 k K ; 1) joins with at
most k+1 tuples in relationQRk+1 , where k+1 < tk+1 . Then 0 v(l1 l2 : : : lK ) u for
all l1 l2 : : : lK , where u = Kk=2 (k =tk ) < 1. It follows from (1.18)that
P jY~ n0 (u) ; j p
where
u2 2
n0 (u) = n0 (u p) = 22 ln 1 ; p
n0 :
3. Application to Online Aggregation
Users of an ordbms often execute \aggregation queries" in order to obtain statistical
summaries of large, complex data sets. Aggregation queries are processed by rst executing
a (typically complex) sequence of joins and selections on the base relations to create an
output relation each tuple of the output relation is then mapped to a real number and
aggregate quantities such as the sum, mean, or variance are computed from the numbers.
The rows of the output table sometimes are divided into groups based on values of the data
attributes and aggregates are computed separately for each group.
We focus on scientic and decision-support applications in which the user explores a
data set by executing a sequence of aggregation queries in an interactive manner. The
formulation of each successive query is in"uenced by the results of previous queries. In
this setting it is crucial that the processing time for each query be as short as possible.
Moreover, since the typical goal is simply to get a rough feel for the data, approximate
results often suce. Unfortunately, current database systems do not adequately support
such interactive exploration: the result of an aggregation query is not returned to the user
until the query has run to completion and the exact answer has been computed. Such an
exact computation, which can involve the processing of many terabytes of data, can take an
extremely long time. Moreover, there is no feedback during processing a user can lose much
valuable time before discovering that a particular query is misguided or uninformative.
In an eort to address these problems, Hellerstein, Haas, and Wang 9] have proposed
an online aggregation interface to an ordbms. This interface lets users both observe the
progress of their aggregation queries and control the execution of these queries on the "y.
The idea is to retrieve the pages of each base table in random order, so that the rows retrieved so far can be viewed as a random sample. At each time point, the system displays
a running estimate of the nal value of the aggregate based on all of the pages retrieved
so far. The system also indicates the estimated proximity of each running estimate to the
corresponding nal result by means of a running condence interval. The user can abort
processing of the query as soon as the intervals become suciently short. When aggregates
are being computed for multiple groups, some online-aggregation prototype systems maintain a running condence interval for each group and permit the user to increase/decrease
15
the relative processing speed for an individual group on the "y or abort processing for an
individual group.
Classical condence-interval formulas based on results for i.i.d. observations cannot be
applied in the setting of online aggregation because of the complicated correlation structure
induced by the joins and selections executed prior to the nal aggregation step. Largesample condence intervals suitable for online aggregation are given in 5, 6] such intervals
are based on central limit theorems. Although the large-sample intervals are useful in the
earlier stages of query processing, the set of auxiliary statistics needed to compute the
intervals can become very large as the processing proceeds. Also, the intervals rest on the
approximating assumption that samples are obtained with replacement, and this assumption
becomes untenable as the sample size (number of records scanned) becomes large. Finally,
the actual coverage probability for the intervals can be less than the nominal value.
The results in Section 1 can be used to obtain conservative running condence intervals
for a variety of online aggregation queries. That is, for a prespecied parameter p 2 (0 1),
a number is displayed such that the current value of the running estimate is within of the nal answer with probability p. Although conservative condence intervals are
wider in general than large-sample intervals, the computations for the conservative intervals
require minimal memory and CPU time and avoid the undercoverage problem alluded to
above.
In the following subsections, we illustrate the application of our results to online aggregation processing by means of an extended example. Consider an online aggregation
interface to a relational database system containing the following three relation schemes:
Supplier-scheme = (part-num, supplier, price)
Inventory-scheme = (part-num, location, quantity)
Sales-scheme = (item, month, day, location, number-sold)
Assume that there are m1 pages of tuples (with t1 tuples per page) in the Supplier relation,
m2 pages of tuples (with t2 tuples per page) in the Inventory relation, and m3 pages of
tuples (with t3 tuples per page) in the Sales relation. Tuples are retrieved from each input
relation a page at a time. Denote by Supplier(i) the ith tuple in the Supplier relation, and
similarly for the Inventory and Sales relations.
3.1. Complex SUM, COUNT, and AVERAGE Queries
As a rst example, consider the query that returns the total value of inventory stored
at the San Jose warehouse:
SELECT SUM(Supplier.price * Inventory.quantity)
FROM Supplier, Inventory
WHERE Supplier.part-num = Inventory.part-num
AND Inventory.location = 'San Jose'
16
For 1 i m1 t1 and 1 j m2 t2 , set
;
;
v0 (i j ) = Supplier(i):price Inventory(j ):quantity
if Supplier(i):part-num = Inventory(j ):part-num and Inventory(j ):location = \San Jose"
otherwise, set v0 (i j ) = 0. Also set
v(l1 l2 ) = m1m2
X
X
i B(1l1 ) j B(2l2 )
2
v0 (i j )
(3.1)
2
where B (1 k) and B (2 k) denote the kth page of tuples from the Supplier relation and
Inventory relation, respectively. Observe that is of the form (1.7) with v dened as above.
Suppose it is known that
(i) no part has a price greater than p and
(ii) no more than q units of a given part type are stored at the San Jose warehouse
simultaneously.
Such information often can be deduced from statistics that are maintained in the database
system catalog for use by the query optimizer. It follows that maxij v0 (i j ) pq and the
function v satises the bounds a v b with a = 0 and b = m1 m2 t1 t2 pq. Tighter bounds
on v can be obtained by observing that the part-num attribute is in fact a key for the
Supplier table that is, there are no duplicate part-num values in Supplier. (Otherwise, the
query is not well posed.) We can take b = m1 m2 t2 pq, since each row in the Inventory table
joins with at most one row in the Supplier table.
Suppose that at a given time point n1 pages have been retrieved from the Supplier
relation and n2 pages have been retrieved from the Inventory relation. Then the estimator
Y~ n dened by (1.9) is unbiased for , where n = (n1 n2 ). It follows from (1.13) that a
conservative 100p% condence interval for is given by Y~ n ; Y~ n + ], where is dened
by (1.20) with a and b as above. If the assumptions in (i) and (ii) hold and
(iii) all of the pages in the Supplier relation and a portion of the pages in the Inventory
relation have been retrieved, so that n1 = m1 and n2 < m2 ,
then we can obtain a tighter interval than the one given above. In analogy to (1.14), set
0
0
m1
m1
X
X
1
wn (l2 ) = m
v(l1 l2 ) = m2
1 l1 =1
0
X
X
l1 =1 i B(1l1 ) j B(2l2 )
2
v0 (i j )
2
for 1 j m2 . Since at most t2 terms in the above sum are positive (with the value of a
positive term equal to at most pq), we can obtain a tighter interval Y~ n ; (n) Y~ n + (n)]
by setting
0
;
= b(n) ; a(n)
1 ln; 2 1=2 2m (n) 1 ; p
0
17
0
(3.2)
where a(n) = 0 and b(n) = m2 t2 pq.
The formula (1.20) for the precision parameter assumes that pages, but not necessarily
tuples, are retrieved in random order. That is, the attribute values of a tuple may depend
on the page on which the tuple resides. When such dependence is present, the tuples
are said to be clustered on the pages otherwise, they are said to be unclustered. If it is
known that rows are unclustered, the lengths of the conservative condence intervals can
be considerably reduced. To see this, suppose that the assumptions in (i) and (ii) hold.
Also suppose that n1 (< m1 ) pages have been retrieved from the Supplier table and n2
(< m2 ) pages have been retrieved from the Inventory table. If the rows are clustered on
the pages then is computed from (1.20) with n = (n1 n2 ) as discussed above. If the
rows are unclustered, however, then rows (and not just pages) are retrieved in random
order, and can be computed using the same formula as in the clustered case except that
n = (t1n1 t2 n2). Similarly, if the assumptions in (i){(iii) hold, then is computed from
(3.2) with n = (n1 n2 ) in the clustered case and n = (t1 n1 t2 n2 ) in the unclustered case.
As a second example of an aggregation query, consider the query that returns the total
number of part types at the San Jose warehouse that are supplied by the Acme company:
SELECT COUNT(*)
FROM Supplier, Inventory
WHERE Supplier.part-num = Inventory.part-num
AND Supplier.supplier ='Acme' AND Inventory.location = 'San Jose'
We assume here that the combined attribute (part-num, location) is a key for the Inventory
relation. Set
v0 (i j ) = 1
if Supplier(i):part-num = Inventory(j ):part-num, Supplier(i):supplier = \Acme", and
Inventory(j ):location = \San Jose" otherwise, set v0 (i j ) = 0. Dening v(i j ) as in (3.1),
we see that is of the form (1.7). With no additional assumptions about the data, the
function v satises the bounds a v b with a = 0 and b = m1 m2 t1 t2 . As with the SUM
query discussed above, tighter bounds on v may be obtainable. Since has the same mathematical form as the result of the SUM query discussed above, the previous discussion for
the SUM query carries over to the current setting almost unchanged only the specic form
of the function v and the value of b are dierent. In particular, the estimator Y~ n dened
by (1.9) is unbiased for and conservative choices for the condence interval half-width are of the form (1.20). Moreover, the result in (1.15) can be applied as described for SUM
queries.
Finally, consider the query that returns the average price of the part types supplied
by the Acme company and stored in San Jose:
0
SELECT AVERAGE(Supplier.price)
FROM Supplier, Inventory
WHERE Supplier.part-num = Inventory.part-num
AND Supplier.supplier ='Acme' AND Inventory.location = 'San Jose'
18
For 1 i m1 t1 and 1 j m2 t2 , set
f0(i j ) = Supplier(i):price
if Supplier(i):part-num = Inventory(j ):part-num, Supplier(i):supplier = \Acme", and
Inventory(j ):location = \San Jose" otherwise, set f0 (i j ) = 0. Also set
f (l1 l2 ) =
X
X
f0 (i j ):
i B(1l1 ) j B(2l2 )
2
2
Similarly, set
g0 (i j ) = 1
if Supplier(i):part-num = Inventory(j ):part-num, Supplier(i):supplier = \Acme", and
Inventory(j ):location = \San Jose" otherwise, set g0 (i j ) = 0. Also set
g(l1 l2 ) =
X
X
i B(1l1 ) j B(2l2 )
2
g0 (i j ):
2
Then is of the form (f )=(g) as in Theorem 3. Under assumption (i) above, a conservative 100p% condence interval for is given by Y~ n ; np Y~ n + np ], where np is given
by (1.22) with af = 0, bf = t2 p, ag = 0, and bg = t2 . (As above, we have used the fact that
the part-num attribute is a key for the Supplier table.)
0
0
3.2. Summary Statistics Dened on Selections
We assume throughout that the online aggregation system retrieves tuples from each relation in random order. As discussed above, this assumption is stronger than the assumption
that the system retrieves pages in random order and implies that tuples are unclustered.
Consider the query that returns the average daily number (S ) of widgets sold in December:
SELECT AVG(Sales.number-sold)
FROM Sales
WHERE Sales.month='December'
AND Sales.item='widget'
For 1 i m3 t3 , set v(i) = Sales(i):number-sold. Also set u(i) = 1 if Sales(i):item =
\widget" and Sales(i):month = \December" otherwise, set u(i) = 0. Then (S ) is of the
form (1.24) with u and v dened as above. Suppose that no location stocks more than w
widgets on any day. Then the function v satises the bounds a v b with a = 0 and
b = w.
Suppose that at a given time point n tuples (equivalently, n=t3 pages) have been retrieved
from the Sales relation and k (> 0) of these tuples correspond to widgets sold in December.
19
The estimator Y n (S ) dened by (1.26) (with In = k) is conditionally unbiased for (S ). By
(1.29), a conservative 100p% condence interval is given by Y n (S ) ; Y n (S ) + ], where
0
0
0
1 ;
ln
= (b ; a)
0
1=2 :
2
2k 1 ; p
Note that an alternative condence interval is given by Theorem 3. Neither interval dominates the other in all cases, and the shorter of the two intervals should be used.
Now consider the query that returns the variance 2 (S ) of the daily number of widgets
sold in December:
SELECT VARIANCE(Sales.number-sold)
FROM Sales
WHERE Sales.month='December'
AND Sales.item='widget'
Dening u and v as above, we see that 2 (S ) is of the form (1.30). As above, suppose that
n tuples have been retrieved from the Sales relation and k of these tuples correspond to
widgets sold in December. We can estimate 2 (S ) by Zn (S ), where this estimator is dened
by (1.40) (with In = k). It follows from (1.41) that a conservative 100p% condence interval
is given by Zn (S ) ; 0 Zn (S ) + 0 ], where
00
0
00
00
d(b ; a)2 + d(b ; a)2
1 ln; 2 1=2 :
0 = 4(
d ; 1)2
d ; 1 8bk=2c 1 ; p
In the above expression, d is a lower bound on the total number jS j of tuples that correspond
to widgets sold in December. The parameter d can be taken equal to k if a larger lower
bound on jS j is available, then d should be taken equal to this lower bound.
p If the quantity
of interest is the standard deviation rather than the variance, then Zn (S ) can be used
to estimate (S ). By (1.42), a conservative 100p% condence interval is given by Zn (S ) ;
Zn (S ) + ], where = p0 .
00
00
00
4. Proofs
Proof of Theorem 1. Fix a vector n 2 N. For 1 k K let Nk ( nk ) be the unique
random integer such that Sk = f Lki : 1 i Nk g contains exactly nk distinct values
denote the set of these distinct values by Tk = f Lki : 1 i nk g. Observe that Tk
coincides with a random sample of size nk drawn uniformly without replacement from k .
Set S = (S1 S2 : : : SK ) and T = (T1 T2 : : : TK ). We can view Y~ n , Y~ n , and T as
functions of S . Write
0
0
nK
n1
X
X
E Y~ n j T ] = n 1 n
E v(L1i1 : : : LKiK ) j T ] :
1
K i1 =1 iK =1
20
(4.1)
It follows from symmetry considerations that
P f Lki = l j l 2 Tk g = n1 k
for 1 l mk , 1 i nk , and 1 k K , so that
nK
n1
X
X
E v(L1i1 : : : LKiK ) j T ] = n 1 n
v(L1i1 : : : LKiK ):
1
K i1 =1 iK =1
0
0
(4.2)
Substituting (4.2) into (4.1), we nd that Y~ n = E Y~ n j T ]. The desired result now follows
from the Rao-Blackwell Theorem, because T is a sucient statistic for the sampling scheme
that generates S see Pathak 17] for denitions and details.
0
We now prove Theorem 2 using an approach developed by Hoeding in the setting of
U -statistics and averages of m-dependent random variables see 11, Section 5].
We rst recall the following elementary but useful inequality, attributed by Hoeding 11]
to S. N. Bernstein:
h i
P f X 0 g hinf0 E ehX
(4.3)
for any random variable X . The inequality in (4.3) holds since P f X 0 g = E g(X )],
where
(
g(x) = 1 if x 0
0 if x < 0
and g(x) ehx for all real x and h 0. We also need the following proposition.
Proposition 2. Let Y = p1U1 + p2U2 + + pmUm , where each of U1 U2 : : : Um is the
average of n 1 independent random variables taking values in 0 1] and p1 p2 : : : pm are
nonnegative numbers such that p1 + p2 + + pm = 1. The random variables U1 U2 : : : Um
need not be mutually independent, but are assumed to have common mean . Then
h h(Y
inf E e
h 0
t)
;
;
i
e
2nt2
;
for t > 0.
This proposition follows almost immediately from results in Sections 1, 4 and 5 in 11].
Indeed,
h
E eh(nY
n nt)
;
;
m
i X
i=1
21
h
pi E eh(nUi
n nt)
;
;
i
:
for h 0 by the inequality that is stated just prior to (5.2) in 11]. On the other hand, it
follows from (1.8) and (4.16) in 11] that
h
E eh(nUi
n nt)
;
;
i
e
hnt+h2 n=8
;
for 1 i m and h 0. The right side of the above inequality is minimized when h = 4t,
so that
h h(nUi n nt) i 2nt2
inf
E
e
e
h 0
;
;
;
for 1 i m, and the asserted inequality follows directly.
Proof of Theorem 2. Fix t > 0 and n 2 N, and set
m = m(n) = min(n1 n2 : : : nK ):
Assume without loss of generality that n1 = m. Also assume without loss;of generality
that
0 v 1 the general result follows by considering the function v () = v() ; a =(b ; a).
Set
0
U (i2 i3 : : : iK ) = n1
n
X
1
1 j =1
v(L1j L2i2 +j : : : LKiK +j )
for (i2 i3 : : : iK ) 2 f 1 2 : : : n2 g f;1 2 : : : n3 g f 1 2 : : : nK g, where an index
of the form \ik + j " is interpreted as (ik + j ; 1) mod nk + 1 (that is, the indices \wrap
around"). For example, when K = 3, n1 = 3, n2 = 4, and n3 = 3, we have
U (3 1) = v(L11 L24 L32 ) + v(L12 L321 L33 ) + v(L13 L22 L31 ) :
Observe that
nK
n2 X
n3
X
X
Y~ n = n n 1 n
U (i2 i3 : : : iK ):
2 3
K i2 =1 i3 =1 iK =1
Indeed, for specied values of i1 i2 : : : iK the quantity v(L1i1 L2i2 : : : LKiK ) appears
exactly once as the i1 -st term in the representation of U (i2 ; i1 i3 ; i1 : : : iK ; i1 ), where
indices wrap around as described above. By construction, each U (i2 i3 : : : iK ) is the
average of n1 i.i.d. random variables. It follows from (4.3) and Proposition 2 that
h
P f Y~ n ; t g hinf0 E eh(Y~ n
t)
;
;
i
e
;
2nt2 :
A symmetric argument establishes the same bound for P f ; Y~ n t g, and (1.12) follows.
22
We can now use (1.12) to establish (1.13) in the same way that Hoeding uses (1.5) to
establish (1.6). First suppose that nk < mk for 1 k K , and dene m = m(n) as above.
Observe that m is also equal to m (n) in this case. Since the function g(x) = ehx is convex
for h 0, it follows from (4.3) and Theorem 1 that
i
h i
h i
h
P Y~ ; t E eh(Y~ n t) = e h(+t) E ehY~ n e h(+t) E ehY~ n
0
0
0
;
;
0
;
;
for h 0, so that, by Proposition 2,
h
i
P Y~ ; t hinf0 E eh(Y~ n t) e 2nt2 :
0
;
;
;
A symmetric argument establishes the same bound for P f ; Y~ n t g, and (1.13) follows.
Now suppose without loss of generality that for some integer r = r(n) with 1 r < K we
have nk < mk for 1 k r and nk mk for r + 1 k K . Observe that we can write
n1 X
n2
nr
X
X
1
Y~ =
w (L L : : : L )
0
0
n
n1 n2 nr i1 =1 i2 =1
ir =1
n
1i1
0
2i2
0
rir
0
where wn is dened as in (1.14). The desired result follows by applying the previous
argument to the r-dimensional cross-product average of the function wn . The only case not
considered so far is when nk mk for 1 k K , but for this case the desired result follows
trivially.
To prove Theorem 3, we need the following lemma.
Lemma 1. Suppose that there exist real numbers Yi, i, i (i = 1 2) such that 1 2 > 0,
Y2 > 2 and Yi ; i i Yi + i for i = 1 2. Then
Y Y + jY j
1 ; 1 2 1 1 2 :
2 Y2
Y2 (Y2 ; 2 )
Proof. We have
Y Y ; Y jY ; Y j
1 ; 1 = 2 1 1 2 2 1 1 2 :
2 Y2
Y2 2
Y2 (Y2 ; 2 )
The set A = f Y1 ; 1 Y1 + 1 g f Y2 ; 2 Y2 + 2 g contains the value of (1 2 ) that
maximizes the numerator of the rightmost term. Observe that jY2 1 ; Y1 2 j Y2 1 + jY1 j2
for any (1 2 ) 2 A.
Proof of Theorem 3. By (1.19), we have
P Y~ n(h) ; (h) (bh ; ah)np (1 + p)=2
for h = f g. It then follows from Bonferroni's inequality (see Miller 16, p. 8]) that
P Y~ n (h) ; (h) (bh ; ah)np for h = f g p:
The desired result now follows from Lemma 1.
0
0
23
Proof of Proposition 1. Set Yi = v(Li ) and Yi = v(Li ) for 1 i n. Throughout, we use
the algebraic identities
0
0
n
n X
n
X
X
1
2
2
Zn = n Yi ; n(n ; 1)
YiYj
i=1
i=1 j =i+1
and
0
1
n
n X
n
X
X
Zn = mm; 1 @ n1 (Yi )2 ; n (n2; 1)
Yi Yj A i=1
i=1 j =i+1
0
0
0
0
0
0
0
0
0
0
where n = n ^ m.
Fix a convex function f and assume rst that n m. Using Jensen's inequality (see 1,
p. 283]) and the fact that Zn 2 when n m, we have
0
0
E f (Zn) = f (2 ) = f (E Zn]) E f (Zn)] :
Now assume that n < m. We establish the desired inequality by mimicking the argument
in Section 6 of 11]. For y = (y1 y2 : : : yn ) 2 <n, set
0
zn(y) = n1
and
n
X
i=1
yi2 ; n(n2; 1)
n X
n
X
i=1 j =i+1
yiyj
zn (y) = mm; 1 zn(y):
0
Thus, Zn = zn (Y ) and Zn = zn (Y ), where Y = (Y1 Y2 : : : Yn ) and Y = (Y1 Y2 : : : Yn ).
Also set
Qn = f q = (q1 q2 : : : qn ) 2 f0 1 : : : ngn : q1 + q2 + + qn = n g
0
0
0
0
0
0
0
and denote by %n the set of permutations of f 1 2 : : : n g. For 2 %n and y = (y1 y2 : : : yn ) 2 <n , we abuse notation slightly and denote the vector (y(1) y(2) : : : y(n) ) by (y).
Finally, for y = (y1 y2 : : : yn ) 2 <n and q = (q1 q2 : : : qn ) 2 Qn , set zbn (q y) = zn (yb),
where ybi = y1 for 1 i q1 , ybi = y2 for q1 < i q1 + q2 , and so forth. For example,
when n = 5 and q = (2 0 2 1 0), we have zbn (q y) = zn (y1 y1 y3 y3 y4 ). Similarly to
Equation (6.6) in 11], we can write
; (4.4)
E f (Zn)] = E f zn (Y ) = E gn (Y f ) where g n is a function of the form
0
g n (y f ) =
X X
q Qn n
2
;
p(q )f zbn q (y)
2
24
with each coecient p(q ) independent of f and
X X
q Qn n
2
p(q ) = 1:
2
The representation in (4.4) can be interpreted as follows: a realization y of the random
vector Y can be obtained by generating a realization y of the random vector Y , permuting
the components of y according to a randomly selected permutation 2 %n , and then
forming a new vector of length n by replacing each component of (y ) by 0 or more copies
of the component in accordance with a randomly selected vector q 2 Qn . The quantity
p(q ) is the probability that the components of y are permuted according to and then
the components of (y ) are duplicated in accordance with q.
Symmetry considerations imply that there exist numbers f p(q): q 2 Qn g such that
p(q ) = p(q) for 2 %n and q 2 Qn, so that
0
0
0
0
0
0
gn (y f ) =
X
q Qn
X !
;
p(q)
f zbn q (y) :
n
2
2
For xed q = (q1 q2 : : : qn ) 2 Qn and y = (y1 y2 : : : yn ) 2 <n, observe that
X ;
zbn q (y)
n
1
0 n
n X
n
n
X
X
X @1 X
2
2
q
(
q
;
1)
i i
qiy2 (i) ; n(n ; 1)
qi qj y(i) y(j ) ; n(n ; 1)
y2 (i)A
=
n
2
i=1
i=1 j =i+1
i=1
n
0
1n n
n X
n
n
X
XX
X
4(
n
;
2)!
qiqj A
yi yj
= (n ; 1)! yi2 ; @ n(n ; 1)
i=1 j =i+1
i=1 j =i+1
i=1
!X
n
n
X
2
2
; n(n(n;;1)!1) qi(qi ; 1)
i=1
n
X
i=1
!
yi2
= n n;! 1 n ; n1 qi2 zn (y):
i=1
Denoting the identity function by h, we see that g n (y h) = czn (y) for some constant c. It
follows from (4.4) and the unbiasedness of both Zn and Zn for 2 that
E g (Y h) = E z (Y )] = m ; 1 E z (Y ) 0
n
0
n
m
25
n
0
so that c = (m ; 1)=m and thus gn (y h) = zn (y). Using Jensen's inequality, we have
0
;
g n (y f ) =
p(q )f zbn q (y)
q Qn n
0
1
X
X
;
f@
p(q )zbn q (y) A
q Qn n
;
= f gn (y h)
; = f z (y) :
X X
2
2
2
2
n
0
Since y is arbitrary,
;
E f (Zn )] = E g n (Y f ) E f zn(Y ) = E f (Zn ) 0
0
0
0
and the desired result follows.
Proof of Theorem 4. Fix t > 0 and n 2 f 2 3 : : : m g. We have
h i
h i
P Zn ; 2 t e h(
2 +t) E ehZn e h(
2 +t) E ehZn
0
0
;
;
for h > 0, where the rst inequality follows from (4.3) and the second inequality follows
from Proposition 1. Krat and Schmitz 13] show that
h
i
inf e h(
2 +t) E ehZn e 8 n=2 t2 =(b a)4 h 0
;
; b
c
;
so that
P Zn ; 2 t e 8 n=2 t2 =(b a)4 :
0
; b
c
;
A symmetric argument establishes the same bound for P 2 ; Zn t , and the inequality
in (1.38) follows. An argument as in (1.35) then establishes (1.39).
0
Proof of Theorem 5. Fix > 0, n 2 f 2 3 : : : m g, and k 2 f 2 3 : : : jS j ^ n g. Since for
any random variable 0 X 1 we have
2) = 1 Var X ] = E X 2 ; E 2 X ] E X ] ; E 2 X ] 0max
(
u
;
u
(4.5)
u 1
4
it follows that
2
S j(b ; a)2 d(b ; a)2 :
2 (S ) (b ;4 a) j4(
jS j ; 1) 4(d ; 1)
26
Using this result, Theorem 4, and an argument as in the derivation of (1.29), we nd that
2 d
d
(
b
;
a
)
2
P Zn (S ) ; (S ) d ; 1 + 4(d ; 1)2 In = k
2 (S ) d
2
P Zn (S ) ; (S ) d ; 1 + (d ; 1) In = k
= P d ;d 1 Zn (S ) ; 2 (S ) In = k
jS j ; 1
2
P
Z (S ) ; (S ) I = k
00
0
00
0
00
; b
Similarly,
n
2
8 k=2 =(b a)4
00
jS j
e
0
c
0
;
d ; d(b ; a)2 I = k
P Zn
d ; 1 4(d ; 1)o2 n
n
P Zn (S ) ; 2 (S ) ;
In = k
jS j ; 1
2
P jS j Zn (S ) ; (S ) ;
In = k
e 8 k=2 2=(b a)4
00
(S ) ; 2 (S ) ;
0
00
; b
P jZn
00
(S ) ; 2 (S )j 0
00
Thus,
n
c
0
;
d + d(b ; a)2 I = k 2e 8 k=2 2 =(b a)4
d ; 1 4(d ; 1)2 n
0
; b
c
;
and (1.41) follows directly. The inequality in (1.42) now follows by an argument as in
(1.35).
Acknowledgements
The author wishes to thank Joe Hellerstein, whose work helped motivate this paper.
The author also beneted from comments by Hadas Shachnai and an anonymous referee of
an earlier version of this paper.
References
1] P. Billingsley, \Probability and Measure," second ed., Wiley, New York, 1986.
2] W. G. Cochran, \Sampling Techniques," third ed., Wiley, New York, 1977.
3] H. Cramer, \Mathematical Methods of Statistics," Princeton University Press, Princeton, New Jersey, 1946.
27
4] S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz, Bifocal sampling for skewresistant join size estimation, in \Proc. 1996 ACM SIGMOD Intl. Conf. Managment
of Data," pp. 271{281, ACM Press, 1996.
5] P. J. Haas, Large-sample and deterministic condence intervals for online aggregation,
in \Proc. Ninth Intl. Conf. Scientic and Statist. Database Management," pp. 51{63,
IEEE Computer Society Press, 1997.
6] P. J. Haas and J. M. Hellerstein, Ripple joins for online aggregation, in \Proc. 1999
ACM SIGMOD Intl. Conf. Managment of Data," pp. 287{298, ACM Press, 1999.
7] P. J. Haas, J. F. Naughton, S. Seshadri, and A. N. Swami, Selectivity and cost estimation for joins based on random sampling, J. Comput. System Sci. 52 (1996),
550{569.
8] P. J. Haas, J. F. Naughton, and A. N. Swami, On the relative cost of sampling for
join selectivity estimation, in \Proc. Thirteenth ACM SIGACT-SIGMOD-SIGART
Symposium on Principles of Database Systems," pp. 14{24, ACM Press, 1994.
9] J. M. Hellerstein, P. J. Haas, and H. J. Wang, Online aggregation, in \Proc. 1997
ACM SIGMOD Intl. Conf. Managment of Data," pp. 171{182, ACM Press, 1997.
10] W. Hoeding, A class of statistics with asymptotically normal distribution, Ann.
Math. Statist. 19 (1948), 293{325.
11] W. Hoeding, Probability inequalities for sums of bounded random variables, J. Amer.
Statist. Assoc. 58 (1963), 13{30.
12] W. Hou, G. Ozsoyoglu, and B. Taneja, Processing aggregate relational queries with
hard time constraints, in \Proc. 1989 ACM SIGMOD Intl. Conf. Managment of Data,"
pp. 68{77, ACM Press, 1989.
13] O. Krat and N. Schmitz, A note on Hoeding's inequality, J. Amer. Statist. Assoc.
64 (1969), 907{912.
14] R. J. Lipton and J. F. Naughton, Query size estimation by adaptive sampling, in
\Proc. Ninth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database
Systems," pp. 40{46, ACM Press, 1990.
15] R. J. Lipton, J. F. Naughton, and D. A. Schneider, Practical selectivity estimation
through adaptive sampling, in \Proc. 1990 ACM SIGMOD Intl. Conf. Managment of
Data," pp. 1{11, ACM Press, 1990.
16] R. G. Miller, Jr., \Simultaneous Statistical Inference," second ed., Springer-Verlag,
New York, 1981.
28
17] P. K. Pathak, Suciency in sampling theory, Ann. Math. Statist. 35 (1964), 795{808.
18] P. K. Pathak, An extension of an inequality of Hoeding, Studia Sci. Math. Hungar.
10 (1975), 73{74.
19] S. Seshadri, \Probabilistic Methods in Query Processing," Ph.D. Dissertation, Department of Computer Sciences, University of Wisconsin, Madison, Wisconsin, 1992.
29