RJ 10040 (90536) August 22, 1996 Computer Science (Revised 7/16/99) Research Report HOEFFDING INEQUALITIES FOR JOIN-SELECTIVITY ESTIMATION AND ONLINE AGGREGATION Peter J. Haas IBM Research Division Almaden Research Center 650 Harry Road San Jose, CA 95120-6099 LIMITED DISTRIBUTION NOTICE This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specic requests. After outside publication, requests should be lled only by reprints or legally obtained copies of the article (e.g., payment of royalties). IBM Research Division Yorktown Heights, New York San Jose, California Zurich, Switzerland HOEFFDING INEQUALITIES FOR JOIN-SELECTIVITY ESTIMATION AND ONLINE AGGREGATION Peter J. Haas IBM Research Division Almaden Research Center 650 Harry Road San Jose, CA 95120-6099 e-mail: [email protected] ABSTRACT: We extend Hoeding's inequalities for simple averages of random variables to the case of cross-product averages. We also survey some new and existing Hoeding inequalities for estimators of the mean, variance, and standard deviation of a subpopulation. These results are applicable to two problems in object-relational database management systems: xed-precision estimation of the selectivity of a join and online processing of aggregation queries. For the rst problem, the new results can be used to modify the asymptotically ecient sampling-based procedures of Haas, Naughton, Seshadri, and Swami so that there is a guaranteed upper bound on the number of sampling steps. For the second problem, the inequalities can be used to develop conservative condence intervals for online aggregation such intervals avoid the large intermediate storage requirements and undercoverage problems of intervals based on large-sample theory. Keywords: Hoeding inequality, join-selectivity estimation, query optimization, database sampling, online aggregation 1. Introduction and Summary In many applications, it is necessary to estimate a population mean m X 1 = m v(i) i=1 using random sampling here m > 1 is a xed integer and v is a real-valued function dened on the set f 1 2 : : : m g. Denote by L1 L2 : : : Ln a random sample drawn uniformly with replacement from the set f 1 2 : : : m g and by L1 L2 : : : Ln a random sample drawn uniformly without replacement. For n 1, the estimators 0 0 0 n X 1 Y n = n v(Li) (1.1) i=1 and Y n = n ^1 m 0 nX m ^ i=1 v(Li ) (1.2) 0 are each unbiased for in that E Y n ] = E Y n ] = . (Here n ^ m denotes the minimum of n and m.) In a famous paper, Hoeding 11, Theorem 4] shows that 0 (1.3) E f (Y n)] E f (Y n )] for n 1 and any convex function f . In particular, it follows by taking f (x) = x2 ; 2 in (1.3) that VarY n ] Var Y n . It is frequently useful to bound the probability that Y n (resp., Y n ) deviates from 0 0 0 by more than a specied amount. Suppose that the only information available prior to sampling consists of lower and upper bounds a and b, respectively, on the function v: a v(i) b (1.4) for 1 i m. In 11], Hoeding not only establishes (1.3), but also shows that P jY n ; j t 2e 2nt2 =(b a)2 (1.5) n o P jY n ; j t 2e 2n t2 =(b a)2 : (1.6) ; and 0 ; for t > 0 and n 1, where ( n= n 0 ; 0 if n < m +1 if n m: 1 ; In the following, we extend Hoeding's results by allowing v to be a function of more than one variable and establishing analogues of inequalities (1.3), (1.5), and (1.6) for \crossproduct averages." We also survey some new and existing Hoeding inequalities for the case in which we wish to estimate the mean, variance, or standard deviation of the numbers f v(i): i 2 S g, where S f 1 2 : : : m g. We refer to S as a \subpopulation" and assume that the membership or non-membership in S of an element i is discovered only when i is sampled. Our results, which are stated formally in the remainder of this section, have a number of applications in object-relational database management systems (ordbms's). In Section 2 we consider the problem of using sampling to estimate the selectivity of a join to within a prespecied precision. Haas, Naughton, Seshadri, and Swami 7] have previously provided an asymptotically ecient procedure called f p cross for xed-precision estimation of selectivities. We show how the new inequalities can be used to modify f p cross so that there is a guaranteed upper bound on the number of sampling steps executed by the procedure. In Section 3, we show that our results are pertinent to online aggregation processing as described in Hellerstein, Haas, and Wang 9]. Section 4 contains the proofs of our results. 1.1. Cross-Product Averages Let m1 m2 : : : mK (K 1) be nite positive integers and v be a real-valued function dened on the set = 1 2 K , where k = f 1 2 : : : mk g for 1 k K . To avoid trivialities we assume throughout that mk > 1 for 1 k K . Suppose we wish to estimate the population mean m X m X = m m 1 m 1 2 1 2 K l1 =1 l2 =1 mK X lK =1 v(l1 l2 : : : lK ) (1.7) using random sampling. For 1 k K , denote by Lk1 Lk2 : : : Lkn a random sample of size n drawn uniformly with replacement from the set k . Similarly, denote by Lk1 Lk2 : : : Lkn a random sample drawn uniformly without replacement. We assume throughout that the sampling mechanisms for sets 1 2 : : : K are mutually independent. Set N = f 1 2 : : : gK and, for n = (n1 n2 : : : nK ) 2 N, dene the cross-product averages Y~ n and Y~ n by 0 0 0 0 nK n1 X n2 X X Y~ n = n n 1 n v(L1i1 L2i2 : : : LKiK ) 1 2 K i1 =1 i2 =1 iK =1 and Y~ n = 0 Y K k=1 (nk ^ mk ) ! 1 nXm nXm ; 1^ 1 2^ i1 =1 i2 =1 2 2 nKXmK ^ iK =1 v(L1i1 L2i2 : : : LKiK ) 0 0 0 (1.8) (1.9) respectively. It is straightforward to show that E Y~ n ] = E Y~ n ] = for n 2 N . Moreover, Y~ n for all n = (n1 n2 : : : nK ) 2 N such that nk mk for 1 k K . Although Y~ n can be viewed as a simple average of n1 n2 nK random variables (and similarly for Y~ n ), the inequalities (1.3), (1.5), and (1.6) are not directly applicable. The problem is that the random variables that make up the average are not mutually independent: in general, v(L1i1 L2i2 : : : LKiK ) and v(L1j1 L2j2 : : : LKjK ) are dependent unless ik 6= jk for 1 k K . An analogous remark applies to Y~ n . Our rst result extends the inequality in (1.3). Theorem 1. Let Y~ n and Y~ n be dened as in (1.8) and (1.9), respectively. Then E f (Y~ n) E f (Y~ n)] (1.10) for n 2 N and any convex function f . In particular, Var Y~ n ] Var Y~ n ]. Since Y~ n and Y~ n are each unbiased, it follows from Theorem 1 that Y~ n has a lower mean squared error than Y~ n . Our next result, Theorem 2 below, generalizes the inequalities in (1.5) and (1.6) and bounds the probability that Y~ n (resp., Y~ n ) deviates from by more than a specied amount. The inequalities require a priori knowledge only of lower and upper bounds a and b, respectively, on the function v: a v(l1 l2 : : : lK ) b (1.11) for (l1 l2 : : : lK ) 2 . For n 2 N, set m(n) = 1 min n k K k 0 0 0 0 0 0 0 0 0 0 and m (n) = 1 min n k K k 0 0 where ( nk = nk 0 if nk < mk +1 if nk mk for 1 k K . Theorem 2. Let Y~ n and Y~ n be dened as in (1.8) and (1.9), respectively, and let a and b satisfy (1.11). Then P f jY~ n ; j t g 2e 2m(n)t2 =(b a)2 (1.12) and P jY~ n ; j t 2e 2m (n)t2 =(b a)2 (1.13) for t > 0 and n 2 N. 0 ; 0 ; 3 ; 0 ; Clearly, the tighter the bounds a and b on the function v, the tighter the above inequalities. Suppose as a worst-case scenario that the function v depends only upon the rst of its K arguments and that n1 nk for 2 k K . Then the cross-product averages dened in (1.8) and (1.9) reduce to ordinary averages as in (1.1) and (1.2), and the inequalities (1.12) and (1.13) reduce to (1.5) and (1.6). These latter inequalities represent the best available Hoeding bounds for this situation. In this sense, the inequalities in (1.12) and (1.13) can be viewed as tight worst-case Hoeding bounds. The bound in (1.13) sometimes can be tightened as follows. Fix n 2 N and suppose that for some positive integer r = r(n) < K we have nk < mk for 1 k r and nk mk for r < k K . Set mX mK r+1 mX r+2 X 1 w (l l : : : l ) = v(l l : : : l l l : : : l ) n 1 2 r mr+1mr+2 mK lr+1=1 lr+2=1 1 2 lK =1 r r+1 r+2 K (1.14) for (l1 l2 : : : lr ) 2 1 2 r . Let a(n) and b(n) be lower and upper bounds, respectively, on the function wn . Applying Theorem 2 to the r-dimensional cross-product average of the function wn , we nd that P jY~ n ; j t 2e 2m (n)t2 =(b(n) a(n))2 (1.15) for t > 0 and n 2 N. The key point is that it is sometimes possible to choose a(n) and b(n) such that a(n) > a and/or b(n) < b, so that the bound in (1.15) is tighter than the bound in (1.13) see Section 3.1 below for an example. Of course, this approach to tightening the bound in (1.13) can be applied with obvious modications when f k : nk < mk g is an arbitrary strict subset of f 1 2 : : : K g. When n = (n n : : : n) for some n 1, we write Y~ n and Y~ n instead of Y~ n and Y~ n , respectively, so that n n n ~Y = 1 X X X v(L L : : : L ) (1.16) 0 0 ; ; 0 n and Y~ n = 0 Y K k=1 nK i1 =1 i2 =1 (n ^ mk ) ! 1 nXm nXm ; 1 ^ 1i1 iK =1 ^ 2 i1 =1 i2 =1 nX mK ^ iK =1 2i2 0 KiK v(L1i1 L2i2 : : : LKiK ): 0 0 0 In this case, the inequalities (1.12) and (1.13) take the form P f jY~ n ; j t g 2e 2nt2 =(b a)2 ; and ( P jY~ n ; j t 2e 0 2nt2 =(b a)2 if n < max(m1 m2 : : : mK ) if n max(m1 m2 : : : mK ) ; 0 ; ; 4 (1.17) (1.18) for t > 0 and n 1. Hoeding actually establishes his inequalities (1.3), (1.5), and (1.6) for an arbitrary collection of mutually independent, real-valued random variables (not necessarily discrete). The inequalities in (1.10), (1.12), and (1.13) also can be shown to hold at this level of generality. Hoeding also establishes one-sided bounds, such as the following one-sided analogues of (1.5): P Y n ; t e 2nt2 =(b a)2 ; and ; P ; Y n t e 2nt2 =(b a)2 : ; ; For each two-sided bound presented in this paper, there exists a pair of one-sided analogues as in the above example. In applications, the foregoing bounds are often \inverted" for purposes of obtaining condence intervals. For example, it follows from (1.13) that for xed p 2 (0 1) and n 2 N P jY~ n ; j p (1.19) 0 where 1=2 ; 2 1 = (b ; a) 2m (n) ln 1 ; p (1.20) 0 Our nal result gives an analogue of (1.19) for the ratio of two cross-product averages. Consider two real-valued functions f and g, each dened on , along with nite constants af , bf , ag ( 0), and bg such that af f (l1 : : : lK ) bf and ag g(l1 : : : lK ) bg (1.21) for (l1 : : : lK ) 2 . For p 2 (0 1) and n 2 N, set 1=2 ; 1 4 np = 2m (n) ln 1 ; p : 0 Also set np = np ~ Y n (g)(bf; ; af ) + jY~ n (f )j(bg ; ag ) Y~n (g) Y~n(g) ; (bg ; ag )np 0 0 0 0 ! (1.22) if Y~ n (g) > (bg ; ag )np otherwise, set np = 1. In (1.22), Y~ n (f ) is dened as in (1.9), but with v replaced by f , and similarly for Y~ n (g). Dene corresponding population averages (f ) and (g) analogously. Take 0=0 = 0. 0 0 0 5 Theorem 3. Suppose that (1.21) holds and (g) > 0. Then Y~ (f ) (f ) P ~n ; (g) np p Yn (g) 0 0 for p 2 (0 1) and n 2 N, where np is dened by (1.22). Theorem 3 applies to the case of sampling without replacement, but analogous results hold for sampling with replacement. The previously-mentioned techniques for tightening bounds also can be applied in the current setting. Condence intervals for other aggregates such as VARIANCE also can be obtained by adapting the techniques used to establish Theorem 3. 1.2. Subpopulations We now consider the problem of estimating the mean, variance, and standard deviation of the numbers f v(i): i 2 S g, where v is a real-valued function dened on f 1 2 : : : m g (with m > 1) and S is a nonempty subset of f 1 2 : : : m g. We assume that the set S is specied by means of an indicator function u: ( u(i) = 1 if i 2 S 0 if i 2 f 1 2 : : : m g ; S (1.23) for 1 i m. We also assume that it is not possible to sample directly from S as in previous sections, each element in the sample is selected randomly and uniformly from the set f 1 2 : : : m g. The function u is then applied to determine whether the sampled element is a member of S . 1.2.1. Mean Denote by (S ) the average of the function v over the set S : P m u(i)v(i) X i=1 (S ) = jS1 j v(i) = P (1.24) m u(i) i=1 i S where jS j denotes the number of elements in S . We refer to (S ) as the subpopulation mean over S . Dene random indexes L1 L2 : : : Ln and L1 L2 : : : Ln as at the beginning of Section 1. For n 1, two possible estimators of (S ) are 2 0 Y n(S ) = I1 n X n i=1 6 0 u(Li )v(Li ) 0 (1.25) and Y n (S ) = I1 0 nX m ^ u(Li)v(Li ) (1.26) u(Li ) (1.27) u(Li): (1.28) n i=1 0 where In = and In = 0 n X i=1 nX m ^ i=1 0 0 0 We take Y n (S ) = 0 when In = 0 and similarly for Y n (S ). Neither Y n (S ) nor Y n (S ) is unbiased for . As n becomes large, however, the bias of each estimator approaches 0 and each estimator converges to with probability 1. Because Y n (S ) and Y n (S ) are each biased estimators, it appears quite dicult to develop Hoeding inequalities analogous to (1.12) and (1.13). In practice, however, it suces to bound the conditional probability that Y n (S ) (resp., Y n (S )) deviates from by more than a specied amount, given the observed value of In (resp., In ). The key observation (cf Section 2.12 in Cochran 2]) isPas follows: given that In = k (where k > 0), the estimator Y n (S ) is distributed as (1=k) ki=1 v(Li ), where f L1 L2 : : : Lk g is a random sample of size k drawn from the set S uniformly with replacement. An analogous statement holds for the conditional distribution of Y n (S ) given In = k. Thus, Y n (S ) and Y n (S ) are conditionally unbiased for (S ) and, using the inequalities in (1.5) and (1.6), we obtain the bounds 0 0 0 0 0 0 0 0 P jY n (S ) ; (S )j t In = k 2e 2kt2 =(b a)2 for t > 0, n 1, and 1 k n, and ; ; o n P jY n (S ) ; (S )j t In = k 2e 2kt2 =(b a)2 0 0 ; ; (1.29) for t > 0, n 1, and 1 k jS j ^ n. 1.2.2. Variance and Standard Deviation: Sampling with Replacement Denote by 2 (S ) the variance of the function v over the set S : 2 (S ) = 1 X; jS j i S Pm u(i);v(i) ; (S ) 2 2 v(i) ; (S ) = i=1 P : m u(i) i=1 2 7 (1.30) We refer to 2 (S ) and (S ) as the subpopulation variance over S and subpopulation standard deviation over S , respectively. First suppose that S = f 1 2 : : : m g and that this fact is known a priori. (The value of m need not be known.) In this case we write for (S ) and 2 for 2 (S ). Fix n 2 and dene the sample average Y n as in (1.1). It is well-known that the estimator Zn = n ;1 1 n X i=1 (v(Li ) ; Y n )2 (1.31) is unbiased for 2 see, for example, Cramer 3, p. 347]. As pointed out by Hoeding 10], Zn can be written in the form X; Zn = n(n1; 1) g v(Li ) v(Lj ) i=j 6 where g(x y) = (x ; y)2 =2 that is, Zn is a \one-sample U -statistic" as dened in 10, 11]. Hoeding's inequalities for U statistics 11, Section 5a] therefore can be applied in a straightforward manner to yield inequalities for Zn . These computations, which have been carried out by Krat and Schmitz 13], yield the inequalities P Zn ; 2 t e 8 n=2 t2 =(b a)4 (1.32) P 2 ; Zn t e 8 n=2 t2 =(b a)4 (1.33) ; b ; b and c ; c ; P jZn ; 2 j t 2e 8 n=2 t2 =(b a)4 (1.34) for t > 0 and n 2, wherep bxc denotes the largest integer less than or equal to x. The random variable Zn often serves as an estimator of the standard deviation . To ; b c ; obtain a bound analogous to that in (1.34), observe that px ; py = (x ; 2pxy + y)1=2 px ; y for 0 y x. Using (1.32), we nd that P np o Zn ; t P np o Zn ; 2 t = P Zn ; 2 t2 e 8 n=2 t4 =(b a)4 : ; b p c ; (1.35) A similar argument using (1.33) yields the same bound for P ; Zn t , and combination of the two bounds yields the inequality o np P Zn ; t 2e 8 n=2 t4 =(b a)4 : ; b 8 c ; (1.36) Now suppose that S is an arbitrary nonempty subset of f 1 2 : : : m g. The quantity 2 (S ) can be estimated by n X ; Zn (S ) = I 1; 1 u(Li ) v(Li ) ; Y n (S ) 2 n i=1 where u, Y n (S ), and In and are dened as in (1.23), (1.25), and (1.27), respectively. Using the above results and arguing as in the case of the subpopulation mean, we obtain the conditional inequalities P jZn (S ) ; 2 (S )j t In = k 2e 8 k=2 t2 =(b a)4 ; b and c ; np o P Zn(S ) ; (S ) t In = k 2e 8 k=2 t4 =(b a)4 ; b c ; for t > 0, n 2, and 2 k n. 1.2.3. Variance and Standard Deviation: Sampling without Replacement Suppose at rst that S = f 1 2 : : : m g, that this fact is known a priori, and that m is known. Fix n 2 and dene the sample average Y n as in (1.2). It is well-known that the 0 estimator Zn = m ; 1 1 nXm ^ (v(Li ) ; Y n )2 (1.37) (n ^ m) ; 1 i=1 is unbiased for 2 see, for example, 2, Theorem 2.4]. In analogy with (1.3), we have the following result. Proposition 1. Let Zn and Zn be dened as in (1.31) and (1.37), respectively. Then 0 m 0 0 0 E f (Zn) E f (Zn )] 0 for n 2 and any convex function f . In particular, Var Zn ] Var Zn ]. Pathak 18] gives a proof of this result based on the Rao-Blackwell inequality. In Section 4 we give a \rst principles" proof along the lines of the original argument used by Hoeding to establish (1.3). As in Hoeding's proof of (1.6), the inequality in Proposition p 1 can be combined with the results in Section 1.2.2 to establish inequalities for Zn and Zn . 0 0 9 0 Theorem 4. Let Zn be dened as in (1.37), and let a and b satisfy (1.4). Then P jZn ; 2 j t 2e 8 n=2 t2 =(b a)4 0 0 and ; b c ; o np P Zn ; t 2e 8 n=2 t4 =(b a)4 ; b 0 c (1.38) (1.39) ; for t > 0 and 2 n m. The inequality (1.38) is stated in 13], but the supporting proof is incomplete. Now suppose that S is an arbitrary nonempty subset of f 1 2 : : : m g. If jS j is known, then for n 2 the estimator jS j ; 1 Zn (S ) = 0 jS j 1 nXm u(L );v(L ) ; Y 2 : i n In ; 1 i=1 i ^ 0 0 0 0 is conditionally unbiased for 2 (S ), given the value of In . Arguing as in previous sections, we obtain the conditional inequalities 0 P jZn(S ) ; 2j t In = k 2e 8 k=2 t2 =(b a)4 0 and 0 ; b c ; o np P Zn (S ) ; t In = k 2e 8 k=2 t4 =(b a)4 0 0 ; b c ; for t > 0, n 2, and 2 k jS j^ n. If jS j is unknown, we cannot use the estimator Zn (S ). If S is known to contain a sizable number of elements, however, a reasonable estimator can be obtained by ignoring the troublesome bias-correction term (jS j ; 1)=jS j and simply computing the sample variance over all elements of S observed so far. This approach yields a biased estimator of 2 (S ) given by 0 nX m ; Zn (S ) = I 1; 1 u(Li) v(Li ) ; Y n (S ) 2 n i=1 ^ 00 0 0 (1.40) 0 0 where Y n (S ) and In are dened as in (1.26) and (1.28), respectively. Because Zn also is biased when given the value of In , it is dicult to obtain even conditional Hoeding inequalities without some extra information. When jS j is large relative to (b ; a)2 and a good lower bound d jS j is available, the following inequality can be useful. Theorem 5. Let Zn (S ) be dened as in (1.40) and let d jS j. Then 0 0 00 0 00 P jZn (S ) ; 2 (S )j t In = k 2e 8 k=2 2 =(b a)4 00 0 10 ; b c ; (1.41) and np o P j Zn (S ) ; (S )j t In = k 2e 8 k=2 2 =(b a)4 0 00 ; b c (1.42) ; for t > 0, n 2, and 2 k jS j ^ n, where (b ; a)2 = (t d a b) = d ;d 1 t ; 4( d ; 1) and (b ; a)2 : = (t d a b) = d ;d 1 t2 ; 4( d ; 1) Note that, given In = k, we can set d = k if k is suciently large. 0 1.3. Some Further Renements Krat and Schmitz 13] provide several techniques that can be used to tighten the various bounds given in this section. The rst (trivial) observation is that if a v b, then, for example, P f jY~ n ; j t g = 0 for t > b ; a, where Y~ n and are as in Theorem 2. Similarly, it follows from the inequality in (4.5) below that jZn ; 2 j tn for n 0, where Zn and 2 are as in (1.34) and 2 tn = n4((bn;;a1)) : p p Thus, P jZn ; 2 j t = 0 for t > tn and P j Zn ; j t = 0 for t > tn . All of the other inequalities in this section can be tightened in a similar manner. A less trivial result is obtained after rewriting (1.12), (1.34), and (1.36) in the form P f jY~ n ; j t g 2e m(n)(t=(b a)) ; P jZn ; 2 j t 2e and ;b o np P Zn ; t 2e ; n=2 (2t=(b a)2 ) ;b c ; n=2 (2t2 =(b a)2 ) c ; respectively, where (x) = 2x2 . It follows from results in 13] that the above inequalities hold with replaced by 0 , where 0(x) = 2x2 + 94 x4 + 29 x6 : All of the other inequalities in this section can be rewritten and tightened in a similar manner. 11 2. Application to Join-Selectivity Estimation A (select-)join query with K ( 2) input relations R1 R2 : : : RK species a subset of the cross product R1 R2 RK . For each element (j1 j2 : : : jK ) in this subset, tuples j1 j2 : : : jK are concatenated to form a tuple of the output relation R12:::K . The \selectivity of the join" is the number of tuples in R12::: K divided by the number of elements in the cross product of the input relations. Selectivity estimates play a key role in query optimization for ordbms's as well as for capacity planning, cost estimation for online queries, system access control, load balancing, and statistical studies. Our formulation of the selectivity-estimation problem follows 7]. For j1 2 R1 j2 2 R2 : : : jK 2 RK , set v0 (j1 j2 : : : jK ) = 1 if tuples j1 j2 : : : jK join (that is, if these tuples are concatenated to form a tuple of R12::: K ) otherwise, set v0 (j1 j2 : : : jK ) = 0. The function v0 is determined by the join and selection predicates that make up the query. We wish to estimate the selectivity , dened as X X X 1 = jR12::: K j = v (j j : : : j ): jR1 RK j jR1 RK j j1 R1 j2 R2 2 2 jK RK 0 1 2 K 2 A na!ve method for estimating the selectivity is to use a \tuple level, independent" sampling scheme, denoted by t indep. At the nth sampling step of t indep, a tuple is selected randomly and uniformly from each input relation. The observation Xn is then computed from the K selected tuples, where Xn = 1 if the tuples join, and Xn = 0 otherwise. In other words, Xn is the selectivity of the K -way join of the randomly selected tuples. The tuples are then discarded prior to the next sampling step. For each n 1, the observations X1 X2 : : : Xn are identically distributed, and the average of these observations is an unbiased estimator of if samples are drawn with replacement, then these observations are also mutually independent. An alternative approach is to estimate using the \page-level, cross-product" sampling scheme proposed in Hou, Ozsoyoglu, and Taneja 12] and analyzed in 7, 19]. We denote this scheme by p cross. At each sampling step of p cross and for each of the K input relations, a page of tuples is selected randomly and uniformly from among the pages that make up the relation this randomly selected page is stored in main memory. At the nth sampling step, nK ; (n ; 1)K observations are generated by computing (i) the selectivity of the K -way join of the pages selected at the current sampling step and (ii) the selectivities of all possible K -way joins among pages selected at the current sampling step and pages selected at previous sampling steps. Although p cross examines many more tuples per sampling step than t indep, the resulting observations are not mutually independent, even when samples are drawn with replacement. Haas et al. 7] compare p cross and t indep when samples are drawn with replacement. They show that, for any xed number of sampling steps, selectivity estimators based on 12 the p cross scheme always have variance less than or equal to that of estimators based on the t indep scheme. In practice, the variance of the selectivity estimator can be smaller by orders of magnitude when p cross is used. We therefore focus throughout on the p cross sampling scheme and its variants. The sampling cost of p cross sometimes can be reduced by using \index-assisted" sampling as proposed by Lipton, Naughton, and Schneider 14, 15] and extended in 7]. Suppose, for example, that we wish to estimate the selectivity of an equijoin of relations R1 and R2 using the t indep scheme and that R2 has an index on its join attribute that is, the join predicate is \R1 :a = R2 :a" and R2 has an index on attribute a. At each sampling step, one tuple is selected randomly and uniformly from relation R1 and the total number of tuples from R2 that join with this random tuple is computed using the index no sampling from R2 is required. Thus, we obtain jR2j observations at each sampling step. This idea extends in a straightforward manner to the p cross scheme and to general K -way joins when one or more of the input relations has a combined index on (the concatenation of) all relevant join and selection attributes. In the following, we assume that samples are drawn with replacement. We also assume that tuples are stored and brought into main memory in pages, where each page contains N ( 1) tuples. To discuss page-level and index-assisted sampling schemes in a unied way, we consider a generalized scheme in which, for 1 k K , the tuples in relation Rk are partitioned into mk blocks B (k 1) B (k 2) : : : B (k mk ) with tk (= jRk j=mk ) tuples per block. For each relation Rk , a block of tuples is selected at each sampling step, and all of the tuples in the block are brought into main memory. When t1 = = tK = N we have pure page-level sampling. When tj = jRj j for one or more values of j we have index-assisted sampling. (As indicated above, we don't actually bring all of the tuples in an indexed relation into memory, we just perform an index lookup. Such a lookup, however, is equivalent to examining all of the tuples in the indexed relation, as far as their join and selection attributes are concerned. In general, one or more I/O's may be required to perform the lookup see 7] for a detailed discussion of sampling costs.) We assume that at most K ; 1 indexes are available Ganguly, Gibbons, Matias, and Silberschatz 4] provide an estimation procedure when K = 2 and there is an index on each input relation. For (l1 l2 : : : lK ) 2 , denote by v(l1 l2 : : : lK ) the selectivity of the join of blocks B (1 l1 ) B (2 l2 ) : : : B (K lK ): X X X 1 v (l l : : : l ) = v (j j : : : j ): 1 2 K t1t2 tK j1 B(1l1 ) j2 B(2l2 ) 2 2 jK B(KlK ) 0 1 2 K 2 The function v is called the selectivity function for the join. Observe that the selectivity can be represented as a cross-product average of the form (1.7). Denote by Lki (resp., Lki ) the random index of the block of tuples selected from relation Rk at the ith sampling step when samples are drawn with (resp., without) replacement. Then for n 1 the estimators Y~ n and Y~ n dened by (1.16) and (1.17), respectively, are each unbiased for . 0 0 13 Suppose that we wish to estimate the selectivity to within with probability p, where > 0 and p 2 (0 1). In general, it is impossible to satisfy the precision criterion with a probability exactly equal to p. To address this problem, Haas et al. 7] introduced a xed-precision estimation procedure called f p cross. In this procedure, the basic p cross sampling procedure is executed for a random number of sampling steps. After each sampling step of p cross, a stopping rule is used to determine whether to continue sampling or to stop sampling and return the cross-product average based on all of the blocks sampled so far. The nal estimate Y~ () is given by Y~ () = Y~ N () , where Y~ n is dened by (1.16) for n 1 and N () is the random number of sampling steps executed by the procedure. The stopping rule for f p cross is obtained by approximating the distribution of Y~ () by a normal distribution. Use of this approximation leads to an estimate of the probability that the precision criterion is satised. If the estimated probability is greater than or equal to the prespecied probability, then no further samples are taken see 7] for details. It is shown in 7] that the f p cross procedure is \asymptotically consistent" in the sense that the probability of satisfying the precision criterion converges to the prespecied value as the precision criterion becomes increasingly stringent: lim 0 P f jY~ () ; j g = p. Moreover, f p cross is \asymptotically ecient" in the sense that the total number of sampling steps converges to the theoretical minimum number of required steps as becomes small. The number of sampling steps required by f p cross is independent of the size of the input relations. Thus, for a xed precision criterion, the cost of sampling relative to the cost of computing exactly decreases as the size of input relations increases see Haas and Naughton 8] for further discussion. One shortcoming of f p cross is that there is no guaranteed upper bound on the number of sampling steps. This can be an issue because the normal approximation that underlies the stopping rule is exact only in the limit as ! 0. For xed positive , the probability that the precision criterion is satised can be underestimated, resulting in too many sampling steps. We can use the results in Section 1.1 to alleviate this problem and provide a guaranteed upper bound on the number of sampling steps. (See 15] for a related discussion of \sanity bounds.") Set 1 2 n0 = n0 ( p) = 22 ln 1 ; p where dxe denotes the smallest; integer greater than or equal to x. Then the idea is to stop sampling in f p cross after min n0 N () steps, where N () is the number of steps executed in the original f p cross procedure. It follows from (1.12) that no more than n0 sampling steps are ever needed to satisfy the precision criterion. The bound n0 can be improved upon when there is additional a priori information on the selectivity of the join. For example, consider a join query consisting of the single join predicate ! R1 :a1 = R2 :a1 and R2 :a2 = R3 :a2 and and RK 1:aK 1 = RK :aK 1 : ; 14 ; ; Suppose it is known a priori that each tuple in relation Rk (1 k K ; 1) joins with at most k+1 tuples in relationQRk+1 , where k+1 < tk+1 . Then 0 v(l1 l2 : : : lK ) u for all l1 l2 : : : lK , where u = Kk=2 (k =tk ) < 1. It follows from (1.18)that P jY~ n0 (u) ; j p where u2 2 n0 (u) = n0 (u p) = 22 ln 1 ; p n0 : 3. Application to Online Aggregation Users of an ordbms often execute \aggregation queries" in order to obtain statistical summaries of large, complex data sets. Aggregation queries are processed by rst executing a (typically complex) sequence of joins and selections on the base relations to create an output relation each tuple of the output relation is then mapped to a real number and aggregate quantities such as the sum, mean, or variance are computed from the numbers. The rows of the output table sometimes are divided into groups based on values of the data attributes and aggregates are computed separately for each group. We focus on scientic and decision-support applications in which the user explores a data set by executing a sequence of aggregation queries in an interactive manner. The formulation of each successive query is in"uenced by the results of previous queries. In this setting it is crucial that the processing time for each query be as short as possible. Moreover, since the typical goal is simply to get a rough feel for the data, approximate results often suce. Unfortunately, current database systems do not adequately support such interactive exploration: the result of an aggregation query is not returned to the user until the query has run to completion and the exact answer has been computed. Such an exact computation, which can involve the processing of many terabytes of data, can take an extremely long time. Moreover, there is no feedback during processing a user can lose much valuable time before discovering that a particular query is misguided or uninformative. In an eort to address these problems, Hellerstein, Haas, and Wang 9] have proposed an online aggregation interface to an ordbms. This interface lets users both observe the progress of their aggregation queries and control the execution of these queries on the "y. The idea is to retrieve the pages of each base table in random order, so that the rows retrieved so far can be viewed as a random sample. At each time point, the system displays a running estimate of the nal value of the aggregate based on all of the pages retrieved so far. The system also indicates the estimated proximity of each running estimate to the corresponding nal result by means of a running condence interval. The user can abort processing of the query as soon as the intervals become suciently short. When aggregates are being computed for multiple groups, some online-aggregation prototype systems maintain a running condence interval for each group and permit the user to increase/decrease 15 the relative processing speed for an individual group on the "y or abort processing for an individual group. Classical condence-interval formulas based on results for i.i.d. observations cannot be applied in the setting of online aggregation because of the complicated correlation structure induced by the joins and selections executed prior to the nal aggregation step. Largesample condence intervals suitable for online aggregation are given in 5, 6] such intervals are based on central limit theorems. Although the large-sample intervals are useful in the earlier stages of query processing, the set of auxiliary statistics needed to compute the intervals can become very large as the processing proceeds. Also, the intervals rest on the approximating assumption that samples are obtained with replacement, and this assumption becomes untenable as the sample size (number of records scanned) becomes large. Finally, the actual coverage probability for the intervals can be less than the nominal value. The results in Section 1 can be used to obtain conservative running condence intervals for a variety of online aggregation queries. That is, for a prespecied parameter p 2 (0 1), a number is displayed such that the current value of the running estimate is within of the nal answer with probability p. Although conservative condence intervals are wider in general than large-sample intervals, the computations for the conservative intervals require minimal memory and CPU time and avoid the undercoverage problem alluded to above. In the following subsections, we illustrate the application of our results to online aggregation processing by means of an extended example. Consider an online aggregation interface to a relational database system containing the following three relation schemes: Supplier-scheme = (part-num, supplier, price) Inventory-scheme = (part-num, location, quantity) Sales-scheme = (item, month, day, location, number-sold) Assume that there are m1 pages of tuples (with t1 tuples per page) in the Supplier relation, m2 pages of tuples (with t2 tuples per page) in the Inventory relation, and m3 pages of tuples (with t3 tuples per page) in the Sales relation. Tuples are retrieved from each input relation a page at a time. Denote by Supplier(i) the ith tuple in the Supplier relation, and similarly for the Inventory and Sales relations. 3.1. Complex SUM, COUNT, and AVERAGE Queries As a rst example, consider the query that returns the total value of inventory stored at the San Jose warehouse: SELECT SUM(Supplier.price * Inventory.quantity) FROM Supplier, Inventory WHERE Supplier.part-num = Inventory.part-num AND Inventory.location = 'San Jose' 16 For 1 i m1 t1 and 1 j m2 t2 , set ; ; v0 (i j ) = Supplier(i):price Inventory(j ):quantity if Supplier(i):part-num = Inventory(j ):part-num and Inventory(j ):location = \San Jose" otherwise, set v0 (i j ) = 0. Also set v(l1 l2 ) = m1m2 X X i B(1l1 ) j B(2l2 ) 2 v0 (i j ) (3.1) 2 where B (1 k) and B (2 k) denote the kth page of tuples from the Supplier relation and Inventory relation, respectively. Observe that is of the form (1.7) with v dened as above. Suppose it is known that (i) no part has a price greater than p and (ii) no more than q units of a given part type are stored at the San Jose warehouse simultaneously. Such information often can be deduced from statistics that are maintained in the database system catalog for use by the query optimizer. It follows that maxij v0 (i j ) pq and the function v satises the bounds a v b with a = 0 and b = m1 m2 t1 t2 pq. Tighter bounds on v can be obtained by observing that the part-num attribute is in fact a key for the Supplier table that is, there are no duplicate part-num values in Supplier. (Otherwise, the query is not well posed.) We can take b = m1 m2 t2 pq, since each row in the Inventory table joins with at most one row in the Supplier table. Suppose that at a given time point n1 pages have been retrieved from the Supplier relation and n2 pages have been retrieved from the Inventory relation. Then the estimator Y~ n dened by (1.9) is unbiased for , where n = (n1 n2 ). It follows from (1.13) that a conservative 100p% condence interval for is given by Y~ n ; Y~ n + ], where is dened by (1.20) with a and b as above. If the assumptions in (i) and (ii) hold and (iii) all of the pages in the Supplier relation and a portion of the pages in the Inventory relation have been retrieved, so that n1 = m1 and n2 < m2 , then we can obtain a tighter interval than the one given above. In analogy to (1.14), set 0 0 m1 m1 X X 1 wn (l2 ) = m v(l1 l2 ) = m2 1 l1 =1 0 X X l1 =1 i B(1l1 ) j B(2l2 ) 2 v0 (i j ) 2 for 1 j m2 . Since at most t2 terms in the above sum are positive (with the value of a positive term equal to at most pq), we can obtain a tighter interval Y~ n ; (n) Y~ n + (n)] by setting 0 ; = b(n) ; a(n) 1 ln; 2 1=2 2m (n) 1 ; p 0 17 0 (3.2) where a(n) = 0 and b(n) = m2 t2 pq. The formula (1.20) for the precision parameter assumes that pages, but not necessarily tuples, are retrieved in random order. That is, the attribute values of a tuple may depend on the page on which the tuple resides. When such dependence is present, the tuples are said to be clustered on the pages otherwise, they are said to be unclustered. If it is known that rows are unclustered, the lengths of the conservative condence intervals can be considerably reduced. To see this, suppose that the assumptions in (i) and (ii) hold. Also suppose that n1 (< m1 ) pages have been retrieved from the Supplier table and n2 (< m2 ) pages have been retrieved from the Inventory table. If the rows are clustered on the pages then is computed from (1.20) with n = (n1 n2 ) as discussed above. If the rows are unclustered, however, then rows (and not just pages) are retrieved in random order, and can be computed using the same formula as in the clustered case except that n = (t1n1 t2 n2). Similarly, if the assumptions in (i){(iii) hold, then is computed from (3.2) with n = (n1 n2 ) in the clustered case and n = (t1 n1 t2 n2 ) in the unclustered case. As a second example of an aggregation query, consider the query that returns the total number of part types at the San Jose warehouse that are supplied by the Acme company: SELECT COUNT(*) FROM Supplier, Inventory WHERE Supplier.part-num = Inventory.part-num AND Supplier.supplier ='Acme' AND Inventory.location = 'San Jose' We assume here that the combined attribute (part-num, location) is a key for the Inventory relation. Set v0 (i j ) = 1 if Supplier(i):part-num = Inventory(j ):part-num, Supplier(i):supplier = \Acme", and Inventory(j ):location = \San Jose" otherwise, set v0 (i j ) = 0. Dening v(i j ) as in (3.1), we see that is of the form (1.7). With no additional assumptions about the data, the function v satises the bounds a v b with a = 0 and b = m1 m2 t1 t2 . As with the SUM query discussed above, tighter bounds on v may be obtainable. Since has the same mathematical form as the result of the SUM query discussed above, the previous discussion for the SUM query carries over to the current setting almost unchanged only the specic form of the function v and the value of b are dierent. In particular, the estimator Y~ n dened by (1.9) is unbiased for and conservative choices for the condence interval half-width are of the form (1.20). Moreover, the result in (1.15) can be applied as described for SUM queries. Finally, consider the query that returns the average price of the part types supplied by the Acme company and stored in San Jose: 0 SELECT AVERAGE(Supplier.price) FROM Supplier, Inventory WHERE Supplier.part-num = Inventory.part-num AND Supplier.supplier ='Acme' AND Inventory.location = 'San Jose' 18 For 1 i m1 t1 and 1 j m2 t2 , set f0(i j ) = Supplier(i):price if Supplier(i):part-num = Inventory(j ):part-num, Supplier(i):supplier = \Acme", and Inventory(j ):location = \San Jose" otherwise, set f0 (i j ) = 0. Also set f (l1 l2 ) = X X f0 (i j ): i B(1l1 ) j B(2l2 ) 2 2 Similarly, set g0 (i j ) = 1 if Supplier(i):part-num = Inventory(j ):part-num, Supplier(i):supplier = \Acme", and Inventory(j ):location = \San Jose" otherwise, set g0 (i j ) = 0. Also set g(l1 l2 ) = X X i B(1l1 ) j B(2l2 ) 2 g0 (i j ): 2 Then is of the form (f )=(g) as in Theorem 3. Under assumption (i) above, a conservative 100p% condence interval for is given by Y~ n ; np Y~ n + np ], where np is given by (1.22) with af = 0, bf = t2 p, ag = 0, and bg = t2 . (As above, we have used the fact that the part-num attribute is a key for the Supplier table.) 0 0 3.2. Summary Statistics Dened on Selections We assume throughout that the online aggregation system retrieves tuples from each relation in random order. As discussed above, this assumption is stronger than the assumption that the system retrieves pages in random order and implies that tuples are unclustered. Consider the query that returns the average daily number (S ) of widgets sold in December: SELECT AVG(Sales.number-sold) FROM Sales WHERE Sales.month='December' AND Sales.item='widget' For 1 i m3 t3 , set v(i) = Sales(i):number-sold. Also set u(i) = 1 if Sales(i):item = \widget" and Sales(i):month = \December" otherwise, set u(i) = 0. Then (S ) is of the form (1.24) with u and v dened as above. Suppose that no location stocks more than w widgets on any day. Then the function v satises the bounds a v b with a = 0 and b = w. Suppose that at a given time point n tuples (equivalently, n=t3 pages) have been retrieved from the Sales relation and k (> 0) of these tuples correspond to widgets sold in December. 19 The estimator Y n (S ) dened by (1.26) (with In = k) is conditionally unbiased for (S ). By (1.29), a conservative 100p% condence interval is given by Y n (S ) ; Y n (S ) + ], where 0 0 0 1 ; ln = (b ; a) 0 1=2 : 2 2k 1 ; p Note that an alternative condence interval is given by Theorem 3. Neither interval dominates the other in all cases, and the shorter of the two intervals should be used. Now consider the query that returns the variance 2 (S ) of the daily number of widgets sold in December: SELECT VARIANCE(Sales.number-sold) FROM Sales WHERE Sales.month='December' AND Sales.item='widget' Dening u and v as above, we see that 2 (S ) is of the form (1.30). As above, suppose that n tuples have been retrieved from the Sales relation and k of these tuples correspond to widgets sold in December. We can estimate 2 (S ) by Zn (S ), where this estimator is dened by (1.40) (with In = k). It follows from (1.41) that a conservative 100p% condence interval is given by Zn (S ) ; 0 Zn (S ) + 0 ], where 00 0 00 00 d(b ; a)2 + d(b ; a)2 1 ln; 2 1=2 : 0 = 4( d ; 1)2 d ; 1 8bk=2c 1 ; p In the above expression, d is a lower bound on the total number jS j of tuples that correspond to widgets sold in December. The parameter d can be taken equal to k if a larger lower bound on jS j is available, then d should be taken equal to this lower bound. p If the quantity of interest is the standard deviation rather than the variance, then Zn (S ) can be used to estimate (S ). By (1.42), a conservative 100p% condence interval is given by Zn (S ) ; Zn (S ) + ], where = p0 . 00 00 00 4. Proofs Proof of Theorem 1. Fix a vector n 2 N. For 1 k K let Nk ( nk ) be the unique random integer such that Sk = f Lki : 1 i Nk g contains exactly nk distinct values denote the set of these distinct values by Tk = f Lki : 1 i nk g. Observe that Tk coincides with a random sample of size nk drawn uniformly without replacement from k . Set S = (S1 S2 : : : SK ) and T = (T1 T2 : : : TK ). We can view Y~ n , Y~ n , and T as functions of S . Write 0 0 nK n1 X X E Y~ n j T ] = n 1 n E v(L1i1 : : : LKiK ) j T ] : 1 K i1 =1 iK =1 20 (4.1) It follows from symmetry considerations that P f Lki = l j l 2 Tk g = n1 k for 1 l mk , 1 i nk , and 1 k K , so that nK n1 X X E v(L1i1 : : : LKiK ) j T ] = n 1 n v(L1i1 : : : LKiK ): 1 K i1 =1 iK =1 0 0 (4.2) Substituting (4.2) into (4.1), we nd that Y~ n = E Y~ n j T ]. The desired result now follows from the Rao-Blackwell Theorem, because T is a sucient statistic for the sampling scheme that generates S see Pathak 17] for denitions and details. 0 We now prove Theorem 2 using an approach developed by Hoeding in the setting of U -statistics and averages of m-dependent random variables see 11, Section 5]. We rst recall the following elementary but useful inequality, attributed by Hoeding 11] to S. N. Bernstein: h i P f X 0 g hinf0 E ehX (4.3) for any random variable X . The inequality in (4.3) holds since P f X 0 g = E g(X )], where ( g(x) = 1 if x 0 0 if x < 0 and g(x) ehx for all real x and h 0. We also need the following proposition. Proposition 2. Let Y = p1U1 + p2U2 + + pmUm , where each of U1 U2 : : : Um is the average of n 1 independent random variables taking values in 0 1] and p1 p2 : : : pm are nonnegative numbers such that p1 + p2 + + pm = 1. The random variables U1 U2 : : : Um need not be mutually independent, but are assumed to have common mean . Then h h(Y inf E e h 0 t) ; ; i e 2nt2 ; for t > 0. This proposition follows almost immediately from results in Sections 1, 4 and 5 in 11]. Indeed, h E eh(nY n nt) ; ; m i X i=1 21 h pi E eh(nUi n nt) ; ; i : for h 0 by the inequality that is stated just prior to (5.2) in 11]. On the other hand, it follows from (1.8) and (4.16) in 11] that h E eh(nUi n nt) ; ; i e hnt+h2 n=8 ; for 1 i m and h 0. The right side of the above inequality is minimized when h = 4t, so that h h(nUi n nt) i 2nt2 inf E e e h 0 ; ; ; for 1 i m, and the asserted inequality follows directly. Proof of Theorem 2. Fix t > 0 and n 2 N, and set m = m(n) = min(n1 n2 : : : nK ): Assume without loss of generality that n1 = m. Also assume without loss;of generality that 0 v 1 the general result follows by considering the function v () = v() ; a =(b ; a). Set 0 U (i2 i3 : : : iK ) = n1 n X 1 1 j =1 v(L1j L2i2 +j : : : LKiK +j ) for (i2 i3 : : : iK ) 2 f 1 2 : : : n2 g f;1 2 : : : n3 g f 1 2 : : : nK g, where an index of the form \ik + j " is interpreted as (ik + j ; 1) mod nk + 1 (that is, the indices \wrap around"). For example, when K = 3, n1 = 3, n2 = 4, and n3 = 3, we have U (3 1) = v(L11 L24 L32 ) + v(L12 L321 L33 ) + v(L13 L22 L31 ) : Observe that nK n2 X n3 X X Y~ n = n n 1 n U (i2 i3 : : : iK ): 2 3 K i2 =1 i3 =1 iK =1 Indeed, for specied values of i1 i2 : : : iK the quantity v(L1i1 L2i2 : : : LKiK ) appears exactly once as the i1 -st term in the representation of U (i2 ; i1 i3 ; i1 : : : iK ; i1 ), where indices wrap around as described above. By construction, each U (i2 i3 : : : iK ) is the average of n1 i.i.d. random variables. It follows from (4.3) and Proposition 2 that h P f Y~ n ; t g hinf0 E eh(Y~ n t) ; ; i e ; 2nt2 : A symmetric argument establishes the same bound for P f ; Y~ n t g, and (1.12) follows. 22 We can now use (1.12) to establish (1.13) in the same way that Hoeding uses (1.5) to establish (1.6). First suppose that nk < mk for 1 k K , and dene m = m(n) as above. Observe that m is also equal to m (n) in this case. Since the function g(x) = ehx is convex for h 0, it follows from (4.3) and Theorem 1 that i h i h i h P Y~ ; t E eh(Y~ n t) = e h(+t) E ehY~ n e h(+t) E ehY~ n 0 0 0 ; ; 0 ; ; for h 0, so that, by Proposition 2, h i P Y~ ; t hinf0 E eh(Y~ n t) e 2nt2 : 0 ; ; ; A symmetric argument establishes the same bound for P f ; Y~ n t g, and (1.13) follows. Now suppose without loss of generality that for some integer r = r(n) with 1 r < K we have nk < mk for 1 k r and nk mk for r + 1 k K . Observe that we can write n1 X n2 nr X X 1 Y~ = w (L L : : : L ) 0 0 n n1 n2 nr i1 =1 i2 =1 ir =1 n 1i1 0 2i2 0 rir 0 where wn is dened as in (1.14). The desired result follows by applying the previous argument to the r-dimensional cross-product average of the function wn . The only case not considered so far is when nk mk for 1 k K , but for this case the desired result follows trivially. To prove Theorem 3, we need the following lemma. Lemma 1. Suppose that there exist real numbers Yi, i, i (i = 1 2) such that 1 2 > 0, Y2 > 2 and Yi ; i i Yi + i for i = 1 2. Then Y Y + jY j 1 ; 1 2 1 1 2 : 2 Y2 Y2 (Y2 ; 2 ) Proof. We have Y Y ; Y jY ; Y j 1 ; 1 = 2 1 1 2 2 1 1 2 : 2 Y2 Y2 2 Y2 (Y2 ; 2 ) The set A = f Y1 ; 1 Y1 + 1 g f Y2 ; 2 Y2 + 2 g contains the value of (1 2 ) that maximizes the numerator of the rightmost term. Observe that jY2 1 ; Y1 2 j Y2 1 + jY1 j2 for any (1 2 ) 2 A. Proof of Theorem 3. By (1.19), we have P Y~ n(h) ; (h) (bh ; ah)np (1 + p)=2 for h = f g. It then follows from Bonferroni's inequality (see Miller 16, p. 8]) that P Y~ n (h) ; (h) (bh ; ah)np for h = f g p: The desired result now follows from Lemma 1. 0 0 23 Proof of Proposition 1. Set Yi = v(Li ) and Yi = v(Li ) for 1 i n. Throughout, we use the algebraic identities 0 0 n n X n X X 1 2 2 Zn = n Yi ; n(n ; 1) YiYj i=1 i=1 j =i+1 and 0 1 n n X n X X Zn = mm; 1 @ n1 (Yi )2 ; n (n2; 1) Yi Yj A i=1 i=1 j =i+1 0 0 0 0 0 0 0 0 0 0 where n = n ^ m. Fix a convex function f and assume rst that n m. Using Jensen's inequality (see 1, p. 283]) and the fact that Zn 2 when n m, we have 0 0 E f (Zn) = f (2 ) = f (E Zn]) E f (Zn)] : Now assume that n < m. We establish the desired inequality by mimicking the argument in Section 6 of 11]. For y = (y1 y2 : : : yn ) 2 <n, set 0 zn(y) = n1 and n X i=1 yi2 ; n(n2; 1) n X n X i=1 j =i+1 yiyj zn (y) = mm; 1 zn(y): 0 Thus, Zn = zn (Y ) and Zn = zn (Y ), where Y = (Y1 Y2 : : : Yn ) and Y = (Y1 Y2 : : : Yn ). Also set Qn = f q = (q1 q2 : : : qn ) 2 f0 1 : : : ngn : q1 + q2 + + qn = n g 0 0 0 0 0 0 0 and denote by %n the set of permutations of f 1 2 : : : n g. For 2 %n and y = (y1 y2 : : : yn ) 2 <n , we abuse notation slightly and denote the vector (y(1) y(2) : : : y(n) ) by (y). Finally, for y = (y1 y2 : : : yn ) 2 <n and q = (q1 q2 : : : qn ) 2 Qn , set zbn (q y) = zn (yb), where ybi = y1 for 1 i q1 , ybi = y2 for q1 < i q1 + q2 , and so forth. For example, when n = 5 and q = (2 0 2 1 0), we have zbn (q y) = zn (y1 y1 y3 y3 y4 ). Similarly to Equation (6.6) in 11], we can write ; (4.4) E f (Zn)] = E f zn (Y ) = E gn (Y f ) where g n is a function of the form 0 g n (y f ) = X X q Qn n 2 ; p(q )f zbn q (y) 2 24 with each coecient p(q ) independent of f and X X q Qn n 2 p(q ) = 1: 2 The representation in (4.4) can be interpreted as follows: a realization y of the random vector Y can be obtained by generating a realization y of the random vector Y , permuting the components of y according to a randomly selected permutation 2 %n , and then forming a new vector of length n by replacing each component of (y ) by 0 or more copies of the component in accordance with a randomly selected vector q 2 Qn . The quantity p(q ) is the probability that the components of y are permuted according to and then the components of (y ) are duplicated in accordance with q. Symmetry considerations imply that there exist numbers f p(q): q 2 Qn g such that p(q ) = p(q) for 2 %n and q 2 Qn, so that 0 0 0 0 0 0 gn (y f ) = X q Qn X ! ; p(q) f zbn q (y) : n 2 2 For xed q = (q1 q2 : : : qn ) 2 Qn and y = (y1 y2 : : : yn ) 2 <n, observe that X ; zbn q (y) n 1 0 n n X n n X X X @1 X 2 2 q ( q ; 1) i i qiy2 (i) ; n(n ; 1) qi qj y(i) y(j ) ; n(n ; 1) y2 (i)A = n 2 i=1 i=1 j =i+1 i=1 n 0 1n n n X n n X XX X 4( n ; 2)! qiqj A yi yj = (n ; 1)! yi2 ; @ n(n ; 1) i=1 j =i+1 i=1 j =i+1 i=1 !X n n X 2 2 ; n(n(n;;1)!1) qi(qi ; 1) i=1 n X i=1 ! yi2 = n n;! 1 n ; n1 qi2 zn (y): i=1 Denoting the identity function by h, we see that g n (y h) = czn (y) for some constant c. It follows from (4.4) and the unbiasedness of both Zn and Zn for 2 that E g (Y h) = E z (Y )] = m ; 1 E z (Y ) 0 n 0 n m 25 n 0 so that c = (m ; 1)=m and thus gn (y h) = zn (y). Using Jensen's inequality, we have 0 ; g n (y f ) = p(q )f zbn q (y) q Qn n 0 1 X X ; f@ p(q )zbn q (y) A q Qn n ; = f gn (y h) ; = f z (y) : X X 2 2 2 2 n 0 Since y is arbitrary, ; E f (Zn )] = E g n (Y f ) E f zn(Y ) = E f (Zn ) 0 0 0 0 and the desired result follows. Proof of Theorem 4. Fix t > 0 and n 2 f 2 3 : : : m g. We have h i h i P Zn ; 2 t e h( 2 +t) E ehZn e h( 2 +t) E ehZn 0 0 ; ; for h > 0, where the rst inequality follows from (4.3) and the second inequality follows from Proposition 1. Krat and Schmitz 13] show that h i inf e h( 2 +t) E ehZn e 8 n=2 t2 =(b a)4 h 0 ; ; b c ; so that P Zn ; 2 t e 8 n=2 t2 =(b a)4 : 0 ; b c ; A symmetric argument establishes the same bound for P 2 ; Zn t , and the inequality in (1.38) follows. An argument as in (1.35) then establishes (1.39). 0 Proof of Theorem 5. Fix > 0, n 2 f 2 3 : : : m g, and k 2 f 2 3 : : : jS j ^ n g. Since for any random variable 0 X 1 we have 2) = 1 Var X ] = E X 2 ; E 2 X ] E X ] ; E 2 X ] 0max ( u ; u (4.5) u 1 4 it follows that 2 S j(b ; a)2 d(b ; a)2 : 2 (S ) (b ;4 a) j4( jS j ; 1) 4(d ; 1) 26 Using this result, Theorem 4, and an argument as in the derivation of (1.29), we nd that 2 d d ( b ; a ) 2 P Zn (S ) ; (S ) d ; 1 + 4(d ; 1)2 In = k 2 (S ) d 2 P Zn (S ) ; (S ) d ; 1 + (d ; 1) In = k = P d ;d 1 Zn (S ) ; 2 (S ) In = k jS j ; 1 2 P Z (S ) ; (S ) I = k 00 0 00 0 00 ; b Similarly, n 2 8 k=2 =(b a)4 00 jS j e 0 c 0 ; d ; d(b ; a)2 I = k P Zn d ; 1 4(d ; 1)o2 n n P Zn (S ) ; 2 (S ) ; In = k jS j ; 1 2 P jS j Zn (S ) ; (S ) ; In = k e 8 k=2 2=(b a)4 00 (S ) ; 2 (S ) ; 0 00 ; b P jZn 00 (S ) ; 2 (S )j 0 00 Thus, n c 0 ; d + d(b ; a)2 I = k 2e 8 k=2 2 =(b a)4 d ; 1 4(d ; 1)2 n 0 ; b c ; and (1.41) follows directly. The inequality in (1.42) now follows by an argument as in (1.35). Acknowledgements The author wishes to thank Joe Hellerstein, whose work helped motivate this paper. The author also beneted from comments by Hadas Shachnai and an anonymous referee of an earlier version of this paper. References 1] P. Billingsley, \Probability and Measure," second ed., Wiley, New York, 1986. 2] W. G. Cochran, \Sampling Techniques," third ed., Wiley, New York, 1977. 3] H. Cramer, \Mathematical Methods of Statistics," Princeton University Press, Princeton, New Jersey, 1946. 27 4] S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz, Bifocal sampling for skewresistant join size estimation, in \Proc. 1996 ACM SIGMOD Intl. Conf. Managment of Data," pp. 271{281, ACM Press, 1996. 5] P. J. Haas, Large-sample and deterministic condence intervals for online aggregation, in \Proc. Ninth Intl. Conf. Scientic and Statist. Database Management," pp. 51{63, IEEE Computer Society Press, 1997. 6] P. J. Haas and J. M. Hellerstein, Ripple joins for online aggregation, in \Proc. 1999 ACM SIGMOD Intl. Conf. Managment of Data," pp. 287{298, ACM Press, 1999. 7] P. J. Haas, J. F. Naughton, S. Seshadri, and A. N. Swami, Selectivity and cost estimation for joins based on random sampling, J. Comput. System Sci. 52 (1996), 550{569. 8] P. J. Haas, J. F. Naughton, and A. N. Swami, On the relative cost of sampling for join selectivity estimation, in \Proc. Thirteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems," pp. 14{24, ACM Press, 1994. 9] J. M. Hellerstein, P. J. Haas, and H. J. Wang, Online aggregation, in \Proc. 1997 ACM SIGMOD Intl. Conf. Managment of Data," pp. 171{182, ACM Press, 1997. 10] W. Hoeding, A class of statistics with asymptotically normal distribution, Ann. Math. Statist. 19 (1948), 293{325. 11] W. Hoeding, Probability inequalities for sums of bounded random variables, J. Amer. Statist. Assoc. 58 (1963), 13{30. 12] W. Hou, G. Ozsoyoglu, and B. Taneja, Processing aggregate relational queries with hard time constraints, in \Proc. 1989 ACM SIGMOD Intl. Conf. Managment of Data," pp. 68{77, ACM Press, 1989. 13] O. Krat and N. Schmitz, A note on Hoeding's inequality, J. Amer. Statist. Assoc. 64 (1969), 907{912. 14] R. J. Lipton and J. F. Naughton, Query size estimation by adaptive sampling, in \Proc. Ninth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems," pp. 40{46, ACM Press, 1990. 15] R. J. Lipton, J. F. Naughton, and D. A. Schneider, Practical selectivity estimation through adaptive sampling, in \Proc. 1990 ACM SIGMOD Intl. Conf. Managment of Data," pp. 1{11, ACM Press, 1990. 16] R. G. Miller, Jr., \Simultaneous Statistical Inference," second ed., Springer-Verlag, New York, 1981. 28 17] P. K. Pathak, Suciency in sampling theory, Ann. Math. Statist. 35 (1964), 795{808. 18] P. K. Pathak, An extension of an inequality of Hoeding, Studia Sci. Math. Hungar. 10 (1975), 73{74. 19] S. Seshadri, \Probabilistic Methods in Query Processing," Ph.D. Dissertation, Department of Computer Sciences, University of Wisconsin, Madison, Wisconsin, 1992. 29
© Copyright 2026 Paperzz