Optimal histograms for limiting worst-case error

Optimal Histograms for Limiting Worst-Case
Error Propagation in the Size of Join Results
YANNIS
E, IOANNIDIS
University
of Wisconsin
and
STAVROS
CHRISTODOULAKIS
Technical
Many
University
current
frequency
result
of Crete
relational
database
distribution
of values
sizes and access plan
or transitively
serial
affect
histograms
because
such
queries
without
estimates
its
subclass
histograms
in several
where
in which
it participates.
show
that
optimal
query
is always
vacuously
large
histograms)
are always
the
values
in the
into
buckets
Categories
optimal.
to the
and Subject
[Models
and
Descriptors:
Principles]:
Systems—query
processing
General
Performance,
Terms:
Additional
Key
Words
restricted
the
and
the
relations.
results
on their
queries.
a
are
and
the
distributions
for t-clique
queries
of end-biased
attribute(s)
frequency
We
of which
a subclass
for the join
based
join
for all
histogram
that
form
of
size of such
(all
serial
we prove
(which
a histogram
t-cllque
result
and the frequency
Finally,
h zstograms
class
interest
attribute(s)
symbols
optimal
characteristics
directly
the
on equality
same
in the
the
query
is of particular
are called
error
no function
on finding
be sorted
and
type
worst-case
of a relation,
and
then
assigned
above.
G. 1.0 [Numerical
Systems
latter
on the
To construct
optimality
We identify
We concentrate
results
first
estimate
approximations
system.
the
approximate
is joined
h~gh-bzased
must
in the histogram
database
to
basis
systems.
one join
query
on this
relation
on the query
of the
of joins,
and
histograms;
of this
with
based
attribute(s)
according
queries
of histograms
database
each
we present
attributes
number
exist
by the
for reducing
queries
histogram
in the join
a very
For
queries),
end-biased
of values
with
serial.
t-clique
optimal
Join
histogram
form
of relations
of end-btased
symbols
joins
some
that
derived
are used
function
the
use
attributes
costs. The errors
many
and
systems
in the
Analysis]:
Information
General—error
Theory;
H.2.4
analysis;
[Database
H. 1.1
Management]:
Theory
and
Phrases:
Histograms,
join
size estimation,
query
optimization,
vector
majorization
This
and
work
Authors’
WI
was partially
IRI-9157368
addresses:
53706;
Crete,
not made
supported
Award)
Y. E. Ioannidis,
Chania,
publication
and
ACM
date
of this
commercial
appear,
Machinery.
Foundation
HP,
under
grants
IRI-91
13736
and AT &T.
Dept.,
Computer
University
Eng.
Dept.,
and
TransactIons
material
is granted
advantage,
the ACM
notice
is given
To copy otherwise,
0362--5915/93/1200-0709
ACM
Sciences
and
fee all or part
for direct
its
DEC,
of Wisconsinj
Technical
Madison,
University
of
Greece.
to copy without
or distributed
Science
from
Computer
Electronic
Crete,
Association
for Computing
specific permission.
01993
by the National
and by grants
S. Christodoukalis,
73100
Permission
of the
(PYI
on Database
that
provided
copyright
copying
or to republish,
that
notice
the copies
is by permission
requires
are
and the title
of the
a fee and/or
$3.50
Systems,
Vol.
18, No.
4, December
1993,
Pages
709-748.
710
.
Y. E. Ioannldis and S. Chnstodoukalis
1. INTRODUCTION
Query optimizers
of relational
access plan for a given query
of the
database
estimate
relations
the values
of the
optimizer
tained
statistics
database
systems
based on a variety
that
of several
[SAC+
the
system
parameters
the
maintains.
of interest
79]. Histograms
containing
decide on the most efficient
of statistics
on the contents
are the
number
most
of tuples
These
that
common
in
are
affect
used
to
the decision
type
a relation
of main-
for
each
of
several subsets of values (buckets)
in an attribute.
Usually,
the information
contained
in a histogram
represents
an inaccurate
picture
of the actual
contents
of the database.
This is due to two reasons: first, for each subset of
values
in an attribute,
only aggregate
information
is captured
in the histogram;
second, as the database
is updated,
if it is not appropriately
updated
as well.
erroneous
This
data
would
by applying
to accomplish
not be much
some simple
its task.
of a problem
functions
is not the case, however,
sequence of many simpler
the information
becomes obsolete
Hence, the query optimizer
uses
if the desired
on the erroneous
estimates
statistics
for many complex
queries
operations,
e.g., multi-join
were
only
derived
once. This
that are processed
queries
processed
as a
as a
sequence
of 2-way joins.
In that case, the query optimizer
must estimate
various
parameters
of the intermediate
results
of the operations,
and then
use the obtained
values
to estimate
the corresponding
parameters
of the
results
of subsequent
operations.
Even if the original
errors in the statistics
maintained
by the database
system
are small,
their
transitive
effect on
estimates
derived
for parameters
of the complete
query may be devastating.
Consequently,
the
decision
based on data with large
original
system statistics
of the
query
optimizer
may
be wrong
since
it is
errors. This phenomenon
where the errors in the
affect the error in the derived
estimates
is called
and is one of the main
en-or propagation
optimizer
technology.
There
are several
parameters
whose
issues
that
challenge
inaccurate
estimation
current
can
query
lead
a
query optimizer
to wrong
decisions.
Moreover,
there are several
operators
that may be present
in a query,
and each one is affected
by errors in its
operands
differently.
In this paper, we concentrate
on the relation
size and on
the join as the parameter
and the operator
of interest,
respectively.
This
choice is motivated
by their importance
in query optimization
and by their
sensitivity
to error
We investigate
propagation.
the optimality
of histograms
for limiting
the error
propaga-
tion in the estimates
of the result
sizes of a restricted
type of join queries.
Specifically,
we study equality
join queries without
function
symbols,
where
each relation
is joined
on the same attribute(s)
for all joins
in which
it
participates.
The focus of our work is on histograms
that accurately
record
the average
frequency
within
each bucket.
We identify
a class of such
histograms
and show that, for the specific type of join queries
studied,
the
optimal
histogram
for reducing
the worst-case
error in the size of such a
query is always in that class. For 2-way join queries with no function
symbols
(all
of which
are
ACM
TransactIons
on
vacuously
Database
Systems,
of the
Vol
query
18,
No
type
4, December
studied),
1993
we present
several
Ophrnal Histogram for Limiting Worst-Case Error Propagation
results
on finding
characteristics
of the
query
histogram
results
the
optimal
histogram
and the frequency
relations.
that
is
discuss
In
that
always
class,
optimal
histogram
in that
distributions
we
for
optimality
class
based
of values
also
very
with
to
query
attributes
a specific
queries.
respect
on the
in the join
identify
large
711
.
type
Since
a specific
all
of
these
query,
we
conclude
with
some possible
heuristics
on choosing
a histogram
that
is
potentially
effective
on both large and small queries. In practice,
constructing
a histogram
is straightforward.
The values in the join attribute(s)
of interest
must first be sorted based on their frequency,
and then assigned into buckets
according
to the
specified
criteria
by the formal
We are aware
database
of no work
query
literature
dictated
results
in the area
optimization
on deriving
by the
other
good
estimates
systems
have
database
applications,
this
apparent
that
it
becomes
needed
has
to
been
face.
system
as
propagation
in the context
of
paper.
for
the
parameters
is an extensive
of the
result
has been surveyed
by Mannino
et al. [10].
the effect of the errors in these estimates
are unreliable—but
reason
of the
in this
our own [4, 5]. There
error
of a sequence
of such operations.
propagate
exponentially,
and therefore,
primary
goals
presented
of error
than
estimates
database
operations,
which
not the case, however,
with
optimality
or heuristics
The folklore
has
beyond
a certain
the problem
the
But
low
as
the
query
can no longer
a better
even for the currently
of the
been ignored.
queries
complexity
that
increases
be the case. In hindsight,
understanding
common,
been that errors
point,
computed
has essentially
complexity
of error
low complexity
can grow enough to cause erroneous
decisions
Although
used in many systems, the formal
of
This is
on the
The
current
in
future
however,
propagation
queries,
where
is
errors
by the optimizers
[3, 8, 9, 17].
properties
of histograms
have
not been studied
extensively.
In addition,
the few pieces of work of which we
are aware deal with histograms
in the context of single operations,
primarily
selection.
Specifically,
Piatetsky-Shapiro
and Connell
[15] dealt
with
the
effect of histograms
on reducing
two classes of histograms:
in
attribute
values
associated
(or “equi-height”)
values associated
that
equi-width
for a variety
with
histogram,
with each
histograms
of selection
the error for selection
queries. They studied
an “equi-widt&
histogram,
the number
of
each bucket
is the same;
the total number
of tuples
bucket
is the same. Their
have
queries
a much
than
higher
equi-depth
in an “equi-depth”
having
the attribute
main result
showed
worst-case
and average
histograms.
error
Muralikrishna
and DeWitt
[13] extended
this study to include
multidimensional
histograms
that are appropriate
for multiattribute
selection
queries. The details of those
studies and the assumptions
under which the above statement
holds are very
different
from
the
foundations
of our
work.
Several
other
researchers
have
dealt with “variable-width”
histograms
for selection
queries, where the buckets are chosen based on various
criteria
[6, 12, 14]. The survey by Mannino
et al. [10]
contains
various
references
to work
on choosing
the
appropriate
number
of buckets in a histogram
for sufficient
error reduction
in the area of
statistics.
That work deals primarily
with selections
as well. Histograms
for
single-join
queries
have been minimally
studied,
and then again without
ACM
Transactions
on Database
Systems,
Vol.
18, No.
4, December
1993
712
.
Y. E. Ioannidis
emphasis
on optimality
[1, 7, 14]. Our
that
it deals with
discusses properties
This
the
paper
study
paper.
and S. Christodoukalls
is organized
of error
as follows.
propagation
It also gives
work
is different
arbitrarily
large join
queries
of histograms
that are optimal
Section
and
2 introduces
states
some of the results
from
the
derived
all the above
for the most part
for such queries.
some notation
assumptions
from
in
and
made
mathematics
for
in
this
(majoriza-
tion theory) that are used throughout
the paper. Section
definitions
on histograms
and some of their fundamental
3 contains
properties.
the basic
Section 4
identifies
are optimal
for some
a characteristic
property
of all histograms
that
query. The class of histograms
that have this property
is studied
further
in
subsequent
sections.
In Section
5, criteria
are provided
for identifying
the
optimal
histogram
within
that class for queries with one join and no function
symbols.
Similar
interest.
In Section
results
number
of joins
are
tograms
(as the
number
Section
7 makes
that
limit
the
are also obtained
6, the properties
studied
some general
worst-case
2. MATHEMATICAL
Majorization
We
present
jorization,
class
propagation
tends
AND
to infinity)
of
very
large
optimal
his-
is identified.
on how to choose histograms
based
8 summarizes
FOUNDATIONS
subclass
with
of asymptotically
query
recommendations
Section
histogram
for queries
on
the
our results
PROBLEM
results
of the
and gives
direc-
FORMULATION
Theory
some
which
on limiting
the
in the
error
previous
sections. Finally,
tions for future
work.
2.1
and
of joins
for a special
of histograms
the
important
will
prove
error
results
from
to be useful
propagation.
the
mathematical
in studying
They
are
all
the
taken
theory
of ma-
effect
of histograms
from
Marshall
and
Olkin [11]. In what follows, an M-vector
gz whose components
are a,, 1< i s
M, is denoted
by ~ = (al,...,
a~)
or by g = (a,).
The components
of all
vectors are nonnegative
reals. A vector
the inequality
al > a,, ~ holds. Finally,
product
is defined
g is nonincreasing
for two vectors
when
g and
V1
< i < M,
~, their
inner
as
This may also be generalized
1 < j < N, whose inner product
for an arbitrary
is defined
as
number
N
of vectors
g(J),
We have taken the liberty
to use * both in the infix notations
as a binary
operator
on vectors and in the functional
notation
as an N-ary
operator
on
vectors for arbitrary
N. The relationship
between the two uses is straightforward: g *~ = *(g, ~).
ACM
Transactions
on Database
Systems,
Vol
18, No 4, December
1993
Optimal Histogram for Limiting Worst-Case Error Propagation
Definition
tive
2.1.
components,
There
two
g = (a, ) and
inequalities
The most
If
2.1.
Example
can be derived
one for this
g is a nonincreasing
2.1.
As an example
vector
of the
1), g -- (10, 5, 1), and
above
b = (1,5,
~remises
of Theorem
2.1 are satisfied.
the theorem,
since q * & = 42, whereas
The
above
theorem
THEOREM 2.2.
In
that
significant
paper
based
on the
is expressed
by
and
g majorizes
~,
then
— *3.
x = (3,2,
2.2
nonnega-
theorem.
THEOREM
*(a(l)
—
~ = (b, ) with
~ if
important
property.
the following
M-vectors
g majorizes
are several
majorization
a*x>b
—
—
For
713
.
a~NJ)
)...>_
Problem
this
*(b(l),
—
1<
j < N,
. . . . btN))
—
consider
The same holds
~ * y = 23.
can be extended
If for all
>—
theorem,
10). One can easily
to inner
the
vectors
verify
for the
that
of multiple
products
the
conclusion
Q(J) majorizes
~(j),
then
an operator
that
combines
of
vectors.
the
inequality
hOl&.
Formulation
paper,
we use the
term
join
as
tuples
of
two relations
based on some condition
satisfied
by the tuples. In general,
such
a condition
will be a conjunction
of simple
relationships,
each relationship
involving
an attribute
relation.
names
of the
For example,
of these
relations,
a query
(Ro.a
is a query
with
one join,
relation
whose
= R1.a
whereas
(Ro.a
is a query
first
with
two joins.
each parenthesis
is a join.
= R1.a)
In both
to
elements.
the
values
In this
of
study,
the
RO.b
and(RO.b
of the
second
a, b are attribute
is
(1)
qualification
is
(2)
= Rz.b)
qualifications
above,
relations
multiple
attributes
the following
ACM Transactions
and
= l?l.b)
whose
in which
join
we make
and
the
an attribute
names
qualification
a query
Consider
a tree query of N joins
potential
confusion
with
To avoid
refer
and
if R ~, R ~, R ~ are relation
on Database
use
of
these
condition
RO, ...,
of the
Vol.
inside
RN participate.
term
“value,”
relations
assumptions
Systems,
the
as
about
18, No.
the
we
join
the form
4, December
of
1993.
714
Y. E, Ioannidls
.
and S Chrlstodoukalis
the query:
All joins
are equality
joins
is joined
with
Each
relation
joins
in which
it participates.
with
multiple
other
no function
on exactly
the
That
relations,
symbols.
same
(Al)
attribute(s)
for
is, even if a relation
each join
is always
all
is joined
on the
(A2)
same
attribute(s).
Queries
that
satisfy
the
above
assumptions
are called
t-clique
their query graph would be a clique due to the transitivity
example,
a query with qualification
(1) is a t-clique
query,
queries,
since
of equality.
but a query
For
with
qualification
(2) is not, because
RO participates
in the join with
RI with
b, i.e., (A2) is violated.
attribute
a and in the join with
Rz with attribute
Similarly,
a query with qualification
(Ro.a
=R1.aand
is a t-clique
Ro.b
query,
attributes
=R,.
R ~ is joined
because
a and
b)and(Ro.a
b. Finally,
a query
(Ro.a
=R2.aand
with
with
+RO.
both
Ro.b
R ~ and
=Rz.
b)
Rz based
(3)
on both
qualification
b =R1.a
+R1.
b)
is not a t-clique
query, because the join involves
the arithmetic
function
+,
i.e., (Al) is violated.
Note that all queries with one join (often referred
to as
2-way join queries because they involve
two relations)
satisfy (A2) by definition. Hence, for
results
presented
An obvious
2-way equality
in this paper
implication
the same number
of attributes
the join
of (A2) is that
of attributes
as a single
attribute
of the join
attribute,
with no function
general.
all relations
of compatible
of the query.
attribute
values
join queries
are completely
types.
of potentially
Based
participate
We view
tuple
on the above,
may be of tuple
form
symbols,
form
the join
as well.
in joins
this
the
with
common
set
and refer
to it as
elements,
i.e., the
For example,
in both
(1) and (3), the combination
of attributes
a, b is considered
as a single
attribute
whose values are tuples (pairs)
of atomic values. The ~“oin domain
of a t-clique
query
is the
set of all join
elements
that
could
potentially
9
appear
in
results
are
the
appear
in
the
mentioned
We
so
database
attribute
query,
of
of
the
i.e.,
any
relation
number
independent
in
the
All
attributes
of
each
of
the
of
the
form
some
referring
arbitrary
to
parameters
the
are
numbering
i-th
join
of the
element
elements
is
meaningful.
forthcoming
relation
join
that
attribute
in
the
join
The
domain,
following
of interest:
M
The size of the join
tLJ
The number
of tuples in RJ whose join attribute
contains
the i-th
is
join
element
of the join
domain,
1 < i < M, O < j < N. This
called the frequency
of the i-th join element
of the join domain
in
domain.
R,.
s
ACM
query.
of
above.
assume
that
join
independent
The size of the result
TransactIons
on
Database
Systems,
relation
Vol
18,
No
of the query,
4, December
1993
Optimal Histogram for Limltlng Worst-Case Error Propagation
For
simplicity,
treat
given
the latter
the
bijection
as the join
domain
between
itself,
Q
i.e., ~
and
the
set {1,2,...,
= {1,2,
. . . . M}.
need to distinguish
between
the two, we use the term “actual
refer to the former.
We should
point
out, however,
that
.
715
M},
we
Whenever
we
join domain”
the bij ection
to
is
arbitrary
and thus not necessarily
order
preserving.
For example,
if the
actual join domain is a finite
subset of the integers,
reals, or bounded-length
character
strings,
the i-th join element
based on the natural
ordering
of the
join
domain
is not
O <j<iV,
thevectorfJ
associated
with
i. For each relation
RJ,
t~, .)is called the frequency
distribution
in
it is also useful
to treat
all the frequencies
in ~j as a
the join element
with which each frequency
is associated.
R~. Occasionally,
collection,
ignoring
That collection
is in
called
the frequency
j.
= (10,3,
Clearly,
following
7,3)
necessarily
= (tlj,...,
general
a multiset
set of RI, and
and II
= (3,3,
for t-clique
formula:
queries,
s=
That
the
*(Jo,...,
is, the size of S of the result
product
of the frequency
2.2 can be directly
THEOREM
2.2.
Let the fi-equency
Consider
Consider
JN)=
f
fit,J.
~=1 J=o
of a t-clique
any
join
are
duplicates),
is
a t-clique
query
query
is equal
the
to the inner
relations.
Q with
relations
Theorem
Ml,
R],
for all relations.
O < j < N, the corresponding
set
with
the following:
be given
frequency
related
if
(4)
of the participating
O < j < N,
of Q is maximized
when, for all
~1 is a nonincreasing
vector.
PROOF.
contain
parameters
on (4) to derive
sets Ml,
it may
above
distributions
applied
(i.e.,
is denoted
by Mj. For example,
10, 7), then M. = Ml = {10, 7,3, 3}.
O < j < N,
O < j < N.
The result
frequency
and
let
size
vector
~1, ~~ be two
potential
frequency
vectors
for RI (whose components
are the elements
of
M~). If tj is a nonincreasing
vector, then clearly,
jj majorizes
j;. Hence, by
Theorem
2.2 and formula
(4), the result
size of Q is maximized
when the
frequency
Most
vectors
often
distributions
for
the
query
size
of all relations
database
systems
in the query
S is inaccurate
❑
are nonincreasing.
have
relations.
inaccurate
Therefore,
as well,
and
this
knowledge
of the
the estimate
affects
the
that
frequency
they
decisions
derive
of their
optimizers.
Definition
2.2.
Suppose
that
a certain
quantity
has
a definite
value
A
whereas the database
system approximates
it by the value A’. The difference
A – A’ is the exact error and the fraction
( A – A’ )/A’
is the relatiue error in
the approximate
value A’.
One could have used the fraction
(A
– Ae )/A
in defining
the relative
error
in A’, instead
of following
Definition
2.2. Our choice was motivated
by a
desire to measure
error based on the value that the database
system knows,
which is A’. In addition,
there is a very simple relationship
between
the two
types of relative
error, which can be used to derive the value of one given the
ACM
Transactions
on Database
Systems,
Vol.
18, No.
4, December
1993.
716
Y. E. Ioannidis
.
and S. Christodoukalls
value of the other.
Specifically,
then the following
holds:
Note
that
as El varies
the values
they
from
small
of relative
(close to O). Hence,
by
the database
is adopted for the relevant
error plays no significant
reasons,
all the results
on relative
error
in this
role.
paper
For
are
it deals with
potentially
by the same symbol
system
Se. In
the
sequel,
erroneous
as the correct
we
arises, we occasionally
being “relative
error.”
that
that
when
its
so that
value
value
with
concentrate
use the
term
for any fixed
value
of the estimated
result
used
distribution
result
size
errors.
alone,
the
If
no
intended
– 1 be the correresult. By (4), this
size S’, Theorem
(5) imply
that
the error
D is maximized
when for all
nonincreasing
vector. Moreover,
it has been shown that,
by the
an additional
on relative
“error”
For a given collection
of ~~’, O s j s N, let D = (S/S’)
sponding
relative
error in the estimated
size of the query
implies
that for t-clique
queries
Note
note
the
information
“e.” For example,
the approximation
of the frequency
estimated
....t&l) and the corresponding
by ~~ = (t~l,
is denoted
confusion
meaning
when
– @ to 1. Also
close to each other
errors,
is denoted
superscript
is denoted
from
are very
~z = ( A – A’ )/A,
achieves
enough
derived based on Definition
2.2.
For any quantity
of interest,
the
system
error
and
small
goal of maintaining
definition
that
all the above
~1 = ( A – A’)/Ae
– 1 to CC,.sZ varies
of the two types
are very
if
2.3 and
O < j s N, Jj is a
under the uniform
distribution
assumption,
the error grows exponentially
with
N [4, 5]. (This
holds for many other approximations
of the frequency
distributions
as well.)
where
such
The focus of our attention
is on reducing
D for databases
worst-case
creasing.
chosen
behavior
The
is exhibited,
particular
histograms
i.e.,
method
on the
that
frequency
where
we
all
study
frequency
vectors
is maintaining
are
nonin-
appropriately
distributions.
2.3 An Example
Formula
(5) holds for arbitrary
frequency
distributions.
To obtain
a better
feeling for the magnitude
of the error propagation,
we apply (5) to a specific
database
instance,
which
will also be our running
example
for the entire
paper. In particular,
we examine
the case where the actual frequency
distributions
are Zipf [2, 18]. The main characteristic
of the Zipf distribution
is
that it assigns high frequencies
to few join elements
and low frequencies
to
most join elements.
Thus, this example
deals with a quite common
special
case, since the above
many databases.
ACM
TransactIons
on
Database
is claimed
Systems,
to be a characteristic
VOI
18,
No
4, December
1993
of the
distribution
in
Optimal Histogram for Limitmg Worst-Case Error Propagation
.
717
0.5-
0.4.
R
e
0.3.
1
!~
: ,,
t:
1!
i :,
a
t
i
v
e
\ ::
0.2 ~
E
r
:
$
!
,
0.1
r
Z=o.oo
\
0.0
.!
2=0.02
. . --------.. . . . . . ... _. _.-.
.. . . . . ... \._. _...
\. ___ _.-,
. .... ...
... . . . . . . . .
\ . . . .._. _..
.. . . . . . . .
.. .
,. . . . . . . . . . ...
I
-0.1
o
10
20
30
40
50
60
70
90
80
Z=O.04
z=O.06
z=O.08
Z=o.lo
100
Join Element
Fig. 1.
Assume
frequency
that
all
distribution
relations
in
is Zipf,
Zipf frequency
the
i.e.,
distribution.
database
for
all
are
equal
forall
and we assume
for
assume
relations.
Furthermore,
we
M = 100 join
elements.
Figure
1 is
0.1. One can see that
z = 0.0,0 .02,...,
bution
increases
Suppose that
(uniform)
graphical
other
that
a graphical
the deviation
(6)
that
the
join
it is equal
domain
to 10000
contains
representation
of (6) for
from the uniform
distri-
with z, but it is not very dramatic
for the range
the database
system uses the Zipf distribution
as the approximation
to
representation
of equation
and the
l<isll.
In (6), Tj is the size of RJ in tuples,
all
to each
j,
depicted.
with
z = O
the actual
distribution.
Figure
2 is a
(5) for that case. Specifically,
the rela-
tive error in the query result
size is shown as a function
of the number
of
joins for various
values of ~. The observed
results
are rather
discouraging.
Even small errors in the individual
relations
propagate
in the query result
growing
at an exponential
rate and generating
a final error that very quickly
ACM
Transactions
on Database
Systems,
Vol.
18, No.
4, December
1993.
718
Y. E, Ioann!dls and S. Chnstodoukalis
.
I .5
2=0.08
I
7,=0.06
R
e
s 1.0
u
1
t
2=0.10
s
1
z
e
E
r
0.5
r
o
r
L=o.oz
0.0
2
6
10
14
Number
Fig. 2.
becomes
result
intolerable.
worst-case
3. BASIC
Join
error,
size error
Note
since
for Zlpf
that
by
AND
22
under
Theorem
2.3,
distributions
PROPERTIES
26
30
(N+l)
distributions
all frequency
DEFINITIONS
18
of Relatlons
uniform
this
approximation
situation
produces
are nonincreasing
OF HISTOGRAMS
is a very common
In histograms,
the
Among commercial
systems
approach
to approximating
histogranzs
today, maintaining
frequency
distributions
[17].
join domain
is partitioned
sumed within
each bucket.
and a uniform
distribution
into buckets,
Hence, based on our convention
about
domain,
a bucket
distribution
is a subset
captured
by
of ~
= {1,2,...
a histogram
a
vectors.
is called
, M}.
the
The approximate
h istogruvl
is asthe join
frequency
distribution.
We
should emphasize
the fact that buckets
that do not represent
a contiguous
range
in the join
domain
are perfectly
valid,
e.g., bucket
{1, 3}. The join
domain
as defined
provides
no indication
of how the join elements
should be
grouped
in buckets.
The numbering
of the join elements
based on the bijection between
the actual join domain
and {1,2, . . . . M} (Section
2.2) has been
arbitrary
the join
uniform
ACM
and does not reflect, for example,
a natural
value-based
ordering
of
domain. Also note that maintaining
the necessary information
for the
distribution
assumption
over the entire join domain
is equivalent
to
Transactions
on
Database
Systems,
Vol
18,
No
4, December
1993
Optimal Histogram for Limiting Worst-Case Error Propagation
maintaining
a histogram
with
a single
bucket.
Such
719
.
a histogram
is called
trivial.
In this
have
study,
we assume
an accurate
any bucket
record
that
the
histograms
of the average
maintained
frequency
within
by the
of RJ, O < j < N, for all 1 s i < M,
b in the histogram
system
each bucket,
i.e., for
i e b = t,;
= Xk. ~th j// b 1. Thus, we concentrate
on errors that arise from aggregating
the frequency
distributions
and not on ones that arise from delaying
the
to the histogram.
The reason is that the
propagation
of database
updates
the
only
former
are
choice
of histograms.
schedule
frequency
LEMMA
the
If
we
aware
concentrate
error,
i.e.,
relations
when
the
larger
For
useful
all
associated
worst
case
error
2.3
the
maximizes
simplicity
is
the k-th
the
because,
for
for
to join
any
given
join
any
fixed
and
Lemma
3.1,
possible
approximate
presentation
the
error
differently,
size
approximate
sets
of frequency
2.3),
we
worst-case
association
among
all
which
estimated
size.
Hence,
the
This
collection
of the
We
Unfortunately,
in the frequency
element.
result
actual
t,’).
Combin-
applicable.
(Theorem
value
any
❑
reducing
value
join
~
whereas
(
optimal.
for
largest
elements
given
size
sense
generally
same
the
and
lemma.
result
some
a
~.
the
histograms
value
the
in the
in
that
with
maximum
the join
are
1 < k < M,
maximum
than
that
distribution
(t, ) majorizes
yields
of
between
t,is nonincreasing,
bucket
buckets
result
is
and
all
appropriate
distribution.
~ majori~es
histogram,
each
optimal
of frequencies
2.2
histogram
no
the
an
the investigation
relationship
frequency
reduces
ones
for
maximum
Theorems
never
for
identifying
the
associations
the
of
on
of all
to the
of
by
histogram
then
within
of this
identify
produces
bucket
by an appropriate
be controlled
to histograms,
nonincreasing
distribution,
each histogram
to
represents
propagation
Hence,
implications
are
a
each
constant.
In general,
it
~ is
Within
attempt
only
and any corresponding
histogram
PROOF.
be controlled
can
can
the scope of this paper.
lemma
establishes
a majorization
3.1.
tf remains
that
latter
update
distribution
corresponding
of error
The
of database
which is beyond
The following
ing
types
sets,
possible
corresponds
result
size.
join
result
Therefore,
the
By
size
is
optimal
size.
and
without
loss
of generality,
we
choose
between the actual join domain
and the set {1, 2,...,
M} (Section
2.2) so that it preserves
the frequency-based
ordering
of the join elements.
Thus, the k-th most frequent
join element
is associated
with
k. Because of
the decision
to use {1,2, . . . . M} as the join domain
itself, the above can be
restated
simply as the following
convention,
which holds for the entire paper:
the
bijection
The fi-equency
Recall
distributions
value-based
Example
sumptions
that,
in
general,
following
ordering
distributions
the
the
of all relations
frequency-based
above
of the actual
convention
join
are noninereasing.
ordering
captured
by
is completely
unrelated
to any
the
domain.
3.1.
To illustrate
the above definitions,
conventions,
and ason histograms,
consider the “canonical”
EMP relation
and focus on
ACM
TransactIons
on Database
Systems,
Vol.
18, No.
4, December
1993
720
Y. E. Ioannldis
.
Table
I.
Department
the
dept
and
Name
Number
Frequency-based
30
2
shoe
20
3
t Oy
40
1
which
simplicity,
contains
assume
frequency
the
that
of each
name
there
department
of
are
in
the
department
four
the
Itank
different
EMP
of
each
departments
relation
is
given
in
table.
that the frequency
the following:
first,
is toy > jewelry
second,
is used
as the join
ordering
distributions
of all relations
the frequency-based
ordering
> shoe > candy
the bijection
to the frequency-based
based
of Employees
Relation
jewelry
The assumption
creasing
implies
which
in the EMP
4
the
attribute;
Names
10
I?or
that
of Department
candy
the following
ments
Frequencies
attribute,
employee.
and S Christodoukalls
for
between
domain
rank
relations
by convention,
mentioned
is completely
all
the actual
unrelated
that
join
have
domain
associates
in Table
I. Note
the
dept
and {1,2,
3, 4},
each department
that
to the alphabetical
are noninof depart-
this
frequency-
ordering
of depart-
ments based on their names. Figure
3(a) is a graphical
representation
of the
nonincreasing
frequency
distribution
of department
names in EMP. In the
x-axis, both the actual and the conventional
join domains
are given. Based on
the usual convention
[9, 15], if one were to build
a histogram
on the dept
attribute,
buckets
alphabetical
two
2-element
show
would
ordering
the
buckets
original
be formed
of their
by
names.
is shown
distribution
departments
An
example
in Figures
3(b)
and
join
which
that
of such
and
are
close
in
a histogram
3(c). In the
elements
(or
the
with
former,
we
equivalently
frequencies)
are placed in which
bucket.
In the latter,
we show the actual
histogram
distribution
that is the result of averaging
the frequencies
in each
b ~ = {1,3}
and b ~ = {2,4} are not contiguous
bucket. Note that the buckets
ranges within
{1, 2,3, 4}. As another
example,
consider
the histogram
with
two 2-element
buckets that is shown in Figures
3(d) and 3(e), again depicted
in the two ways discussed
for the first histogram.
In this case, the buckets
bl
{1,2}and
=
group
b2 = {3,4}
departments
tograms
based
are equally
valid
are contiguous
ranges within
{1,2,3,
4}, but do not
on their
names.
In principle,
all possible
hisfor the
purposes
of this
paper,
including
the
two
above.
Based
on the above,
histogram
optimality
is defined
as follows.
Definition
3.1.
Consider
a query
Q with relations
RI, O < j < N, associated with nonincreasing
frequency
distributions.
For each relation
R~, let Y<’
be a collection
of histograms
of interest.
The ( N + I)-vector
( Hj ), where
histogram
vector for Q within
(XJ’), if the
HI = Z;, O < j < N, is an optimal
approximate
result
size of Q that
it generates
approximate
result
size of Q that
any other
ACM
Transactmns
on
Database
Systems,
Vol
18,
No
is greater
such histogram
4, December
1993
than
or equal
vector
to the
generates.
Optimal Histogram for Limiting Worst-Case Error Propagation
1
toy
2
4
3
jewelry
shoe
.
721
Join Element
Actual
candy
Join Element
(a)
~umber of
Tuples
Number
Turies
t
1
2
3
of
Join Element
4
2
1
Jumber of
Tuples
Number of
Tuples
~
I
1
2
3.
Frequency
4
Join Element
3
3
4
Join Elcmcn(
~
+
Join Element
4
2
1
(e)
(d)
Fig.
3
(c)
(b)
distribution
ACM
and
Transactions
two
histograms
of department
on Database
Systems,
Vol.
names
18, No
in
EMP.
4, December
1993
722
Y, E. Ioannidis
.
Note
that
optimality
and S. Chnstodoukalls
is defined
per
distributions,
and for the histograms
that
the optimal
histograms
differ
frequency
distributions.
per relation
that
the remaining
In reality,
is optimal
relations.
query
and
per
collection
one would
for all queries
Whenever
like
and
possible,
to identify
roles
types
of histograms
we present
in the rest
Definition
of this
3.2.
and histogram
Note
that
associated
for
A histogram
all join
elements
is called
elements
with
it
in
LI
such
are
of that
histograms,
that
that
equal
the
play
of
form,
we define
some significant
frequencies
is called
a univalued
associated
uniualued.
bucket,
L univalued
with
If the
biased.
the
results
the
other
Any
histogram
captures
accurately.
A histogram
3.3.
bucket
bucket
in
frequencies
Definition
ued
buckets
distributions
paper.
with
all the join elements
bucket is called m ultivalued.
their
one histogram
all frequency
but in general,
we follow Definition
3.1 for optimality,
Before we proceed with the investigation
of optimal
several
of frequency
of all relations
together.
The reason is
for different
queries
and for different
highest
L univaled
distinct
buckets
buckets
frequencies
and
one
correspond
multival-
to the join
Lz lowest
and the
distinct
LI and
Lz such that
L = LI + Lz, then it is called
end-biased.
If LI = L and Lz = O, then it is called
high-biased.
If LI = O and
Lz = L, then it called low-biased.
frequencies
We
for
should
current
some
point
out
systems
that
for
high-biased
histograms
approximating
the
are
frequency
being
used
distributions
of
by
some
relations
[Se189].
Definition
and
3.4.
if either
b2,
b2,
the inequality
all
pairs
of its
For example,
A histogram
Vi
= bl,
is called
i < k holds.
It
4.
the histogram
histograms
is called
In
to its buckets
respect
i > k
holds,
if it is serial
serial
bl
or Vi
= bl,
with
respect
that
of Figures
k =
to
of Figures
searching
for
the
3(d)–(e)
we
is serial,
while
are
dealing
with
nonincreasing
frei > k above is equivalent
to the reverse
frequencies,
i.e., t,< tk.Also, note that
are serial.
OF OPTIMAL
CLASS
with
buckets.
3(b)–(c)
is not. Note that because
quency distributions,
the inequality
inequality
between the corresponding
end-biased
serial
the inequality
k = b2,
HISTOGRAMS
optimal
histograms,
the
following
general
result
is
useful.
4.1.
LEMMA
Consider
assume
that
For two
different
G majorizes
under
ACM
a t-clique
t?le histogram
histograms
that
of H,
query
distributions
then
G and
the
with
relations
error
under
G is
no
H.
Transactions
on
Database
Systems,
Vol
R],
O < j < N,
and
R], 1 < ]“ < N, are nonincreasing.
H for RO, if the histogram
distribution
for
18,
No.
4, December
1993
higher
than
the
of
error
Optimal Histogram for Limiting Worst-Case Error Propagation
PROOF.
implies
Since
size
This
that
we
is
S“
are
a
dealing
G
be
the
less
the
than
overall
consequence
larger
(or
case
frequency
must
Clearly,
a
with
(nonincreasing
under
straightforward
obtains
equal)
that
equal
optimal
the
histogram
buckets,
one for each join element.
make fair comparisons,
we define
the
the
to
Theorem
under
produces
distributions),
or
of
value
2.1,
than
highest
above
error
G
the
that
within
in the
histogram
size,
histogram
As an example,
the
(2,{2, 2}),
elements.
they
Definition
4.2.
histograms
that
have D buckets.
B buckets.
and whose
i.e., the number
since
error
one with
singleton
respect to the
the optimal
each such space.
Definition
4.1.
The bucket template
( /3, B ), where
B is a set of integers
buckets
the
To avoid such vacuous arguments
and to
several interesting
spaces of histograms,
where each space contains
histograms
that are equivalent
with
amount
of information
that
they maintain.
Then we identify
histoWams
El.
actual
❑
H.
is always
which
under
possible
implies
under
723
.
bucket
For
any
conform
Also,
sum
of buckets
template
have
S@ of a histogram
is a pair
~ ==
that denote the cardinalities
of the
two
histograms
each
template
one
%,
to Q’ and the space%;
the space
to M,
and
~ = IB! is the
in the histogram.
of both
buckets,
bucket
is equal
of Example
of which
the
has
3.1 is
two
space -%3 is the
set of
is the set of histograms
%6 is the set of biased
histograms
join
that
that
have
Note that %> is closed under join element
exchange between buckets. That is,
b ~ and bz, consider
any histogram
H’
given H = ~~j and two of its buckets
that
is identical
to H
in all remaining
buckets,
and has
b ~ and
b2 replaced
Ibjl = Ibll, Ibjl = lb21, and b~ U bj = bl U b2. Then
by b~ and bj such that
H’ =%3. Also note that ~z G%’, since Y?@ requires
that not only the number
but also the size of the buckets be the same. Given the above, the following
lemma and the subsequent
theorem
shed some light on histogram
majorization within
section. )
~EiMMA
YE. (Optimality
4..2.
two buckets
Consider
b, and
results
a bucket
bt, and
within
template
let s and
%;
will
@ and
given
a histogram
t be the average
element frequencies,
respectively.
Consider
with respect to b, and bt and is constructed
be
in
the
next
H G Y@ with
of the corresponding
join
a histogram
G ● %9 that is serial
from H by exchanging
elements of
b. and bt and leaving
all other buckets unchanged.
Let the corresponding
frequency
averages for b. and b, be s, and t,, respectively.
Without
loss of
i < k holds.
If
generality,
assume that in G, Vi ● b,, k = b,, the inequality
H.~
SI > s > tl and SI > t > tl, then G majorizes
PROOF.
concentrate
1For
Obviously,
on b. and
any histogram
H, three
in
what
follows,
b,. Thus,
of the four
ACM
we
without
inequalities
Transactions
can
ignore
all
loss of generality,
always
on Database
other
buckets
we assume
and
that
b.
hold.
Systems,
Vol.
18, No.
4, December
1993.
724
and
the
Y. E, Ioannidis
.
bt contain
the
following
first
I b,, I + I b~ I elements
of the
join
domain.
Assume
H and G are denoted
by h, and g,, respectively.
facts can be derived.
(a) Since the set of elements
frequencies
in the two histograms,
the following
holds:
that
Then the
remains
in
three
unchanged
and S. Christodoukalls
x~~l~ Ib’lh, = ~~!l~ l~’lg,.
Ib.1,then Z~.lg, > E:. ~h,, since for all 1 s i s K, g, = SI and
(b) If Ks
either takes the value s or the value t, both of which are no greater
than
by the
holds:
premises
In the above,
of the
lemma.
the inequality
(c) If
I b, I < K < lb, I + Ib,l, then
is due to two
observations:
expressions
on its two sides are equal (fact
K + 1 s i s Ib,,l + Ibtl, g, = tl and h, either
both of which
By Definition
THEOREM
there
the first
following
sums
of the
(a)); for the second sums, for all
takes the value s or the value t,
are no less than tl by the premises
of the lemma.
❑
2.1, points (a)–(c ) prove that G majorizes
H.
For
4.1.
is a serial
PROOF.
the
h,
S1
any
histogram
Consider
bucket
template
in X2
two
that
arbitrary
Q? and
majorizes
b,, and
buckets
any
H ~%;,
histogram
it.
H. We first
b, in
show that
and bt that
majorizes
H. Specifically,
consider
histograms
that can be constructed
from
all other
buckets
H by exchanging
elements
of b,, and bt and leaving
unchanged.
Again,
in what follows,
we ignore all other buckets
and concentwo cases.
trate on b, and bt. We distinguish
there
is
a histogram
in
%2
that
Case 1. Ib. I # I bt 1. In this
can be constructed
where
Vk
~ bt,
majorize
HI
Vi
E b~,
‘dh
and
G bt,
k
the
i <
by
other
does not majorize
serial
case, there
as above
(denoted
i > k
H, then
is
with
are only
be serial
(denoted
Hz).
one
We
does.
H and that
with
by
prove
Without
HI)
that
loss
respect
to
b,
two histograms
respect
and
if
one
one
in&~
bS and
to
where
of them
of generality,
I b,, I < I bt 1. We use the following
does
assume
The average
of the frequencies
in bucket
b, in
H.
The average
of the frequencies
in bucket
bt in
H.
SJ
The average
of the frequencies
in bucket
b, in
HJ, j = 1,2.
t,
The average
of the frequencies
in bucket
b, in
H,, j = 1,2.
to illustrate
the
situation
by showing
on the
original
frequency
distribution
which join elements
are placed in which bucket.
By the manner
in which the b, and bt buckets have been constructed
various
ACM
histograms,
Transactions
on
the following
Database
Systems,
inequalities
Vol
18, No
can be derived:
4, December
1993.
not
that
notation:
t
4 can be used
one
● b,<,
Vi
s
Figure
that
bt:
in the
Optimal Histogram for Limiting Worst-Case Error Propagation
Number of
Tuples
I
s=avgof
❑
t=avgof
•l
.
Bucket h
m
Bucket
m
b,
725
Join Element
H
Number of
Tuples
❑
S1= aVg Of
?
Numf)er of
les
❑
s2= avg of
L
Join Element
Join Elemcnl
Hz
H
Fig. 4.
The
inequalities
in
Serializations
(7)
are
of two buckets
derived
from
by element
the
fact
exchange.
that
all
three
s’s
are
averages of the same number
Ib,,I of frequencies,
with s ~ being the average of
the Ib, I largest frequencies
among those in the two buckets,
and SZ being the
average
the
of the
I b, I smallest
t’s are derived
similarly.
tl > Sz are due to the fact
such
frequencies.
Finally,
that
b,
the
The
inequalities
outermost
is smaller
inequalities
(has fewer
join
in (8) among
SI > t~ and
elements)
than
b,.
The fact
inequalities
that
111 does not majorize
in (7) and (8), the above
Lemma
4.2. That inequality
combined
tz > t > SZ and tz > s > SZ. By Lemma
H implies
is derived
that
tl > s. Given
as the contrapositive
the
of
with (7) and (8) yields the following:
4.2, these imply that Hz majorizes
H.
Case 2.
Ib. I = Ib, 1. In this case, there is a unique
histogram
in %1, (denoted by EJl) that can be constructed
as above and be serial with respect to b,
and
bt. (Whether
it is b. that ends up with the high frequencies
or bt is
immaterial,
since both buckets
contain
the same number
of elements. ) We
ACM
Transactions
on Database
Systems,
Vol
18, No
4, December
1993
726
Y. E Ioannidls
.
prove
that
lZl
and S. Chnstodoukalls
majorizes
Ib, I = Ibtl, both
average of the
s and
largest
H,
t can be no less than
and
frequencies
in the
using
the
two
t~,since the latter
buckets.
a histogram
H.
majorizes
buckets
in
properties.
guaranteed
The
above
will
an
PROOF.
By
histogram
histogram
(N
of buckets
to the
histogram
in
in
H,
two
we can
buckets
for
pairs,
Ib, I
applicable
all
Y<> with
of such
to derive
result
a t-clique
of bucket
the
this
and
pairs
of
desired
process
is
query
Q with
templates
for
any
results
on the
optimality
of
size error.
relations
(Y?]).
(.Yd, ) where
4.1,
Consider
+ 1)-z>ector
Similar
An
interesting
tograms
are
for
all
There
Hj
RI,
O s j < N,
exists
histograms
histogram
an
optimal
in it are serial.
= %,,
there
is
a serial
Since Gj is serial,’ the corresponding
By applying
Lemma
4.1 the theorem
the
from
optimal
the
range
for
in the
t-clique
so that
4.1.
As
an
are
domain
Zipf
with
for
generated
and
for
three
ACM
TransactIons
the
other
relations
the
Systems,
optimal
in it are
is
order
of the
are
usually
his-
elements
that
actual
traditional
it
that
join
join
approach
indicates
grouped
that
in
domain.
may
be far
histograms
buckets
based
on
frequencies.
of
when
the
types
two
are
and
2.3).
identical
of
Assume
(i.e.,
of histograms
VO1 18, NO 4, December
have
uniform
1993
have
of
frequency
that
We
that
histograms,
relations
whose
as well.
trivial
serial
identical
elements
(Section
are
importance
joining
100
histograms
interesting
on Database
elements
4.2
stores
this
O < j < N,
an
histograms
bucket
Moreover,
contains
z = 0.2
two
when
total
R,,
exists
❑
Theorem
natural
join
relations
There
( biased)
results.
is that
error
join
of
all
each
example
worst-case
( fl~ ).
that
queries.
corresponding
whose
maintained
a way
their
the
set
implies
be constructed
Q with
sizes
concerning
in
theorem
query
{ ?[P,) where
observation
above
Example
histogram
previous
const~wcted
in
a t-clique
of
Q within
to
a certain
What
error
pair
construction
number
to the join
Theorem
PROOF.
butions
of the smallest
4.2 is directly
respect
above
a serial
for Q within
vector
tuples
the
a finite
any
with
is the
both s
❑
Consider
histogram
end-biased.
closeness
as in case 1. Because
u
an
consider
for
can now be used
COROIJAFW 4.1.
in
Lemma
serial
G~ E R,&,, that majorizes
it.
distribution
is nonincreasing.
is proved.
should
in
is
+ I)-vector
vector
is
applying
respect
4.2.
(N
histogram
belong
that
result
theorem
with
THEOREM
and
%~
Since there
to complete.
histograms
and
in
Repeatedly
Ii
notation
is the average
Therefore,
and implies
that HI majorizes
H.
The above concludes
the proof that
obtain
same
t can be no greater
than
SI, since the latter
Ib. I frequencies
in the two buckets.
Similarly,
the
10000
distri-
histograms
calculated
the
approximation)
five
buckets:
(a)
Optimal Histogram for Limitmg Worst-Case Error Propagation
Table
II.
Error
in a 2-Way
Join
Query
Histogram
Trivial
4.64%
Nonserial
4.60%
[[
High-biased
histogram
where
the i-th
+ i 10< x < 19}; (b) the unique
elements
each;
and
(c) the
(four univalued
buckets
frequencies
and one that
contain
contains
and
same
(c) conform
to the
1.10%
2.15%
bucket
serial
b,, 1<
histogram
unique
Histograms
Error
Serial
a nonserial
for Various
727
.
i <5,
with
high-biased
template,
i.e.,
to b, = {5x
with
twenty
with
five
buckets
with
the
elements).
four
Note
histogram
the join
elements
the remaining
join
bucket
is equal
five buckets
they
belong
highest
that (b)
to %’& for
.% = (5,{20, 20,20,20,
20}), whereas (b), (c), and (d) have the same number
of
buckets,
i.e., they belong to %$. The corresponding
results
on the error based
on (5) are shown
As
expected
trivial
in Table
the
II.
serial
and
and the nonserial
however.
First,
identical
errors,
although
than
latter.
Second,
the
than the
accurately
the
high-biased
ones. There
nonserial
and
the
the former
the
histograms
are two more
trivial
five
histogram
serial one. Hence, in this case, the
maintaining
the highest
frequencies
Theorem
4.2 is the most
important
result
better
histograms
maintains
high-biased
are
interesting
times
than
the
points
to note,
generate
almost
more
generates
information
a larger
error
relatively
common
practice
is far from optimal.
u
of this
section
and states
that
of
the
optimal
histograms
are serial. It does not offer, however,
any indication
as to
which of the possibly many serial histograms
is the optimal
one in each case.
Unfortunately,
optimal
we
serial
the relations
but
for
significantly
affects
(i)
aware
of no
depends
all
relations,
the
the optimal
general
on the
also on the query
distribution
dependencies
are
histogram
result
specific
size. That
number
serial
in
that
frequency
direction.
The
distributions
of
is, even for the same frequency
of relations
histogram.
joined
The intuition
in
the
query
behind
these
is the following:
The frequencies
that are rarer should be known more accurately,
i.e., the
associated join elements
should be placed in univalued
buckets or at least
buckets with few join elements
that have similar
frequencies.
The reason
is that
bution.
these frequencies
offer
The following
artificial
more information
about the overaH
example
will help drive the point
distrihome.
Consider
the frequency
distribution
(10, 9,8, 1). The frequency
of the
fourth join element
is much lower than the other three frequencies,
which
are very close to each other. Thus, for small join queries,
histogram
111
with buckets {1, 2, 3] and {4} is more preferable
than, say, histogram
Hz
with buckets
{1} and {2, 3, 4}. The histogram
distribution
of HI is very
ACM
Transactions
on Database
Systems,
Vol
18, No. 4, December
1993.
728
Y. E, Ioannidls
.
and S, Chrlstodoukalls
close to the actual
one, (9, 9, 9, 1) instead
of (10, 9, 8, 1),
histogram
distribution
of Hz is quite
different,
(10,6,6,6)
(10,9, 8, 1).
(ii)
whereas
the
instead
of
As the number
of joins in queries grows, large frequencies
dominant
in the computation
of the query
result
sizes
should
be known
tuples
in
the
more
results
accurately.
associated
The
with
reason
is that
most
frequent
the
become more
and therefore
the
number
join
of
elements
becomes a larger fraction
of the overall result sizes as the number
of joins
in queries grows. Consider
a relation
having the frequency
distribution
of
the example in (i) above, For a 2-way join query of the relation
with itself,
the first
join
246 tuples,
join query
element
approximately
size
The
52%.
grows,
and
becomes
presented
in the next
presented
the limit).
in the subsequent
A general
result
As mentioned
earlier,
t-clique
of the
serial
for
all
Let
that
a given
have
among
introduced
for
of
it
as the
query
becomes
more
~
be
FOR
JOIN
For
the
buckets.
those
with
with
We
~
2-WAY
join
the
rest
does
of the
discuss
histogram
attempt
The
values
with
no
not
restrict
paper,
of
buckets.
join
queries.
%;
of N, still
QUERIES
queries
(A2)
subset
~, we
~
(i) for 2-way
i.e., arbitrary
equality
of buckets
histograms
quantify
section quantify
(ii) for very large
that
captures
the precise
balance
assumption
section.
&P
number
histogram
2-way
Hence,
in this
histograms.
histograms
competitive
of joins
size queries,
HISTOGRAMS
queries.
results
more
number
section
(i) and (ii) for arbitrary
our efforts.
SERIAL
that
we
optimality
the
notation
the
concentrate
contains
to identify
following
function
the
within
optimal
needs
symgeneron
serial
Y@,
serial
to be
buckets:
The maximum
join element
in bucket
b,, 1< i < @, of a serial
PP = M,
whereas
we also define
histogram
of size P. (Clearly,
P,
p.
Note
Hz
a certain
results
are
out of a total
H1.
5. OPTIMAL
i.e.,
Thus,
beyond
to the result
40%. On the other hand, for a 5-way
itself,
the corresponding
percentage
is
than
between
escapes
ality
100 tuples
preferable
The results
queries
(in
bols
contributes
that is approximately
of the relation
with
that
=
the
0.)
set
{pi 10 s i s P} completely
specifies
a serial
histogram
of size
B.
As mentioned
In two different
above, in this section, we concentrate
subsections,
we investigate
optimal
on 2-way join queries.
serial and optimal
end-
biased histograms,
respectively.
Before proceeding
in that direction,
however,
we present the following
general
and rather
unexpected
result, which implies
that for maximal
error reduction,
the same histograms
should be used for
both
relations
ACM
TransactIons
in a 2-way
on Database
join
Systems,
query.
Vol. 18. No
4, December
1993
Optimal Histogram for Limiting Worst-Case Error Propagation
THEOREM
Consider
5.1.
a function-free
tions RO and R ~ and an integer
then for Q, H is optimal
PROOF.
Let
11 G 9P
within
be the
equality
B > 1. If H G P;
used
query
Q of two
is the histogram
u $= ~~ot for RI
histogram
join
for
729
.
rela-
used for RO,
as well.
RO, characterized
by the set
{ptlo
s i s b}
of maximum
join elements
in its buckets.
Consider
an arbitrary
histogram
G E U ~. IYP, characterized
by the set {q, 10 s i s ~‘} of
maximum
join elements
in its buckets. Let TZ~ (resp. U,J), j = O, 1,be equal to
T,, = Zjj’~ ~ tkj (resp., U,J = ~~. ~ th~). When j is omitted,
the sum of the frequencies
of the first p, join elements
of Q under
the
approximate
size TP under
(H, G).
Concentrate
histogram
pair
(H,
(H,
H ) (resp.
T, (resp. T;) denotes
in the result relation
(H,
G)).
than
the
H ) is never
less
consecutive
members
We
prove
that
approximate
the
size
T;
under
on two arbitrary
pl _ ~ and pl,
of {pl10 < i < ~}. Assume that q, <pl_l
< q,+l and qj
O < i, j < ~’. Consider
the join elements
between
pl.
(inclusive)
and their contribution
Tl – Tl ~ (resp. T( –
mate size of the result under the histogram
pair (H, H)
be defined as A = ( Tl – Tl _ ~) – (T; – T/_ ~). To find an
for A, we distinguish
1<1<
~,
<p~ < q~+l for some
~ (exclusive)
and pl
T;. ~) to the approxi(resp. (H, G )). Let A
equivalent
expression
two cases.
Case 1. q, < qJ (or equivalently
p’s and q’s for an arbitrary
instance
Figure
5, the Following holds:
ACM
TransactIons
q,, ~ < qj).
The relative
ordering
of the
of this case are shown in Figure 5. From
on Database
Systems,
Vol
18, No
4, December
1993.
730
Y. E. Ioannldls
.
qi
n-l
Fig. 5,
—
and S. Chnstodoukahs
Relative
P,
q,
qi+l
placement
of maximum
join
elements
q+~
of buckets.
Tlo – T(l_l)O
Pl –Pi-l
Case
lowing
2.
q, = q, (or
equivalently
q,,
Similarly
~ > pl ).
to case 1, the
fol-
holds:
(10)
Formulas
(9)
Specifically,
than
pl,
defined
and
assume
i.e.,
(10)
can
that
q,
q] 5 Pl
< qj~~.
be captured
is the
In
largest
uniformly
among
addition,
let
xl,
by
the
yl,
q’s
and
a single
formula.
that
is
no
Zl,
O <1<
larger
B, be
as follows:
Tro – T,l. ~)o
xl =
(11)
Pl
–Pi-l
.v/ = 1“~1 –
UJ1
u
Z1 = (P1 –qJ)
Then
both
Formula
Clearly,
ACM
(9) and (10) are equivalent
(12)
– q,
“+’)]
qj+l–f7J
(13)
“
to
(14) can now be used to capture
the difference
between
TP and T;.
Tp == x~= ~(Tl – T1. ~) and T; = Z?. I(T[ – Tl _ ~). By (14) we obtain the
Transactmns
on
Database
Systems,
Vol
18,
No
4, December
1993,
Optimal Histogram for Limiting Worst-Case Error Propagation
.
731
following:
TP – T; =
;
((q
–
I“.,)
– (q’ – 7’-1))
-z,)
-
1=1
= ; Xl((y,
1=1
(y,_,
-zl_l))
p-l
–Xl(yo
–Zo)
x
+
(xl –X1+l)(Y1
–21)
+~p(Yp
–Zp)
1=1
p-l
=
(x,
~
-X,+l)(yj
-z,).
(15)
1=1
The
last
equality
is due to the
fact
that
all
four
of YO, zo, YP, and
zp are
equal to O. This can be easily verified
by using PO = q. = 7’01 = Uol = O and
pp = qfi = M and T’l = UPI = SI (the size of RI) in (12) and (13).
For each term of (15) we make the following
two observations.
First,
the
inequality
x ~ > xl+ ~ holds,
since
x ~ is the
average
of some frequencies
are all no less than the frequencies
whose average
is equal to
Second,
for similar
frequency
distribution
of RO is nonincreasing).
the inequality
yl > Z[ holds. Specifically,
(12) and (13) yield
The first term
RI of the join
that
xl+ 1 (the
reasons>
inside the rightmost
elements
between
parenthesis
is the average frequency
in
q,] and pl while
the second term is the
corresponding
average
of the join
elements
between
q, and qj + 1. Since
In conjunction
with the fact that
qj + ~ > pz, the parenthesis
is non-negative.
this implies
that
yl > Zl. Applying
the above two
Pt z q~ (by definition),
❑
observations
to (15) yields the theorem.
Example
5.1.
Consider
distribution
with
z = 0.2. Assume
a relation
that
R o that
follows
a 2-bucket
serial
the
Zipf
histogram
R. such that the maximum
join element
in the high-frequency
Figure 6 shows the error as a function
of the corresponding
join
frequency
is used for
bucket is 10.
element
p of
a 2-bucket
histogram
for R ~. Three different
Zipf frequency
distributions
for
RI are shown with
z = 0.2, 0.5, and 1.0, respectively.
In all cases, p = 10
generates
the least error. It is interesting
to note that the error grows quite
indicating
the importance
of
fast on the two sides of the optimal
p value,
choosing
(e.g.,
Zipf
the
appropriate
with
As mentioned
that
for
maximal
z =
1.0)
right
error
histogram.
are
before
affected
As
Theorem
reduction,
ACM Transactions
expected,
more
5.1,
the
more
skewed
distributions
severely.
its
same
on Database
most
important
histograms
Systems,
should
implication
be
used
Vol. 18, No. 4, December
is
for
1!393.
732
Y, E. Ioannidis
.
and S. Christodoukalis
0.50
Fixed
Relation:
2=0.2 p=10
I
I
Z=l. o
o
10203
O4O5O6O7O
Fig. 6.
Choosing
a histogram
for one relation
100
90
80
Join Element
p
given
the histogram
of the other.
both relations
in a 2-way join query. There is no point in having more buckets
in the histogram
of one relation
than in the other or having different
buckets.
Hence,
single
5.1
we use this
histogram
Optimality
result
for both
Among
in the
following
relations
that
ALL Serial
subsections
would
and
be optimal
only
search
for a
for such a query.
Histograms
In this section, we identify
the optimal
histogram
within
3; for a histogram
are equal
size ~ > 2. We assume that for both R o and RI, not all frequencies
in
each
generate
one
of
zero
error.
them,
The
because
following
otherwise
notation
all
histograms
needs
are
optimal
to be introduced
for
and
relation
R], j = 0,1:
SJ
The size of Rj
T,J
The
sum
tion
of R],
T,(x)
Transactions
in tuples,
i frequencies
in
the
i.e., T,j = Zi = ~tkl. By definition,
A monotonically
increasing
differentiable
that
agrees with
T,~ on the integers.
denoted by Z( x ).)
(Note the difference
5.1.) In the following
ACM
measured
of the highest
(Its
frequency
function
on the reals
derivative
on x is
in the definition
of T,l compared
to the proof
result,
~, is approximated
by T~( x ) so that
on Database
Systems,
Vol. 18, No 4, December
1993
distribu-
S] = T~l.
of Theorem
the latter’s
Optimal Histogram for Limltlng Worst-Case Error Propagation
derivatives
outcome
theorem
can be obtained.
Although
works
well in practice.
for B = 2 so that intuition
THEOREM
Consider
5.2.
based
on such
We first
present
is developed.
a function-free
equality
733
.
an approximation,
the
join
special
query
the
case
of the
Q of two rela-
RO and R ~ such that not all frequencies
in RO (resp., R ~) are equal.
Q, the optimal
serial histogram
with 2 buckets for RO and RI satisfies
For
the
tions
following:
(16)
where
forj
= O, 1,
PROOF.
togram
The
size
for
S’(pi)
the
join
result
of p ~ is equal
as a furwtion
=
generated
the
optimal
value
=
serial
his-
– Tl(pl))
M–pl
for p ~, we differentiate
~o(p,)~,(pl)
S’(pi)
a 2-bucket
+ (sO – To(pl))(sl
~o(Plml(Pl)
PI
To find
by
to
+
S’( pl ) and equate
~o(p,)~,(p,)
_
with
O.
~o(P,)~,(P,)
o +
P1
P;
T’JP,))T,(P,)+ To(p,)(s, - T,(p,))
(s0 -
M–pl
(L$J– To(p,))(s~ – T~(p~)) = o
+
(M–pl)2
=
(
(
SI – Tl(pl)
Tl(pl)
—
PI
Fo(pl)
M–pl
so
To(pl)
+
–
)
To(PI)
Tl(pl)
M–pl
PI
Toil
)
(SO – TO(P1))(SI
—
P;
The
both
last
equality
sides
since both
is derived
of it by Y;( pl)y~(
Y;( p 1) and
by
(M
the
pl)
(16).
-yI( p 1) are nonzero.
ACM
Transactions
–p,)z
of y~- ( p ~) and
definition
yields
The
This
on Database
– TI(PI))
division
y14( p ~). Dividing
is indeed
is due to the fact that
Systems,
allowed
not all
Vol. 18, No. 4, December
1993
734
Y, E. Ioannidis
.
frequencies
frequencies
in each relation
are equal,
which
for elements
up to p ~ to be greater
after pl.
We have
shown
To see that
it is indeed
avoid
that
an argument
determined
that
Se(pl)
based
an extreme
value
it is enough
to observe
on the formula
by the
rate.
At
the
also decreases
(both
the
Se(pl)
>0
Hence,
for the value
To use
to
Theorem
y~(pl)
the
of
of p ~ that
5.2
in
but
database
Tj( p ~) decreases
at
– PI)
rate
due to averaging.
satisfies
(16),
to
~’(pi)
(16),
the
or equivalently
<0,
satisfies
As
at a rate
of p 1, the
at a slower
that
PI
of it
right
(We
+ (SJ – ~(pl))/(M
= ~(pl)/pl
decrease),
value
(16).
PI
~1( p ~) decreases)
larger
the value
to T], so as p ~ grows,
is added
time,
terms
considering
and
that
same
satisfies
the following.
when
for S’( p ~) due to its complexity.)
(and correspondingly
frequencies
added to T,. The
the frequency
Hence,
takes
a maximum,
forces the average
of the
than the average
of those
Tj( P ~) increases
p ~ grows,
smaller
and S. Chnstodoukalis
Se has
of it
<0.
❑
a maximum.
T( p ~) must
histograms,
left
Se(pl)
be approximated
by a discrete
quantity.
We adopt one of the canonical
methods
that uses the
average of the value differences
of T](x) between
p ~ — 1 and p ~ and between
PI
and
= ~(pl
PI + L i.e.,
+ 1) – ~(pl
Theorem
2~(P1)
= t~(Pl)
– ?(P1
– 1)) + (T1(P1
+
Hence, the following
– 1) = tp, ) + t(p,.,l),.
1) – T1(PI~)
corollary
of
Q of
two
5.2 can be obtained.
COROLLARY
Consider
5.1.
a function-free
equality
join
query
relations
RO and RI such that not all frequencies
in RO ( resp., RI) are equal.
For Q, the optimal
serial histogram
uith 2 buckets for RO and RI satisfies the
following:
tp,o
+ t ~,1 + t(pj+l)l
t(p, +l)o
+
Ylo
where
y~l
replacing
and
y~l
Tl(pl)
are
with
Applying
the corollary
be done in time linear
=
Yll
derived
from
yJ- ( p ~)
‘YTo +
Y:l
Y;o
7;1 ‘
and
(17)
yJ+( p ~),
respectively,
by
TPLJ.
to identify
the optimal
2-bucket
in the number
of join elements.
serial histogram
can
Roughly,
this would
involve
scanning
the join elements
from 1 to M, and for each one calculating
the expressions
in the two sides of (17) and comparing
them, until
(17) is
satisfied.
The following
example
illustrates
how Corollary
5.1 can be applied
on specific frequency
such an algorithm.
distributions;
the
Example
5.2.
Consider
again
the
z = 0.1 and assume that both relations
for pl
= 19. One can verify
that
this
results
been
Zipf distribution
follow it. In this
choice
tion for the 2-way join result size. Note that
buckets is far from the median (which would
the fact that the distribution
than the lower
ones (Figure
have
generates
obtained
by using
of Section
2.3 with
case, (17) is satisfied
the largest
approxima-
the break point between
the two
be 50 in this case). This is due to
is skewed, with its high frequencies
1). Hence,
this example
illustrates
being rarer
how (17)
captures
point (i) of the intuition
described
in the end of Section
dependency
of optimal
serial histograms
on the rarity
of frequencies.
ACM
TransactIons
on
Database
Systems,
Vol.
18,
No
4, December
1993
4 on the
Optimal Histogram for Limmng Worst-Case Error Propagation
Theorem
5.2
buckets
~. This
follows
the steps
THEOREM
tions
can
where
Consider
5.3.
forj
generalized
to
version
with
equality
not all frequencies
histogram
an
below
arbitrary
without
number
its
proof,
of
which
5.2.
a fllnction-free
such that
serial
deal
is given
of the one of Theorem
RO and RI
Q, the optimal
following:
be
generalized
735
.
with
join
query
in RO (resp,,
~ buckets
Q of two rela-
Rl)
are equal.
for RO and
RI
satisfies
For
the
= 0, 1,
~(Pl~
y;(p~)
– q(P/.l)
+
~J(Pl+l)
q(Pt)
and
Pl+l
P1 –PI-I
q(pl)
yj-(pi)
–
q(P1.lJ
_
–P1
q(Pl+l)
–
~’(Pi)
=
Pl
Applying
–
=
–Pi-l
the above theorem
P1+l–
to identify
the optimal
P1
“
serial
histogram
for RO
and R ~ still requires
polynomial
time in the number
of join elements,
where
the degree of the polynomial
is the number
of desired buckets minus
1. The
reason for the complication
compared
to the 2-bucket
case is that the optimal
break
points
possible
buckets
algorithm
cannot
5.2 Optimality
The
above
Because
be identified
independently
of each other,
so roughly
combinations
of points must be examined.
If the number
is considered
as a parameter
of the size of the input,
requires
exponential
time in that parameter.
Among
results
of the
End-Biased
identify
complexity
the
all
of desired
then the
Histograms
optimal
involved
in
histograms
applying
these
among
all
results,
serial
ones.
a reasonable
alternative
that several
database
systems
have adopted
is to only support
end-biased
histograms
[17]. In that
case, the candidate
break
points
for
buckets
are much
fewer.
Unfortunately,
differentiation
has not been as
effective in identifying
the optimal
end-biased
histograms
as before. Although
it results
in a theorem
similar
to Theorem
5.2 or 5.3, the formulas
involved
are complex
and unintuitive.
As an alternative,
we explore
an inductive
approach
in which
the univalued
buckets
of the end-biased
histogram
are
chosen one at a time. Given a fixed set of ~ – 1 univalued
buckets,
choosing
the
optimal
next
univalued
always
be the optimal
provides
a satisfactory
first
k,
bucket
results
in
a histogram
that
may
not
end-biased
histogram
with
~ buckets,
but it often
approximation.
According
to the above approach,
the
1< k < p – 1, univalued
buckets
that
are
chosen
can
be ignored
when
choosing
the (k + 1)-th one. This is done by concentrating
on the
unique multi valued bucket that has been formed at that point and identifying
which of the two (sets of) join elements
that are associated
with its highest
and lowest frequency
should be placed in a univalued
bucket. Hence, without
ACM
Transactions
on Database
Systems,
Vol.
18, No
4, December
1993
736
Y. E, Ioannidls
.
loss of generality
only
and S. Christodoukalis
and to simplify
addresses
the first
presentation,
the following
step in the histogram
formation
series
process,
of results
i.e., the case
of finding
the first frequency
to be placed in a univalued
bucket.
Based on Theorem
5.1, for optimal
results the two relations
must
same
histogram.
end-biased
histogram
elements
R1. The
Therefore,
that are all associated
second such histogram
the lowest
frequency
placed
In
an
simplicity
in
in both
effort
to
RO and
alternatives
R ~. Let
k and
in
above
succinct
the
and
the
As in the previous
THEOREM
Consider
frequency
subsection,
Consider
5.4.
RO and
RI
such
not
all
equality
bucket
join
t~J
M]
where
forj
(t,,
PROOF.
Let
S:
of each
the
and
of the
for
the size of R].
query
Q of two
RI)
rela-
are equal.
is optimal
for Q. The
bucket is chosen as
if
aO+al>O
if
aO+al<O
S;
denote
the
–
thfj
result
“
size
approximations
to be maintained
k join
assuming
candidate
histograms,
tlJand
when
in the univalued
elements
a comparison
in
the
of the
buckets,
univalued
two
sizes
following:
ktlotll
+ (SO – ktlO)(Sl
z kt~ot~l
(M
– k)ktlo
–SOk(tll
tlotll
ACM
assume
– s,/M
+ tM, )/2
t lJ
t~~ are chosen as the frequencies
respectively.
Based on (4), and
yields
respec-
= O, 1,
al =
buckets
we
in RO ( resp.,
t
(
of R~ =
of join
histograms,
criteria,
2 buckets that
in its univalued
follows:
in univalued
two
establishes
a relationship
of the specific
distributions
frequencies
the end-biased
histogram
with
that is accurately
maintained
Frequency
in both I?. and
associated
with
S~, j = O, 1, denotes
a function-free
that
the
optimal
k’ be the number
usable
k = k‘. The following
theorem
optimal
histogram
and properties
that
involved.
have
for the
for a 2-way join query. The first such
the maximum
possible number
of join
bucket
obtain
two
with the highest
frequency
does the same for elements
a univalued
between
tions
are only
histogram
with two buckets
places in a univalued
bucket
elements
tively.
there
TransactIons
+ (SO – kt~O)(Sl
t,,
– (M
– tJIl)
– t&lotM1
on Database
– ktll)/(M
–
Systems,
– k)
– ktMl)\(M
– k)
– k)ktMOtM1
– Slk(tlo
~(tll
Vol
– tMO)
– tM~)
18,
No.
–
+ kztlotll
~(tlo
4, December
– tMO)
1993
– k2tMOtl~l
>0
>0
Optimal Histogram for Limiting Worst-Case Error Propagation
Based
the
on the premises
above
formula
737
.
tlj> tMj for j = O, 1. Hence, dividing
(tlo– tMo)(tll– tM ~) completes
the proof.
of the theorem,
by the nonzero
❑
If R ~ and
R ~ are the same relation
size S~, the following
COROLLARY
with
itself
Consider
5.2.
and
simple
suppose
R with
corollary
a function-free
that
not all
frequencies
i < M, and
t,,1<
can be obtained.
equality
frequencies
end-biased
histogram
with 2 buckets that
is accurately
maintained
in its univalued
join
query
Q of relation
in R are equal.
Consider
is optimal
for Q. The frequency
bucket is chosen as follows:
R
the
that
if
Frequency
in univalued
bucket
of R =
2
I
PROOF.
pends
on
If
cio =
the
-
a,
implies
that
is a rather
intuitive
optimal
choice
de-
of the
two
explanation
of Corollary
extremes).
In
is, in some sense, more
case,
the
distinguished
for the opposite case as well.
provided
by Theorem
5.4 is rather
between
to maintain
highest
(rarer)
argument
holds
The criterion
join
valuable
that
more
2-way
5.2, which
captures
a >0, the frequency
distribution
is more skewed
average. of the distribution
is lower
than
the
and is therefore
specific
the
❑
alone.
point (i) of Section 4. When
towards
lower
values
(the
average
5.4
of a. Since t~ is always greater
than tM, the denominator
positive.
Therefore,
the optimal
choice depends on the sign of
of a is always
distribution
Theorem
M
sign
its nominator
There
al
s~”
<—
tl+tM
if
t~
two
relations,
frequency
than
in a univalued
since
al
bucket.
simple
of the
the lowest
to apply
can be easily
one,
A similar
for
any
computed.
Unfortunately,
it does not provide
a general
answer
when considering
all
possible 2-way joins that can be formed
between
relations
from a large set.
Depending
on the particular
join, the answer for any relation
may be different. Below we offer a general heuristic
that will
that we need to first prove the following
simple
PROPOSITION
5.1.
If
not
all
frequencies
in
work well in many
proposition.
a relation
R are
cases. For
equal,
then
lal <l/2.
PROOF.
We
prove
that
a s
1/2;
the
other
inequality
can
be proved
simi-
larly.
Since there is at least one frequency
inequality
always holds in the strict
❑
limit
M - =.
ACM
Transactions
(~,) that is higher
than
t~, the last
sense. Equality
can be obtained
at the
on Database
Systems,
Vol.
18, No.
4, December
1993.
738
.
Y E. Ioannldis and S, Chnstodoukalls
Table
The above
III.
Values
of a for the Zipf
Distribution
for Various
Values
z
0.0+6
0.02
0.04
0.06
0.08
0.10
Q
0.290
0.296
0.301
0.307
0.312
0.318
proposition
can be used
each relation
Rj of interest,
aj should
are obtained,
the
decision
following
in applying
the following
be calculated.
can be made:
Based
(a) if
of z
heuristic.
For
on the values
that
CYJis positive
(resp.
) should be chosen for all
negative)
for most relations,
then
tlj (resp. tNIJ
histograms;
(b) if al has a large absolute value (e.g., above 0.3) whenever
it is
positive
(resp. negative)
and a small absolute
value (e.g., below O.1) whenever
be chosen for all
it is negative
(resp. positive),
then tlj (resp. tMj) should
histograms.
If none of the above holds and there is a varied
mix of positive
and
into
negative
account,
scope of this
Example
values
of aj,
e.g., frequency
then additional
or importance
paper.
5.3.
Consider
again
the Zipf
ous values of z. The corresponding
In the above, c is an arbitrarily
we want
considerations
should
be taken
of queries,
which are beyond the
to emphasize
for this
distribution
example.
First,
very dramatically
differences
in the
as z
values
increases.
Hence
of this parameter
differences
skew
of distributions.
in
the
of Section
2.3 for vari-
values of a are given in Table III.
small number.
There are three points
the value
one should
only when
Second,
for
that
of a does not change
expect
there
the
Zipf
to see large
are dramatic
distribution,
which is claimed
to be quite common in “natural”
data, a is always positive.
Third,
even for extremely
small values of z, the value of a is relatively
high.
The combination
of the three points above implies
that, for data that follows
Zipf
distributions,
contain
the
the high
6. ASYMPTOTICALLY
The
techniques
tograms
univalued
frequencies
OPTIMAL
used
for 2-way
join
in
buckets
of end-biased
histograms
should
in the distributions.
the
queries
SERIAL
HISTOGRAMS
previous
cannot
section
to
be generally
characterize
applied
optimal
to obtain
hissimilar
characterizations
for larger
queries.
Moreover,
even in the cases where they
can, the resulting
criteria
are rather
cumbersome
and are not easily applicathey
have
shown a clear trend of the optimal
ble in practice.
Nevertheless,
histograms
when
the number
of relations
in a t-clique
query
increases.
Specifically,
the optimal
serial histograms
tend to have many buckets
each
one of which contains
few elements
associated
with high frequencies
and one
large
one with
the remaining
elements
of low frequency.
Similarly,
the
optimal
end-biased
histograms
tend to have increasingly
more of their univalued buckets
containing
elements
with high frequencies.
The following
theorem formalizes
the above observation
in the limit.
We first introduce
some
ACM
TransactIons
on
Database
Systems,
Vol
18,
No.
4, December
1993
.
739
.4’ is the
set of
Optimal Histogram for Limiting Worst-Case Error Propagation
notation
about
positive
natural
0]
a family
The minimum
join
of a high-biased
dj – 1 denotes
highest
0
element,
the number
{O}} (where
U
of i, in the multivalued
H] ● ‘2P
of join
for
R,j,
elements
bucket
j ● M U {O}. That
associated
with
the
is,
~1 – 1
in RJ.
of all values
Consider
6.1.
i.e., value
histogram
frequencies
The minimum
THEOREM
{Rj Ij G .V
of relations
numbers):
a family
of 8~, i.e., 6’ = minj
of t-clique
queries
e,~,,
(oj~J}.
{Q~ IN ~.fl,
QN
where
is on relations
R], O < j < N. Let ( H~N) ) be the (N + 1)-uector
of serial
histograms
that is optimal
for Q~ within
(P; ), where all histogram
sizes
satisfy ~1 > 1 and the approximate
frequencies
&i-e denoted by t ~~N’. Let ~H] )
be the corresponding
high-biased
histogram
vector in ( ?[O ). If for an infinite
{O}, t$N) < tl,, then at the ~imit,
number
of values of j G.KU
the following
holds:
lim (H~N))
N+ x
PROOF.
under
the
Let
S’
and
high-biased
S’(N’
denote
histograms
= (HJ).
the
and
result
under
size
arbitrary
~1 buckets
for
R], O < j < N, respectively.
showing
that
limN - .(S’ – S’(N’) > 0. Observe
high-biased.
Formula
(4) yields the following:
MN
By the premises
of the theorem,
M
is clearly
non-negative.
for each product
Combining
for
histograms
theorem
QN
G~N )
is proved by
Hj is
tfj= t~j,since
N
in the rightmost
above formula,
an arbitrary
number
of its fractions
N ~ w, that sum tends to 0. In addition,
the limit
its left
serial
The
that
with
approximations
these
sum of the
are less than 1. Hence,
of the sum immediately
two facts
as
to
yields
We
claim that the quantity
in the parenthesis
is always
non-negative
and
therefore
the same holds for its limit. Assume the worst case that maximizes
ACM
Transactions
on Database
Systems,
Vol
18, No.
4, December
1993.
j
Y, E. Ioannldis
740
.
the
rightmost
that
6 –
sum
1
is
in
the
and S, Chrlstodoukalis
the
parenthesis
highest
join
(with
the
element
in
negative
some
sign),
bucket
i.e.,
of
assume
G~N),
for
all
{O}. (This represents
a worst case because then t~,~~~)j is the average
of frequencies
that are no lower than t(fl. 1~l.) Also, by the definition
of O, for
Hence,
the claim
is equivalent
to
alll<i<
O–landj=ti~u
{O}, t~j= t,j.
=JI’
U
with t,~N) representing
the worst case discussed above. By Lemma
3.1, for all
j =,7’ u {O}, the M-vector
( t,j ) majorizes
the M-vector
( tf~N)).In conjunction
with
Theorem
The
above
optimal
2.2, this
was
vector
is necessary
that
the
case
where
the
tions
are
uniform.
in the
the
We
limit
Then
expect
that
in
the
behavior
of optimal
histograms.
include
the
where
for
each
relation
where
for
all
G.I’U
Example
large
are
6.1.
join
2-way
j
identical,
and
their
that
the
have
calculated
were
used
Table
IV.
Again
the
nonserial
the
errors
exponential
in
error
Example
serial
and
ones.
are
the
examples
condi-
consider
the
distribu-
distributions
be
used
for
that
theorem
would
of the
asymptotic
where
are
rather
histograms
frequency
essence
are
a
this
example,
should
relations
the
be
Theorem
same,
distinct,
for
and
The
instead
join
query.
Recall
join
6.1
the
case
the
case
by
for
same
types
the
evident
additional
than
once
5-way
are
join
better
interesting
those
again
that
contains
are
the
histograms
of dealing
100
elements,
of the
2.3).
query
than
2-way
[4, 5]. Secondj
Assume
as well.
histograms
are
the
points
a
relations
identical
of
in
with
the
z = 0.2 (Section
relations
histograms
two
domain
with
five
results
high-biased
where
Zipf
the
of
4.1,
whose
are
larger
is very
in
of the
frequencies
a 5-way
tuples
are
significantly
growth
with
high-biased
There
under
non-uniform
5.4
the
importance
generated
4.1.
expressed
reality,
two
condition
all
Example
maintained
the
z O.
for the
0.
the
10000
In
but
two
captures
t) highest
(3J =
distributions
histograms
the
deal
the
the
where
continue
contain
frequency
– S’(N))
so it holds
holds
For
all
Characteristic
the
we
- .(S’
variability
Theorem
cases
case
{O},
we
query,
only
result
To illustrate
queries,
join
above
it
satisfy.
asymptotic.
because
and
most
that
is enough
essentially
histogram
Therefore,
applicable
there
is
0 must
is indeed
satisfied.
is
6.1
that
is violated
optimal
lim~
vector,
case, it can be equivalently
Theorem
that
condition
Therefore,
histogram
In that
condition
to ensure
behavior
purpose.
of
unnatural
tion
(18).
❑
criticism
and
determine
as well.
= (HJ).
A reasonable
technical
proves
for an arbitrary
( H} N’)
(HjN’)
as limN+.
fact
shown
shown
trivial
to note.
join
We
that
and
First,
query.
contrary
in
-The
to what
was observed in Example
4,1, the high-biased
histogram
generates
a smaller
error than the serial one. This was to be expected at the limit due to Theorem
6.1. In this case, however,
the distributions
are skewed
enough
that the
cross-over
point comes early, at relatively
small queries.
ACM
TransactIons
on
Database
Systems,
VO1
18,
No,
4, December
1993
Optimal Histogram for Limiting Worst-Case Error Propagation
Table
IV.
Error
in a 5-Way
Join
Query
Histogram
6.2.
As another
Histograms
Error
Trivial
Example
for Various
741
.
79.42%
Nonserial
78.79%
Serial
25.00%
High-biased
16.43%
example,
we show the effect
of the error
of using
in all relations
of the example
the high-biased
histogram
with L + 1 buckets
introduced
in Section 2.3 (for various
values of L). That is, we assume that
the join elements
of the relations
follow a Zipf distribution
with z = 0.02 and
z = 0.1, and
univalued
trivial
show
the
buckets
in
histogram
effect
the
on the
error
histogram.
(uniform
distribution).
a single
element
We
there
that
Figure
tation
of (5) for these cases.
The results
are rather
impressive.
maintaining
when
Note
7 shows
observe
has tremendous
L = 1, 5, and
a graphical
that
impact
error. An even more surprising
result
is that,
error as a function
of N has a maximum.
That
the query
relations,
are
L = O corresponds
10
to the
represen-
in both
cases,
even
the
total
in reducing
in all cases with
L > 0, the
is, beyond a certain
point, as
size grows,
the error
decreases.
The reason
the value
of the frequency
distribution
for
is that with
more
the most common
elements
becomes an increasingly
larger fraction
of the total size of the query
result, thus reducing
the error (point (ii) of the intuition
described
in the end
of Section
distribution
presented
fixed
value
distributions
the original
4). As expected,
this is more
(z = O.1). We must emphasize
deals with
the largest
for the estimated
possible
result
dramatic
for the
that, by Theorem
result
more
skewed
2.3, the case
size (worst-case
size). If the frequencies
error
given
given
a
by the Zipf
were associated
with the join elements
of a different
way, then
error (for L = O) would be less than what is shown in Figure
7,
but the error under the high-biased
histograms
for each value of L could be
larger.
Nevertheless,
the improvement
for the worst-case
error gives much
hope for being able to optimize
very large queries
in some cases, without
being
overwhelmed
by the errors
in the query
The phenomenon
of the worst-case
error
of the number
of joins N under high-biased
above
example.
happens,
The
next
and also extends
theorem
provides
the conditions
relations.
having
a maximum
as a function
histograms
is not unique
to the
general
to capture
conditions
for when
the case where
this
the error
tends to a finite
number
other than O. Thus, it complements
Theorem
6.1,
which showed that for large join queries, high-biased
histograms
are optimal:
not only are they optimal,
but often the corresponding
error tends to zero as
well.
THEOREM 6.2.
is
on relations
Consider
R],
a family
O s j s N.
Let
ACM Transactmns
of t-clique
( H~ )
queries
be an
on Database
(N
Systems,
{Q~ IN ~~,
+ 1)-vector
where
QN
of arbitrary
Vol. 18, No. 4, December
1993
-i
J
?—
.
w
I!ow.
.
.
-.
m..
.
LJJ.
.O
-
Optimal Histogram for Limiting Worst-Case Error Propagation
743
.
histograms
in (YP ) with approximate
frequencies
denoted by t~l. Suppose that
Z ( resp. Z’) denotes the minimum
join element,
i.e., value of i, such that for
an infinite
number
of values of j E.Y u {0), t,~ < tll (resp., t~l < t ~~), and let
K:
denote
K; = {j~t~~ < t ~~ and j ~. Y u {0)), Then
limll=
c
N+.
where
c is a constant
Z – 1, Vj = K;.
(5),
we
obtain
the
Z and
of
and
the
for
all
nominator
Z’
If
= 1, then
therefore
tends
finiteness
of
therefore
equal
c
>
()
is
Next
Clearly,
value
of
the
are
no
> 1 and
VI
< i <
c = O.
the
error
~:
to itself
that
the
a sufficient
hold:
Z’
of the
rightmost
O. Similarly
positive,
in
– 1), K,
since
for
sum
of
in
This
(20)
is
implies
fraction
condition
s Z, K, L K:,
let
K,
in
both
the
K, =
denote
(resp.
K:) is finite.
The
i = 1, ~,~= o(tlJ/tlJ)
= 1.
the
denominator,
~%D = CO.If Z’
fraction
to that
sums
K;,
to
that
the
limN
limit.
equal
is
limit
terms
the
in
(19)
1 s i s Z’
that
K{,
and
present
always
obtain
if Z = Z’
for
(19) implies
strictly
to O, implying
constant
we
following
we
a
there
is
1 (resp.
is always
K,
the
{0}}. Then
1 s i s Z –
in (20)
Z’,
denominator
{jIt,J < tl, and j G. J’u
where
= 1,
then
following
holds:
—
definition
nominator
Z’>1
Z’
t l] = t~j holds,
1 +D=
By the
if
if
O < c < ~. Moreover,
~, the equality
Using
PROOF.
and
{
Y.
the following
for
independent
that
minus
when
K, c K,,
> 1, then
lim~
which
due
of
- .D
to the
iV,
and
= c, where
1.
c = O. Observe
~, and
that
the
K; c K~+ ~. From
(20)
following:
F >0. Also,
E is obtained
for
any
when
ACM
given
VI
collection
< i < Z‘
Transactions
{K: 11 s i < Z’ – 1}, the lowest
– 1, (i) K, = K;, and (ii) Z’ – 1 is
on Database
Systems,
Vol.
18, No,
4, December
1993.
744
Y. E, Ioannidis
.
the highest
join
computed
t ,Z.
element
without
~)j).
( t,J/’tl,
In
that
E = 1 and
The
first
(Note
case,
–
fraction
‘dl
< i < Z‘
C=o.
any
3.1,
bounds
in turn
since
it
E.z
defining
– 1, K, = K:.)
are
lower
(Z’
–
~, the
by
that
~)1 is
than
1)-vector
Theorem
2.2,
c = O is equivalent
the following.
from
the
c = O.) The
equates
two
that
F imply
derived
for
t~z
= lY~ (so that
●&.
j
E and
(Note
The
all
satisfy
necessary
j
tfJ\t
~J). Therefore,
(
on
all
frequencies
for
I)-vector
is immediately
is also
straightforward
the
account
Lemma
F = O. These
Z = Z’
of
the
that
nominator
this
together
of F in
definition
second
reverse
and
condition
provide
(21).
implication
the
denominator
also
implies
a sufficient
that
condition
for
n
Example
Example
6.3.
a large
number
low-biased
can
low-biased
by
histograms
Assume
are
ones
that
frequency
all
The
numbers
with
the
=
relations
the
maximum
‘Some
necessary
practice.
several
so
close
is
almost
that
we
sufficient
conditions
functions
do exist
c = O They
In addition,
that
TransactIons
results
covers
on
< N
are very
they
would
from
mathematics
many
Database
require
common
Systems,
use
small
same
where
if
81sis
for
value
a
ones.
a different
number
to each
of joins
of N
beyond
as
one
and
the
however,
us to introduce
common
2.3,
of
most
can
easily
to present
cases.
4, December
1993
that
strict
and
impossible
notation
Zipf
is
a
distribu-
inequahty
sufficient
to use m
together
simpler
the
and
S(a)
the
Implies
a much
of
skewed
a necessary
be rather
additional
size
elements,
Figure
see
to derwe
would
we decided
No
the
142).
majorizatlon
much
characteristics
the
100
vs.
and
i.e.,
contains
that
be used
(22)
100”
(143
vector
could
18,
other
<80
some
domain
for when
Hence,
l<z
Section
examined
They
equal
if
are
in
the
Vol
even
low-biased
therefore
is some
are
there
join
complex,
additional
high-biased
distribution,
a
to O
is as follows:
the
of (22),
and
for
database
that
10000,
one
under
over
in
tended
choice.
discussed
to
for
there
+ 1)/2]
of vector
other
condition
chosen
representation
(or equahty)
condition
(
( .z = O. 1)
graphical
the
O < j
present
We
if
observed
and
error
Zipf
5.2.)
even
right
143 – [(i
value
distribution
all
distributions
is very
the
We
the
to be preferred
101 – i
were
Zipf
in
for
ZJ
With
preferable,
the
relations
general.
of
that,
more
are
distribution
t
more
Corollary
fact
the phenomenon
a maximum
behavior
are
applying
the
that
presented
histograms.
expose
high-biased
is
the
histograms
be verified
to
shows
error
compare
high-biased
distribution
which
corrected
of relations
we
under
join,
(This
the
where
versus
2-way
The above theorem
6.2 where
example,
ACM
by
of H~ for
bucket
into
the (Z’
these lower
equivalence
that
is again
for
in some
taking
) majorizes
E > 1.By (21),
to
and S. Chrlstodoukalls
with
sufficient
.
746
tion
is
and S. Christodoukalis
Y E. Ioannldls
.
skewed
common.
Consider
towards
the
two
high
cases
frequencies,
where
the
i.e.,
high
database
frequencies
system
uses
are
more
a high-biased
for L = 10 or a low-biased
histogram
with
histogram
with
L + 1 buckets
L + 1 buckets
for L = 10. Figure
8(b) is a graphical
representation
of(5) for
the two
function
cases. The relative
error
of the number
of relations
in the query
in the query.
result
size
As expected
is shown
as a
from Corollary
5.2, for a small number
of relations
the error is smaller
under the low-biased
than under the high-biased
histogram.
Nevertheless,
the result
of Theorem
6.2 can also be observed,
a maximum
beyond
low-biased
histogram
7. HISTOGRAM
In this
we reflect
on the entire
however,
we should
not addressed
the
to limiting
worst-case
assumptions,
the most
queries). Nevertheless,
and that
under
any
depends
specific
whereas
set of results
the
histogram
error
has
under
presented
in this
the
paper
the
directions
emphasize
the fact that the results
entire
issue of histogram
optimality
error
restrictive
we believe
propagation.
specific
join
to which
query,
They
are based
in this
with
on certain
of which is the form of the queries (t-clique
that they shed enough light on the problem
they
point
with
should be chosen may be useful in general.
To reduce the worst-case
error, the optimal
For
the high-biased
to zero,
on them, make some concrete recommendations
on what types of
should
be maintained
for database
relations
and how these
chosen. To put the recommendations
that
follow
in the right
perspective,
paper
have
respect
it tends
to infinity.
RECOMMENDATION
section,
and, based
histograms
should
be
since the error
which
tends
the
specific
respect
histograms
serial
histogram
to how
histograms
are always
that
is
serial.
optimal
on the number
of participating
relations
in the query
and
frequency
distributions.
In reality,
the database
administrator
their
de-
cides on the maintained
histogram
having
a collection
of queries
in mind,
with an overall
goal of obtaining
good estimates
for all of them, large and
small.
The results
presented
in this paper
showed
that
high-biased
histograms
are the optimal
choice for large queries.
On the contrary,
for small
queries, the correct decision depends on the specific frequency
distributions
of
the participating
relations
in ways that one cannot deal with each individual
relation
independently.
Clearly,
the
database
administrator
would
like
to be
able to decide on the optimal
histogram
for each relation
by looking
at the
characteristics
of that relation
alone.
Given the above, for an overall
effective
estimation,
we suggest the following heuristic
approach
for choosing a histogram
for a relation.
Independent
of
the frequency
distribution
of the relation,
its histogram
has some univalued
buckets with the join elements
associated
with the largest
frequencies.
Further, if the distribution
is very skewed towards
large frequencies
(which
is
not very common),
the remaining
elements
should be divided
among another
set of buckets based on the specific distribution
of the individual
relations,
so
that
small
queries,
e.g., 2-way join
queries,
do not suffer
as well.
The
ACM
Transactions
on
Database
Systems,
Vol
18,
No
4, December
1993
Optimal Histogram for Limiting Worst-Case Error Propagation
effectiveness
that requires
of the
further
above heuristic
investigation.
approach
on real
.
databases
747
is an issue
8. SUMMARY
Error
propagation
significant
much
in the
challenges
higher
complexity
deal. Maintaining
tions
tion
than
efforts
those
optimization
to
with
which
we have
studied
optimize
conventional
frequency
used by database
paper,
is one of the
effectively
to approximate
technique
In this
of query
the
histograms
is a common
of errors.
context
facing
technology
distributions
systems
to limit
histograms
most
queries
of
can
in rela-
the propaga-
and how they
reduce
errors that represent
the worst case. Specifically,
we have introduced
serial
and end-biased
histograms
and showed that for the restricted
class of t-clique
queries,
the
optimal
serial.
For 2-way
presented
results
end-biased
histogram
histogram
distributions
for errors
based
of values
on the
in the join
This
work
has raised
several
characterization
than one join?
for the error
query
query
result
interesting
of optimal
and
of the query
for very
histograms
size is always
symbols,
we have
and the optimal
characteristics
attributes
also examined
histogram
optimality
limit)
and showed that high-biased
precise
in the
equality
join queries
with no function
on finding
the optimal
serial histogram
the
frequency
relations.
We have
large t-clique
queries
are always optimal.
questions
histograms
and issues.
for
queries
What
that
How many buckets should an optimal
histogram
Lo be within
certain
prespecified
bounds?
How
(in
the
is the
have
more
have in order
is histogram
optimality
defined with respect to multiple
queries and which histograms
are
to be preferred
for a variety
of queries?
Is it reasonable
to use histograms
that are optimal
in reducing
the variance
of the error instead of the worst-case
error
and
what
characteristics
trary
equality
are
the
characteristics
of optimal
join
histograms
queries
with
paper change when considering
nonequality
joins
or selections)
operator
cost
final
decision
current
more
of such
than
What
queries,
one join?
How
are
the
primarily
arbi-
do the results
of this
completely
different
types of queries
and different
parameters
of interest
or ranking
of alternative
of the optimizer)?
Many
and future
histograms?
for non-t-clique
access plans, which
of these questions
(e.g.,
(e.g.,
determines
the
are part of our
work.
REFERENCES
1. CHRISTOIXXJLAKIS,
ACM-SIGMOD
S.
Estimating
Conference
2. CHRISTODOULAKIS,
S.
block
transfers
on the Management
Implications
of certain
and join
of Data,
assumptions
sizes.
(San
In
Jose,
Proceedings
Calif.,
in database
May
of the 1983
1983),
performance
40-54.
evalua-
tion. ACM Trans. Database
Syst. 9, 2 (June 1984), 163-186.
3. CHRISTODOULAKIS,S. On the estimation
and use of selectivities
in database performance
evaluation. Res. Rep. CS-89-24, Dept. of Computer Science, Univ. of Waterloo, June 1989.
4. IOANNIDIS, Y,, AND CHRISTODOULAKIS, S. On the propagation
of errors in the size of join
of t?ze 1991 ACM-SIGMOD
Conference
on the Management
of Data
results. In Proceedings
(Denver, Colo., May 1991), 268-277.
5. IOANNIDIS, Y., AND CHRISTODOULAKIS,S. Error propagation
under the uniform distribution
assumption. Tn preparation,
1993.
ACM
Transactions
on Database
Systems,
Vol.
18. No.
4, December
1993
748
Y E. Ioannidis
.
6. KAMEL,
N.,
AND KING,
Proceedings
Tex,,
May
7. KOOI,
1985),
R. P
Western
8
of the
MACKERT,
L. F.,
distribution
in
R‘
G. M.
of the
based
on the
relation
on
texture
Management
databases.
validation
12th
Comput,
Academic
and
analysis.
of Data,
Ph.D.
performance
ZnternatZonal
VLDB
validation
D. C., May
1986),
20, 3 (Sep.
New
York,
In
(Austin,
dissertation,
Case
evaluation
Conference
for
(Kyoto,
disAug.
13. MURALIKRISHNA,
1,
on the Management
satisfying
G.,
of Data
(Bostonj
estimation
m database
selection
in
Proceedings
June
Oct.
of Majorlzatlon
of relations.
1979),
Equi-depth
Proceedings
111., June
L.
and
Its Applzcattons,
(Denver,
C.
histograms
for
Colo.,
Oct.
of the 5th
estimating
selectivity
Conference
query
optimization.
In
a relational
database
Symposlam
Pro-
1985).
estimation
of
of the 1984 ACM-SIGMOD
1984),
Proceedings
28–36.
for relational
Accurate
In
418-425.
of the 1988 ACM-SYGMOD
1988),
A ddsm
Conference
Mass.,
International
dis-
systems.
the
number
Conference
of tuples
on the Manage-
256–276.
P. G., ASTRAHAN, M. M., CHAMBERLAIN, D. D., LORIE, R, A., AND PRICE, T, G,
16, SELING~R,
SIGMOD
In
AND CONNELL,
In
models
D. J.
(Chicago,
Annual
a condition.
profile
Theory
de Janiero,
queries.
of Data
PIATETS~-SHAPIRO,
Statistical
Distribution
(Rio
B., AND KERSCHBERG,
of the ACM
for
on the Management
192-221.
Inequalities:
AND DEWITT,
for multl-dimensional
14. MUTHUSWAM~,
evaluation
Conference
1979.
Conference
M.,
performance
84–95.
1988),
T. H,, AND OTOO, E,
VLDB
and
of the 1986 ACM-SIGMOD
A. W., AND OLKIN,
International
ceedings
R’
G. M.
Proceedings
Serv.,
Press,
12. MERRETT,
factors
AND LOHMAN,
In
M. V., CHU, P., AND SAGER, T.
11.MARSHALL,
path
of queries
Proceedings
(Washington,
MANNINO,
ment
data
Conference
Sep. 1980.
AND LOHMAN,
In
queries.
of Data
15
of
149-159.
tributed
ACM
model
optimization
Univ.,
L. F..
9. MACKERT,
10
A
ACM-SIGMOD
319-325.
queries
1986),
R.
1985
The
Reserve
tributed
and S. Chnstodoukalis
management
on Management
system.
of Data
In
Proceedings
(Boston,
Mass.,
Access
of the ACM
June
1979),
23-34
17. SELINGER,
P. G,
18. Z[PF, G. K.
Mass..
Received
ACM
June
Human
1989.
Personal
BehaL,zor
and
revised
August
communication.
the PrmcLple
of Least
Effort.
Addison-Wesley,
1949.
September
TransactIons
1991;
on
Database
Systems,
Vol.
1992;
18,
accepted
No
December
4, December
1993.
1992
Reading,