Join Synopses

Join Synopses for Approximate Query
Answering
Swarup Achrya
Philip B. Gibbons
Viswanath Poosala
Sridhar Ramaswamy
Presented by
Bhushan Pachpande
Contents
1.
2.
3.
4.
5.
6.
7.
Introduction
Need for approximate answers
Problem with joins
Join synopses
Allocation
Maintenance of join synopses
Experimental Evaluation
Introduction
This paper
 demonstrates difficulty of providing good approximate
answers to join queries
 proposes join synopses as the efficient solution for this
problem
 presents strategy for allocating available space for join
synopses
 provides efficient algorithm for maintaining join
synopses in presence of updates to the base relations
Why Approximate answers ?



Reduce overhead for large DBs and improve response
time
Reduce access to the base relation
Example of Approximate answers



Initial queries in the data mining which are used to
determine what the interesting queries are
Queries requesting numerical answers and full precision of
exact answer not needed e.g. total, average
The research in this paper was conducted while
developing efficient approximate query answering
system, Aqua.
Aqua System




improve response time by avoiding frequent access to
original data
maintains smaller sized statistical summaries, called
synopses, on warehouse.
sits on the top of the DBMS.
There key components



Statistic Collection
Query Rewriting
Maintenance
Collects all synopses,
uses it to answer
queries posed by user
parses sql input, rewrite
queries for scaling certain
operators to fit for synopses
Keep synopses up to date during
updating of original data
Problem with Joins



Natural set of synopses for an approximate query includes uniform
random samples of each base relations
Non-uniform result samples - For the join to be uniform random
sample, probabilities of tuples in join samples must be equal
Small join result
Join of relations R & S on attribute X
R.X
S.X
a
a
a1
a2
b
b
b1
Probabilities of tuples a1 and a2 being selected
should be same as prob. of tuples a1 and b1
selected in join
a
prob. (a1,a2)=(1/r)*(1/r)*(1/r)=1/r3
b
prob. (a1,b1)=(1/r)*(1/r)*(1/r) *(1/r)=1/r4
To get uniform join samples is very difficult
uniform random sampling
Join Synopses





Naïve way - execute all possible join queries and collect samples
Join synopses - samples are taken from small set of distinguished
joins
Can obtain random samples of all possible joins in the schema
This is scheme is for foreign key joins
Modeled database schema as a
graph


vertex - base relation
directed edge (u to v) – if u has at
least one attribute which is foreign
key in v
Join Synopses

Key result proved - There is 1-1 correspondence
between a tuple in relation ‘r’ & a tuple in the output of
any foreign key join involving ‘r’ & any of its descendents
in the graph.

A sample Sr of a relation ‘r’ can be used to produce
another relation J(Sr) called a join synopsis of ‘r’. (
provides random samples).

Join synopses of R is simply a sample of R where as for
C it is the join of N, R and sample of C.
Join Synopses

For each node u in database schema G, corresponding to a relation r1,
define J(u) to be the output of the maximum foreign key join r1xr2x..xrk
with source r1.

Let Su be a uniform random sample of r1.

The join synopsis J(Su) is the output of Suxr2xr3…..xrk.

J(Su) is a uniform random sample of J(u) with |Su| tuples.

Thus we can extract from our synopsis a uniform random sample of the
output of any k-way foreign key join.

From 1 join synopsis for a node whose foreign key join has k relations,
we can extract a URS of the output of between k-1 & pow(2,k-1)-1
distinct foreign key joins.
Allocation

Allocate space among various join synopses when certain properties of
query workload are known.

Identify heuristics for the common case when such properties are not
known.

Let ‘S’ be a set of queries with selects, aggregates, group by’s & foreign
key joins.

For each relation Ri, find fraction Fi of queries in S for which Ri is the
source relation in a foreign key join.

It is known that the error bounds are inversely proportional to sqrt(n).(nnumber of tuples in join sample).

Select join synopsis sizes so as to minimize the average relative error.
Allocation

The average relative error bound over the queries is proportional to
sum(fi/sqrt(ni))

ni is selected so as to minimize the above equation for the total
memory allocated for join synopses

For each relation Ri if si = size of single join synopses tuple then join
synopses size is chosen so as
sum (nisi) <= Total memory allocated

In the absence of query work load information heuristic strategies
can be used.



EqJoin
CubeJoin
PropJoin
divides
dividesspace
spaceequally
proportionalto cube
divides
space
proportional
amongst
the join
relations
to their
join
synopses
tupletuple
root
of their
synopses
sizes
sizes
Maintenance of Join Synopses



Need to maintain the join synopses when base
relation is updated (insert or delete)
does not require frequent access to base relation
If a new tuple is inserted




Let Pu be the probability of newly arrived tuple for relation u
in random sample Su
Let uxr2xr3x….xrk be the max foreign key join with source
u.
We add ‘T’ (new tuple) to Su with probability Pu.
If ‘T’ is added to Su, we add to J(Su) the tuple Txr2xr3x….rk
Maintenance of Join Synopses


On delete of a tuple T from u



If T is added to Su and Su exceeds its target size, then
select uniformly at random a tuple T’ to evict from Su and
remove the tuple in J(Su) corresponding to T’.
T is in Su delete the tuple from Su and remove the tuple
from J(Su) corresponding to T
If sample becomes too small due to many deletions
repopulate by scanning relation u.
This algorithm performs lookups with the base
relation with small probability Pu
Experimental Evaluation

Two classes of experiments
 Accuracy experiments
 Maintenance experiments

Accuracy Experiments
 Compares accuracy of techniques based on join synopses and
based on base samples
 parameters varied - query selectivity and total space allocated to
precomputed summaries (summary size/join synopses size)

Maintenance Experiments
 Study cost of keeping join synopses up to date in presence of
insertions/deletions to the underlying data.
Experimental Evaluation

Ran results on TPC-D decision support benchmark

Query used is an aggregate that is computed on join of Lineitem,
Customer, Order, Supplier, Nation and Region.

The query used is

region parameter is set to ‘ASIA’ and selection predicate is on
o_orderdate column to the range [1/1/94, 1/1/95]
Query selectivity is varied using
these parameters
Experimental Evaluation
Accuracy Experiments
Experimental Evaluation
Maintenance Experiments
tuples inserted
in lineitem table
Conclusion

Provides uniform random sampling for joins in the
database having foreign key joins.

Focus on computing approximate answers to aggregates
computed on multi-way joins.

Join synopses can be maintained effectively during
updates.