PPT

A Robust, Optimization-Based Approach
for Approximate Answering of
Aggregate Queries
By :
Surajid Chaudhuri
Gautam Das
Vivek Narasayya
Presented by :Sayed Muchallil
September 21st, 2010
CONTENTS
1. INTRODUCTION
2. ARCHITECTURE FOR APPROXIMATE QUERY
PROCESSING
3. FIXED WORKLOAD
4. STRATIFIED SAMPLING
5. SOLUTION
6. SUMMARY
Pre-computed samples
 Can give approximate answer very efficiently.
 Workload are used to make sure that errors
are acceptable.
Previous Studies
 Solution is difficult to evaluate theoretically.
 Do not formally deal with uncertainty in the
expected workload.
 Ignoring the variance in the data distribution.
Sample
Product ID
Revenue
1
10
2
10
3
10
4
1000
Only 50% of R records can be used as
sample
Query : “SELECT SUM(Revenue) FROM
R”
The answer for is 1030
Table R
Sample (cont.)
The answer for the query for table S1 is
40.
Sample Table S1
Product
ID
Revenue
1
10
4
1000
Sample Table S2
The answer for the query for table S2 is
2020.
How to get these answer?
Sample (cont.)
 large variance in the aggregate column can
lead to large relative errors.
 Relative error = |y - y’| / y
 Relative error for S1 = |1030 – 40| / 1030
 Relative error for S2 = |1030 – 2020| / 1030
What’s New ?
The goal is to pick sample that minimize error.
If actual workload is identical to the given
workload (fixed), error will be smaller.
Can work for identical and similar query to the
given workload.
Sampling
• Two ways for selecting samples
– Randomized
– Deterministic
• A Workload W is a set of pairs of queries and
their weight.
– W = {<Q1, w1>,<Q2, w2>,…<Qq, wq>}
– Σiwi = 1.
Architecture
for
Approximate Query Processing
Architecture (cont.)
Offline Component
Selects sample or records from relation R
Online Component
Rewrites an incoming query to use the sample.
What is “rewrites” means?
Reports answer with an estimate error
Architecture (cont.)
New method for automatically lifting a given
workload.
It is unrealistic to assume that the incoming
queries will be identical to the given workload.
The key : the ability to compute a probability
distribution Pw.
Error Metrics
 Relative Error : |y - y’| / y
 Squared Error : SE(Q) = (|y - y’| / y)²
 Squared Error for GROUP BY query
SE(Q) = (1/g) Σi ((yi – yi’)/ yi)²
 a probability distribution of queries pw
 Mean squared error for the distribution:
MSE(pw) =ΣQ pw(Q)*SE(Q)
 Root mean squared error :
RMSE(pw) = √MSE(pw)
Fixed Workload
Special case ?
A given workload are “identical” to the incoming
queries.
Problem: FIXEDSAMP
Input: R, W, k
Output: A sample of k records (with appropriate
additional columns) such that MSE(W) is minimized.
Fundamental Regions
Relation R contains 9 records
W consists of 2 queries
 Q1 = select records with C values between 10 -50
 Q2 = select records with C values between 40 -70
These queries divide Relation R into 4
fundamental regions.
Fundamental Regions (cont.)
Fundamental Regions (cont.)
• partitioning the records in R into a minimum
number of regions R1, R2, …, Rr such that for
any region Rj, each query in W selects either
all records in Rj or none.
• Total number fundamental regions =?
Min(2|W|, n)
FIXEDSAMP Solution
Step 1. Identify Fundamental Regions in R
 r <= k
 r>k
Step 2 Pick Sample Records
Step 3 Assign values to additional columns
LIFTING WORKLOAD TO QUERY
DISTRIBUTION
Query Q’ is not identical, Pw(Q’) is high if Q’ is
similar to queries in the workload, and Low if
not.
Q’ and Q are similar if selected records have
significant overlap.
LIFTED WORKLOAD
P{Q}(R’) is the probability of occurrence of any query
that selects exactly the set of records R’.
For any given record inside (resp. outside) RQ, the
parameter δ (resp. γ) represents the probability that
an incoming query will select this record
LIFTED WORKLOAD (Cont.)
LIFTED WORKLOAD (Cont.)
δ → 1 and γ → 0: implies that incoming queries are
identical to workload queries.
δ → 1 and γ → ½: implies that incoming queries are
supersets of workload queries.
δ → ½ and γ → 0: implies that incoming queries are
subsets of workload queries.
δ → ½ and γ → ½: implies that incoming queries are
unrestricted.
RATIONALE FOR STRATIFIED SAMPLING
A population is partitioned into multiple
strata, and samples are selected uniformly
from each stratum.
STRATIFIED SAMPLING
a stratified sampling scheme partitions R into r
strata containing n1, ., nr records (where Σnj = n),
with k1, …, kr records uniformly sampled from
each stratum (where Σkj = k).
Q1 = SELECT COUNT(*) FROM R WHERE
ProductID IN(3,4);
Product ID
Revenue
1
10
2
10
3
10
POPQ1 = {0,0,1,1} = non-zero variance
4
1000
Divided into two strata {0,0} and {1,1}
POPQ is population of query Q
SOLUTION FOR SINGLE-TABLE SELECTION
QUERIES WITH AGGREGATION
Stratification
How many strata
How many records for each stratum
Allocation
Determines how to divide k
Sampling
Forms the final sample of k record
SOLUTION FOR COUNT AGGREGATE
Stratification (lemma 1)
r is not known, divide R into fundamental regions
and treat them as strata.
Allocation (lemma 2)
MSE(pW) = Σi wi MSE(p{Q})
MSE(pW) can be expressed as a weighted sum of
the MSE of each query in the workload
SOLUTION FOR COUNT AGGREGATE (Cont.)
For any Q ε W, we express MSE(p{Q}) as a function of
the kj’s
Lemma 3 :
ApproxMSE(p{Q}) =
Then,
SOLUTION FOR COUNT AGGREGATE (Cont.)
 Since we have an (approximate) formula for MSE(p{Q}),
we can express MSE(pw) as a function of the kj’s
variables.
Corollary 1 : MSE(pw) = Σj(αj / kj), where each αj is a
function of n1,…,nr, δ, and γ.
αj captures the “importance” of a region; it is positively
correlated with nj as well as the frequency of queries in
the workload that access Rj.

Now we can minimize MSE(pw).
SOLUTION FOR COUNT AGGREGATE (Cont.)
Lemma 4: Σj (αj / kj) is minimized subject to Σj kj = k
if kj = k * ( sqrt(αj) / Σi sqrt(αi) )
 This provides a closed-form and computationally
inexpensive solution to the allocation problem
since αj depends only on δ, γ and the number of
tuples in each fundamental region
SOLUTION FOR SUM AGGREGATE
Stratification
Bucketing technique
Divide fundamental regions with large variance into a
set of finer regions.
Treat each region as strata
Allocation
Yj is average (sum) of the aggregate column values
of all records in region Rj
SOLUTION FOR SUM AGGREGATE (Cont.)
Each value in the region can be approximated
as yj
An approximate formula for MSE(P{Q}) for
SUM query Q in W
Pragmatic Issues
Identifying Fundamental Regions
Handling Large Number of Fundamental Regions
Obtaining Integer Solution
Obtaining unbiased error
STRAT ALGORITHM
IMPLEMENTATION AND EXPERIMENTAL RESULT
This experiment compares the STRAT method
to other methods.
USAMP – uniform random sampling
 WSAMP – weighted sampling
 OTLIDX – outlier indexing combined with
weighted sampling

CONG – Congressional sampling
COUNT AGGREGATE
SUM AGGREGATE
COUNT AGGREGATE
THANK YOU