A Robust, Optimization-Based Approach
for Approximate Answering of
Aggregate Queries
By :
Surajid Chaudhuri
Gautam Das
Vivek Narasayya
Presented by :Sayed Muchallil
September 21st, 2010
CONTENTS
1. INTRODUCTION
2. ARCHITECTURE FOR APPROXIMATE QUERY
PROCESSING
3. FIXED WORKLOAD
4. STRATIFIED SAMPLING
5. SOLUTION
6. SUMMARY
Pre-computed samples
Can give approximate answer very efficiently.
Workload are used to make sure that errors
are acceptable.
Previous Studies
Solution is difficult to evaluate theoretically.
Do not formally deal with uncertainty in the
expected workload.
Ignoring the variance in the data distribution.
Sample
Product ID
Revenue
1
10
2
10
3
10
4
1000
Only 50% of R records can be used as
sample
Query : “SELECT SUM(Revenue) FROM
R”
The answer for is 1030
Table R
Sample (cont.)
The answer for the query for table S1 is
40.
Sample Table S1
Product
ID
Revenue
1
10
4
1000
Sample Table S2
The answer for the query for table S2 is
2020.
How to get these answer?
Sample (cont.)
large variance in the aggregate column can
lead to large relative errors.
Relative error = |y - y’| / y
Relative error for S1 = |1030 – 40| / 1030
Relative error for S2 = |1030 – 2020| / 1030
What’s New ?
The goal is to pick sample that minimize error.
If actual workload is identical to the given
workload (fixed), error will be smaller.
Can work for identical and similar query to the
given workload.
Sampling
• Two ways for selecting samples
– Randomized
– Deterministic
• A Workload W is a set of pairs of queries and
their weight.
– W = {<Q1, w1>,<Q2, w2>,…<Qq, wq>}
– Σiwi = 1.
Architecture
for
Approximate Query Processing
Architecture (cont.)
Offline Component
Selects sample or records from relation R
Online Component
Rewrites an incoming query to use the sample.
What is “rewrites” means?
Reports answer with an estimate error
Architecture (cont.)
New method for automatically lifting a given
workload.
It is unrealistic to assume that the incoming
queries will be identical to the given workload.
The key : the ability to compute a probability
distribution Pw.
Error Metrics
Relative Error : |y - y’| / y
Squared Error : SE(Q) = (|y - y’| / y)²
Squared Error for GROUP BY query
SE(Q) = (1/g) Σi ((yi – yi’)/ yi)²
a probability distribution of queries pw
Mean squared error for the distribution:
MSE(pw) =ΣQ pw(Q)*SE(Q)
Root mean squared error :
RMSE(pw) = √MSE(pw)
Fixed Workload
Special case ?
A given workload are “identical” to the incoming
queries.
Problem: FIXEDSAMP
Input: R, W, k
Output: A sample of k records (with appropriate
additional columns) such that MSE(W) is minimized.
Fundamental Regions
Relation R contains 9 records
W consists of 2 queries
Q1 = select records with C values between 10 -50
Q2 = select records with C values between 40 -70
These queries divide Relation R into 4
fundamental regions.
Fundamental Regions (cont.)
Fundamental Regions (cont.)
• partitioning the records in R into a minimum
number of regions R1, R2, …, Rr such that for
any region Rj, each query in W selects either
all records in Rj or none.
• Total number fundamental regions =?
Min(2|W|, n)
FIXEDSAMP Solution
Step 1. Identify Fundamental Regions in R
r <= k
r>k
Step 2 Pick Sample Records
Step 3 Assign values to additional columns
LIFTING WORKLOAD TO QUERY
DISTRIBUTION
Query Q’ is not identical, Pw(Q’) is high if Q’ is
similar to queries in the workload, and Low if
not.
Q’ and Q are similar if selected records have
significant overlap.
LIFTED WORKLOAD
P{Q}(R’) is the probability of occurrence of any query
that selects exactly the set of records R’.
For any given record inside (resp. outside) RQ, the
parameter δ (resp. γ) represents the probability that
an incoming query will select this record
LIFTED WORKLOAD (Cont.)
LIFTED WORKLOAD (Cont.)
δ → 1 and γ → 0: implies that incoming queries are
identical to workload queries.
δ → 1 and γ → ½: implies that incoming queries are
supersets of workload queries.
δ → ½ and γ → 0: implies that incoming queries are
subsets of workload queries.
δ → ½ and γ → ½: implies that incoming queries are
unrestricted.
RATIONALE FOR STRATIFIED SAMPLING
A population is partitioned into multiple
strata, and samples are selected uniformly
from each stratum.
STRATIFIED SAMPLING
a stratified sampling scheme partitions R into r
strata containing n1, ., nr records (where Σnj = n),
with k1, …, kr records uniformly sampled from
each stratum (where Σkj = k).
Q1 = SELECT COUNT(*) FROM R WHERE
ProductID IN(3,4);
Product ID
Revenue
1
10
2
10
3
10
POPQ1 = {0,0,1,1} = non-zero variance
4
1000
Divided into two strata {0,0} and {1,1}
POPQ is population of query Q
SOLUTION FOR SINGLE-TABLE SELECTION
QUERIES WITH AGGREGATION
Stratification
How many strata
How many records for each stratum
Allocation
Determines how to divide k
Sampling
Forms the final sample of k record
SOLUTION FOR COUNT AGGREGATE
Stratification (lemma 1)
r is not known, divide R into fundamental regions
and treat them as strata.
Allocation (lemma 2)
MSE(pW) = Σi wi MSE(p{Q})
MSE(pW) can be expressed as a weighted sum of
the MSE of each query in the workload
SOLUTION FOR COUNT AGGREGATE (Cont.)
For any Q ε W, we express MSE(p{Q}) as a function of
the kj’s
Lemma 3 :
ApproxMSE(p{Q}) =
Then,
SOLUTION FOR COUNT AGGREGATE (Cont.)
Since we have an (approximate) formula for MSE(p{Q}),
we can express MSE(pw) as a function of the kj’s
variables.
Corollary 1 : MSE(pw) = Σj(αj / kj), where each αj is a
function of n1,…,nr, δ, and γ.
αj captures the “importance” of a region; it is positively
correlated with nj as well as the frequency of queries in
the workload that access Rj.
Now we can minimize MSE(pw).
SOLUTION FOR COUNT AGGREGATE (Cont.)
Lemma 4: Σj (αj / kj) is minimized subject to Σj kj = k
if kj = k * ( sqrt(αj) / Σi sqrt(αi) )
This provides a closed-form and computationally
inexpensive solution to the allocation problem
since αj depends only on δ, γ and the number of
tuples in each fundamental region
SOLUTION FOR SUM AGGREGATE
Stratification
Bucketing technique
Divide fundamental regions with large variance into a
set of finer regions.
Treat each region as strata
Allocation
Yj is average (sum) of the aggregate column values
of all records in region Rj
SOLUTION FOR SUM AGGREGATE (Cont.)
Each value in the region can be approximated
as yj
An approximate formula for MSE(P{Q}) for
SUM query Q in W
Pragmatic Issues
Identifying Fundamental Regions
Handling Large Number of Fundamental Regions
Obtaining Integer Solution
Obtaining unbiased error
STRAT ALGORITHM
IMPLEMENTATION AND EXPERIMENTAL RESULT
This experiment compares the STRAT method
to other methods.
USAMP – uniform random sampling
WSAMP – weighted sampling
OTLIDX – outlier indexing combined with
weighted sampling
CONG – Congressional sampling
COUNT AGGREGATE
SUM AGGREGATE
COUNT AGGREGATE
THANK YOU
© Copyright 2026 Paperzz