Query Optimization Schema for Examples Motivating Example

Schema for Examples
Sailors (sid: integer, sname: string, rating: integer, age: real)
Reserves (sid: integer, bid: integer, day: dates, rname: string)
Query Optimization
• Similar to old schema; rname added for variations.
• Reserves:
Based on slides from:
1. Ramakrishnan and Gehrke
2. H.Garcia Molina
3. Surajit Chaudhuri
–
• Sailors:
–
RA Tree:
Motivating Example
SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid=S.sid AND
R.bid=100 AND S.rating>5
bid=100
rating > 5
(Scan;
write to bid=100
temp T1)
Sailors
(Index Nested Loops,
sid=sid with pipelining )
bid=100
Sailors
Reserves
–Projecting out unnecessary fields from outer doesn’t help.
v
Join column sid is a key for Sailors.
–At most one matching tuple, unclustered index on sid OK.
v
v
• Main difference: push selects.
• With 5 buffers, cost of plan:
–
–
–
–
Reserves
(Scan;
rating > 5 write to
temp T2)
Sailors
Scan Reserves (1000) + write temp T1 (10 pages, if we have 100 boats, uniform
distribution).
Scan Sailors (500) + write temp T2 (250 pages, if we have 10 ratings).
Sort T1 (2*2*10), sort T2 (2*3*250), merge (10+250)
Total: 3560 page I/Os.
• If we used BNL join, join cost = 10+4*250, total cost = 2770.
• If we `push’ projections, T1 has only sid, T2 only sid and sname:
–
T1 fits in 3 pages, cost of BNL drops to under 250 pages, total < 2000.
(On-the-fly)
rating > 5 (On-the-fly)
(Use hash
index; do
not write
result to
temp)
(Sort-Merge Join)
sid=sid
Sailors
sname
• With clustered index on bid of
Reserves, we get 100,000/100 = 1000
tuples on 1000/100 = 10 pages.
• INL with pipelining (outer is not
materialized).
sname
Alternative Plans 1
(No Indexes)
sid=sid
Reserves
Alternative Plans 2
With Indexes
Each tuple is 50 bytes long, 80 tuples per page, 500
pages.
(On-the-fly)
sname
• Cost: 500+500*1000 I/Os
(On-the-fly)
• By no means the worst plan!
Plan: sname
• Misses several opportunities:
selections could have been `pushed’ bid=100 rating > 5 (On-the-fly)
earlier, no use is made of any
available indexes, etc.
(Simple Nested Loops)
• Goal of optimization: To find more
sid=sid
efficient plans that compute the same
answer.
Reserves
Each tuple is 40 bytes long, 100 tuples per page, 1000
pages.
Decision not to push rating>5 before the join is based on
availability of sid index on Sailors.
Cost: Selection of Reserves tuples (10 I/Os); for each,
must get matching Sailors tuple (1000*1.2); total 1210 I/Os.
Overview of Query Optimization
• Ideally: Want to find best plan.
• Practically: Avoid worst plans!
• Two main issues:
–
–
For a given query, what is the search space?
How is the search implemented?
• Algorithm to search plan space for cheapest (estimated)
plan.
• How is the cost of a plan estimated?
1
Outline
• Query Representation
• Search space
• Search algorithm
Query Representation
• Query Tree
bid=100
(On-the-fly)
sname
– System Examples
• Cost estimation
• Advanced topics
(Sort-Merge Join)
(Scan;
write to bid=100
temp T1)
(Scan;
rating > 5 write to
temp T2)
Sailors
Query Representation
– Aggregation
– Order by
– SPJ
– Relations
• Each block is represented and optimized
independently
Search Space
• Given an SPJ block:
– Equivalence Transformation
– Choice of physical operators
– Use of existing indexes
– Building indexes or sorting on the fly
Reserves
sid=sid
Reserves
• A query is decomposed into blocks
• Query graph
sid=sid
Sailors
rating > 5
Outline
• Query Representation
• Search space
• Search algorithm
– System Examples
• Cost estimation
• Advanced topics
SPJ block Transformations
(Select)
σp1∧∧p2(R) =
σp1vp2(R) =
σp1 [ σp2 (R)]
[ σp1 (R)] U [ σp2 (R)]
• Nested sub-queries discussed later
2
SPJ block Transformations
(Project)
Let: X = set of attributes
Y = set of attributes
XY = X U Y
Let x = subset of R attributes
z = attributes in predicate P
(subset of R attributes)
πxy (R) = πx [πy (R)]
πx[σp (R) ] =
SPJ block Transformations
(Join - all set operations)
R
(R
S
S)
=
T
SPJ block Transformations
(Project and Select)
S
R
=R
(S
T)
πxz
πx {σp [ πx (R) ]}
SPJ block Transformations (Join)
Let p = predicate with only R attribs
q = predicate with only S attribs
m = predicate with only R,S attribs
Can also write as trees, e.g
T
σ
σ
R
SPJ block Transformations (Join)
Let x = subset of R attributes
y = subset of S attributes
z = intersection of R,S attributes
πxy (R
S) =
πxy{[πxz (R) ]
πxy {σp (R S)} =
πxy {σp [πxz’ (R)
[πyz (S) ]}
σ
p
(R
S) =
[
q
(R
S) =
R
p
(R)]
[
σ
S
q
(S)]
Which are “good” transformations?
σp1∧∧p2 (R) → σp1 [σp2 (R)]
σp (R S) → [σp (R)] S
R
S → S
R
πx [σp (R)] → πx {σp [πxz (R)]}
πyz’ (S)]}
z’ = z U {attributes used in P }
3
Outline
• Search space
• Search algorithm
– System Examples
• Cost estimation
• Advanced topics
Search Algorithm
Naïve1
– Enumerate all possible plans (o(n!))
– Pick the best plan
– Intractable
Naïve 2
– Order of relations fixed by the query
– Selections are pushed
• No further transformations
– Single multiway nested loop join for each block
• Index used if they exist
• Star tree
Search Algorithm
Semi-Naïve
– Order of relations fixed by the query
– Selections are pushed
• No further transformations
– Nested loop vs. sort merge join
– Left-deep tree
Implementation problems:
• expressions reference columns of tables
• expressions must be adapted to the position of tables
in the tree (including interm. tables)
Search Algorithm
Dynamic programming (System R)
• Enumerated using N passes (if N relations joined):
–
–
–
Pass 1: Find best 1-relation plan for each relation.
Pass 2: Find best way to join result of each 1-relation plan (as
outer) to another relation. (All 2-relation plans.)
Pass N: Find best way to join result of a (N-1)-relation plan (as
outer) to the N’th relation. (All N-relation plans.)
• For each subset of relations:
–
–
Search Algorithm
Greedy
– Cost model
• Based on statistics (size of relation, distribution of
attribute values)
• I/O cost of each operator
– Choice of join order using a greedy approach
• For each outermost table
– Find best join operator with one of the remaining table
– Repeat until no remaining table
• Keep best plan (left deep plan)
Outline
• Search space
• Search algorithm
– System Examples
• Cost estimation
• Advanced topics
Cheapest plan overall,
Cheapest plan for each interesting order of the tuples.
4
Starburst Optimizer
• Query Graph Model for
representation of queries
• Rewrite Rule Engine
– Condition -> action rules
where LHS and RHS are
arbitrary C functions on
QGL representation
– Rules and Classes for
search control
– Conflict Resolution
Schemes
• Bottom up enumeration of
plans
• Grammar-like set of
production rules to
generate execution plans
– LOLEPOP: terminals
(physical operators)
– STAR : production rules
(alternative
implementations of query
graph blocks)
– GLUE: additional rules for
achieving a given property
(order)
Volcano Optimizer
• Query as an algebraic
tree
• Transformation rules
• Top down algorithm
– Logical rules,
implementation rules
• Optimization goal
– Logical expression,
physical properties,
estimated cost
Outline
– Logical expressions
optimized on demand
– Enumerate possible moves
• Implement operator
• Enforce property
• Apply transformation rules
– Select move based on
promise
– Branch and bound
Cost Estimation
• For each plan:
• Search space
• Search algorithm
–
estimate cost of each operation in plan tree.
• Cost is IO cost + w * CPU cost
• Depends on input cardinalities.
• We’ve already discussed how to estimate IO cost of different
operations (sequential scan, index scan, joins, etc.)
– System Examples
• Cost estimation
• Advanced topics
–
estimate size of result for each operation in tree!
• Use information about the input relations.
• For selections and joins, assume independence of predicates.
Size Estimation and Reduction
Factors
SELECT attribute list
FROM relation list
WHERE term1 AND ... AND termk
• Consider a query block:
• Maximum # tuples in result is the product of the
cardinalities of relations in the FROM clause.
• Reduction factor (RF) associated with each term reflects
the impact of the term in reducing result size. Result
cardinality = Max # tuples * product of all RF’s.
–
–
–
–
Implicit assumption that terms are independent!
Term col=value has RF 1/NKeys(I), given index I on col
Term col1=col2 has RF 1/MAX(NKeys(I1), NKeys(I2))
Term col>value has RF (High(I)-value)/(High(I)-Low(I))
Statistics and Catalogs
• Need information about the relations and indexes
involved. Catalogs typically contain at least:
–
–
–
# tuples (NTuples) and # pages (NPages) for each relation.
# distinct key values (NKeys) and NPages for each index.
Index height, low/high key values (Low/High) for each tree
index.
• Catalogs updated periodically.
–
Updating whenever data changes is too expensive; lots of
approximation anyway, so slight inconsistency ok.
5
Reduction Factor Estimation
• Assume uniform distribution of values
• Execute query on a sample database
• Pre-compute statistical descriptors
– Histograms: range predicates
– Frequent values, number of distinct values:
equality predicates
Histograms
• Histogram structure
120
– Divides the values of a
column into k buckets
– k trades-off between
accuracy and space
occupation
100
80
60
Frequency
40
20
0
10
60
– Equi-depth histogram
• All buckets have same
number of values
190
80
60
Frequency
40
20
0
0-65
65-140
140-200
Optimizing multi-block queries
• Multiblock query
• Search space
• Search algorithm
– View with aggregates
– Table expression
– Nested subqueries
– System Examples
• Cost estimation
• Advanced topics
SELECT c_custkey
FROM customer
WHERE 100000 <
(SELECT sum(o_price)
FROM orders
WHERE o_custkey = c_custkey)
SQL Server Solution
SELECT
1000000
140
100
• All buckets have same
value range
CUSTOMER
120
120
– Equi-width histogram
Outline
70
• Examples
SQL Server Solution
Select (1000<X)
<
APPLY(bind:c_custkey)
SUBQUERY(X)
CUSTOMER
SGb(X=sum(o_price))
ScalarGb
SELECT
=
o_custkey
Select (o_custkey = c_custkey)
X:=
ORDERS
SUM
c_custkey
o_price
6
SQL Server Solution
Magic sets
CREATE VIEW deptavgsal AS
(SELECT E.did, avg(E.sal) as avgsal
FROM Emp E
GROUP BY E.did);
Select (1000<X)
SGb(X=sum(o_price))
SELECT E.did, E.sal
FROM Emp E, Dept D, deptavgsal V
WHERE E.did = V.did
AND E.age < 30
AND D.budget > 100000
AND E.sal > V.avgsal;
o_custkey = c_custkey
CUSTOMER
ORDERS
Magic sets
CREATE VIEW PartialResult AS
(SELECT E.eid, E.sal, E.did
FROM Emp E, Dept D
WHERE E.did = D.did AND
E.age < 30
AND D.budget > 100000)
CREATE VIEW Filter AS
(SELECT DISTINCT P.did
FROM PartialResult P)
CREATE VIEW LimitedDeptAvgSal
AS
(SELECT F.did, avg(E.sal) as
avgsal
FROM Emp E, Filter F
WHERE E.did = F.did
GROUP BY E.did);
SELECT P.eid, P.sal
FROM PartialResult P,
LimitedDeptAvgSal V
WHERE P.did = V.did
AND P.sal > V.avgsal;
Filter-join
V
E
D
7