Reserves Sailors sid=sid bid=100 sname (On-the-fly)

Relational Query Optimization
Big Picture Overview
SQL Query
Relational Algebra
SELECT S.name
FROM Reserves R, Sailors S
WHERE R.sid = S.sid
AND R.bid = 100
AND S.rating > 5
𝜋S.name(𝜎bid=100⋀rating>5(
Query Parser
Reserves
⋈R.sid=S.sid Sailors))
Optimized (Physical) Query Plan:
(Logical) Query Plan:
𝜋S.name
𝜎R.bid=100 ⋀ S.rating > 5
⋈R.sid=S.sid
Reserves
Sailors
Query Optimizer
Operator Code
B+-Tree
Indexed Scan
Iterator
𝜋S.name
On-the-fly
Project Iterator
𝜎S.rating>5
On-the-fly
Select Iterator
⋈R.sid=S.sid
𝜎R.bid=100
Reserves
Sailors
Indexed Nested
Loop Join Iterator
Heap Scan
Iterator
Big Picture Overview
SQL Query
Relational Algebra
SELECT S.name
FROM Reserves R, Sailors S
WHERE R.sid = S.sid
AND R.bid = 100
AND S.rating > 5
𝜋S.name(𝜎bid=100⋀rating>5(
Query Parser
Reserves
⋈R.sid=S.sid Sailors))
Optimized (Physical) Query Plan:
(Logical) Query Plan:
𝜋S.name
𝜎R.bid=100 ⋀ S.rating > 5
⋈R.sid=S.sid
Reserves
Sailors
Query Optimizer
Operator Code
B+-Tree
Indexed Scan
Iterator
𝜋S.name
On-the-fly
Project Iterator
𝜎S.rating>5
On-the-fly
Select Iterator
⋈R.sid=S.sid
𝜎R.bid=100
Reserves
Sailors
Indexed Nested
Loop Join Iterator
Heap Scan
Iterator
Relational Operators
as a Graph
𝜋sname(𝜋sid(𝜋bid(𝜎color=‘red’(Boats)) ⋈ Res) ⋈ Sailors)
• Query plan
– Edges encode “flow” of tuples
– Vertices = Operators
– Source vertices = table
access operators …
• Also called
dataflow
graph
𝜋sname
⋈
𝜋sid
⋈
𝜋bid
𝜎color=‘red
Boats
Reserves
Sailors
Query Executor
Instantiates Operators
On-the-fly
Project Iterator
𝜋sname(𝜋sid(𝜋bid(𝜎color=‘red’(Boats)) ⋈ Res) ⋈ Sailors)
𝜋sname
• Query planner selects operators
• Each operator instance:
– Implements iterator interface
– Efficiently executes operator
logic forwarding tuples
to next operator
On-the-fly
𝜋bid
On-the-fly
Select Iterator
B+-Tree
Indexed Scan
Iterator
Boats
𝜋sid
⋈
On-the-fly
Project Iterator
Indexed Nested
Loop Join Iterator
Project Iterator
𝜎color=‘red
Operator Code
⋈
B+-Tree
Indexed Scan
Iterator
B+-Tree
Indexed Scan
Iterator
Reserves
Sailors
Query Optimization Overview
•
•
•
•
Query can be converted to relational algebra
Rel. Algebra converts to tree
Each operator has implementation choices
Operators can also be applied in different orders!
SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid=S.sid AND
R.bid=100 AND S.rating>5
(sname)(bid=100  rating > 5)
(Reserves ⨝ Sailors)
sname
bid=100
rating > 5
sid=sid
Reserves
Sailors
Query Optimization Overview (cont.)
• Plan: Tree of R.A. ops (and some others) with choice of
algorithm for each op.
–
Recall: Iterator interface (next()!)
• Three main issues:
– For a given query, what plans are considered?
– How is the cost of a plan estimated?
– How do we “search” in the “plan space”?
• Ideally: Want to find best plan.
• Reality: Avoid worst plans!
Cost-based Query Sub-System
Queries
Select *
From Blah B
Where B.blah = blah
Query Parser
Usually there is a
rewriting step before
the cost-based steps to
simplify queries (esp. to
“flatten” views and
subqueries).
Query Optimizer
Plan
Generator
Plan Cost
Estimator
Catalog Manager
Schema
Query Executor
Statistics
Schema for Examples
Sailors (sid: integer, sname: string, rating: integer, age: real)
Reserves (sid: integer, bid: integer, day: dates, rname: string)
• As seen in previous lectures…
• Reserves:
– Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.
– Assume there are 100 boats
• Sailors:
– Each tuple is 50 bytes long, 80 tuples per page, 500 pages.
– Assume there are 10 different ratings
•
Assume we have 5 pages in our buffer pool!
Motivating Example
SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid=S.sid AND
R.bid=100 AND S.rating>5
sname
(On-the-fly)
rating > 5
(On-the-fly)
bid=100
• Cost: 500+500*1000 I/Os
• By no means the worst plan!
• Misses several opportunities:
(Page-Oriented
– selections could be ‘pushed’ down
sid=sid Nested loops)
– no use made of indexes
• Goal of optimization: Find faster plans
Reserves
Sailors
that compute the same answer.
Sorting: 2-Way (a strawman)
• Pass 0 (conquer):
– read a page, sort it, write it.
– only one buffer page is used
– a repeated “batch job”
Sort in place
INPUT
I/O
Buffer
OUTPUT
RAM
Sorting: 2-Way (a strawman)
• Pass 0 (conquer):
– read a page, sort it, write it.
– only one buffer page is used
– a repeated “batch job”
• Pass 1, 2, 3, …, etc. (merge):
– requires 3 buffer pages
• note: this has nothing to do with double buffering!
– merge pairs of runs into runs twice as long
– a streaming algorithm, as in the previous slide!
INPUT
Output
Buffer
3
Input Buffer
1
2
Input Buffer
OUTPUT
Two-Way External Merge Sort
• Conquer and Merge:
sort subfiles and merge
3,4
6,2
9,4
8,7
5,6
3,1
2
3,4
2,6
4,9
7,8
5,6
1,3
2
4,7
8,9
2,3
4,6
1,3
5,6
Input file
PASS 0
1-page runs
PASS 1
2
2-page runs
PASS 2
• Each pass we read + write
each page in file (2N)
• N pages in the file.
So, the number of passes is:
 log 2 N   1
• So total cost is:
2 N log 2 N   1
2,3
4,4
6,7
8,9
1,2
3,5
6
4-page runs
PASS 3
1,2
2,3
3,4
4,5
6,6
7,8
9
8-page runs
General External Merge Sort
• More than 3 buffer pages. How can we utilize them?
• To sort a file with N pages using B buffer pages:
– Pass 0: use B buffer pages. Produce éê N / Bùú sorted runs of
B pages each.
– Pass 1, 2, …, etc.: merge B-1 runs at a time.
Merge
Conquer
B
1
1
...
...
B-1
⌈N/B⌉
Sorted Runs
length B
Sorted Runs
Length B(B-1)
Cost of External Merge Sort
• Number of passes: 1 + élog B -1 é N / Bù ù
• Cost = 2N * (# of passes)
• E.g., with 5 buffer pages, to sort 108 page
file:
– Pass 0: é108 / 5 ù = 22 sorted runs of 5 pages
each (last run is only 3 pages)
– Pass 1: é22 / 4 ù = 6 sorted runs of 20 pages
each (last run is only 8 pages)
– Pass 2: 2 sorted runs, 80 pages and 28 pages
– Pass 3: Sorted file of 108 pages
Formula check: 1+┌log 22┐= 1+3  4 passes √
4
Memory Requirement for
External Sorting
• How big of a table can we sort in two passes?
– Each “sorted run” after Phase 0 is of size B
– Can merge up to B-1 sorted runs in Phase 1
• Answer: B(B-1).
– Sort N pages of data in about B = N space
Merge
Conquer
B
1
1
...
...
⌈N/B⌉
Sorted Runs
length B
B-1
Sorted Runs
Length B(B-1)
Block Nested Loop Join Assume B Pages
of Memory
R
R
S:
foreach block bR of B-2 pages in R do
foreach page bS in S do
foreach record r in bR do
foreach record s in bSdo
if θ(ri,sj) then add <r, s> to result
Cost: (Assume B = 102)
B - 2 Memory
– ([R]/(B-2)) * [S] + [R]
– 10 * 500 + 1,000
• 6,000
S
Swap R and S?
– ([S] /(B-2)) * [R] + [S]
– 5 * 1,000 + 500
[R]=1000, pR=100, |R| = 100,000
[S]=500, pS=80, |S| = 40,000
• 5,500
• @10ms  55 seconds!
• Can we do better?
Cost of Sort-Merge Join
R
Read
In Memory
Sort
Write
Sorted
Runs
B–2
Buffers
S
Read
Read
Write
Read
Buffer Buffer
Merge
Runs
B–1
Input
Buffers
Write
Read
Sorted
Files
Write
Read
Out Buffer
Merge
Files
B–1
Input
Buffers
?
Out Buffer
Sorting Phase
Merging Phase
• Cost: Sort R + Sort S + ([R]+[S])
– But in worst case, last term could be [R]*[S] (very unlikely!)
– Q: what is worst case?
• Suppose buffer B > Sqrt(max([R], [S])):
– Both R and S can be sorted in 2 passes
– Total join cost = 4*1000 + 4*500 + (1000 + 500) = 7500
– Comparisons:
• Indexed Nested Loop: 80,500
• Block Nested Loop: 5,500
• Grace Hash Join: 4,500
[R]=1000, pR=100, |R| = 100,000
[S]=500, pS=80, |S| = 40,000
Sort-Merge Join Considerations
R
Read
In Memory
Sort
Write
Sorted
Runs
B–2
Buffers
S
Read
Read
Write
Buffer Buffer
Read
Merge
Runs
B–1
Input
Buffers
Write
Read
Merge
Files
Sorted
Files
Write
Read
Out Buffer
Sorting Phase
B–1
Input
Buffers
Out Buffer
Merging Phase
• An important refinement (needs 2x buffers):
– Do the join during the final merging pass of sort
1. Read R and write out sorted runs (pass 0)
2. Read S and write out sorted runs (pass 0)
3. Merge R-runs and S-runs, while finding R ⋈ S matches
– Cost = 3*[R] + 3*[S]
• Sort-merge join an especially good choice if:
– one or both inputs are already sorted on join attribute(s)
– output is required to be sorted on join attributes(s)
?
More Alternative Plans
(No Indexes)
• Sort Merge Join
(On-the-fly)
sname
(Sort-Merge Join)
sid=sid
(Scan;
write to bid=100
temp T1)
Reserves
rating > 5
(Scan;
write to
temp T2)
Sailors
• With 5 buffers, cost of plan:
– Scan Reserves (1000) + write temp T1 (10 pages) = 1010.
– Scan Sailors (500) + write temp T2 (250 pages) = 750.
– Sort T1 (2*2*10) + sort T2 (2*4*250) + merge (10+250) = 2300
–
Total: 4060 page I/Os.
• If use Block NL join, join = 10+4*250, total cost = 2770.
• Can also `push’ projections, but must be careful!
– T1 has only sid, T2 only sid, sname:
– T1 fits in 3 pgs, cost of Chunk NL under 250 pgs,
total < 2000.
Index Nested Loops Join
R
S: foreach tuple r in R do
foreach tuple s in S where ri == sj do
add <r, s> to result
Cost = [R] + ([R] * pR) * cost to find matching S tuples
• If index uses Alt. 1  cost to traverse tree from root to
leaf. (e.g., 2-4 IOs)
• For Alt. 2 or 3:
1. Cost to lookup RID(s); typically 2-4 IOs for B+Tree.
2. Cost to retrieve records from RID(s); depends on clustering.
– Clustered index: 1 I/O per page of matching S tuples.
– Unclustered: up to 1 I/O per matching S tuple (w/o
sorting)
Index Nested Loops Join
R
S: foreach tuple r in R do
foreach tuple s in S where ri == sj do
add <r, s> to result
Cost = [R] + ([R] * pR) * cost to find matching S tuples
• Assume alternative 2 (by reference) unclustered
– 1 matching tuple (e.g., primary key lookup)
– only leaves and data entries not in cache  2 IOs
• Cost:
– R⋈S: 1000 + (1,000*100)*2 = 201,000
– S⋈R: 500 + (500*80)*2 = 80,500  13 Minutes
– Block Nested Loop: 5,500
•
55 Seconds …
[R]=1,000, pR=100, |R| = 100,000
[S]=500, pS=80, |S| = 40,000
(On-the-fly)
More Alt Plans: Indexes
sname
• With clustered index on bid
of Reserves, we access:
– 100,000/100 = 1000 tuples on
1000/100 = 10 pages.
• INL with outer not
materialized.
rating > 5
(On-the-fly)
(Index Nested Loops,
sid=sid with pipelining )
(Use B+-tree
do not
bid=100
write
to temp)
– Projecting out unnecessary fields from outer
doesn’t make an I/O difference.
Sailors
Reserves

Join column sid is a key for Sailors.
–At most one matching tuple, unclustered index on sid OK.

Decision not to push rating>5 before the join is based on
availability of sid index on Sailors.
Cost: Selection of Reserves tuples (10 I/Os); then, for each,
must get matching Sailors tuple (1000*1.2); total 1210 I/Os.

Summing up
• There are lots of plans
– Even for a relatively simple query
• Engineers often think they can pick good ones
– MapReduce API was based on that assumption
• Not so clear that’s true!
– Manual query planning can be tedious, technical
• Witness the rise of Hive, SparkSQL, despite immature optimizers
– Machines are better at enumerating options than people
– But we will see soon how optimizers make simplifying
assumptions
What is needed for optimization?
• Given: A closed set of operators
– Relational ops (table in, table out)
– Encapsulation (e.g. based on iterators)
1. Plan space
– Based on relational equivalences, different implementations
2. Cost Estimation based on
– Cost formulas
– Size estimation, in turn based on
• Catalog information on base tables
• Selectivity (Reduction Factor) estimation
3. A search algorithm
– To sift through the plan space and find lowest cost option!
Query Optimization
• We’ll focus on “System R” (“Selinger”) optimizers
• “Cascades” optimizer is the other common one
– Later, with notable differences,
but similar big picture
Highlights of System R Optimizer
Works well for 10-15 joins.
1.Plan Space: Too large, must be pruned.
– Many plans share common, “overpriced” subtrees
•
ignore them all!
In some implementations, only consider left-deep plans
– In some implementations, Cartesian products avoided
2.Cost estimation
– Very inexact, but works ok in practice.
– Stats in system catalogs used to estimate sizes & costs
– Considers combination of CPU and I/O costs.
– System R’s scheme has been improved since that time.
3.Search Algorithm: Dynamic Programming
–
Query Optimization
1. Plan Space
2. Cost Estimation
1. Search Algorithm
Query Blocks: Units of Optimization
•
•
•
•

Break query into query blocks
Optimize one block at a time
Uncorrelated nested blocks computed once
Correlated nested blocks like function calls
– But sometimes can be “decorrelated”
– Beyond the scope of CS150!
SELECT S.sname
FROM Sailors S
WHERE S.age IN
(SELECT MAX (S2.age)
FROM Sailors S2
GROUP BY S2.rating)
Outer block
For each block, the plans considered are:
– All available access methods, for
each relation in FROM clause.
– All left-deep join trees
– right branch always a base table
– consider all join orders and join methods
Nested block
D
C
A
B
Schema for Examples
Sailors (sid: integer, sname: string, rating: integer, age: real)
Reserves (sid: integer, bid: integer, day: dates, rname: string)
• Reserves:
– Each tuple is 40 bytes long, 100 tuples per page, 1000
pages. 100 distinct bids.
• Sailors:
– Each tuple is 50 bytes long, 80 tuples per page, 500 pages.
10 ratings, 40,000 sids.
Translating SQL to Relational Algebra
SELECT S.sid, MIN (R.day)
FROM Sailors S, Reserves R, Boats B
WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red”
GROUP BY S.sid
HAVING COUNT (*) >= 2
For each sailor with at least two reservations for red boats,
find the sailor id and the earliest date
on which the sailor has a reservation for a red boat.
Translating SQL to “Relational Algebra”
SELECT S.sid, MIN (R.day)
FROM Sailors S, Reserves R, Boats B
WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red”
GROUP BY S.sid
HAVING COUNT (*) >= 2

S.sid, MIN(R.day)
(HAVING
COUNT(*)>2 (
GROUP BY S.Sid (
B.color = “red” (
Sailors Reserves
Boats))))
Recall R. A. Equivalences
• Allow us to choose different join orders and to “push” selections and
projections ahead of joins.
• Selections:
 c1…cn(R)  c1(…(cn(R))…) (cascade)
 c1(c2(R))  c2(c1(R))
(commute)
• Projections:
• a1(R)  a1(…(a1, …, an-1(R))…) (cascade)
• Cartesian Product
– R  (S  T)

(R  S)  T
(associative)
–RS SR
(commutative)
– This means we can do joins in any order.
• But…beware of cartesian product!
• (R
⋈ S) ⋈ T 
(R  T)
⋈S
More R. A. Equivalences
• Eager projection
– Can cascade and “push” (some) projections thru selection
– Can cascade and “push” (some) projections below one side of a join
– Rule of thumb: can project anything not needed “downstream”
 a1( c1(R) )  a1( c1( a1,c1 (R) ) )
• Selection on a cross-product is equivalent to a join.
– If selection is comparing attributes from each side
– E.g. R.a=S.b(R x S)  R ⨝R.a=S.b S
• A selection only on attributes of R commutes with R ⨝ S.
– i.e., (R ⨝ S)  (R) ⨝ S
– but only if the selection doesn’t refer to S!
“Physical” Equivalences
• Certain “physical” operators not in the Algebra
can affect plan cost without affecting answers
– E.g. Index scan (iteration through result is sorted)
– E.g. Sort (iteration through result is sorted)
– E.g. Hash (iteration through result is grouped)
• Some are tradeoffs
– E.g. Index scan vs. Heap scan
– E.g. Sort-Merge join vs. Index NL join
• Others are optional cost/benefit add-ons
– E.g. Insert a Sort iterator into a plan
Queries Over Multiple Relations
• A System R heuristic: only left-deep join trees considered.
–
–
Restricts the search space
Left-deep trees allow us to generate all fully pipelined plans.
• Intermediate results not written to temporary files.
• Not all left-deep trees are fully pipelined (e.g., SM join).
D
D
C
A
B
C
A
B
A
B
C
D
Query Optimization
1. Plan Space
2. Cost Estimation
3. Search Algorithm
Cost Estimation
• For each plan considered, must estimate total cost:
– Must estimate cost of each operation in plan tree.
• Depends on input cardinalities.
• We’ve already discussed this for various operators
– sequential scan, index scan, joins, etc.
–
Must estimate size of result for each operation in tree!
• Because it determines downstream input cardinalities!
• Use information about the input relations.
• For selections and joins, assume independence of predicates.
– In System R, cost is boiled down to a single number
consisting of #I/O + CPU-factor * #tuples
• Q: Is “cost” the same as estimated “run time”?
Statistics and Catalogs
• Need info on relations and indexes involved.
• Catalogs typically contain at least:
Statistic
Meaning
NTuples
# of tuples in a table (cardinality)
NPages
# of disk pages in a table
Low/High
min/max value in a column
Nkeys
# of distinct values in a column
IHeight
the height of an index
INPages
# of disk pages in an index
• Catalogs updated periodically.
– Too expensive to do continuously
– Lots of approximation anyway, so a little slop here is ok.
• Modern systems do more
– Esp. keep more detailed statistical information on data values
• e.g., histograms
Size Estimation and Selectivity
SELECT attribute list
FROM relation list
WHERE term1 AND ... AND termk
• Max output cardinality = product of input cardinalities
• Selectivity (sel) associated with each term
– reflects the impact of the term in reducing result size.
– |output| / |input|
Result cardinality = Max # tuples *
∏seli
– Book calls selectivity “Reduction Factor” (RF)
• Avoid confusion:
– “highly selective” in common English is opposite of a high
selectivity value (|output|/|input| high!)
Result Size Estimation
• Result cardinality = Max # tuples * product of all RF’s.
•
Term col=value
(given Nkeys(I) on col)
RF = 1/NKeys(I)
•
Term col1=col2
(handy for joins too…)
RF = 1/MAX(NKeys(I1), NKeys(I2))
Why MAX? See bunnies on next slide…
•
Term col>value
RF = (High(I)-value)/(High(I)-Low(I) + 1)
Implicit assumptions: values are uniformly distributed and
terms are independent!
•
Note, if missing the needed stats, assume 1/10!!!