Slide

IBM Research
Consistently Estimating the Selectivity of
Conjuncts of Predicates
Volker Markl, Nimrod Megiddo, Marcel Kutsch,
Tam Minh Tran, Peter Haas, Utkarsh Srivastava
Non-Confidential | 7-12-2005 | Volker Markl
© 2005 IBM Corporation
IBM Research
Agenda
 Consistency and Bias Problems in Cardinality Estimation
 The Maximum Entropy Solution
 Iterative Scaling
 Performance Analysis
 Related Work
 Conclusions
2
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
What is the problem?

Consider the following three attributes:
Make
Model
Legend:
Color
3
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
Correlation
© 2005 IBM Corporation
IBM Research
How to estimate the cardinality of the predicate…
100,000
200,000
Make
Model
‘Mazda’
‘323’
(real cardinality: 49,000)
… Make = ‘Mazda’ AND Model = ’323’
AND Color = ‘red’
200,000
Legend:
cardinality
Color
‘red’
attribute
value
4
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Without any additional knowledge
100,000
200,000
Make
Model
‘Mazda’
‘323’
Base cardinality: 1000,000
200,000
Selectivity( Make= ‘Mazda’ AND Model = 323
AND Color = ‘red’ )
Color
red
 Independence assumption:
s(Make = ‘Mazda’ ) * s( Model = ‘323’ ) * s( Color =‘red’ ) =
100,000 * 200,000
1,000,000
1,000,000
*
200,000
1,000,000
= 0.004
Estimated Cardinality: 0.004 * 1,000,000 = 4000
5
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
Legend:
denote by s(?)
the selectivity
of ?
© 2005 IBM Corporation
IBM Research
Additional knowledge given (1):
100,000
200,000
Make
Model
‘Mazda’
‘323’
Selectivity( Make = ‘Mazda’ AND Model = ‘323’
AND Color = ‘red’ )
200,000
case 1: s( Make AND Model ) * s( Color ) =
Color
‘red’
50,000
1,000,000
*
200,000
1,000,000
= 0.01
estimated card:
10,000
Additional knowledge:
Make AND Model
card(‘Mazda’ AND ‘323’) = 50,000
Legend:
Conjunct
Pred X AND Pred Y
Cardinality
6
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Additional knowledge given (2):
100,000
200,000
Make
Model
‘Mazda’
‘323’
Selectivity( Make = ‘Mazda’ AND Model = ‘323’
AND Color = ‘red’ )
200,000
Color
‘red’
Additional knowledge:
case 1: s( Make AND Model ) * s(Color) =
0.01  estimated card: 10,000
case 2: s( Make AND Color ) * s( Model ) =
200,000
1,000,000
*
90,000
1,000,000
= 0.018
estimated card:
18,000
Make AND Model
card(‘Mazda’ AND ‘323’) = 50,000
Legend:
Make AND Color
cardl(‘Mazda’ AND ‘red’) = 90,000
Conjunct
Pred X AND Pred Y
Cardinality
7
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Additional knowledge given (3):
100,000
200,000
Make
Model
‘Mazda’
‘323’
Selectivity( Make = ‘Mazda’ AND Model = ‘323’
AND Color = ‘red’ )
200,000
Color
‘red’
Additional knowledge:
Make AND Model
card(‘Mazda’ AND ‘323’) = 50,000
case 1: s( Make AND Model ) * s(Color) =
0.01  estimated card: 10,000
case 2: s( Make AND Color ) * s( Model ) =
0.018  estimated card: 18,000
case 3: s( Model AND Color ) * s( Make ) =
150,000
1,000,000
*
100,000
1,000,000
= 0.015
estimated card:
15,000
Legend:
Make AND Color
cardl(‘Mazda’ AND ‘red’) = 90,000
Conjunct
Pred X AND Pred Y
Model AND Color
card(‘323’ AND ‘red’) = 150,000
8
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
Cardinality
© 2005 IBM Corporation
IBM Research
Why is this a problem?
case 0: s( Make) * s(Model ) * s(Color) =
0.004  estimated card: 4,000
Cardinality Bias
case 3: s( Model AND Color ) * s( Make ) =
0.015  estimated card: 15,000
Fleeing from Knowledge to Ignorance
case 2: s( Make AND Color ) * s( Model ) =
0.018  estimated card: 18,000
15,000
4,000
Index Intersect
Make
Color
Model
9
FETCH Make
18,000
FETCH Model
150,000
90,000
Index Scan
Index Scan
Model, Color
Make, Color
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
What has happened?
 Inconsistent model
different estimates for the same intermediate result
due to multivariate statistics with overlapping information
 Bias during plan selection
results in the selection of sub-optimal plans
 Bias Avoidance means keeping the model consistent
State-of-the-art is to do bookkeeping of the first multivariate statistic used, and ignore
further overlapping multivariate statistics
Does not solve the problem, as ignoring knowledge also means bias
Bias is arbitrary, depends on what statistics are used first during optimization
 Only possible solution is to exploit all knowledge consistently
10
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Problem: Only partial knowledge of the DNF atoms
100,000
200,000
Make
Model
‘Mazda’
‘323’
Mazda
323
200,000
Color
Mazda &
323 &
red
Mazda & 323 & red
Mazda & 323 & red
‘red’
Mazda &
323 &
red
Additional knowledge:
Mazda & 323
& red
Mazda & 323
& red
Make AND Model
p(‘Mazda’ AND ‘323’) = 50,000
Legend:
Make AND Color
pl(‘Mazda’ AND ‘red’) = 90,000
Model AND Color
p(‘323’ AND ‘red’) = 150,000
11
Mazda & 323 & red
DNF =
disjunctive
normal form
Mazda & 323 & red
red
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
X denotes not X
© 2005 IBM Corporation
IBM Research
How to compute the missing values of the distribution?
Mazda
323
Mazda &
323 &
red
Mazda & 323 & red
Mazda & 323 & red
Mazda &
323 &
red
Mazda & 323
& red
Probability( Make = ‘Mazda’
AND Model = ‘323’
AND Color = ‘red’ )
Mazda & 323
& red
Mazda & 323 & red
Mazda & 323 & red
red
12
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Solution: Information Entropy H( X ) = -∑ xi log( xi )
 Entropy is a measure for the “uninformedness” of a probability distribution
X=(x1, …, xm) with x1 + … + xm = 1
 Maximizing information entropy
for unknown selectivities
using known selectivities as constraints
will avoid bias
 The less is known about a probability distribution, the larger the entropy
Nothing  uniformity: s(X = ?) = 1/m
Marginals  independence: s(X = ? and Y = ?) = s(X=?) * s(Y=?)
Thus: the principle of maximum entropy generalizes uniformity and
independence used in today’s query optimizers
13
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Entropy Maximization for Cardinality Estimation
given some selectivities (single and conjunctive)
over a space of n predicates p1, …, pn
 choose a model which is consistent with this knowledge
but otherwise as uniform as possible
 maximize the entropy of the probability distribution
X = (xb | b  {0,1}n)
max( H ( X ))  

b{0,1}
n
Legend:
{0,1}n denotes
the n-fold cross
product of the
set {0,1}, i.e.,
{0,1}  …  {0,1}
xb log xb
b


p
is the selectivity of the DNF atom
i{1,..., n} i
b{0,1}
i
xb
n
bi = 0 means that predicate pi is negated in the DNF
bi = 1 means that predicate pi is a positive term in the DNF
14
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
n times
Also, for a predicate
p1 = p
p0 = not p
© 2005 IBM Corporation
IBM Research
Maximum Entropy Principle – Example:
Knowledge sY, Y  T:
s1 = s(Mazda) = 0.1
s2 = s(323) = 0.2
s3 = s(red) = 0.2
000
Mazda
s1,2 = s(Mazda & 323) = 0.05
s1,3 = s(Mazda & red) = 0.09
323
Mazda & 323 & red
s2,3 = s(red & 323) = 0.15
T = {{1}, {2}, {3}, {1,2}, {1,3}, {2,3}, }
Constraints:
s1= Mazda & 323 & red +
Mazda & 323 & red +
s1 = x100 +
100
110
Mazda & 323 & red
x101 +
Mazda & 323 & red +
x110 +
Mazda & 323 & red
x111
010
Mazda &
323 &
red
101
Mazda &
323 &
red
111
Mazda & 323
& red
Mazda & 323 & red
011
Mazda & 323
& red
001
Mazda & 323 & red
15
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
red
© 2005 IBM Corporation
IBM Research
Maximum Entropy Principle – Example:
Knowledge sY, Y  T:
s1 = s(Mazda) = 0.1
s2 = s(323) = 0.2
s3 = s(red) = 0.2
000
Mazda
s1,2 = s(Mazda & 323) = 0.05
s1,3 = s(Mazda & red) = 0.09
323
s2,3 = s(red & 323) = 0.15
T = {{1}, {2}, {3}, {1,2}, {1,3}, {2,3}, }
Constraints:
100
010
110
 0.10 = s1 = x100 + x101 + x110 + x111
 0.20 = s2 = x010 + x011 + x110 + x111
 0.20 = s3 = x101 + x111 + x011 + x001
101
111
 0.05 = s1,2 = x110 + x111
011
 0.09 = s1,3 = x101 + x111
 0.15 = s2,3 = x011 + x111
 1.00 = s = x000 + x001 + x010 + x011 +
x100 + x101 + x110 + x111
001
red
Objective Function:
max( H ( X ))  
16

b{0,1}3
xb log xb
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Solving the Constrained Optimization Problem
Minimize the objective function:
1
0
1
0
p1  p2  p3

b{0,1}n
1
0
0
p1  p2  p3
1
1
1
0
p1  p2  p3
1
p1  p2  p3
xb log xb
1
0
0
1
p1  p2  p3
0
1
1
p1  p2  p3
0
1
p1  p2  p3
Satisfying the |T|2{1, .., n} constraints:
for all Y  T :

x
bC (Y ) b
0
0
0
p1  p2  p3
 sY
Legend:
2{1,…,n} denotes the powerset of {1,..,n}
C(Y) denotes all DNF atoms that
contribute to Y, i.e., formally,
 General solution: Iterative Scaling
17
C(Y) := {b  {0,1}n |  iY : bi = 1} and
C() := {0,1}n
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Maximum Entropy and Lagrange Multipliers


b{0,1}n
xb log xb is convex.
 We can build a Lagrangian function by associating a multiplier Y
with each constraint bC (Y ) xb  sY and subtracting the constraints
from the objective function
L( X , )  b{0,1}n xb log xb  Y T Y

bC (Y )
xb  sY

 Differentiation w.r. to xb and equating to zero yields conditions for
minimum
L
for each b  { 0,1 }n :
 ln xb  1 
Y  0
xb

Y P b,T 
 Exponentiation of the Lagrange Multipliers in the derivatives yields product form
1
zY  e Y  xb 
zY
Legend:


Y

P
b
,
T
e
P(b, T)  T denotes the indexes Y of
 Replacing xb in each constraint yields a condition all known selectivities sY to which
in the exponentiated Lagrange multipliers zX
DNF atom b contributes its value xb:


bC Y 
18

z
W P b,T  W
 sY e for each Y  T
P(b,T) = {Y  T |  iY : bi = 1}  {}
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Iterative Scaling
 We can now isolate zY for a particular Y  T
zY 
sY *e

bC (Y )

z
W P b ,T \{Y } W
 and thus iteratively compute zY from all zW, W  T\{Y}
 This algorithm is called Iterative Scaling (Darroch and Ratcliff, 1972) and
converges to a stable set of Lagrangian multipliers zY, Y  T
 This stable point minimizes the objective function and satisfies all
constraints
 We can compute all DNF atoms xb from these stable multipliers using
xb 
1
e

z
Y P b ,T  Y
 and can in turn compute all missing selectivities sY 
19

x
bC (Y ) b
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Maximum Entropy Solution of the Example
Knowledge:
s(Mazda) = s1 = 0.1
s(323) = s2 = 0.2
s(red) = s3 = 0.2
s(Mazda & 323) = s1,2 = 0.05
s(Mazda & red) = s1,3 = 0.09
s(red & 323) = s2,3 = 0.15
Mazda
323
000
100
010
110
111
101
Selectivity( Make = ‘Mazda’
AND Model = ‘323’
AND Color = ‘red’ )
011
001
s1,2,3 = x111 = ???
red
20
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Iterative Scaling
Knowledge:
s1 = 0.1
s2 = 0.2
s3 = 0.2
sØ = 1
s1,2 = 0.05
s1,3 = 0.09
s2,3 = 0.15
1st Iteration:
111
110
101
100
z1
z1
z1
z1
z2
z2
z1,2
z1,2
z1 

bC ({1})
z1
z2
z1,2
z3
z1,3
z2,3
z
21

= 0.067957
=1
=1
=1
=1
=1
=1
001
z2
2
z3
z1,3
z1,3
z3
z3
z2,3
zØ
zØ
zØ
zØ
2,3
zØ
zØ
z
= 0.1
= 0.785759
= 0.05
= 0.785759
= 0.05
= 0.392879
= 1.571518
3
1,3
W P b ,T \{1} W
s1
s2
s1,2
s3
s1,3
s2,3
s
000
1,2
z2,3
s1 * e
010
1
z2
z3
zØ
011
zØ
Ø
s1
x 110
x 100
x 111
x 101
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Iterative Scaling
Knowledge:
s1 = 0.1
s2 = 0.2
s3 = 0.2
sØ = 1
s1,2 = 0.05
s1,3 = 0.09
s2,3 = 0.15
1st Iteration:
111
110
101
100
z1
z1
z1
z1
z2
z2
z1,2
z1,2
z2 

22
001
000
z2
2
1,2
z3
z1,3
z1,3
z3
z3
3
1,3
z2,3
z2,3
zØ
zØ
zØ
zØ
2,3
zØ
zØ
zØ
Ø
s2 * e
bC ({2})
z1
z2
z1,2
z3
z1,3
z2,3
z
010
1
z2
z3
zØ
011

= 0.067957
= 0.254531
=1
=1
=1
=1
=1
z
W P b ,T \{ 2} W
s1
s2
s1,2
s3
s1,3
s2,3
s
= 0.062727
= 0.2
= 0.012727
= 0.492879
= 0.031363
= 0.1
= 0.985759
s2
x 010
x 110
x 111
x 011
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Iterative Scaling
Knowledge:
s1 = 0.1
s2 = 0.2
s3 = 0.2
sØ = 1
s1,2 = 0.05
s1,3 = 0.09
s2,3 = 0.15
1st Iteration:
z1, 2 
23
110
101
100
z1
z1
z1
z1
z2
z2
z1,2
z1,2

= 0.067957
= 0.254531
= 3.928794
=1
=1
=1
=1
s1, 2 * e

010
001
1
z2
2
z3
z1,3
z1,3
z3
z3
z2,3
zØ
zØ
zØ
zØ
2,3
zØ
zØ
z
= 0.1
= 0.237273
= 0.05
= 0.511516
= 0.05
= 0.118637
= 1.023032
3
1,3
W P b ,T \{1, 2} W
s1
s2
s1,2
s3
s1,3
s2,3
s
000
1,2
z3
zØ
011
z2
z2,3
bC ({1, 2})
z1
z2
z1,2
z3
z1,3
z2,3
z
111
zØ
Ø
s1,2
x 110
x 111
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Iterative Scaling
Knowledge:
s1 = 0.1
s2 = 0.2
s3 = 0.2
sØ = 1
s1,2 = 0.05
s1,3 = 0.09
s2,3 = 0.15
1st Iteration:
z3 

24
110
101
100
z1
z1
z1
z1
z2
z2
z1,2
z1,2
s3 * e

= 0.067957
= 0.254531
= 3.928794
= 0.390994
=1
=1
=1
010
001
1
z2
2
z3
z1,3
z1,3
z3
z3
3
1,3
z2,3
zØ
zØ
zØ
zØ
2,3
zØ
zØ
z
W P b ,T \{ 3} W
s1
s2
s1,2
s3
s1,3
s2,3
s
000
1,2
z3
zØ
011
z2
z2,3
bC ({3})
z1
z2
z1,2
z3
z1,3
z2,3
z
111
zØ
Ø
s3
= 0.069550
= 0.165023
= 0.034775
= 0.2
= 0.019550
= 0.046386
= 0.711516
x 111
x 011
x 101
x 001
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Iterative Scaling
Knowledge:
s1 = 0.1
s2 = 0.2
s3 = 0.2
sØ = 1
s1,2 = 0.05
s1,3 = 0.09
s2,3 = 0.15
1st Iteration:
z1,3 
25
110
101
100
z1
z1
z1
z1
z2
z2
z1,2
z1,2

= 0.067957
= 0.254531
= 3.928794
= 0.390994
= 4.603645
=1
=1
s1,3 * e

010
001
1
z2
2
z3
z1,3
z1,3
z3
z3
z2,3
zØ
zØ
zØ
zØ
2,3
zØ
zØ
z
= 0.14
= 0.200248
= 0.07
= 0.27045
= 0.09
= 0.081611
= 0.781966
3
1,3
W P b ,T \{1, 3} W
s1
s2
s1,2
s3
s1,3
s2,3
s
000
1,2
z3
zØ
011
z2
z2,3
bC ({1, 3})
z1
z2
z1,2
z3
z1,3
z2,3
z
111
zØ
Ø
s1,3
x 111
x 101
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Iterative Scaling
Knowledge:
s1 = 0.1
s2 = 0.2
s3 = 0.2
sØ = 1
s1,2 = 0.05
s1,3 = 0.09
s2,3 = 0.15
1st Iteration:
z 2,3 
110
101
100
z1
z1
z1
z1
z2
z2
z1,2
z1,2

s2 , 3 * e

= 0.067957
= 0.254531
= 3.928794
= 0.390994
= 4.603645
= 1.837978
=1
z
W P b ,T \{ 2 , 3} W
s1
s2
s1,2
s3
s1,3
s2,3
s
= 0.177709
= 0.268637
= 0.107709
= 0.338839
= 0.127709
= 0.15
= 0.850355
010
001
000
1
z2
2
1,2
z3
z3
z1,3
z1,3
zØ
011
z2
z3
z3
3
1,3
z2,3
bC ( 2 , 3)
z1
z2
z1,2
z3
z1,3
z2,3
z
111
z2,3
zØ
zØ
zØ
zØ
2,3
zØ
zØ
zØ
Mazda
323
Ø
s2,3
x 111
x 011
red
26
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Iterative Scaling
Knowledge:
s1 = 0.1
s2 = 0.2
s3 = 0.2
sØ = 1
s1,2 = 0.05
s1,3 = 0.09
s2,3 = 0.15
1st Iteration:
z 

110
101
100
z1
z1
z1
z1
z2
z2
z1,2
z1,2
s * e

= 0.067957
= 0.254531
= 3.928794
= 0.390994
= 4.603645
= 1.837978
= 1.175979
010
z2
001
000
z2
2
1,2
z3
z3
z1,3
z1,3
zØ
011
1
z3
z3
3
1,3
z2,3
bC ({})
z1
z2
z1,2
z3
z1,3
z2,3
z
111
z2,3
zØ
zØ
zØ
2,3
zØ
zØ
zØ
zØ
z
W P b ,T \{ } W
s1
s2
s1,2
s3
s1,3
s2,3
s
= 0.208982
= 0.315911
= 0.126664
= 0.398468
= 0.150183
= 0.176397
=1
Ø
sØ
0.029399
0.029399
0.110115
0.097264
0.052919
0.079133
0.169152
0.432619
27
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Maximum Entropy Solution of the Example
Knowledge:
s(Mazda) = s1 = 0.1
s(323) = s2 = 0.2
s(red) = s3 = 0.2
Mazda
323
s(Mazda & 323) = s1,2 = 0.05
s(Mazda & red) = s1,3 = 0.09
s(red & 323) = s2,3 = 0.15
0.009918
0.049918
0.000082
0.049918
0.040082
0.100082
Selectivity( Make = ‘Mazda’
AND Model = ‘323’
AND Color = ‘red’ )
s1,2,3 = x111 = 0.049918
0.009918
0.740082
Iterations: 241
28
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
red
© 2005 IBM Corporation
IBM Research
Let’s compare:
Selectivity( Make = ‘Mazda’ AND Model = ‘323’
AND Color = ‘red’ )
Real : s( Model AND Color AND Make ) =
0.049  actual card: 49,000
case 0: s( Make) * s( Model ) * s(Color) =
0.004  estimated card: 4,000
case 1: s( Make AND Model ) * s(Color) =
0.010  estimated card: 10,000
Error: 10x
case 2: s( Make AND Color ) * s( Model ) =
0.018  estimated card: 18,000
Error: 2.5x
case 3: s( Model AND Color ) * s( Make ) =
0.015  estimated card: 15,000
Error: 3x
ME: s( Model AND Color ) * s( Make ) =
0.049918  estimated card: 49,918
29
Error: 5x
Almost no error
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Forward Estimation: Predicting s1,2,3 , given …
Absolute Estimation Error
1000
100%: 9583
900
4th quartile
800
3rd quartile
788
median
700
2nd quartile
600
1st quartile
200
queries
500
400
300
200
100
0
79
s1
s2
s3
1
30
Legend:
75%: 2138
79
42
65
s1,3
s2,3
11
s1,2
s1,3
9
s1,2
s2,3
6
s1,2
s1,3
s2,3
s
s
s
2.1b
2.1.c
2.1a
2.2c 2.2a 2.2b 2.3
1,3
2,3
1,2
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
0
s1,2,3
3
© 2005 IBM Corporation
IBM Research
Comparing DB2 and ME : Predicting s1,2,3 , given
…1000
Legend:
4th quartile
Absolute Estimation Error
900
2nd quartile
800
200
queries
1st quartile
700
600
500
400
300
200
100
44
11
0
DB2
SOTA
s
1,3
ME
ME
s2,3
2.2a
31
3rd quartile
median
mean
44
DB2
ME
SOTA
ME
s
1,2
s1,3
2.2b
9
79
DB2
SOTA
s
1,2
65
ME
ME
s2,3
2.2c
43
DB2
SOTA
6
ME
ME
s1,2 , s1,3 , s2,3
2.3
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Backward Estimation: Given s1,2,3 , predicting …
1200
Legend:
4th quartile
1100
median
mean
absolute estimation error
1000
3rd quartile
2nd quartile
200
queries
1st quartile
900
800
700
600
500
400
300
200
100
0
DB2
SOTA
ME
MAKE = ? AND
s1,2 MODEL =?
32
DB2
SOTA
ME
ME
MODEL = ? AND
s1,3 COLOR =?
DB2
SOTA
ME
MAKE = ? AND
s2,3 COLOR =?
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
time until convergence of iterative scaling
Computation Cost
100
|T|
7
0
1
2
3
4
5
6
7
8
9
10
6
75
10
9
50
5
3
2
4
8
1
25
0
0
5 6 7 8 9 10 11 12 13 14 15 16 17 18
number of predicates |P|
33
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Related Work
 Selectivity Estimation
SAC+79 P.G. Selinger et al: Access Path Selection in a Relational DBMS. SIGMOD 1979
Chr83 S. Christodoulakis: Estimating record selectivities. Inf. Syst. 8(2): 105-115 (1983)
Lyn88 C. A. Lynch: Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values. VLDB 1988: 240-251
PC84 G. Piatetsky-Shapiro, C. Connell: Accurate Estimation of the Number of Tuples Satisfying a Condition. SIGMOD Conference 1984: 256-276
PIH+96 V. Poosala, et. al: Improved histograms for selectivity estimation of range predicates. SIGMOD 1996
 Recommending, Constructing, and Maintaining Multivariate Statistics
AC99 A. Aboulnaga, S. Chaudhuri: Self-tuning Histograms: Building Histograms Without Looking at Data. SIGMOD 1999: 181-192
BCG01 N. Bruno, S. Chaudhuri, L. Gravano: STHoles: A Multidimensional Workload-Aware Histogram. SIGMOD 2001
BC02 N. Bruno and S. Chaudhuri: Exploiting Statistics on Query Expressions for Optimization. SIGMOD 2002
BC03 N. Bruno, S. Chaudhuri: Efficient Creation of Statistics over Query Expressions. ICDE 2003:
BC04 N. Bruno, S. Chaudhuri: Conditional Selectivity for Statistics on Query Expressions. SIGMOD 2004: 311-322
SLM+01 M. Stillger, G. Lohman, V. Markl, and M. Kandil: LEO – DB2’s Learning Optimizer. VLDB 2001
IMH+04 I. F. Ilyas, V. Markl, P. J. Haas, P. G. Brown, A. Aboulnaga: CORDS: Automatic discovery of correlations and soft functional dependencies. Proc. 2004 ACM
SIGMOD, June 2004.
CN00 S. Chaudhuri, V. Narasayya: Automating Statistics Management for Query Optimizers. ICDE 2000: 339-348
DGR01 A. Deshpande, M. Garofalakis, R. Rastogi: Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data. SIGMOD 2001
GJW+03 C. Galindo-Legaria, M. Joshi, F. Waas, et al: Statistics on Views. VLDB 2003: 952-962
GTK01 L. Getoor, B. Taskar, D. Koller: Selectivity Estimation using Probabilistic Models. SIGMOD 2001
PI97 V. Poosala and Y. Ioannidis: Selectivity Estimation without value independence. VLDB 1997
 Entropy and Maximum Entropy
Sha48
C. E. Shannon: A mathematical theory of communication, Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656, July and October, 1948
DR72 J.N. Darroch and D. Ratcliff: Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics (43), 1972:1470–1480.
GP00 W. Greiff, J. Ponte: The maximum-entropy approach and probabilistic IR models. ACM TIS. 18(3): 246-287, 2000
GS85 S. Guiasu and A. Shenitzer: The principle of maximum-entropy. The Mathematical Intelligencer, 7(1), 1985.
34
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation
IBM Research
Conclusions
 Problem: Inconsistent Cardinality Model and Bias in today’s Query Optimizers
due to overlapping Multivariate Statistics (MD Histograms, etc.)
To reduce bias, today’s optimizers only use a consistent subset of available multivariate statistics
 Cardinality estimates suboptimal despite better information
 Bias towards plans without proper statistics (“fleeing from knowledge to ignorance”)
 Solution: Maximizing Information Entropy
Generalizes concepts of uniformity and independence used in today’s query optimizers
All statistics are utilized  Cardinality estimates improve, some by orders of magnitude
Cardinality Model is consistent  No bias towards particular plans
Consistent estimates are computed in subsecond time
for up to 10 predicates per table
however, algorithm is exponential in the number of predicates
 Not covered in the talk (see paper):
Reducing algorithm complexity through pre-processing
Impact on query performance  speedup, sometime by orders of magnitude
 Future Work:
Extension to join estimates
35
Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential
© 2005 IBM Corporation