IBM Research Consistently Estimating the Selectivity of Conjuncts of Predicates Volker Markl, Nimrod Megiddo, Marcel Kutsch, Tam Minh Tran, Peter Haas, Utkarsh Srivastava Non-Confidential | 7-12-2005 | Volker Markl © 2005 IBM Corporation IBM Research Agenda Consistency and Bias Problems in Cardinality Estimation The Maximum Entropy Solution Iterative Scaling Performance Analysis Related Work Conclusions 2 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research What is the problem? Consider the following three attributes: Make Model Legend: Color 3 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential Correlation © 2005 IBM Corporation IBM Research How to estimate the cardinality of the predicate… 100,000 200,000 Make Model ‘Mazda’ ‘323’ (real cardinality: 49,000) … Make = ‘Mazda’ AND Model = ’323’ AND Color = ‘red’ 200,000 Legend: cardinality Color ‘red’ attribute value 4 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Without any additional knowledge 100,000 200,000 Make Model ‘Mazda’ ‘323’ Base cardinality: 1000,000 200,000 Selectivity( Make= ‘Mazda’ AND Model = 323 AND Color = ‘red’ ) Color red Independence assumption: s(Make = ‘Mazda’ ) * s( Model = ‘323’ ) * s( Color =‘red’ ) = 100,000 * 200,000 1,000,000 1,000,000 * 200,000 1,000,000 = 0.004 Estimated Cardinality: 0.004 * 1,000,000 = 4000 5 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential Legend: denote by s(?) the selectivity of ? © 2005 IBM Corporation IBM Research Additional knowledge given (1): 100,000 200,000 Make Model ‘Mazda’ ‘323’ Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) 200,000 case 1: s( Make AND Model ) * s( Color ) = Color ‘red’ 50,000 1,000,000 * 200,000 1,000,000 = 0.01 estimated card: 10,000 Additional knowledge: Make AND Model card(‘Mazda’ AND ‘323’) = 50,000 Legend: Conjunct Pred X AND Pred Y Cardinality 6 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Additional knowledge given (2): 100,000 200,000 Make Model ‘Mazda’ ‘323’ Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) 200,000 Color ‘red’ Additional knowledge: case 1: s( Make AND Model ) * s(Color) = 0.01 estimated card: 10,000 case 2: s( Make AND Color ) * s( Model ) = 200,000 1,000,000 * 90,000 1,000,000 = 0.018 estimated card: 18,000 Make AND Model card(‘Mazda’ AND ‘323’) = 50,000 Legend: Make AND Color cardl(‘Mazda’ AND ‘red’) = 90,000 Conjunct Pred X AND Pred Y Cardinality 7 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Additional knowledge given (3): 100,000 200,000 Make Model ‘Mazda’ ‘323’ Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) 200,000 Color ‘red’ Additional knowledge: Make AND Model card(‘Mazda’ AND ‘323’) = 50,000 case 1: s( Make AND Model ) * s(Color) = 0.01 estimated card: 10,000 case 2: s( Make AND Color ) * s( Model ) = 0.018 estimated card: 18,000 case 3: s( Model AND Color ) * s( Make ) = 150,000 1,000,000 * 100,000 1,000,000 = 0.015 estimated card: 15,000 Legend: Make AND Color cardl(‘Mazda’ AND ‘red’) = 90,000 Conjunct Pred X AND Pred Y Model AND Color card(‘323’ AND ‘red’) = 150,000 8 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential Cardinality © 2005 IBM Corporation IBM Research Why is this a problem? case 0: s( Make) * s(Model ) * s(Color) = 0.004 estimated card: 4,000 Cardinality Bias case 3: s( Model AND Color ) * s( Make ) = 0.015 estimated card: 15,000 Fleeing from Knowledge to Ignorance case 2: s( Make AND Color ) * s( Model ) = 0.018 estimated card: 18,000 15,000 4,000 Index Intersect Make Color Model 9 FETCH Make 18,000 FETCH Model 150,000 90,000 Index Scan Index Scan Model, Color Make, Color Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research What has happened? Inconsistent model different estimates for the same intermediate result due to multivariate statistics with overlapping information Bias during plan selection results in the selection of sub-optimal plans Bias Avoidance means keeping the model consistent State-of-the-art is to do bookkeeping of the first multivariate statistic used, and ignore further overlapping multivariate statistics Does not solve the problem, as ignoring knowledge also means bias Bias is arbitrary, depends on what statistics are used first during optimization Only possible solution is to exploit all knowledge consistently 10 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Problem: Only partial knowledge of the DNF atoms 100,000 200,000 Make Model ‘Mazda’ ‘323’ Mazda 323 200,000 Color Mazda & 323 & red Mazda & 323 & red Mazda & 323 & red ‘red’ Mazda & 323 & red Additional knowledge: Mazda & 323 & red Mazda & 323 & red Make AND Model p(‘Mazda’ AND ‘323’) = 50,000 Legend: Make AND Color pl(‘Mazda’ AND ‘red’) = 90,000 Model AND Color p(‘323’ AND ‘red’) = 150,000 11 Mazda & 323 & red DNF = disjunctive normal form Mazda & 323 & red red Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential X denotes not X © 2005 IBM Corporation IBM Research How to compute the missing values of the distribution? Mazda 323 Mazda & 323 & red Mazda & 323 & red Mazda & 323 & red Mazda & 323 & red Mazda & 323 & red Probability( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) Mazda & 323 & red Mazda & 323 & red Mazda & 323 & red red 12 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Solution: Information Entropy H( X ) = -∑ xi log( xi ) Entropy is a measure for the “uninformedness” of a probability distribution X=(x1, …, xm) with x1 + … + xm = 1 Maximizing information entropy for unknown selectivities using known selectivities as constraints will avoid bias The less is known about a probability distribution, the larger the entropy Nothing uniformity: s(X = ?) = 1/m Marginals independence: s(X = ? and Y = ?) = s(X=?) * s(Y=?) Thus: the principle of maximum entropy generalizes uniformity and independence used in today’s query optimizers 13 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Entropy Maximization for Cardinality Estimation given some selectivities (single and conjunctive) over a space of n predicates p1, …, pn choose a model which is consistent with this knowledge but otherwise as uniform as possible maximize the entropy of the probability distribution X = (xb | b {0,1}n) max( H ( X )) b{0,1} n Legend: {0,1}n denotes the n-fold cross product of the set {0,1}, i.e., {0,1} … {0,1} xb log xb b p is the selectivity of the DNF atom i{1,..., n} i b{0,1} i xb n bi = 0 means that predicate pi is negated in the DNF bi = 1 means that predicate pi is a positive term in the DNF 14 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential n times Also, for a predicate p1 = p p0 = not p © 2005 IBM Corporation IBM Research Maximum Entropy Principle – Example: Knowledge sY, Y T: s1 = s(Mazda) = 0.1 s2 = s(323) = 0.2 s3 = s(red) = 0.2 000 Mazda s1,2 = s(Mazda & 323) = 0.05 s1,3 = s(Mazda & red) = 0.09 323 Mazda & 323 & red s2,3 = s(red & 323) = 0.15 T = {{1}, {2}, {3}, {1,2}, {1,3}, {2,3}, } Constraints: s1= Mazda & 323 & red + Mazda & 323 & red + s1 = x100 + 100 110 Mazda & 323 & red x101 + Mazda & 323 & red + x110 + Mazda & 323 & red x111 010 Mazda & 323 & red 101 Mazda & 323 & red 111 Mazda & 323 & red Mazda & 323 & red 011 Mazda & 323 & red 001 Mazda & 323 & red 15 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential red © 2005 IBM Corporation IBM Research Maximum Entropy Principle – Example: Knowledge sY, Y T: s1 = s(Mazda) = 0.1 s2 = s(323) = 0.2 s3 = s(red) = 0.2 000 Mazda s1,2 = s(Mazda & 323) = 0.05 s1,3 = s(Mazda & red) = 0.09 323 s2,3 = s(red & 323) = 0.15 T = {{1}, {2}, {3}, {1,2}, {1,3}, {2,3}, } Constraints: 100 010 110 0.10 = s1 = x100 + x101 + x110 + x111 0.20 = s2 = x010 + x011 + x110 + x111 0.20 = s3 = x101 + x111 + x011 + x001 101 111 0.05 = s1,2 = x110 + x111 011 0.09 = s1,3 = x101 + x111 0.15 = s2,3 = x011 + x111 1.00 = s = x000 + x001 + x010 + x011 + x100 + x101 + x110 + x111 001 red Objective Function: max( H ( X )) 16 b{0,1}3 xb log xb Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Solving the Constrained Optimization Problem Minimize the objective function: 1 0 1 0 p1 p2 p3 b{0,1}n 1 0 0 p1 p2 p3 1 1 1 0 p1 p2 p3 1 p1 p2 p3 xb log xb 1 0 0 1 p1 p2 p3 0 1 1 p1 p2 p3 0 1 p1 p2 p3 Satisfying the |T|2{1, .., n} constraints: for all Y T : x bC (Y ) b 0 0 0 p1 p2 p3 sY Legend: 2{1,…,n} denotes the powerset of {1,..,n} C(Y) denotes all DNF atoms that contribute to Y, i.e., formally, General solution: Iterative Scaling 17 C(Y) := {b {0,1}n | iY : bi = 1} and C() := {0,1}n Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Maximum Entropy and Lagrange Multipliers b{0,1}n xb log xb is convex. We can build a Lagrangian function by associating a multiplier Y with each constraint bC (Y ) xb sY and subtracting the constraints from the objective function L( X , ) b{0,1}n xb log xb Y T Y bC (Y ) xb sY Differentiation w.r. to xb and equating to zero yields conditions for minimum L for each b { 0,1 }n : ln xb 1 Y 0 xb Y P b,T Exponentiation of the Lagrange Multipliers in the derivatives yields product form 1 zY e Y xb zY Legend: Y P b , T e P(b, T) T denotes the indexes Y of Replacing xb in each constraint yields a condition all known selectivities sY to which in the exponentiated Lagrange multipliers zX DNF atom b contributes its value xb: bC Y 18 z W P b,T W sY e for each Y T P(b,T) = {Y T | iY : bi = 1} {} Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Iterative Scaling We can now isolate zY for a particular Y T zY sY *e bC (Y ) z W P b ,T \{Y } W and thus iteratively compute zY from all zW, W T\{Y} This algorithm is called Iterative Scaling (Darroch and Ratcliff, 1972) and converges to a stable set of Lagrangian multipliers zY, Y T This stable point minimizes the objective function and satisfies all constraints We can compute all DNF atoms xb from these stable multipliers using xb 1 e z Y P b ,T Y and can in turn compute all missing selectivities sY 19 x bC (Y ) b Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Maximum Entropy Solution of the Example Knowledge: s(Mazda) = s1 = 0.1 s(323) = s2 = 0.2 s(red) = s3 = 0.2 s(Mazda & 323) = s1,2 = 0.05 s(Mazda & red) = s1,3 = 0.09 s(red & 323) = s2,3 = 0.15 Mazda 323 000 100 010 110 111 101 Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) 011 001 s1,2,3 = x111 = ??? red 20 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Iterative Scaling Knowledge: s1 = 0.1 s2 = 0.2 s3 = 0.2 sØ = 1 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 1st Iteration: 111 110 101 100 z1 z1 z1 z1 z2 z2 z1,2 z1,2 z1 bC ({1}) z1 z2 z1,2 z3 z1,3 z2,3 z 21 = 0.067957 =1 =1 =1 =1 =1 =1 001 z2 2 z3 z1,3 z1,3 z3 z3 z2,3 zØ zØ zØ zØ 2,3 zØ zØ z = 0.1 = 0.785759 = 0.05 = 0.785759 = 0.05 = 0.392879 = 1.571518 3 1,3 W P b ,T \{1} W s1 s2 s1,2 s3 s1,3 s2,3 s 000 1,2 z2,3 s1 * e 010 1 z2 z3 zØ 011 zØ Ø s1 x 110 x 100 x 111 x 101 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Iterative Scaling Knowledge: s1 = 0.1 s2 = 0.2 s3 = 0.2 sØ = 1 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 1st Iteration: 111 110 101 100 z1 z1 z1 z1 z2 z2 z1,2 z1,2 z2 22 001 000 z2 2 1,2 z3 z1,3 z1,3 z3 z3 3 1,3 z2,3 z2,3 zØ zØ zØ zØ 2,3 zØ zØ zØ Ø s2 * e bC ({2}) z1 z2 z1,2 z3 z1,3 z2,3 z 010 1 z2 z3 zØ 011 = 0.067957 = 0.254531 =1 =1 =1 =1 =1 z W P b ,T \{ 2} W s1 s2 s1,2 s3 s1,3 s2,3 s = 0.062727 = 0.2 = 0.012727 = 0.492879 = 0.031363 = 0.1 = 0.985759 s2 x 010 x 110 x 111 x 011 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Iterative Scaling Knowledge: s1 = 0.1 s2 = 0.2 s3 = 0.2 sØ = 1 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 1st Iteration: z1, 2 23 110 101 100 z1 z1 z1 z1 z2 z2 z1,2 z1,2 = 0.067957 = 0.254531 = 3.928794 =1 =1 =1 =1 s1, 2 * e 010 001 1 z2 2 z3 z1,3 z1,3 z3 z3 z2,3 zØ zØ zØ zØ 2,3 zØ zØ z = 0.1 = 0.237273 = 0.05 = 0.511516 = 0.05 = 0.118637 = 1.023032 3 1,3 W P b ,T \{1, 2} W s1 s2 s1,2 s3 s1,3 s2,3 s 000 1,2 z3 zØ 011 z2 z2,3 bC ({1, 2}) z1 z2 z1,2 z3 z1,3 z2,3 z 111 zØ Ø s1,2 x 110 x 111 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Iterative Scaling Knowledge: s1 = 0.1 s2 = 0.2 s3 = 0.2 sØ = 1 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 1st Iteration: z3 24 110 101 100 z1 z1 z1 z1 z2 z2 z1,2 z1,2 s3 * e = 0.067957 = 0.254531 = 3.928794 = 0.390994 =1 =1 =1 010 001 1 z2 2 z3 z1,3 z1,3 z3 z3 3 1,3 z2,3 zØ zØ zØ zØ 2,3 zØ zØ z W P b ,T \{ 3} W s1 s2 s1,2 s3 s1,3 s2,3 s 000 1,2 z3 zØ 011 z2 z2,3 bC ({3}) z1 z2 z1,2 z3 z1,3 z2,3 z 111 zØ Ø s3 = 0.069550 = 0.165023 = 0.034775 = 0.2 = 0.019550 = 0.046386 = 0.711516 x 111 x 011 x 101 x 001 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Iterative Scaling Knowledge: s1 = 0.1 s2 = 0.2 s3 = 0.2 sØ = 1 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 1st Iteration: z1,3 25 110 101 100 z1 z1 z1 z1 z2 z2 z1,2 z1,2 = 0.067957 = 0.254531 = 3.928794 = 0.390994 = 4.603645 =1 =1 s1,3 * e 010 001 1 z2 2 z3 z1,3 z1,3 z3 z3 z2,3 zØ zØ zØ zØ 2,3 zØ zØ z = 0.14 = 0.200248 = 0.07 = 0.27045 = 0.09 = 0.081611 = 0.781966 3 1,3 W P b ,T \{1, 3} W s1 s2 s1,2 s3 s1,3 s2,3 s 000 1,2 z3 zØ 011 z2 z2,3 bC ({1, 3}) z1 z2 z1,2 z3 z1,3 z2,3 z 111 zØ Ø s1,3 x 111 x 101 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Iterative Scaling Knowledge: s1 = 0.1 s2 = 0.2 s3 = 0.2 sØ = 1 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 1st Iteration: z 2,3 110 101 100 z1 z1 z1 z1 z2 z2 z1,2 z1,2 s2 , 3 * e = 0.067957 = 0.254531 = 3.928794 = 0.390994 = 4.603645 = 1.837978 =1 z W P b ,T \{ 2 , 3} W s1 s2 s1,2 s3 s1,3 s2,3 s = 0.177709 = 0.268637 = 0.107709 = 0.338839 = 0.127709 = 0.15 = 0.850355 010 001 000 1 z2 2 1,2 z3 z3 z1,3 z1,3 zØ 011 z2 z3 z3 3 1,3 z2,3 bC ( 2 , 3) z1 z2 z1,2 z3 z1,3 z2,3 z 111 z2,3 zØ zØ zØ zØ 2,3 zØ zØ zØ Mazda 323 Ø s2,3 x 111 x 011 red 26 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Iterative Scaling Knowledge: s1 = 0.1 s2 = 0.2 s3 = 0.2 sØ = 1 s1,2 = 0.05 s1,3 = 0.09 s2,3 = 0.15 1st Iteration: z 110 101 100 z1 z1 z1 z1 z2 z2 z1,2 z1,2 s * e = 0.067957 = 0.254531 = 3.928794 = 0.390994 = 4.603645 = 1.837978 = 1.175979 010 z2 001 000 z2 2 1,2 z3 z3 z1,3 z1,3 zØ 011 1 z3 z3 3 1,3 z2,3 bC ({}) z1 z2 z1,2 z3 z1,3 z2,3 z 111 z2,3 zØ zØ zØ 2,3 zØ zØ zØ zØ z W P b ,T \{ } W s1 s2 s1,2 s3 s1,3 s2,3 s = 0.208982 = 0.315911 = 0.126664 = 0.398468 = 0.150183 = 0.176397 =1 Ø sØ 0.029399 0.029399 0.110115 0.097264 0.052919 0.079133 0.169152 0.432619 27 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Maximum Entropy Solution of the Example Knowledge: s(Mazda) = s1 = 0.1 s(323) = s2 = 0.2 s(red) = s3 = 0.2 Mazda 323 s(Mazda & 323) = s1,2 = 0.05 s(Mazda & red) = s1,3 = 0.09 s(red & 323) = s2,3 = 0.15 0.009918 0.049918 0.000082 0.049918 0.040082 0.100082 Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) s1,2,3 = x111 = 0.049918 0.009918 0.740082 Iterations: 241 28 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential red © 2005 IBM Corporation IBM Research Let’s compare: Selectivity( Make = ‘Mazda’ AND Model = ‘323’ AND Color = ‘red’ ) Real : s( Model AND Color AND Make ) = 0.049 actual card: 49,000 case 0: s( Make) * s( Model ) * s(Color) = 0.004 estimated card: 4,000 case 1: s( Make AND Model ) * s(Color) = 0.010 estimated card: 10,000 Error: 10x case 2: s( Make AND Color ) * s( Model ) = 0.018 estimated card: 18,000 Error: 2.5x case 3: s( Model AND Color ) * s( Make ) = 0.015 estimated card: 15,000 Error: 3x ME: s( Model AND Color ) * s( Make ) = 0.049918 estimated card: 49,918 29 Error: 5x Almost no error Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Forward Estimation: Predicting s1,2,3 , given … Absolute Estimation Error 1000 100%: 9583 900 4th quartile 800 3rd quartile 788 median 700 2nd quartile 600 1st quartile 200 queries 500 400 300 200 100 0 79 s1 s2 s3 1 30 Legend: 75%: 2138 79 42 65 s1,3 s2,3 11 s1,2 s1,3 9 s1,2 s2,3 6 s1,2 s1,3 s2,3 s s s 2.1b 2.1.c 2.1a 2.2c 2.2a 2.2b 2.3 1,3 2,3 1,2 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential 0 s1,2,3 3 © 2005 IBM Corporation IBM Research Comparing DB2 and ME : Predicting s1,2,3 , given …1000 Legend: 4th quartile Absolute Estimation Error 900 2nd quartile 800 200 queries 1st quartile 700 600 500 400 300 200 100 44 11 0 DB2 SOTA s 1,3 ME ME s2,3 2.2a 31 3rd quartile median mean 44 DB2 ME SOTA ME s 1,2 s1,3 2.2b 9 79 DB2 SOTA s 1,2 65 ME ME s2,3 2.2c 43 DB2 SOTA 6 ME ME s1,2 , s1,3 , s2,3 2.3 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Backward Estimation: Given s1,2,3 , predicting … 1200 Legend: 4th quartile 1100 median mean absolute estimation error 1000 3rd quartile 2nd quartile 200 queries 1st quartile 900 800 700 600 500 400 300 200 100 0 DB2 SOTA ME MAKE = ? AND s1,2 MODEL =? 32 DB2 SOTA ME ME MODEL = ? AND s1,3 COLOR =? DB2 SOTA ME MAKE = ? AND s2,3 COLOR =? Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research time until convergence of iterative scaling Computation Cost 100 |T| 7 0 1 2 3 4 5 6 7 8 9 10 6 75 10 9 50 5 3 2 4 8 1 25 0 0 5 6 7 8 9 10 11 12 13 14 15 16 17 18 number of predicates |P| 33 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Related Work Selectivity Estimation SAC+79 P.G. Selinger et al: Access Path Selection in a Relational DBMS. SIGMOD 1979 Chr83 S. Christodoulakis: Estimating record selectivities. Inf. Syst. 8(2): 105-115 (1983) Lyn88 C. A. Lynch: Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distribution of Column Values. VLDB 1988: 240-251 PC84 G. Piatetsky-Shapiro, C. Connell: Accurate Estimation of the Number of Tuples Satisfying a Condition. SIGMOD Conference 1984: 256-276 PIH+96 V. Poosala, et. al: Improved histograms for selectivity estimation of range predicates. SIGMOD 1996 Recommending, Constructing, and Maintaining Multivariate Statistics AC99 A. Aboulnaga, S. Chaudhuri: Self-tuning Histograms: Building Histograms Without Looking at Data. SIGMOD 1999: 181-192 BCG01 N. Bruno, S. Chaudhuri, L. Gravano: STHoles: A Multidimensional Workload-Aware Histogram. SIGMOD 2001 BC02 N. Bruno and S. Chaudhuri: Exploiting Statistics on Query Expressions for Optimization. SIGMOD 2002 BC03 N. Bruno, S. Chaudhuri: Efficient Creation of Statistics over Query Expressions. ICDE 2003: BC04 N. Bruno, S. Chaudhuri: Conditional Selectivity for Statistics on Query Expressions. SIGMOD 2004: 311-322 SLM+01 M. Stillger, G. Lohman, V. Markl, and M. Kandil: LEO – DB2’s Learning Optimizer. VLDB 2001 IMH+04 I. F. Ilyas, V. Markl, P. J. Haas, P. G. Brown, A. Aboulnaga: CORDS: Automatic discovery of correlations and soft functional dependencies. Proc. 2004 ACM SIGMOD, June 2004. CN00 S. Chaudhuri, V. Narasayya: Automating Statistics Management for Query Optimizers. ICDE 2000: 339-348 DGR01 A. Deshpande, M. Garofalakis, R. Rastogi: Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data. SIGMOD 2001 GJW+03 C. Galindo-Legaria, M. Joshi, F. Waas, et al: Statistics on Views. VLDB 2003: 952-962 GTK01 L. Getoor, B. Taskar, D. Koller: Selectivity Estimation using Probabilistic Models. SIGMOD 2001 PI97 V. Poosala and Y. Ioannidis: Selectivity Estimation without value independence. VLDB 1997 Entropy and Maximum Entropy Sha48 C. E. Shannon: A mathematical theory of communication, Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656, July and October, 1948 DR72 J.N. Darroch and D. Ratcliff: Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics (43), 1972:1470–1480. GP00 W. Greiff, J. Ponte: The maximum-entropy approach and probabilistic IR models. ACM TIS. 18(3): 246-287, 2000 GS85 S. Guiasu and A. Shenitzer: The principle of maximum-entropy. The Mathematical Intelligencer, 7(1), 1985. 34 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation IBM Research Conclusions Problem: Inconsistent Cardinality Model and Bias in today’s Query Optimizers due to overlapping Multivariate Statistics (MD Histograms, etc.) To reduce bias, today’s optimizers only use a consistent subset of available multivariate statistics Cardinality estimates suboptimal despite better information Bias towards plans without proper statistics (“fleeing from knowledge to ignorance”) Solution: Maximizing Information Entropy Generalizes concepts of uniformity and independence used in today’s query optimizers All statistics are utilized Cardinality estimates improve, some by orders of magnitude Cardinality Model is consistent No bias towards particular plans Consistent estimates are computed in subsecond time for up to 10 predicates per table however, algorithm is exponential in the number of predicates Not covered in the talk (see paper): Reducing algorithm complexity through pre-processing Impact on query performance speedup, sometime by orders of magnitude Future Work: Extension to join estimates 35 Consistently Estimating the Selectivity of Conjuncts of Predicates | Volker Markl | Non-Confidential © 2005 IBM Corporation
© Copyright 2024 Paperzz