University of Stuttgart as Anwendungssoftware A Statistics Propagation Approach to Enable Cost-Based Optimization of Statement Sequences Tobias Kraft, Holger Schwarz, Bernhard Mitschang Institute of Parallel and Distributed Systems University of Stuttgart University of Stuttgart as Anwendungssoftware Overview • Motivation • Cost Estimation Approach • Histogram Propagation • Related Work • Experiments • Conclusion & Future Work 2 University of Stuttgart as Motivation Anwendungssoftware The Case for Optimization • Many of today’s applications embed query generators. • Some of these generators not only produce a single query but a sequence of SQL statements (e.g. MicroStrategy DSS tools). • Rewriting these sequences may lead to significant performance improvements! 3 • Development of an optimizer based on rewrite rules and a heuristic priority-based control strategy Coarse-Grained Optimization (CGO) [VLDB03] University of Stuttgart as Motivation Anwendungssoftware Statement Sequences We focus on statement sequences that: CREATE TABLE q1 (custkey INTEGER, turnover1990 FLOAT); CREATE TABLE q2 (custkey INTEGER, turnover1991 FLOAT); • compute the final result of a request in a set of subsequent steps, • allow to share intermediate results by multiple subsequent steps, • temporarily store intermediate results in tables that are being created and dropped within the sequence, • store the final result in a table that is not being dropped within the sequence. 4 CREATE TABLE q3 (custkey INTEGER, name VARCHAR(25)); INSERT INTO q1 SELECT o.custkey, SUM(o.totalprice) FROM orders o WHERE o.orderyear = 1990 GROUP BY o.custkey; INSERT INTO q2 SELECT o.custkey, SUM(o.totalprice) FROM orders o WHERE o.orderyear = 1991 GROUP BY o.custkey; INSERT INTO q3 SELECT c.custkey, c.name FROM q1, q2, customer c WHERE q1.custkey = c.custkey AND q1.custkey = q2.custkey AND q2.turnover1991 > q1.turnover1990; DROP TABLE q1; DROP TABLE q2; q1 q2 q3 University of Stuttgart as Motivation Anwendungssoftware Problems of the Heuristic Control Strategy • In some scenarios a rule application may lead to deterioration of performance. • Different behavior on different platforms and database management systems. • Alternative sequences of rule applications lead to different results. Need for a cost-based approach. 5 University of Stuttgart as Cost Estimation Approach Anwendungssoftware Problems & Solutions • Cost estimates depend on the physical layout of the database and the capabilities and strategies of the DBMS‘s query optimizer. A cost model on top of the DBMS is no feasible solution. Make use of the cost estimates provided by the DBMS‘s query optimizer. • DBMSs only provide cost estimates for statements on existing tables. Execute CREATE TABLES statements before cost estimation. • Missing statistics for the created tables causes the DBMS‘s query optimizer to use default values for cardinality and selectivity. Propagate statistics through the INSERT statements. Make these propagated statistics available to the DBMS‘s optimizer. 6 University of Stuttgart as Cost Estimation Approach Anwendungssoftware Algorithm of Cost Estimation Input: Output: A statement sequence S. A cost estimate for S. totalcosts = 0 foreach CREATE TABLE statement c in S Execute c on the underlying database system. foreach INSERT statement i in S (in the order given by S) Retrieve a cost estimate for i from the optimizer of the underlying database system. Add this cost estimate to totalcosts. Translate i into an algebraic tree. Retrieve histograms for the base tables from the underlying database system and propagate them through the algebraic tree to retrieve histograms for the target table of i. Store the resulting histograms in the catalog of the underlying database system. foreach CREATE TABLE statement c in S Drop the table that has been created by c. return totalcosts. 7 University of Stuttgart as Cost Estimation Approach Anwendungssoftware Architectural Overview sequence optimized sequence cost-estimation component for statement sequences statement cost aggregation histogram propagation Statistics API cost-based CGO optimizer cost-based control strategy CGO ruleset JDBC histograms cost estimates for statements DBMS query optimizer DBMS execution engine and data management 8 DDL statements University of Stuttgart Histogram Propagation as Anwendungssoftware Overview • We have adopted techniques from approximate query answering and added some extensions: T4 interval arithmetic for arithmetic terms, heuristics for grouping and aggregation, unification of comparison operators, normalization of histograms, common subexpressions • SQL statement sequence algebraic operator trees. • Everything is a bucket / histogram: NULL values are represented by a special bucket. Constant values (used in arithmetic terms) are represented by a histogram with a single bucket. Sets of constants (used in IN-predicates) are represented by a histogram. 9 X T1 T2 T3 University of Stuttgart as Histogram Propagation Anwendungssoftware Histograms • Buckets are 4-tuples: (low, high, card, dv ) Histogram #properties: #dataType: • The NULL value: (null, null, card, 1) 1 * #buckets Bucket • A constant value c : (c, c, card, 1) #properties: #dataType: #cardinality: #distinctValues: 1 1 CharValue -value: String Properties int double double 1 #low 1 Value #dataType: 10 Properties int #high int DateValue DecimalValue IntegerValue -value: Date -value: double -value: long University of Stuttgart as Histogram Propagation Anwendungssoftware Algebra and Propagation • Operators: projection, selection, cartesian product, union, difference and grouping (including aggregation). • A join can be represented by a cartesian product followed by a selection that contains the join condition. • Arithmetic terms (projection / selection) and predicates (selection) are also represented by operator trees. • Propagation is done by recursively traversing the algebraic operator tree and the arithmetic trees in a post order manner. 11 University of Stuttgart Histogram Propagation as Anwendungssoftware Algebra and Propagation • Arithmetic Operators: histograms as input a result histogram as output iterate over all bucket combinations compute a result bucket for each bucket combination • Comparison Operators: A3 histograms as input selectivity as output some operators also provide modified histograms iterate over all bucket combinations compute a selectivity for each bucket combination (and adapt the buckets) A1 A2 A1 A2 A1 ‘ A2 ‘ selectivity A1 A2 A1 12 A2 University of Stuttgart as Histogram Propagation Anwendungssoftware Interval Arithmetic for Arithmetic Terms • Example of using interval arithmetic for adding two buckets: BO.low = BI1.low + BI2.low BO.high = BI1.high + BI2.high BI1 + 0 10 20 30 40 50 = BI2 0 10 20 30 40 50 + 0 10 20 30 40 50 BO 0 10 20 30 40 50 = 0 10 20 30 40 50 0 10 20 30 40 50 60 Serialization 13 10 20 30 40 50 60 University of Stuttgart Histogram Propagation as Anwendungssoftware Heuristics for Grouping and Aggregation • Determine amount of groups and average group sizes. • Use these values together with the histogram of the attribute that should be aggregated to compute the aggregate buckets. cardinality 30 100 groups with group size 50 50 tuple SUM(A) ? 100 20 10 histogram of attribute A 0 10 20 30 40 50 60 70 80 90 100 110 lower bound of SUM(A) = 450 + 500 = 950 14 cardinality 50 tuple value upper bound of SUM(A) = 2100 + 1900 + 875 = 4875 histogram of SUM(A) 950 4875 value University of Stuttgart as Histogram Propagation Anwendungssoftware Unification of Comparison Operators • Comparison operators compare histograms. A single operator implementation can be used for different purposes. No separate join implementation is necessary. • E.g., the comparison operator ‘=‘ can be used in: a predicate comparing two attributes, a predicate comparing an attribute and a constant, a join condition of an equi-join cartesian product followed by a selection that contains the join-condition as predicate, an IN predicate comparing an attribute with a set of values join with a table (represented by a histogram) that contains the values of the value set. 15 C C1 … Cn University of Stuttgart as Related Work Anwendungssoftware on Histogram Propagation • Papers on Approximate Query Answering: Yannis E. Ioannidis, Viswanath Poosala: Histogram-Based Approximation of Set-Valued Query-Answers. VLDB 1999 Viswanath Poosala, Venkatesh Ganti, Yannis E. Ioannidis: Approximate Query Answering using Histograms. IEEE Data Eng. Bull. 1999 • Transformation of SQL queries on tables into SQL queries on histograms. • Join is done by creating two tables under the uniform spread assumption such that each table represents a value distribution which fits to the respective input histogram. • Only equi-joins, no support for predicates that compare two attributes, no grouping and no support of arithmetic terms. 16 University of Stuttgart Experiments as Anwendungssoftware Experimental Setup • Database: TPC-H benchmark database on IBM DB2 V9. • Sample sequences: Sequence S1 generated by the MicroStrategy DSS tool suite. A set of semantically equivalent sequences that result from the application of the CGO rewrite rules. Variants of those sequences resulting from the use of different values in the filter predicates of the sequence. S1 S1 Q5 S2 S3 S4 S5 Q4 Q1 17 Q2 Q3 S6 S7 S8 S10 S9 S10 Q7 Q6 University of Stuttgart Experiments as Anwendungssoftware Results for Alternative Sequences execution time (s) cost estimate (timerons / 1000) S1 9,455 2,379 2,345 S1 S2 6,862 S2 1,826 1,803 S3 6,859 S3 1,810 1,777 S4 6,858 S4 1,791 1,765 9,375 S5 S6 3,791 2,356 2,344 S5 S6 1,103 1,059 16,815 S7 6,805 S7 S8 6,803 S8 1,763 S9 6,802 S9 1,766 S10 3,735 S10 1,777 16,887 16,887 55,170 1,083 without histogram propagation with histogram propagation 18 University of Stuttgart Experiments as Anwendungssoftware Results for Different Selectivities 9,500 cost estimate (timerons / 1000) for sequence S1 9,475 9,450 9,425 9,400 100,000 2,475 300,000 parameter value 500,000 execution time (s) for sequence S1 2,425 2,375 2,325 2,275 100,000 300,000 parameter value without histogram propagation 19 500,000 with histogram propagation University of Stuttgart as Anwendungssoftware Conclusion & Future Work • Cost estimates for statement sequences are necessary to avoid rule applications that lead to performance deterioration. • Making use of the cost estimates of the underlying DBMS is a feasible solution. • Histogram propagation is necessary to get useable cost estimates for statements that access intermediate-result tables and to avoid bad plans. • Future Work: A cost-based control strategy. Extensive measurements. 20 University of Stuttgart as Anwendungssoftware Thank you for your attention! 21 University of Stuttgart 22 as Anwendungssoftware University of Stuttgart as Cost Estimation Approach Anwendungssoftware Statistics API Statistics API DB2 implementation Oracle implementation SQL Server implementation JDBC DB2 database 23 Oracle database SQL Server database • is an interface that offers uniform DBMS-independent access to DBMS statistics, meta data and optimizer estimates • provides a flexible histogram format that abstracts from proprietary data structures used in different DBMSs • implementations exist for IBM DB2, Oracle and MS SQL Server University of Stuttgart as Histogram Propagation Anwendungssoftware Common Subexpressions • Identify arithmetic terms that appear multiple times in the different clauses of an SQL query. • Otherwise, the arithmetic term will be recomputed for each appearance and modifications of histograms may get lost. $1, $2 $1, $2 = count(*) INSERT INTO temptable SELECT year(o.o_orderdate), count(*) FROM orders o WHERE year(o.o_orderdate) > 1992 AND o.o_orderpriority = ‘1-URGENT‘ GROUP BY year(o.o_orderdate) o_orderpriority = ‘1-URGENT‘ $1 > 1992 $1 = year(o_orderdate) 24 orders o_orderdate, o_orderpriority University of Stuttgart as Histogram Propagation Anwendungssoftware Normalization of Histograms • Worst case: The number of buckets in the output histogram is the product of the number of buckets in the input histograms. Serializing a histogram may double the number of buckets. Normalization: prior to the enumeration phase and / or after an output histogram has been produced, reduces the number of buckets by merging adjacent buckets, trade-off between complexity / performance and quality. Merge 10 buckets 4 buckets 25 0 10 20 30 40 50 0 10 20 30 40 50
© Copyright 2026 Paperzz