QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015 Content • • • • • • Introduction System Overview Query Rewriting Cost Model Implementation Experiments Introduction • High-level query languages such as Hive, Pig based on MapReduce have been widely used • Performance bottlenecks of current RDBMS-based infrastructure appear in traditional enterprises • Hive can not fully support the SQL syntax at the moment • Even if some SQL queries used in RDBMS can be directly accepted by Hive, their performance might be very low in the Hadoop contribution • Translate SQL to optimized HiveQL • A cost model is proposed to reflect the execution time of MapReduce jobs • An algorithm is designed to reorganize the join structure so as to construct the near-optimal query SECICS System • The total amount of data is 20TB and there is about 30GB new data added into the database every day • three kinds of data in SECICS: ▫ Meter data:collected by smart meters ▫ Archive data: records the detailed archived information of meter data ▫ Statistic data: the result of offline batch analysis Background Background • Low data write throughput ▫ RDBMS with complex indexes can not provide enough write throughput • Unsatisfied statistics analyzing capability ▫ The average processing time even reaches 3 to 4 hours • Weak scalability ▫ scaling out RDBMS mostly leads to redesign of the sharding strategies as well as a lot of application logic. • Uncontrollable resource competition The migration of Stored Procedures Overview Four Components • SQL Interpreter: ▫ resolves the SQL query provided by a user and parses that query into an Abstracted Syntax Tree • Query Rewriter: ▫ a Rule-Based Rewriter (RBR) checks if a query matches a series of static rules, new equivalent queries will be generated ▫ Cost-Based Optimizer (CBO) is used to further optimize the join structure for each query Four Components • Statistics Collector ▫ collecting statistics of related tables and their columns • Plan Evaluator ▫ The queries with equivalent join cost generated by RBR will be sent to it QUERY REWRITING • Rule-based Rewriter ▫ detect the SQL clauses that are not supported well by Hive and transform them into HiveQL ▫ initial rules are first invoked to check if the query can be rewritten ▫ the RBR will traverse the subqueries of each query and apply rules to them recursively ▫ all rewritten queries are generated and sent to the CBO Example • • • • • • lvRate(uid,deviceid,isMissing,date,type) dataProfile(dataid,uid,isActive) dataRecord(dataid,date,consumption) powerCut(uid,date) gprsUsage(deviceid,dataid,date,gprs) deviceInfo(deviceid,region,type) Basic UPDATE Rule • This rule translates UPDATE into SELECT statement by putting the simpleCondition to selectList • UPDATE lvRate a SET a.isMissing=true LEFT OUTER JOIN dataProfile b ON a.uid=b.uid LEFT OUTER JOIN dataRecord c on b.dataid=c.dataid AND a.date=c.date WHERE c.dataid IS NULL • INSERT OVERWRITE TABLE lvRate SELECT a.uid,a.deviceid,IF(c.dataid IS NULL,true,false) as isMissing,a.date,a.type FROM lvRate LEFT OUTER JOIN dataProfile b ON a.uid=b.uid LEFT OUTER JOIN dataRecord c ON b.dataid=c.dataid AND a.date=c.date (NOT) EXISTS Rule • transforms that subquery into a LEFT OUTER JOIN and replaces that (NOT) EXISTS condition with join Column IS (NOT) NULL • DELETE FROM lvRate a WHERE NOT EXISTS (SELECT 1 FROM powerCut b WHERE a.uid=b.uid AND a.date=b.date ) • INSERT OVERWRITE TABLE lvRate SELECT a.uid,a.deviceid,a.isMissing,a.date,a.type FROM lvRate a LEFT OUTER JOIN ( SELECT uid,date FROM powerCut) b ON a.uid=b.uid AND a.date=b.date WHERE b.uid IS NULL Cost-based Optimizer • SELECT sum(gprs), type FROM gprsUsage A JOIN deviceInfo B ON A.deviceid = B.deviceid JOIN dataRecord C ON A.dataid = C.dataid AND A.date = C.date JOIN dataProfile D ON C.dataid = D.dataid LEFT OUTER JOIN powerCut E ON D.uid = E.uid AND A.date = E.date WHERE E.uid IS NULL AND A.date=’2014-01-01’ GROUP BY B.type • SELECT sum(gprs), type FROM( SELECT T1.gprs, T1.date, T1.type, T2.uid FROM (SELECT A.gprs, A.dataid, A.date, B.type FROM gprsUsage A JOIN deviceInfo B ON A.deviceid = B.deviceid WHERE A.date=’2014-01-01’ )T1 JOIN (SELECT C.dataid, C.date, D.uid FROM dataRecord C JOIN dataProfile D ON C.dataid = D.dataid)T2 ON T1.dataid = T2.dataid Cost-based Optimizer Cost-based Optimizer • Different from traditional databases, MapReduce- based query processing will write join intermediate results back to HDFS and the next join operation will read it from HDFS too, causing big I/O costs • the main difference in intermediate results is that the left-deep plan generates A B C and the bushy plan generates C D • B may has worse performance as jobs will compete for computing resources COST MODEL • Cost of MapReduce ▫ Map phase can be divided into three subphases, which are Map, Spill and Merge. ▫ Reduce phase also includes three parts, Shuffle, Merge and Reduce • Map ▫ For each Mapper: Mapper Cost Model • Spill • Merge • Different from normal MapReduce jobs, in Hive, the internal logic of mappers may vary depending on the specific table to be processed. Reduce • In the reduce phase, shuffle is responsible for fetching mappers outputs to their corresponding reducers • Merge • Reduce • Total Cost Cost of Operators in Map and Reduce • In order to calculate the costs, a few sample queries based on TPC-H are designed as probes to collect the execution time of operators • given a chain with n operators,the cost is evaluated as: Cost of Workflow • A HiveQL query is finally compiled to MapReduce workflows (a directed acyclic graph) where each node is a single MapReduce job and the edge represents the dataflow Experiments • evaluate the correctness and efficiency of Qmapper • the efficiency of translating SQL into HiveQL and the efficiency of HiveQL execution comparing QMapper with manually translated work • TPC-H will demonstrate the execution efficiency of HiveQL generated by Qmapper • Smart Grid application will show the correctness and translation efficiency of QMapper Join Performance Scalability Accuracy of Cost Model
© Copyright 2026 Paperzz