QMapper for Smart Grid: Migrating SQL

QMapper for Smart Grid: Migrating
SQL-based Application
to Hive
Yue Wang, Yingzhong Xu, Yue Liu,
Jian Chen and Songlin Hu
SIGMOD’15, May 31–June 4, 2015
Content
•
•
•
•
•
•
Introduction
System Overview
Query Rewriting
Cost Model
Implementation
Experiments
Introduction
• High-level query languages such as Hive, Pig based
on MapReduce have been widely used
• Performance bottlenecks of current RDBMS-based
infrastructure appear in traditional enterprises
• Hive can not fully support the SQL syntax at the
moment
• Even if some SQL queries used in RDBMS can be
directly accepted by Hive, their performance might
be very low in the Hadoop
contribution
• Translate SQL to optimized HiveQL
• A cost model is proposed to reflect the execution
time of MapReduce jobs
• An algorithm is designed to reorganize the join
structure so as to construct the near-optimal
query
SECICS System
• The total amount of data is 20TB and there is
about 30GB new data added into the database
every day
• three kinds of data in SECICS:
▫ Meter data:collected by smart meters
▫ Archive data: records the detailed archived
information of meter data
▫ Statistic data: the result of offline batch analysis
Background
Background
• Low data write throughput
▫ RDBMS with complex indexes can not provide enough
write throughput
• Unsatisfied statistics analyzing capability
▫ The average processing time even reaches 3 to 4 hours
• Weak scalability
▫ scaling out RDBMS mostly leads to redesign of the
sharding strategies as well as a lot of application logic.
• Uncontrollable resource competition
The migration of Stored Procedures
Overview
Four Components
• SQL Interpreter:
▫ resolves the SQL query provided by a user and
parses that query into an Abstracted Syntax Tree
• Query Rewriter:
▫ a Rule-Based Rewriter (RBR) checks if a query
matches a series of static rules, new equivalent
queries will be generated
▫ Cost-Based Optimizer (CBO) is used to further
optimize the join structure for each query
Four Components
• Statistics Collector
▫ collecting statistics of related tables and their
columns
• Plan Evaluator
▫ The queries with equivalent join cost generated by
RBR will be sent to it
QUERY REWRITING
• Rule-based Rewriter
▫ detect the SQL clauses that are not supported well
by Hive and transform them into HiveQL
▫ initial rules are first invoked to check if the query
can be rewritten
▫ the RBR will traverse the subqueries of each query
and apply rules to them recursively
▫ all rewritten queries are generated and sent to the
CBO
Example
•
•
•
•
•
•
lvRate(uid,deviceid,isMissing,date,type)
dataProfile(dataid,uid,isActive)
dataRecord(dataid,date,consumption)
powerCut(uid,date)
gprsUsage(deviceid,dataid,date,gprs)
deviceInfo(deviceid,region,type)
Basic UPDATE Rule
• This rule translates UPDATE into SELECT statement by
putting the simpleCondition to selectList
• UPDATE lvRate a SET a.isMissing=true
LEFT OUTER JOIN dataProfile b ON a.uid=b.uid
LEFT OUTER JOIN dataRecord c on b.dataid=c.dataid
AND a.date=c.date WHERE c.dataid IS NULL
• INSERT OVERWRITE TABLE lvRate SELECT
a.uid,a.deviceid,IF(c.dataid IS NULL,true,false) as
isMissing,a.date,a.type FROM lvRate LEFT OUTER
JOIN dataProfile b ON a.uid=b.uid LEFT OUTER JOIN
dataRecord c ON b.dataid=c.dataid AND a.date=c.date
(NOT) EXISTS Rule
• transforms that subquery into a LEFT OUTER JOIN
and replaces that (NOT) EXISTS condition with join
Column IS (NOT) NULL
• DELETE FROM lvRate a WHERE NOT EXISTS
(SELECT 1 FROM powerCut b WHERE a.uid=b.uid
AND a.date=b.date )
• INSERT OVERWRITE TABLE lvRate SELECT
a.uid,a.deviceid,a.isMissing,a.date,a.type FROM
lvRate a LEFT OUTER JOIN ( SELECT uid,date
FROM powerCut) b ON a.uid=b.uid AND
a.date=b.date WHERE b.uid IS NULL
Cost-based Optimizer
• SELECT sum(gprs), type FROM gprsUsage A JOIN deviceInfo
B ON A.deviceid = B.deviceid JOIN dataRecord C ON
A.dataid = C.dataid AND A.date = C.date JOIN dataProfile D
ON C.dataid = D.dataid LEFT OUTER JOIN powerCut E ON
D.uid = E.uid AND A.date = E.date WHERE E.uid IS NULL
AND A.date=’2014-01-01’ GROUP BY B.type
• SELECT sum(gprs), type FROM( SELECT T1.gprs, T1.date,
T1.type, T2.uid FROM (SELECT A.gprs, A.dataid, A.date,
B.type FROM gprsUsage A JOIN deviceInfo B ON A.deviceid
= B.deviceid WHERE A.date=’2014-01-01’ )T1 JOIN (SELECT
C.dataid, C.date, D.uid FROM dataRecord C JOIN dataProfile
D ON C.dataid = D.dataid)T2 ON T1.dataid = T2.dataid
Cost-based Optimizer
Cost-based Optimizer
• Different from traditional databases,
MapReduce- based query processing will write
join intermediate results back to HDFS and the
next join operation will read it from HDFS too,
causing big I/O costs
• the main difference in intermediate results is
that the left-deep plan generates A B C and the
bushy plan generates C D
• B may has worse performance as jobs will
compete for computing resources
COST MODEL
• Cost of MapReduce
▫ Map phase can be divided into three subphases,
which are Map, Spill and Merge.
▫ Reduce phase also includes three parts, Shuffle,
Merge and Reduce
• Map
▫ For each Mapper:
Mapper Cost Model
• Spill
• Merge
• Different from normal MapReduce jobs, in Hive,
the internal logic of mappers may vary
depending on the specific table to be processed.
Reduce
• In the reduce phase, shuffle is responsible for
fetching mappers outputs to their corresponding
reducers
• Merge
• Reduce
• Total Cost
Cost of Operators in Map and Reduce
• In order to calculate the costs, a few sample
queries based on TPC-H are designed as probes
to collect the execution time of operators
• given a chain with n operators,the cost is
evaluated as:
Cost of Workflow
• A HiveQL query is finally compiled to
MapReduce workflows (a directed acyclic graph)
where each node is a single MapReduce job and
the edge represents the dataflow
Experiments
• evaluate the correctness and efficiency of
Qmapper
• the efficiency of translating SQL into HiveQL
and the efficiency of HiveQL execution
comparing QMapper with manually translated
work
• TPC-H will demonstrate the execution efficiency
of HiveQL generated by Qmapper
• Smart Grid application will show the correctness
and translation efficiency of QMapper
Join Performance
Scalability
Accuracy of Cost Model