Apache Tez : Accelerating Hadoop Query Processing

Apache Tez : Accelerating Hadoop
Data Processing
Bikas Saha
@bikassaha
© Hortonworks Inc. 2013
Page 1
Tez – Introduction
• Distributed execution framework targeted towards dataprocessing applications.
• Based on expressing a computation as a dataflow graph.
• Highly customizable to meet a broad spectrum of use cases.
• Built on top of YARN – the resource management framework
for Hadoop.
• Open source Apache project and Apache licensed.
© Hortonworks Inc. 2013
Page 2
Tez – Hadoop 1 ---> Hadoop 2
Monolithic
Layered
•
•
•
•
Resource Management – MR
Execution Engine – MR
© Hortonworks Inc. 2013.
2012 Confidential and Proprietary.
Resource Management – YARN
Engines – Hive, Pig, Cascading, Your App!
Tez – Empowering Applications
• Tez solves hard problems of running on a distributed Hadoop environment
• Apps can focus on solving their domain specific problems
• This design is important to be a platform for a variety of applications
App
• Custom application logic
• Custom data format
• Custom data transfer technology
Tez
•
•
•
•
•
•
•
•
Distributed parallel execution
Negotiating resources from the Hadoop framework
Fault tolerance and recovery
Horizontal scalability
Resource elasticity
Shared library of ready-to-use components
Built-in performance optimizations
Security
© Hortonworks Inc. 2013
Page 4
Tez – End User Benefits
• Better performance of applications
• Built-in performance + Application define optimizations
• Better predictability of results
• Minimization of overheads and queuing delays
• Better utilization of compute capacity
• Efficient use of allocated resources
• Reduced load on distributed filesystem (HDFS)
• Reduce unnecessary replicated writes
• Reduced network usage
• Better locality and data transfer using new data patterns
• Higher application developer productivity
• Focus on application business logic rather than Hadoop internals
© Hortonworks Inc. 2013
Page 5
Tez – Design considerations
Look to the Future with an eye on the Past
Don’t solve problems that have already been solved. Or else
you will have to solve them again!
• Leverage discrete task based compute model for elasticity, scalability
and fault tolerance
• Leverage several man years of work in Hadoop Map-Reduce data
shuffling operations
• Leverage proven resource sharing and multi-tenancy model for Hadoop
and YARN
• Leverage built-in security mechanisms in Hadoop for privacy and
isolation
© Hortonworks Inc. 2013
Page 6
Tez – Problems that it addresses
• Expressing the computation
• Direct and elegant representation of the data processing flow
• Interfacing with application code and new technologies
• Performance
• Late Binding : Make decisions as late as possible using real data from at
runtime
• Leverage the resources of the cluster efficiently
• Just work out of the box!
• Customizable to let applications tailor the job to meet their specific
requirements
• Operation simplicity
• Painless to operate, experiment and upgrade
© Hortonworks Inc. 2013
Page 7
Tez – Simplifying Operations
• No deployments to do. No side effects. Easy and safe to try it out!
• Tez is a completely client side application.
• Simply upload to any accessible FileSystem and change local Tez
configuration to point to that.
• Enables running different versions concurrently. Easy to test new
functionality while keeping stable versions for production.
• Leverages YARN local resources.
HDFS
Tez Lib 1
Tez Lib 2
TezClient
TezTask
TezTask
TezClient
Client
Machine
Node
Manager
Node
Manager
Client
Machine
© Hortonworks Inc. 2013
Page 8
Tez – Expressing the computation
Distributed data processing jobs typically look like DAGs (Directed Acyclic
Graph).
• Vertices in the graph represent data transformations
• Edges represent data movement from producers to consumers
Samples
Task-1
Task-2
Preprocessor Stage
Task-1
Task-2
Partition Stage
Sampler
Ranges
Distributed Sort
Task-1
© Hortonworks Inc. 2013
Task-2
Aggregate Stage
Page 9
Tez – Expressing the computation
Tez provides the following APIs to define the processing
• DAG API
• Defines the structure of the data processing and the relationship
between producers and consumers
• Enable definition of complex data flow pipelines using simple graph
connection API’s. Tez expands the logical DAG at runtime
• This is how the connection of tasks in the job gets specified
• Runtime API
• Defines the interfaces using which the framework and app code interact
with each other
• App code transforms data and moves it between tasks
• This is how we specify what actually executes in each task on the cluster
nodes
© Hortonworks Inc. 2013
Page 10
Tez – DAG API
Defines the global processing flow
// Define DAG
DAG dag = DAG.create();
Map1
// Define Vertex
Vertex Map1 = Vertex.create(Processor.class);
// Define Edge
Edge edge = Edge.create(Map1, Reduce1,
SCATTER_GATHER, PERSISTED, SEQUENTIAL,
Output.class, Input.class);
// Connect them
dag.addVertex(Map1).addEdge(edge)…
© Hortonworks Inc. 2013
Map2
Scatter
Gather
Reduce1
Reduce2
Scatter
Gather
Join
Page 11
Tez – DAG API
Edge properties define the connection between producer and
consumer tasks in the DAG
Data movement – Defines routing of data between
tasks
• One-To-One : Data from the ith producer task routes to
the ith consumer task.
• Broadcast : Data from a producer task routes to all
consumer tasks.
• Scatter-Gather : Producer tasks scatter data into shards
and consumer tasks gather the data. The ith shard from
all producer tasks routes to the ith consumer task.
© Hortonworks Inc. 2013
Page 12
Tez – Logical DAG expansion at Runtime
Map1
Map2
Reduce1
Reduce2
Join
© Hortonworks Inc. 2013
Page 13
Tez – Runtime API
Flexible Inputs-Processor-Outputs Model
• Thin API layer to wrap around arbitrary application code
• Compose inputs, processor and outputs to execute
arbitrary processing
• Applications decide logical data format and data
transfer technology
• Customize for performance
• Built-in implementations for Hadoop 2.0 data services –
HDFS and YARN ShuffleService. Built on the same API.
Your implementations are as first class as ours!
© Hortonworks Inc. 2013
Page 14
Tez – Task Composition
Logical DAG
Task A
V-A
Edge AB
Processor-A
Edge AC
Output-1
V-B
Output-3
V-C
V-A = { Processor-A.class }
V-B = { Processor-B.class }
V-C = { Processor-C.class }
Edge AB = { V-A, V-B,
Output-1.class, Input-2.class }
Edge AC = { V-A, V-C,
Output-3.class, Input-4.class }
Input-2
Input-4
Processor-B
Processor-C
Task B
Task C
© Hortonworks Inc. 2013
Page 15
Tez – Library of Inputs and Outputs
Classical ‘Map’
HDFS
Input
Map
Processor
Classical ‘Reduce’
Sorted
Output
Shuffle
Input
Reduce
Processor
HDFS
Output
• What is built in?
Shuffle
Input
Reduce
Processor
Sorted
Output
Intermediate ‘Reduce’ for
Map-Reduce-Reduce
– Hadoop InputFormat/OutputFormat
– SortedGroupedPartitioned Key-Value
Input/Output
– UnsortedGroupedPartitioned Key-Value
Input/Output
– Key-Value Input/Output
© Hortonworks Inc. 2013
Page 16
Tez – Composable Task Model
Adopt
HDFS
Input
Remote
File
Server
Input
Hive Processor
HDFS
Output
Local
Disk
Output
Evolve
HDFS
Input
Remote
File
Server
Input
Custom Processor
HDFS
Output
© Hortonworks Inc. 2013
Local
Disk
Output
Optimize
RDMA
Input
Native
DB
Input
Custom Processor
Kakfa
Pub-Sub
Output
Amazon
S3
Output
Page 17
Tez – Performance
• Benefits of expressing the data processing as a DAG
• Reducing overheads and queuing effects
• Gives system the global picture for better planning
• Efficient use of resources
• Re-use resources to maximize utilization
• Pre-launch, pre-warm and cache
• Locality & resource aware scheduling
• Support for application defined DAG modifications at runtime
for optimized execution
•
•
•
•
Change task concurrency
Change task scheduling
Change DAG edges
Change DAG vertices
© Hortonworks Inc. 2013
Page 18
Tez – Benefits of DAG execution
Faster Execution and Higher Predictability
•
•
•
•
Eliminate replicated write barrier between successive computations.
Eliminate job launch overhead of workflow jobs.
Eliminate extra stage of map reads in every workflow job.
Eliminate queue and resource contention suffered by workflow jobs that
are started after a predecessor job completes.
• Better locality because the engine has the global picture
Pig/Hive - MR
© Hortonworks Inc. 2013
Pig/Hive - Tez
Page 19
Tez – Container Re-Use
• Reuse YARN containers/JVMs to launch new tasks
• Reduce scheduling and launching delays
• Shared in-memory data across tasks
• JVM JIT friendly execution
TezTask Host
Tez
Application Master
Task Done
Start Task
YARN Container
© Hortonworks Inc. 2013
TezTask1
TezTask2
Shared Objects
Start Task
YARN Container / JVM
Page 20
Tez – Sessions
Start
Session
Sessions
© Hortonworks Inc. 2013
Client
Application Master
Task Scheduler
Container Pool
• Standard concepts of pre-launch
and pre-warm applied
• Key for interactive queries
• Represents a connection between
the user and the cluster
• Multiple DAGs executed in the
same session
• Containers re-used across queries
• Takes care of data locality and
releasing resources when idle
Submit
DAG
Pre
Warmed
JVM
Shared
Object
Registry
Page 21
Containers
Tez – Re-Use in Action
Time 
Task Execution Timeline
© Hortonworks Inc. 2013
Tez – Customizable Core Engine
• Vertex Manager
•
•
Determines task
parallelism
Determines when
tasks in a vertex
can start.
• DAG Scheduler
Determines priority
of task
Start
vertex
Get container
Vertex-1
Start
vertex
Vertex Manager
• Task Scheduler
Allocates containers
from YARN and
assigns them to tasks
Get Priority
DAG
Scheduler
Task
Scheduler
Start
tasks
Vertex-2
Get Priority
Get container
© Hortonworks Inc. 2013
Page 23
Tez – Event Based Control Plane
• Events used to communicate
between the tasks and between task
and framework
• Data Movement Event used by
producer task to inform the
consumer task about data location,
size etc.
• Input Error event sent by task to the
engine to inform about errors in
reading input. The engine then takes
action by re-generating the input
• Other events to send task completion
notification, data statistics and other
control plane information
Map Task 1
Output1
Output3
Map Task 2
Output1
Output2
Output3
Output2
Data Event
AM
Router
Scatter-Gather Edge
Error Event
Input1
Input2
Reduce Task 2
© Hortonworks Inc. 2013
Page 24
Tez – Automatic Reduce Parallelism
Event Model
Map tasks send data
statistics events to
the Reduce Vertex
Manager.
Vertex Manager
Pluggable application
logic that understands
the data statistics and
can formulate the
correct parallelism.
Advises vertex
controller on
parallelism
© Hortonworks Inc. 2013
Data Size Statistics
Vertex Manager
Map Vertex
Set Parallelism
Re-Route
Vertex State
Machine
App Master
Reduce Vertex
Cancel Task
Page 25
Tez – Reduce Slow Start/Pre-launch
Event Model
Map completion
events sent to the
Reduce Vertex
Manager.
Vertex Manager
Pluggable application
logic that understands
the data size. Advises
the vertex controller to
launch the reducers
before all maps have
completed so that
shuffle can start.
© Hortonworks Inc. 2013
Task Completed
Vertex Manager
Map Vertex
Start Tasks
Vertex State
Machine
App Master
Start
Reduce Vertex
Page 26
Tez – Theory to Practice
• Performance
• Scalability
© Hortonworks Inc. 2013
Page 27
Tez – Hive TPC-DS Scale 200GB latency
© Hortonworks Inc. 2013
Tez – Pig performance gains
Time in secs
• Demonstrates performance gains from a basic translation to a
Tez DAG
• Deeper integration in the works for further improvements
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
0
MR
Tez
Replicated
Join (2.8x)
Join +
Groupby
(1.5x)
Join +
3 way Split +
Groupby +
Join +
Orderby
Groupby +
(1.5x)
Orderby
(2.6x)
Tez – Observations on Performance
• Number of stages in the DAG
• Higher the number of stages in the DAG, performance of Tez (over MR)
will be better.
• Cluster/queue capacity
• More congested a queue is, the performance of Tez (over MR) will be
better due to container reuse.
• Size of intermediate output
• More the size of intermediate output, the performance of Tez (over MR)
will be better due to reduced HDFS usage.
• Size of data in the job
• For smaller data and more stages, the performance of Tez (over MR) will
be better as percentage of launch overhead in the total time is high for
smaller jobs.
• Offload work to the cluster
• Move as much work as possible to the cluster by modelling it via the job
DAG. Exploit the parallelism and resources of the cluster. E.g. MR split
calculation.
• Vertex caching
• The more re-computation can be avoided the better is the performance.
© Hortonworks Inc. 2013
Page 30
Tez – Data at scale
Hive TPC-DS
Scale 10TB
© Hortonworks Inc. 2013
Page 31
Tez – DAG definition at scale
Hive : TPC-DS Query 64 Logical DAG
© Hortonworks Inc. 2013
Page 32
Tez – Container Reuse at Scale
78 vertices + 8374 tasks on 50 containers (TPC-DS Query 4)
© Hortonworks Inc. 2013
Page 33
Tez – Real World Use Cases for the API
© Hortonworks Inc. 2013
Page 34
Tez – Broadcast Edge
SELECT ss.ss_item_sk, ss.ss_quantity, avg_price, inv.inv_quantity_on_hand
FROM (select avg(ss_sold_price) as avg_price, ss_item_sk, ss_quantity_sk from store_sales
group by ss_item_sk) ss
JOIN inventory inv
ON (inv.inv_item_sk = ss.ss_item_sk);
Hive – MR
Store Sales scan.
Group by and
aggregation.
Hive :
Broadcast Join
Hive – Tez
M
M
M
R
R
Store Sales scan.
Group by and
aggregation
reduce size of this
input.
M
M
M
R
R
Broadcast
edge
HDFS
Inventory and Store
Sales (aggr.) output
scan and shuffle join.
M
M
M
HDFS
R
R
HDFS
© Hortonworks Inc. 2013
M
M
Inventory scan
and Join
M
Tez – Multiple Outputs
f = LOAD ‘foo’ AS (x, y, z);
g1 = GROUP f BY y;
g2 = GROUP f BY z;
j = JOIN g1 BY group,
g2 BY group;
Pig – MR
Pig : Split & Group-by
Pig – Tez
Load foo
Load foo
Split multiplex
de-multiplex
Multiple outputs
Group by y Group by z
HDFS
HDFS
Group by y
Group by z
Load g1 and Load g2
Join
© Hortonworks Inc. 2013
Reduce follows
reduce
Join
Page 36
Tez – Multiple Outputs
FROM (SELECT * FROM store_sales, date_dim WHERE ss_sold_date_sk = d_date_sk and
d_year = 2000)
INSERT INTO TABLE t1 SELECT distinct ss_item_sk
INSERT INTO TABLE t2 SELECT distinct ss_customer_sk;
Hive : Multi-insert
queries
Hive – MR
Hive – Tez
M
Map join
date_dim/store sales
HDFS
M
Materialize join on
HDFS
M
M
M
HDFS
Two MR jobs to
do the distinct
M
M
M
M
Broadcast Join
(scan date_dim,
join store sales)
Distinct for customer
+ items
M
M
M
M
R
R
M
HDFS
R
R
HDFS
© Hortonworks Inc. 2013
M
Tez – One to One Edge
Pig : Skewed Join
l = LOAD ‘left’ AS (x, y);
r = LOAD ‘right’ AS (x, z);
j = JOIN l BY x, r BY x
USING ‘skewed’;
Pig – MR
Pig – Tez
Load &
Sample
Sample L
Aggregate
HDFS
Aggregate
Stage sample map
on distributed cache
Pass through input
via 1-1 edge
Broadcast
sample map
Partition L
Partition R
Partition L and Partition R
Join
Join
© Hortonworks Inc. 2013
Page 38
Tez – Custom Edge
SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand
FROM store_sales ss
JOIN inventory inv
ON (inv.inv_item_sk = ss.ss_item_sk);
Hive : Dynamically
Partitioned Hash Join
Hive – MR
M
Inventory scan
(Runs as single
local map task)
Store Sales scan
and Join
(Inventory hash
table read as side
file)
Inventory scan
(Runs on cluster
potentially more
than 1 mapper)
HDFS
M
M
Hive – Tez
M
Store Sales scan
and Join (Custom
vertex reads
both inputs – no
side file reads)
M
M
Custom edge
M
M
HDFS
HDFS
© Hortonworks Inc. 2013
M
(routes outputs of
previous stage to
the correct
Mappers of the
next stage)
Tez – Bringing it all together
Tez Session populates
container pool
Dimension table
calculation and HDFS
split generation in
parallel
Dimension tables
broadcasted to Hive
MapJoin tasks
Final Reducer prelaunched and fetches
completed inputs
TPCDS – Query-27 with Hive on Tez
Architecting the Future of Big Data
© Hortonworks Inc. 2013
Page 40
Tez – Bridging the Data Spectrum
Fact Table
Dimension
Table 1
Fact Table
Broadcast
Join
Result
Table 1
Dimension
Table 2
Broadcast join
for small data sets
Dimension
Table 1
Dimension
Table 1
Dimension
Table 1
Broadcast
Join
Result
Table 2
Dimension
Table 3
Shuffle
Join
Typical pattern in a
TPC-DS query
Result
Table 3
© Hortonworks Inc. 2013
Based on data size,
the query optimizer
can run either plan as
a single Tez job
Page 41
Tez – Current status
• Apache Project
– Growing community of contributors and users
– Latest release is 0.5.0
• Focus on stability
– Testing and quality are highest priority
– Code ready and deployed on multi-node environments at scale
• Support for a vast topology of DAGs
– Already functionally equivalent to Map Reduce. Existing Map Reduce
jobs can be executed on Tez with few or no changes
– Rapid adoption by important open source projects and by vendors
© Hortonworks Inc. 2013
Page 42
Tez – Developer Release
• API stability – Intuitive and clean APIs that will be supported
and be backwards compatible with future releases
• Local Mode Execution – Allows the application to run entirely
in the same JVM without any need for a cluster. Enables easy
debugging using IDE’s on the developer machine
• Performance Debugging – Adds swim-lanes analysis tool to
visualize job execution and help identify performance
bottlenecks
• Documentation and tutorials to learn the APIs
© Hortonworks Inc. 2013
Page 43
Tez – Adoption
• Apache Hive
• Hadoop standard for declarative access via SQL-like interface (Hive 13)
• Apache Pig
• Hadoop standard for procedural scripting and pipeline processing (Pig 14)
• Cascading
• Developer friendly Java API and SDK
• Scalding (Scala API on Cascading) (3.0)
• Commercial Vendors
• ETL : Use Tez instead of MR or custom pipelines
• Analytics Vendors : Use Tez as a target platform for scaling parallel
analytical tools to large data-sets
© Hortonworks Inc. 2013
Page 44
Tez – Adoption Path
Pre-requisite : Hadoop 2 with YARN
Tez has zero deployment pain. No side effects or traces left
behind on your cluster. Low risk and low effort to try out.
• Using Hive, Pig, Cascading, Scalding
• Try them with Tez as execution engine
• Already have MapReduce based pipeline
• Use configuration to change MapReduce to run on Tez by setting
‘mapreduce.framework.name’ to ‘yarn-tez’ in mapred-site.xml
• Consolidate MR workflow into MR-DAG on Tez
• Change MR-DAG to use more efficient Tez constructs
• Have custom pipeline
• Wrap custom code/scripts into Tez inputs-processor-outputs
• Translate your custom pipeline topology into Tez DAG
• Change custom topology to use more efficient Tez constructs
© Hortonworks Inc. 2013
Page 45
Tez – Roadmap
• Richer DAG support
– Addition of vertices at runtime
– Shared edges for shared outputs
– Enhance Input/Output library
• Performance optimizations
–
–
–
–
Improve support for high concurrency
Improve locality aware scheduling
Add framework level data statistics
HDFS memory storage integration
• Usability
– Stability and testability
– UI integration with YARN
– Tools for performance analysis and debugging
© Hortonworks Inc. 2013
Page 46
Tez – Community
• Early adopters and code contributors welcome
– Adopters to drive more scenarios. Contributors to make them happen.
• Tez meetup for developers and users
– http://www.meetup.com/Apache-Tez-User-Group
• Technical blog series
– http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-dataprocessing
• Useful links
– Work tracking: https://issues.apache.org/jira/browse/TEZ
– Code: https://github.com/apache/tez
– Developer list: [email protected]
User list: [email protected]
Issues list: [email protected]
© Hortonworks Inc. 2013
Page 47
Tez – Takeaways
• Distributed execution framework that models processing as
dataflow graphs
• Customizable execution architecture designed for extensibility
and user defined performance optimizations
• Works out of the box with the platform figuring out the hard
stuff
• Zero install – Safe and Easy to Evaluate and Experiment
• Addresses common needs of a broad spectrum of applications
and usage patterns
• Open source Apache project – your use-cases and code are
welcome
• It works and is already being used by Apache Hive and Pig
© Hortonworks Inc. 2013
Page 48
Tez
Thanks for your time and attention!
Video with Deep Dive on Tez
http://goo.gl/BL67o7
https://hortonworks.webex.com/hortonworks/lsr.php?RCID=b43
82d626867fe57de70b218c74c03e4 – WordCount code walk
through with 0.5 APIs. Debugging in local mode. Execution and
profiling in cluster mode. (Starts at ~27 min. Ends at ~37 min)
Questions?
@bikassaha
© Hortonworks Inc. 2013
Page 49