Apache Tez : Accelerating Hadoop Data Processing Bikas Saha @bikassaha © Hortonworks Inc. 2013 Page 1 Tez – Introduction • Distributed execution framework targeted towards dataprocessing applications. • Based on expressing a computation as a dataflow graph. • Highly customizable to meet a broad spectrum of use cases. • Built on top of YARN – the resource management framework for Hadoop. • Open source Apache project and Apache licensed. © Hortonworks Inc. 2013 Page 2 Tez – Hadoop 1 ---> Hadoop 2 Monolithic Layered • • • • Resource Management – MR Execution Engine – MR © Hortonworks Inc. 2013. 2012 Confidential and Proprietary. Resource Management – YARN Engines – Hive, Pig, Cascading, Your App! Tez – Empowering Applications • Tez solves hard problems of running on a distributed Hadoop environment • Apps can focus on solving their domain specific problems • This design is important to be a platform for a variety of applications App • Custom application logic • Custom data format • Custom data transfer technology Tez • • • • • • • • Distributed parallel execution Negotiating resources from the Hadoop framework Fault tolerance and recovery Horizontal scalability Resource elasticity Shared library of ready-to-use components Built-in performance optimizations Security © Hortonworks Inc. 2013 Page 4 Tez – End User Benefits • Better performance of applications • Built-in performance + Application define optimizations • Better predictability of results • Minimization of overheads and queuing delays • Better utilization of compute capacity • Efficient use of allocated resources • Reduced load on distributed filesystem (HDFS) • Reduce unnecessary replicated writes • Reduced network usage • Better locality and data transfer using new data patterns • Higher application developer productivity • Focus on application business logic rather than Hadoop internals © Hortonworks Inc. 2013 Page 5 Tez – Design considerations Look to the Future with an eye on the Past Don’t solve problems that have already been solved. Or else you will have to solve them again! • Leverage discrete task based compute model for elasticity, scalability and fault tolerance • Leverage several man years of work in Hadoop Map-Reduce data shuffling operations • Leverage proven resource sharing and multi-tenancy model for Hadoop and YARN • Leverage built-in security mechanisms in Hadoop for privacy and isolation © Hortonworks Inc. 2013 Page 6 Tez – Problems that it addresses • Expressing the computation • Direct and elegant representation of the data processing flow • Interfacing with application code and new technologies • Performance • Late Binding : Make decisions as late as possible using real data from at runtime • Leverage the resources of the cluster efficiently • Just work out of the box! • Customizable to let applications tailor the job to meet their specific requirements • Operation simplicity • Painless to operate, experiment and upgrade © Hortonworks Inc. 2013 Page 7 Tez – Simplifying Operations • No deployments to do. No side effects. Easy and safe to try it out! • Tez is a completely client side application. • Simply upload to any accessible FileSystem and change local Tez configuration to point to that. • Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production. • Leverages YARN local resources. HDFS Tez Lib 1 Tez Lib 2 TezClient TezTask TezTask TezClient Client Machine Node Manager Node Manager Client Machine © Hortonworks Inc. 2013 Page 8 Tez – Expressing the computation Distributed data processing jobs typically look like DAGs (Directed Acyclic Graph). • Vertices in the graph represent data transformations • Edges represent data movement from producers to consumers Samples Task-1 Task-2 Preprocessor Stage Task-1 Task-2 Partition Stage Sampler Ranges Distributed Sort Task-1 © Hortonworks Inc. 2013 Task-2 Aggregate Stage Page 9 Tez – Expressing the computation Tez provides the following APIs to define the processing • DAG API • Defines the structure of the data processing and the relationship between producers and consumers • Enable definition of complex data flow pipelines using simple graph connection API’s. Tez expands the logical DAG at runtime • This is how the connection of tasks in the job gets specified • Runtime API • Defines the interfaces using which the framework and app code interact with each other • App code transforms data and moves it between tasks • This is how we specify what actually executes in each task on the cluster nodes © Hortonworks Inc. 2013 Page 10 Tez – DAG API Defines the global processing flow // Define DAG DAG dag = DAG.create(); Map1 // Define Vertex Vertex Map1 = Vertex.create(Processor.class); // Define Edge Edge edge = Edge.create(Map1, Reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, Output.class, Input.class); // Connect them dag.addVertex(Map1).addEdge(edge)… © Hortonworks Inc. 2013 Map2 Scatter Gather Reduce1 Reduce2 Scatter Gather Join Page 11 Tez – DAG API Edge properties define the connection between producer and consumer tasks in the DAG Data movement – Defines routing of data between tasks • One-To-One : Data from the ith producer task routes to the ith consumer task. • Broadcast : Data from a producer task routes to all consumer tasks. • Scatter-Gather : Producer tasks scatter data into shards and consumer tasks gather the data. The ith shard from all producer tasks routes to the ith consumer task. © Hortonworks Inc. 2013 Page 12 Tez – Logical DAG expansion at Runtime Map1 Map2 Reduce1 Reduce2 Join © Hortonworks Inc. 2013 Page 13 Tez – Runtime API Flexible Inputs-Processor-Outputs Model • Thin API layer to wrap around arbitrary application code • Compose inputs, processor and outputs to execute arbitrary processing • Applications decide logical data format and data transfer technology • Customize for performance • Built-in implementations for Hadoop 2.0 data services – HDFS and YARN ShuffleService. Built on the same API. Your implementations are as first class as ours! © Hortonworks Inc. 2013 Page 14 Tez – Task Composition Logical DAG Task A V-A Edge AB Processor-A Edge AC Output-1 V-B Output-3 V-C V-A = { Processor-A.class } V-B = { Processor-B.class } V-C = { Processor-C.class } Edge AB = { V-A, V-B, Output-1.class, Input-2.class } Edge AC = { V-A, V-C, Output-3.class, Input-4.class } Input-2 Input-4 Processor-B Processor-C Task B Task C © Hortonworks Inc. 2013 Page 15 Tez – Library of Inputs and Outputs Classical ‘Map’ HDFS Input Map Processor Classical ‘Reduce’ Sorted Output Shuffle Input Reduce Processor HDFS Output • What is built in? Shuffle Input Reduce Processor Sorted Output Intermediate ‘Reduce’ for Map-Reduce-Reduce – Hadoop InputFormat/OutputFormat – SortedGroupedPartitioned Key-Value Input/Output – UnsortedGroupedPartitioned Key-Value Input/Output – Key-Value Input/Output © Hortonworks Inc. 2013 Page 16 Tez – Composable Task Model Adopt HDFS Input Remote File Server Input Hive Processor HDFS Output Local Disk Output Evolve HDFS Input Remote File Server Input Custom Processor HDFS Output © Hortonworks Inc. 2013 Local Disk Output Optimize RDMA Input Native DB Input Custom Processor Kakfa Pub-Sub Output Amazon S3 Output Page 17 Tez – Performance • Benefits of expressing the data processing as a DAG • Reducing overheads and queuing effects • Gives system the global picture for better planning • Efficient use of resources • Re-use resources to maximize utilization • Pre-launch, pre-warm and cache • Locality & resource aware scheduling • Support for application defined DAG modifications at runtime for optimized execution • • • • Change task concurrency Change task scheduling Change DAG edges Change DAG vertices © Hortonworks Inc. 2013 Page 18 Tez – Benefits of DAG execution Faster Execution and Higher Predictability • • • • Eliminate replicated write barrier between successive computations. Eliminate job launch overhead of workflow jobs. Eliminate extra stage of map reads in every workflow job. Eliminate queue and resource contention suffered by workflow jobs that are started after a predecessor job completes. • Better locality because the engine has the global picture Pig/Hive - MR © Hortonworks Inc. 2013 Pig/Hive - Tez Page 19 Tez – Container Re-Use • Reuse YARN containers/JVMs to launch new tasks • Reduce scheduling and launching delays • Shared in-memory data across tasks • JVM JIT friendly execution TezTask Host Tez Application Master Task Done Start Task YARN Container © Hortonworks Inc. 2013 TezTask1 TezTask2 Shared Objects Start Task YARN Container / JVM Page 20 Tez – Sessions Start Session Sessions © Hortonworks Inc. 2013 Client Application Master Task Scheduler Container Pool • Standard concepts of pre-launch and pre-warm applied • Key for interactive queries • Represents a connection between the user and the cluster • Multiple DAGs executed in the same session • Containers re-used across queries • Takes care of data locality and releasing resources when idle Submit DAG Pre Warmed JVM Shared Object Registry Page 21 Containers Tez – Re-Use in Action Time Task Execution Timeline © Hortonworks Inc. 2013 Tez – Customizable Core Engine • Vertex Manager • • Determines task parallelism Determines when tasks in a vertex can start. • DAG Scheduler Determines priority of task Start vertex Get container Vertex-1 Start vertex Vertex Manager • Task Scheduler Allocates containers from YARN and assigns them to tasks Get Priority DAG Scheduler Task Scheduler Start tasks Vertex-2 Get Priority Get container © Hortonworks Inc. 2013 Page 23 Tez – Event Based Control Plane • Events used to communicate between the tasks and between task and framework • Data Movement Event used by producer task to inform the consumer task about data location, size etc. • Input Error event sent by task to the engine to inform about errors in reading input. The engine then takes action by re-generating the input • Other events to send task completion notification, data statistics and other control plane information Map Task 1 Output1 Output3 Map Task 2 Output1 Output2 Output3 Output2 Data Event AM Router Scatter-Gather Edge Error Event Input1 Input2 Reduce Task 2 © Hortonworks Inc. 2013 Page 24 Tez – Automatic Reduce Parallelism Event Model Map tasks send data statistics events to the Reduce Vertex Manager. Vertex Manager Pluggable application logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism © Hortonworks Inc. 2013 Data Size Statistics Vertex Manager Map Vertex Set Parallelism Re-Route Vertex State Machine App Master Reduce Vertex Cancel Task Page 25 Tez – Reduce Slow Start/Pre-launch Event Model Map completion events sent to the Reduce Vertex Manager. Vertex Manager Pluggable application logic that understands the data size. Advises the vertex controller to launch the reducers before all maps have completed so that shuffle can start. © Hortonworks Inc. 2013 Task Completed Vertex Manager Map Vertex Start Tasks Vertex State Machine App Master Start Reduce Vertex Page 26 Tez – Theory to Practice • Performance • Scalability © Hortonworks Inc. 2013 Page 27 Tez – Hive TPC-DS Scale 200GB latency © Hortonworks Inc. 2013 Tez – Pig performance gains Time in secs • Demonstrates performance gains from a basic translation to a Tez DAG • Deeper integration in the works for further improvements 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 MR Tez Replicated Join (2.8x) Join + Groupby (1.5x) Join + 3 way Split + Groupby + Join + Orderby Groupby + (1.5x) Orderby (2.6x) Tez – Observations on Performance • Number of stages in the DAG • Higher the number of stages in the DAG, performance of Tez (over MR) will be better. • Cluster/queue capacity • More congested a queue is, the performance of Tez (over MR) will be better due to container reuse. • Size of intermediate output • More the size of intermediate output, the performance of Tez (over MR) will be better due to reduced HDFS usage. • Size of data in the job • For smaller data and more stages, the performance of Tez (over MR) will be better as percentage of launch overhead in the total time is high for smaller jobs. • Offload work to the cluster • Move as much work as possible to the cluster by modelling it via the job DAG. Exploit the parallelism and resources of the cluster. E.g. MR split calculation. • Vertex caching • The more re-computation can be avoided the better is the performance. © Hortonworks Inc. 2013 Page 30 Tez – Data at scale Hive TPC-DS Scale 10TB © Hortonworks Inc. 2013 Page 31 Tez – DAG definition at scale Hive : TPC-DS Query 64 Logical DAG © Hortonworks Inc. 2013 Page 32 Tez – Container Reuse at Scale 78 vertices + 8374 tasks on 50 containers (TPC-DS Query 4) © Hortonworks Inc. 2013 Page 33 Tez – Real World Use Cases for the API © Hortonworks Inc. 2013 Page 34 Tez – Broadcast Edge SELECT ss.ss_item_sk, ss.ss_quantity, avg_price, inv.inv_quantity_on_hand FROM (select avg(ss_sold_price) as avg_price, ss_item_sk, ss_quantity_sk from store_sales group by ss_item_sk) ss JOIN inventory inv ON (inv.inv_item_sk = ss.ss_item_sk); Hive – MR Store Sales scan. Group by and aggregation. Hive : Broadcast Join Hive – Tez M M M R R Store Sales scan. Group by and aggregation reduce size of this input. M M M R R Broadcast edge HDFS Inventory and Store Sales (aggr.) output scan and shuffle join. M M M HDFS R R HDFS © Hortonworks Inc. 2013 M M Inventory scan and Join M Tez – Multiple Outputs f = LOAD ‘foo’ AS (x, y, z); g1 = GROUP f BY y; g2 = GROUP f BY z; j = JOIN g1 BY group, g2 BY group; Pig – MR Pig : Split & Group-by Pig – Tez Load foo Load foo Split multiplex de-multiplex Multiple outputs Group by y Group by z HDFS HDFS Group by y Group by z Load g1 and Load g2 Join © Hortonworks Inc. 2013 Reduce follows reduce Join Page 36 Tez – Multiple Outputs FROM (SELECT * FROM store_sales, date_dim WHERE ss_sold_date_sk = d_date_sk and d_year = 2000) INSERT INTO TABLE t1 SELECT distinct ss_item_sk INSERT INTO TABLE t2 SELECT distinct ss_customer_sk; Hive : Multi-insert queries Hive – MR Hive – Tez M Map join date_dim/store sales HDFS M Materialize join on HDFS M M M HDFS Two MR jobs to do the distinct M M M M Broadcast Join (scan date_dim, join store sales) Distinct for customer + items M M M M R R M HDFS R R HDFS © Hortonworks Inc. 2013 M Tez – One to One Edge Pig : Skewed Join l = LOAD ‘left’ AS (x, y); r = LOAD ‘right’ AS (x, z); j = JOIN l BY x, r BY x USING ‘skewed’; Pig – MR Pig – Tez Load & Sample Sample L Aggregate HDFS Aggregate Stage sample map on distributed cache Pass through input via 1-1 edge Broadcast sample map Partition L Partition R Partition L and Partition R Join Join © Hortonworks Inc. 2013 Page 38 Tez – Custom Edge SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand FROM store_sales ss JOIN inventory inv ON (inv.inv_item_sk = ss.ss_item_sk); Hive : Dynamically Partitioned Hash Join Hive – MR M Inventory scan (Runs as single local map task) Store Sales scan and Join (Inventory hash table read as side file) Inventory scan (Runs on cluster potentially more than 1 mapper) HDFS M M Hive – Tez M Store Sales scan and Join (Custom vertex reads both inputs – no side file reads) M M Custom edge M M HDFS HDFS © Hortonworks Inc. 2013 M (routes outputs of previous stage to the correct Mappers of the next stage) Tez – Bringing it all together Tez Session populates container pool Dimension table calculation and HDFS split generation in parallel Dimension tables broadcasted to Hive MapJoin tasks Final Reducer prelaunched and fetches completed inputs TPCDS – Query-27 with Hive on Tez Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 40 Tez – Bridging the Data Spectrum Fact Table Dimension Table 1 Fact Table Broadcast Join Result Table 1 Dimension Table 2 Broadcast join for small data sets Dimension Table 1 Dimension Table 1 Dimension Table 1 Broadcast Join Result Table 2 Dimension Table 3 Shuffle Join Typical pattern in a TPC-DS query Result Table 3 © Hortonworks Inc. 2013 Based on data size, the query optimizer can run either plan as a single Tez job Page 41 Tez – Current status • Apache Project – Growing community of contributors and users – Latest release is 0.5.0 • Focus on stability – Testing and quality are highest priority – Code ready and deployed on multi-node environments at scale • Support for a vast topology of DAGs – Already functionally equivalent to Map Reduce. Existing Map Reduce jobs can be executed on Tez with few or no changes – Rapid adoption by important open source projects and by vendors © Hortonworks Inc. 2013 Page 42 Tez – Developer Release • API stability – Intuitive and clean APIs that will be supported and be backwards compatible with future releases • Local Mode Execution – Allows the application to run entirely in the same JVM without any need for a cluster. Enables easy debugging using IDE’s on the developer machine • Performance Debugging – Adds swim-lanes analysis tool to visualize job execution and help identify performance bottlenecks • Documentation and tutorials to learn the APIs © Hortonworks Inc. 2013 Page 43 Tez – Adoption • Apache Hive • Hadoop standard for declarative access via SQL-like interface (Hive 13) • Apache Pig • Hadoop standard for procedural scripting and pipeline processing (Pig 14) • Cascading • Developer friendly Java API and SDK • Scalding (Scala API on Cascading) (3.0) • Commercial Vendors • ETL : Use Tez instead of MR or custom pipelines • Analytics Vendors : Use Tez as a target platform for scaling parallel analytical tools to large data-sets © Hortonworks Inc. 2013 Page 44 Tez – Adoption Path Pre-requisite : Hadoop 2 with YARN Tez has zero deployment pain. No side effects or traces left behind on your cluster. Low risk and low effort to try out. • Using Hive, Pig, Cascading, Scalding • Try them with Tez as execution engine • Already have MapReduce based pipeline • Use configuration to change MapReduce to run on Tez by setting ‘mapreduce.framework.name’ to ‘yarn-tez’ in mapred-site.xml • Consolidate MR workflow into MR-DAG on Tez • Change MR-DAG to use more efficient Tez constructs • Have custom pipeline • Wrap custom code/scripts into Tez inputs-processor-outputs • Translate your custom pipeline topology into Tez DAG • Change custom topology to use more efficient Tez constructs © Hortonworks Inc. 2013 Page 45 Tez – Roadmap • Richer DAG support – Addition of vertices at runtime – Shared edges for shared outputs – Enhance Input/Output library • Performance optimizations – – – – Improve support for high concurrency Improve locality aware scheduling Add framework level data statistics HDFS memory storage integration • Usability – Stability and testability – UI integration with YARN – Tools for performance analysis and debugging © Hortonworks Inc. 2013 Page 46 Tez – Community • Early adopters and code contributors welcome – Adopters to drive more scenarios. Contributors to make them happen. • Tez meetup for developers and users – http://www.meetup.com/Apache-Tez-User-Group • Technical blog series – http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-dataprocessing • Useful links – Work tracking: https://issues.apache.org/jira/browse/TEZ – Code: https://github.com/apache/tez – Developer list: [email protected] User list: [email protected] Issues list: [email protected] © Hortonworks Inc. 2013 Page 47 Tez – Takeaways • Distributed execution framework that models processing as dataflow graphs • Customizable execution architecture designed for extensibility and user defined performance optimizations • Works out of the box with the platform figuring out the hard stuff • Zero install – Safe and Easy to Evaluate and Experiment • Addresses common needs of a broad spectrum of applications and usage patterns • Open source Apache project – your use-cases and code are welcome • It works and is already being used by Apache Hive and Pig © Hortonworks Inc. 2013 Page 48 Tez Thanks for your time and attention! Video with Deep Dive on Tez http://goo.gl/BL67o7 https://hortonworks.webex.com/hortonworks/lsr.php?RCID=b43 82d626867fe57de70b218c74c03e4 – WordCount code walk through with 0.5 APIs. Debugging in local mode. Execution and profiling in cluster mode. (Starts at ~27 min. Ends at ~37 min) Questions? @bikassaha © Hortonworks Inc. 2013 Page 49
© Copyright 2024 Paperzz