Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook Agenda • Introduce Hadoop projects to prepare you for your group work – Intimate detail will be provided in future lectures • Discuss potential use cases for each project Topics • • • • • • • • • • • • HDFS MapReduce YARN Sqoop Flume NiFi Pig Hive Streaming HBase Accumulo Avro • • • • • • • • • • • • Parquet Mahout Oozie Storm ZooKeeper Spark SQL-on-Hadoop In-Memory Stores Cassandra Kafka Crunch Azkaban HDFS • Hadoop Distributed File System – High-performance file system for storing data • We’ve talked about this enough Hadoop MapReduce • High-performance fault-tolerance data processing system • We’ve also talked about this enough YARN • Abstract framework for distributed application development • Split functionality of JobTracker into two components – ResourceManager – ApplicationMaster • TaskTracker becomes NodeManager – Containers instead of map and reduce slots • Configurable amount of memory per NodeManager MapReduce 2.x on YARN • MapReduce API has not changed – Binary-level backwards compatible (no recompile) • Application Master launches and monitors job via YARN • MapReduce History Server to store… history • Enabled Yahoo! to scale beyond 4,000 nodes Hadoop Ecosystem • Core Technologies – Hadoop Distributed File System – Hadoop MapReduce • Many other tools… – Which we will be discussing… now Apache Sqoop • Apache project designed for efficient transfer between Apache Hadoop and structured data stores • Use through CLI and extendable • Use cases? Apache Flume • Distributed, reliable, available service for collecting, aggregating, and moving large amounts of log data • Configure agents using simple files, extendable • Use cases? Apache NiFi • A service to reliably move and manipulate files between clusters using a web front-end • Uses a GUI to drop processors and connect them to build workflows • Use cases? Apache Pig • Platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs • Infrastructure compiles language to a sequence of MapReduce programs • Use cases? Apache Hive • Data warehouse facilitating querying and managing large datasets • Compiles SQL-like queries into MapReduce programs • Use cases? Hadoop Streaming • Utility to create and run MapReduce jobs with any executable or script as the mapper or reducer • Just a jar file, not a real project • Use cases? Which high-level API is for you? • What are you comfortable with? • What are you being told to use? Apache HBase • Distributed, scalable, big data store • Data stored as sorted key/value pairs, with the key consisting of a row and column • Use cases? Apache Accumulo • Robust, scalable, high-performance data storage and retrieval key/value store • Cell-based access controls – i.e. cell-level security • Use cases? Apache Avro • Data serialization system for the Hadoop ecosystem • Use cases? Apache Parquet • Columnar storage format for Hadoop • Use cases? Apache Mahout • Machine learning library to build scalable machine learning algorithms implemented on top of Hadoop MapReduce • Use cases? Apache Oozie • Workflow scheduler system to manage Apache Hadoop jobs • Use cases? Apache Storm • Distributed real-time computation system • Didn’t have a logo until June 2014 • How is this different than MapReduce? • Use cases? Apache ZooKeeper • Effort to develop and maintain and opensource server enabling highly reliable distributed coordination • Use cases? Apache Spark • Fast and general engine for large-scale data processing • Write applications in Java, Scala, or Python • Use cases? SQL on Hadoop • Apache Drill, Cloudera Impala, Facebook’s Presto, Hortonworks’s Hive Stinger, Pivotal HAWQ, etc. • SQL-like or ANSI SQL compliant MPP execution engines using HDFS as a data store • Use cases? Non use cases? Sample Architecture Flume Agent SQL Oozie Webserver Website Flume Agent Sales MapReduce HBase HDFS Flume Agent Call Center Pig SQL Storm We [maybe] won’t be covering these in detail later on OTHER HADOOP PROJECTS Redis, Memcached, etc. • Open-source in-memory key/value stores • Use cases? Apache Cassandra • NoSQL database for managing large amounts of structured, semi-structured, and unstructured data • Support for clusters spanning multiple datacenters • Unlike HBase and Accumulo, data is not stored on HDFS • Use cases? Non use cases? Apache Crunch • Java framework for writing, testing, and running MapReduce pipelines with a simple API • Same code executes as a local job, as a MapReduce job, or as a streaming Spark job • Use cases? *Not the real logo, but truly fantastic * Apache Kafka • High-throughput distributed publish-subscribe message service • Use cases? Azkaban • Batch workflow job scheduler to run Hadoop jobs • Use cases? Review • A lot of projects available to you for your grou project • Think of a problem you are interested in, then choose the appropriate projects to solve it • Keep in mind data ingest, storage, processing, and egress • Feel free to explore and use other projects than the ones I have listed here – Get permission if you plan on using it as part of your project quota References • All those logos are the property of their owners • *.apache.org • redis.io
© Copyright 2026 Paperzz