Hadoop Overview

Hadoop Ecosystem Overview
CMSC 491
Hadoop-Based Distributed Computing
Spring 2016
Adam Shook
Agenda
• Introduce Hadoop projects to prepare you for
your group work
– Intimate detail will be provided in future lectures
• Discuss potential use cases for each project
Topics
•
•
•
•
•
•
•
•
•
•
•
•
HDFS
MapReduce
YARN
Sqoop
Flume
NiFi
Pig
Hive
Streaming
HBase
Accumulo
Avro
•
•
•
•
•
•
•
•
•
•
•
•
Parquet
Mahout
Oozie
Storm
ZooKeeper
Spark
SQL-on-Hadoop
In-Memory Stores
Cassandra
Kafka
Crunch
Azkaban
HDFS
• Hadoop Distributed File System
– High-performance file system for storing data
• We’ve talked about this enough
Hadoop MapReduce
• High-performance fault-tolerance data
processing system
• We’ve also talked about this enough
YARN
• Abstract framework for distributed application
development
• Split functionality of JobTracker into two
components
– ResourceManager
– ApplicationMaster
• TaskTracker becomes NodeManager
– Containers instead of map and reduce slots
• Configurable amount of memory per
NodeManager
MapReduce 2.x on YARN
• MapReduce API has not changed
– Binary-level backwards compatible (no recompile)
• Application Master launches and monitors job
via YARN
• MapReduce History Server to store… history
• Enabled Yahoo! to scale beyond 4,000 nodes
Hadoop Ecosystem
• Core Technologies
– Hadoop Distributed File System
– Hadoop MapReduce
• Many other tools…
– Which we will be discussing… now
Apache Sqoop
• Apache project designed for efficient transfer
between Apache Hadoop and structured data
stores
• Use through CLI and extendable
• Use cases?
Apache Flume
• Distributed, reliable, available service for
collecting, aggregating, and moving large
amounts of log data
• Configure agents using simple files,
extendable
• Use cases?
Apache NiFi
• A service to reliably move and manipulate files
between clusters using a web front-end
• Uses a GUI to drop processors and connect
them to build workflows
• Use cases?
Apache Pig
• Platform for analyzing large data sets that
consists of a high-level language for expressing
data analysis programs
• Infrastructure compiles language to a
sequence of MapReduce programs
• Use cases?
Apache Hive
• Data warehouse facilitating querying and
managing large datasets
• Compiles SQL-like queries into MapReduce
programs
• Use cases?
Hadoop Streaming
• Utility to create and run MapReduce jobs with
any executable or script as the mapper or
reducer
• Just a jar file, not a real project
• Use cases?
Which high-level API is for you?
• What are you comfortable with?
• What are you being told to use?
Apache HBase
• Distributed, scalable, big data store
• Data stored as sorted key/value pairs, with the
key consisting of a row and column
• Use cases?
Apache Accumulo
• Robust, scalable, high-performance data
storage and retrieval key/value store
• Cell-based access controls
– i.e. cell-level security
• Use cases?
Apache Avro
• Data serialization system for the Hadoop
ecosystem
• Use cases?
Apache Parquet
• Columnar storage format for Hadoop
• Use cases?
Apache Mahout
• Machine learning library to build scalable
machine learning algorithms implemented on
top of Hadoop MapReduce
• Use cases?
Apache Oozie
• Workflow scheduler system to manage
Apache Hadoop jobs
• Use cases?
Apache Storm
• Distributed real-time computation system
• Didn’t have a logo until June 2014
• How is this different than MapReduce?
• Use cases?
Apache ZooKeeper
• Effort to develop and maintain and opensource server enabling highly reliable
distributed coordination
• Use cases?
Apache Spark
• Fast and general engine for large-scale data
processing
• Write applications in Java, Scala, or Python
• Use cases?
SQL on Hadoop
• Apache Drill, Cloudera Impala, Facebook’s
Presto, Hortonworks’s Hive Stinger, Pivotal
HAWQ, etc.
• SQL-like or ANSI SQL compliant MPP execution
engines using HDFS as a data store
• Use cases? Non use cases?
Sample Architecture
Flume
Agent
SQL
Oozie
Webserver
Website
Flume
Agent
Sales
MapReduce
HBase
HDFS
Flume
Agent
Call Center
Pig
SQL
Storm
We [maybe] won’t be covering these in detail later on
OTHER HADOOP PROJECTS
Redis, Memcached, etc.
• Open-source in-memory key/value stores
• Use cases?
Apache Cassandra
• NoSQL database for managing large amounts of
structured, semi-structured, and unstructured data
• Support for clusters spanning multiple datacenters
• Unlike HBase and Accumulo, data is not stored on
HDFS
• Use cases? Non use cases?
Apache Crunch
• Java framework for writing, testing, and
running MapReduce pipelines with a simple
API
• Same code executes as a local job, as a
MapReduce job, or as a streaming Spark job
• Use cases?
*Not the real logo, but truly fantastic
*
Apache Kafka
• High-throughput distributed publish-subscribe
message service
• Use cases?
Azkaban
• Batch workflow job scheduler to run Hadoop
jobs
• Use cases?
Review
• A lot of projects available to you for your grou
project
• Think of a problem you are interested in, then
choose the appropriate projects to solve it
• Keep in mind data ingest, storage, processing,
and egress
• Feel free to explore and use other projects than
the ones I have listed here
– Get permission if you plan on using it as part of your
project quota
References
• All those logos are the property of their
owners
• *.apache.org
• redis.io