MapReduce - I2C Lab Logo

COSC6376 Cloud Computing
Lecture 4. HDFS, MapReduce
Implementation, and Spark
Instructor: Weidong Shi (Larry), PhD
Computer Science Department
University of Houston
Outline
• GFS/HDFS
• MapReduce Implementation
• Spark
Distributed File System
3
Goals of HDFS
• Very Large Distributed File System
 10K nodes, 100 million files, 10PB
• Assumes Commodity Hardware
 Files are replicated to handle hardware failure
 Detect failures and recover from them
• Optimized for Batch Processing
 Data locations exposed so that computations can
move to where data resides
 Provides very high aggregate bandwidth
Assumptions
• Inexpensive components that often fail
• Large files
• Large streaming reads and small
random reads
• Large sequential writes
• Multiple users append to the same file
• High bandwidth is more important than
low latency
6
7
Hadoop Cluster
8
Failure Trends in a Large Disk Drive Population
• The data are broken down by the age a drive was when it failed.
Failure Trends in a Large Disk Drive Population
• The data are broken down by the age a drive was when it failed.
11
12
Architecture
• Chunks
 File  chunks  location of chunks (replicas)
• Master server




Single master
Keep metadata
accept requests on metadata
Most mgr activities
• Chunk servers
 Multiple
 Keep chunks of data
 Accept requests on chunk data
User Interface
• Commads for HDFS User:
 hadoop dfs -mkdir /foodir
 hadoop dfs -cat /foodir/myfile.txt
 hadoop dfs -rm /foodir/myfile.txt
• Commands for HDFS Administrator
 hadoop dfsadmin -report
 hadoop dfsadmin -decommision datanodename
• Web Interface
 http://host:port/dfshealth.jsp
Design decisions
• Single master
 Simplify design
 Single point-of-failure
 Limited number of files
• Meta data kept in memory
• Large chunk size: e.g., 64M
 Advantages
• Reduce client-master traffic
• Reduce network overhead – less network interactions
• Chunk index is smaller
 Disadvantages
• Not favor small files
Master: meta data
• Metadata is stored in memory
• Namespaces
 Directory  physical location
• Files  chunks  chunk locations
• Chunk locations
 Not stored by master, sent by chunk servers
• Operation log
Master Operations
• All namespace operations
 Name lookup
 Create/remove directories/files, etc
• Manage chunk replicas




Placement decision
Create new chunks & replicas
Balance load across all chunkservers
Garbage claim
Master: chunk replica placement
• Goals: maximize reliability, availability and
bandwidth utilization
• Physical location matters
 Lowest cost within the same rack
 “Distance”: # of network switches
• In practice (hadoop)
 If we have 3 replicas
 Two chunks in the same rack
 The third one in another rack
• Choice of chunkservers
 Low average disk utilization
 Limited # of recent writes  distribute write traffic
Master: chunk replica placement
• Re-replication
 Lost replicas for many reasons
 Prioritized: low # of replicas, live files, actively used
chunks
 Following the same principle to place
• Rebalancing
 Redistribute replicas periodically
• Better disk utilization
• Load balancing
Master: garbage collection
• Lazy mechanism
 Mark deletion at once
 Reclaim resources later
• Regular namespace scan
 For deleted files: remove metadata after three days
(full deletion)
 For orphaned chunks, let chunkservers know they are
deleted
• Stale replica
 Use chunk version numbers
Read/Write

File Read

1. open
2. get block locations
Distributed
NameNode
HDFS 3. read FileSystem
client 6. close FSData
name node
InputStream
client JVM
client node
File Write
1. create 2. create
Distributed
NameNode
HDFS 3. writeFileSystem
8. complete
client 7. close FSData
name node
OutputStream
client JVM 4. get a list of 3 data nodes
client node
4. read from the closest node
5. write packet 6. ack packet
5. read from the 2nd closest node
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
data node
data node
data node
data node
data node
data node
If a data node crashed, the crashed node is removed,
current block receives a newer id so as to delete the
partial data from the crashed node later, and
Namenode allocates an another node.
22
Consistency
• It is expensive to maintain strict
consistency
• GFS uses a relaxed consistency
 Better support for appending
 checkpointing
Fault Tolerance
• High availability
 Fast recovery
 Chunk replication
 Master replication: inactive backup
• Data integrity
 Checksumming
• Use CRC32
 Incremental update checksum to improve
performance
• A chunk is split into 64K-byte units
• Update checksum after adding a unit
Cost
25
Discussion
• Advantages
 Works well for large data processing
 Using cheap commodity servers
• Tradeoffs
 Single master design
 Reads most, appends most
• Latest upgrades (GFS II)
 Distributed masters
 Introduce the “cell” – a number of racks in the same
data center
 Improved performance of random r/w
MapReduce Implementation
27
Execution
•
How is this distributed?
1. Partition input key/value pairs into chunks,
run map() tasks in parallel
2. After all map()s are complete, consolidate all
emitted values for each unique emitted key
3. Now partition space of output map keys,
and run reduce() in parallel
•
If map() or reduce() fails, reexecute!
MapReduce - Dataflow
Map Implementation
Chunk(block)
A Map Process
K1~ki
Ki+1-kj
…
…
…
R parts in map’s local storage
Map processes are allocated to be close to the chunks as possible
One node can run a number of map processes. It depends on the setting.
Reducer Implementation
Mapper1 output
Mapper2 output
Mappern output
K1~ki
K1~ki
K1~ki
Ki+1-kj
R Reducers
…
Ki+1-kj
…
Ki+1-kj
…
- R final output files stored in the user designated directory
…
Execution
Execution Initialization
• Split input file into 64MB sections (GFS)
 Read in parallel by multiple machines
• Fork off program onto multiple machines
• One machine is Master
• Master assigns idle machines to either Map or
Reduce tasks
• Master Coordinates data communication
between map and reduce machines
Overview of Hadoop/MapReduce
Partition Function
• Inputs to map tasks are created by contiguous
splits of input file
• For reduce, we need to ensure that records with
the same intermediate key end up at the same
worker
• System uses a default partition function e.g.,
hash(key) mod R
Partition Key-Value Pairs for Reduce
• Hadoop uses a hash code as its default method to
partition key-value pairs.
• The default hash code used by a string object in Java
(used by Hadoop).
• Wn represents the nth element in a string.
• A partition function typically uses the hash code of the
key and the modulo of reducers to determine which
reducer to send the key-value pair to.
Dataflow
• Input, final output are stored on a
distributed file system
 Scheduler tries to schedule map tasks “close” to
physical storage location of input data
• Intermediate results are stored on local FS
of map and reduce workers
• Output is often input to another map
reduce task
How Many Map and Reduce jobs?
• M map tasks, R reduce tasks
• Rule of thumb
 Make M and R much larger than the number of
nodes in cluster
 One DFS chunk per map is common
 Improves dynamic load balancing and speeds
recovery from worker failure
• Usually R is smaller than M, because output
is spread across R files
Misc. Remarks
Parallelism
• map() functions run in parallel, creating different
intermediate values from different input data
sets
• reduce() functions also run in parallel, each
working on a different output key
• All values are processed independently
• Bottleneck: reduce phase can’t start until
map phase is completely finished.
Locality
• Master program divides up tasks based on
location of data: tries to have map() tasks on
same machine as physical file data, or at least
same rack
• map() task inputs are divided into 64 MB blocks
in HDFS (by default): same size as Google File
System chunks
Job Processing
TaskTracker 0
TaskTracker 1
TaskTracker 2
JobTracker
TaskTracker 3
“wordcount”
TaskTracker 4
TaskTracker 5
1. Client submits “wordcount” job, indicating code
and input files
2. JobTracker breaks input file into k chunks, (in
this case 6). Assigns work to T.trackers.
3. After map(), T.trackers exchange map-output to
build reduce() keyspace
4. JobTracker breaks reduce() keyspace into m
chunks (in this case 6). Assigns work.
5. reduce() output may go to HDFS
Fault Tolerance
• Worker failure
 Map/reduce fails  reassign to other workers
 Node failure  redo jobs in other nodes
• Chunk replicas make it possible
• Master failure
 Log/checkpoint
 Master process and GFS master server
• Backup tasks for “straggler”:
 The whole process is often delayed by a few workers,
because of various reasons (network, disk I/O, node
failure…)
 As close to completion, master starts multiple workers for
each in-progress job
• Without vs. with backup tasks: 44% longer
Experiment: Effect of Backup Tasks,
Reliability (example: sorting)
MapReduce Chaining
MapReduce Summary
46
MapReduce Conclusions
• MapReduce has proven to be a useful
abstraction
• Greatly simplifies large-scale computations at
Google
• Functional programming paradigm can be
applied to large-scale applications
• Fun to use: focus on problem, let the
middleware deal with messy details
In Real World
• Applied to various domains in Google
 Machine learning
 Clustering
 Reports
 Web page processing
 Indexing
 Graph computation
 …
• Mahout library
 Scalable machine learning and data mining
• Research projects
MapReduce: The Good
•
•
•
•
Built in fault tolerance
Optimized IO path
Scalable
Developer focuses on Map/Reduce, not
infrastructure
• Simple? API
49
MapReduce: The Bad
• Optimized for disk IO
 Doesn’t leverage memory well
 Iterative algorithms go through disk IO path again
and again
• Primitive API
 Developer’s have to build on very simple abstraction
 Key/Value in/out
 Even basic things like join require extensive code
• Result often many files that need to be
combined appropriately
50
Hadoop Acceleration
51
FPGA Cluster
52
FPGA for MapReduce
53
Density and Energy
54
ARM Cluster
6 node, 6 TB Hadoop cluster on ARM
55
MapReduce Using GPUs
• Graphics Processing Unit (GPU) is a specialized
processor that offloads 3D or 2D graphics
rendering from the CPU
• GPUs’ highly parallel structure makes them more
effective than general-purpose CPUs for a range
of complex algorithms
GPGPU Programming
• Each block can have
up to 512 threads that
synchronize
• Millions of blocks can
be issued
Programming Environment: CUDA
• Compute Unified Device
Architecture (CUDA)
• A parallel computing
architecture developed by
NVIDIA
• The computing engine in
GPUs is accessible to
software developers through
industry standard
programming language
Mars: A MapReduce Framework on
Graphics Processors
• System Workflow and Configuration
Applications
Hardware
Mars: Performance
• Performance speedup between Phoenix and the
optimized Mars with the data size varied.
MapReduce
Advantages/Disadvantages
• Now it’s easy to program for many CPUs
 Communication management effectively gone
• I/O scheduling done for us
 Fault tolerance, monitoring
• machine failures, suddenly-slow machines, etc are handled
 Can be much easier to design and program!
 Can cascade several MapReduce tasks
• But … it further restricts solvable
problems
 Might be hard to express problem in MapReduce
 Data parallelism is key
• Need to be able to break up a problem by data chunks
In-Memory MapReduce
64
In-Memory Data Grid
65
Speed of In-Memory Data Grid
https://www.datanami.com/2013/10/21/accelerating_h
adoop_mapreduce_using_an_in-memory_data_grid/
66
Real-time
67
Motivation
 Many important applications must process large streams of
live data and provide results in near-real-time
-
Social network trends
Website statistics
Intrustion detection systems
etc.
 Require large clusters to handle workloads
 Require latencies of few seconds
Tathagata Das. Spark Streaming: Large-scale near-real-time stream processing.
Available data source
• Twitter public status updates
Jagane Sundar. Realtime Sentiment Analysis Application Using Hadoop and HBase
Firehose
• “Firehose” is the name given to the massive, realtime stream of Tweets that flow from Twitter each
day.
• Twitter provides access to this “firehose”, using a
streaming technology called XMPP
Who are using it
Requirements
 Scalable
to large clusters
 Second-scale latencies
 Simple
programming model
Apache Spark
73
74
75
Apache Spark
• Originally developed in UC Berkeley’s AMP
Lab
• Fully open sourced – now at Apache
Software Foundation
• Commercial Vendor Developing/Supporting
Apache Spark. MapR Technologies.
76
Spark: Easy and Fast Big Data
• Easy to Develop
 Rich APIs in Java, Scala, Python
 Interactive shell
• Fast to Run
 General execution graphs
 In-memory storage
• 2-5× less code
Apache Spark. MapR Technologies.
77
78