Cloudera Certification

Cloudera Certification for
Apache Hadoop Admin
Curriculum
• HDFS (17%)
• YARN and MapReduce version 2 (MRv2) (17%)
• Hadoop Cluster Planning (16%)
• Hadoop Cluster Installation and Administration (25%)
• Resource Management (10%)
• Monitoring and Logging (15%)
• Miscellaneous
HDFS (17%)
• Describe the function of HDFS Daemons
• Describe the normal operation of an Apache Hadoop cluster, both in data storage
and in data processing.
• Identify current features of computing systems that motivate a system like
Apache Hadoop.
• Classify major goals of HDFS Design
• Given a scenario, identify appropriate use case for HDFS Federation
• Identify components and daemon of an HDFS HA-Quorum cluster
• Analyze the role of HDFS security (Kerberos)
• Determine the best data serialization choice for a given scenario
• Describe file read and write paths
• Identify the commands to manipulate files in the Hadoop File System Shell
Describe the function of HDFS Daemons
• HDFS Daemons
Datanode (Stores data in the form of files)
Namenode (In memory representation of HDFS file metadata)
Secondary namenode (Helper to Namenode)
Hadoop Architecture
Storage
Processing
Storage
Processing
Storage
Processing
Metadata
Helper
Processing
Master
Hadoop Architecture
HDFS
Map Reduce
HDFS
Map Reduce
HDFS
Map Reduce
HDFS
HDFS
Map Reduce
HDFS
Datanode
Map Reduce
Datanode
Map Reduce
Datanode
Map Reduce
Namenode
Secondary
Namenode
Map Reduce
Typical Hadoop Cluster
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
Network
Switch(es)
Typical Hadoop Cluster
DN
DN
DN
DN
DN
DN
DN
DN
DN
DN
DN
DN
DN
DN
DN
DN
DN
DN
DN
DN
DN
DN
DN
DN
NN
SNN
Network
Switch(es)
Describe the normal operation of an Apache Hadoop
cluster, both in data storage and in data processing.
• Hadoop Cluster
 Data Storage (HDFS)








Files and Blocks
Fault Tolerance - Replication Factor
Metadata
Datanode
Namenode and Secondary Namenode
Heartbeat
Checksum
Namenode recovery (fsimage, editlogs and safemode)
 Data Processing (Map Reduce – classic/YARN)
 Mappers and Reducers
 MRv1/Classic
 Job Tracker
 Task Tracker
 MRv2/YARN
 Resource Manager
 Node Manager
Hadoop Cluster
(Single node/Cloudera VM)
Metadata
Helper
Processing
Master
Storage
Processing
Hadoop Cluster
(Single node/Cloudera VM)
Namenode
S. Namenode
File Name: deckofcards.txt
Block Name: BLK_XXX1
Contents:
BLACK|SPADE|2
BLACK|SPADE|3
BLACK|SPADE|4
BLACK|SPADE|5
BLACK|SPADE|6
BLACK|SPADE|7
BLACK|SPADE|8
BLACK|SPADE|9
BLACK|SPADE|10
Datanode
Storage
Hadoop Cluster
(Single node/Cloudera VM)
File Name|Block Name|Location
deckofcards.txt|blk_XXX1|node01
Namenode
Datanode
Storage
Processing
S. Namenode
File Name: deckofcards.txt
Block Name: blk_XXX1
Block size: default (128 MB)
Replication Factor: 3 (but only one copy will be there on single node)
Contents:
BLACK|SPADE|2
BLACK|SPADE|3
BLACK|SPADE|4
BLACK|SPADE|5
BLACK|SPADE|6
BLACK|SPADE|7
BLACK|SPADE|8
BLACK|SPADE|9
BLACK|SPADE|10
Namenode will contain file name, all block
names and block location (in memory)
There will be one or more files created with
prefix blk_*
One file will be splitted into multiple blocks.
Processing will be covered later
Hadoop Cluster
Storage
Processing
Storage
Processing
Storage
Processing
Metadata
Helper
Processing
Master
Hadoop Cluster
(Storage)
File Name|Block Name|Location
deckofcards.txt|blk_XXX1|node01
deckofcards.txt|blk_XXX1|node02
deckofcards.txt|blk_XXX1|node03
Datanode
blk_XXX1
Processing
Datanode
blk_XXX1
Processing
Datanode
blk_XXX1
Processing
Namenode
Secondary
namenode
File Name: deckofcards.txt (which is few bytes)
Block Name: blk_XXX1
Block size: default (128 MB)
Replication Factor: 3 (Now there will be 3 copies of each block)
Contents (sample):
BLACK|SPADE|2
BLACK|SPADE|3
BLACK|SPADE|4
BLACK|SPADE|5
BLACK|SPADE|6
BLACK|SPADE|7
BLACK|SPADE|8
BLACK|SPADE|9
BLACK|SPADE|10
Namenode will contain file name, all block
names and block location (in memory)
There will be file created with prefix blk_*
depending up on the size of the file
One file will be splitted into multiple blocks.
Processing will be covered later
Hadoop Cluster
(Storage)
File Name|Block Name|Location
deckofcards.txt|blk_XXX1|node01
deckofcards.txt|blk_XXX1|node02
deckofcards.txt|blk_XXX1|node03
deckofcards.txt|blk_XXX2|node01
deckofcards.txt|blk_XXX2|node02
deckofcards.txt|blk_XXX2|node03
Datanode
blk_XXX1
blk_XXX2
Processing
Datanode
blk_XXX1
blk_XXX2
Processing
Datanode
blk_XXX1
blk_XXX2
Processing
Namenode
Secondary
namenode
File Name: deckofcards.txt (200 MB)
Block Name: blk_XXX1 (128 MB), blk_XXX2 (72 MB)
Block size: default (128 MB)
Replication Factor: 3 (Now there will be 3 copies of each block)
Contents (sample):
BLACK|SPADE|2
BLACK|SPADE|3
BLACK|SPADE|4
BLACK|SPADE|5
BLACK|SPADE|6
BLACK|SPADE|7
BLACK|SPADE|8
BLACK|SPADE|9
BLACK|SPADE|10
Namenode will contain file name, all block
names and block location (in memory)
There will be file created with prefix blk_*
depending up on the size of the file
One file will be splitted into multiple blocks.
Processing will be covered later
Files and Blocks
• File abstraction using blocks
• File abstraction means a file can be larger than any one hard disk in the
cluster
• It can be achieved by Network file system as well as HDFS
• HDFS and other distributed file systems typically uses local file system over
network file system
• Files are distributed on HDFS based on dfs.blocksize
17
Fault Tolerance – Replication Factor
• Fault tolerance – HDFS is fault tolerant
• HDFS does not use RAID (RAID only solves Hard disk failure, mirroring is
expensive and striping is slow)
• HDFS uses mirroring and dfs.replication controls how many copies should be
made (default 3).
• HDFS mirroring/replication solves
• Disk failure as well as any other hardware failure (except network failures)
• Network failures are addressed using multiple racks with multiple switches
18
Metadata
• Files are divided into blocks based up on dfs.blocksize (default 128 MB)
• Each block will have multiple copies and stored in the servers designated as
datanodes. It is controlled by parameter called dfs.replication (default 3)
• What is file metadata?
•
•
•
•
•
•
HDFS file is logical
Each block will have block id and multiple copies
Each copy will be stored in separate data node
Mapping between file, block and block location is metadata of a file
Also file permissions, directories etc
All these will be stored in in-memory of Namenode
19
Data node
• Actual contents of the files are stored as blocks on the slave nodes
• Blocks are simply files on the slave nodes’ underlying file system
• Named blk_xxxxxxx
• Nothing on the slave node provides information about what underlying file the block is a part of
• That information is only stored in the NameNode’s metadata
• Each block is stored on multiple different nodes for redundancy
• Default is three replicas
• Each slave node runs a DataNode (DN) daemon
• Controls access to the blocks
• Communicates with the NameNode
Data node (Slave)
Files (uses replication
factor)
1) Blocks
2) Checksum
Processes (Stand Alone)
1) Data Node
Data node (Slave)
Files (uses replication
factor)
1) dfs.datanode.data.dir
Processes (Stand Alone)
1) proc_datanode
Name node
• Name node is single point of failure
• The NameNode (NN) stores all metadata (in memory)
• Information about file locations in HDFS
• Information about file ownership and permissions
• Names of the individual blocks
• Locations of the blocks
• Metadata is stored on disk and read when the NameNode daemon starts up
• Filename is fsimage
• Note: block locations are not stored in fsimage
• Changes to the metadata are made in RAM
• Changes are also written to a log file on disk called edits – Full details later
Name node (Master)
Namespace (Memory)
1) File locations in HDFS
2) File ownership
3) File permissions
4) Names of the individual
blocks
5) Locations of the blocks
Processes (Stand Alone)
1) Name Node (proc_namenode)
Files (Must be mirrored)
1) FS Image
2) Edit Logs
Name node (Master)
Namespace (Memory)
Processes (Stand Alone)
1) proc_namenode
Files (Must be mirrored)
1) dfs.namenode.name.dir
Name node (Master)
• Configuration file for name node hdfs-site.xml (typically located
at /etc/hadoop/conf)
• dfs.namenode.name.dir parameter in hdfs-site.xml
determines location of the edit logs and fs image
• proc_namenode is name of the process
Secondary Name node (Helper)
Namespace (Memory)
Files
1) Edit logs
2) FS Image
Processes (Stand Alone)
1) proc_secondarynamenode
Secondary Name node (Helper)
• Configuration file for name node hdfs-site.xml (typically located
at /etc/hadoop/conf)
• dfs.namenode.checkpoint.* parameters in name node’s hdfssite.xml determines interoperability between name node and
secondary name node
• proc_secondarynamenode is name of the process
Secondary name node (Helper)
• The Secondary NameNode (2NN) is not-a failover NameNode!
– It performs memory intensive administrative functions for the NameNode
–
–
–
–
–
–
NameNode keeps information about files and blocks (the metadata) in memory
NameNode writes metadata changes to an editlog
Secondary NameNode periodically combines a prior filesystem snapshot and editlog into a new snapshot
New snapshot is transmitted back to the NameNode
Note that fsimage do not contain the locations for the blocks. Namenode namespace will be built in-memory in safe mode when
data nodes are introduced to cluster.
Secondary NameNode should run on a separate machine in a large installation
•
It requires as much RAM as the NameNode
Determine how HDFS stores, reads, and writes files.
Heartbeat and block report
• Datanode sends heartbeat every 3 seconds to Namenode
• Heartbeat interval is controlled by dfs.heartbeat.interval
• Along with heartbeat, Datanode sends information such as
 Disk capacity
 Current activity
• Also data node sends periodic block report (default 6 hours) to
Namenode (dfs.blockreport.*)
Checksum
• Checksum is used to ensure blocks or files are not corrupted while
files are being read from HDFS or written to HDFS
Namenode Recovery and Secondary
Namenode
• Editlogs
• FSImage
• It only contains files and blocks (to reduce the size of the FSImage and improve
restore time – which is serial in nature)
• It does not contain block locations
• Secondary Namenode
• A helper process which merges latest edit log with last snapshot of FSImage and
create new one
• Recovery process
•
•
•
•
Namenode starts in safemode
Restores latest FSImage
Recovers using latest edit log
Namenode will do a roll call to datanode to determine the locations of the blocks
33
HDFS - Important parameters (Hadoop cluster
with one name node)
File Name
Parameter Name
Parameter value
Description
core-site.xml
fs.defaultFS/fs.default.name
hdfs://<namenode_ip>:8020
Namenode ip address or nameservice (HA config)
hdfs-site.xml
dfs.block.size, dfs.blocksize
128 MB
Block size at which files will be stored physically.
hdfs-site.xml
dfs.replication
3
Number of copies per block of a file for fault tolerance
hdfs-site.xml
dfs.namenode.http-address
0.0.0.0:50070
Namenode Web UI. By default it might use ip address of
namenode.
hdfs-site.xml
dfs.datanode.http.address
0.0.0.0:50075
Datanode Web UI
hdfs-site.xml
dfs.name.dir, dfs.namenode.name.dir
<directory_location>
Directory location for FS Image and edit logs on name node
hdfs-site.xml
dfs.data.dir, dfs.datanode.data.dir
<directory_location>
Directory location for storing blocks on data nodes
hdfs-site.xml
fs.checkpoint.dir,
dfs.namenode.checkpoint.dir
<directory_location>
Directory location which will be used by secondary
namenode for checkpoint.
hdfs-site.xml
fs.checkpoint.period,
dfs.namenode.checkpoint.period
1 hour
Checkpoint (merging edit logs with current fs image to
create new fs image) interval.
hdfs-site.xml
dfs.namenode.checkpoint.txns
1000000
Checkpoint (merging edit logs with current fs image to
create new fs image) transactions.
Describe the normal operation of an Apache Hadoop cluster, both in data storage
and in data processing.
• Data processing
•
•
•
•
Distributed
Scalable
Data locality
Fault tolerant
* Will be covered later
Describe the normal operation of an Apache Hadoop cluster, both in data storage
and in data processing.
• Mappers and Reducers
• MRv1/Classic
• MRv2/YARN
Hadoop Cluster
(Processing)
• Mappers
• Each map task operates on one block (typical) or more (split size)
• Typically tries to process block on the same node where map is running.
• Shuffle & Sort
• Happens after all mappers are done before reduce phase is started.
• Sorts and consolidates all the intermediate data for the reducer
• Reducers
• Operates on shuffled/sorted map output
• Writes back the output to HDFS (typically)
Hadoop Cluster
(Processing)
• There are two frameworks to transition the job into tasks to process the data
• MRv1 “Classic”
• Job Tracker (permanent – per cluster)
• Task Tracker (permanent – per node)
• Predetermined number of mappers and reducers
• MRv2/YARN
•
•
•
•
Resource Manager (permanent – per cluster)
Node Manager (permanent – per node)
Application Master (transient – per job)
Container (transient – per job per node)
• There is separate item which will cover MRv1 and MRv2 in detail, for now just understand that there are 2
frameworks to process data and the daemon processes
Mappers and Reducers
• Mappers
– Number of mappers is determined by framework based up on block size and split size
– Uses data locality
– Logic to filter, row level transformations are implemented in the map function
– Mapper tasks execute map function
• Shuffle & Sort
– Taken (typically) care by Hadoop MapReduce framework
– Enhance or customize the capability in the form of custom partitioners and custom comparators.
• Reducers
– Developers needs to determine number of reducers.
– Can be pre-determined for some of the cases
• If the report has to be generated by year, then number of reducers can be number of years you want to generate
report
• If the report has to be generated for number of regions or states , then number of reducers can be number of regions
or states.
– Logic to implement aggregations, joins etc are implemented in the reduce function
– Reducer tasks execute reduce function
Identify current features of computing systems
that motivate a system like Apache Hadoop.
• RDBMS (Relational Database Management Systems)
Designed and developed for operational and transactional applications
Not efficient for batch processing
Not linearly scalable
• Grid Computing (In-memory)
• MPP (Massively Parallel Processing)
RDBMS
Traditional RDBMS
Hadoop
Data size
Gigabytes
Petabytes
Access
Interactive and batch (small)
Batch (large)
Updates
Read and write many times
Write once, read many times
Structure
Static schema
Dynamic schema
Integrity
High
Low
Scaling
Nonlinear
Linear
Apache Hadoop
• Distributed File System
• Distributed Processing
• Data Locality
• Scalable
• Supports Structured, Unstructured and Semi-structured data
• Cost effective
• Open source
• Proven on commodity hardware
Classify major goals of HDFS Design
• Distributed – using block size, default 128 MB
• Hardware Failure – detection of faults and quick, automatic recovery from them is a core
architectural goal of HDFS – using replication factor, default 3.
• Streaming Data Access
 Applications that run on HDFS need streaming access to their data sets. They are not general
purpose applications that typically run on general purpose file systems. HDFS is designed more for
batch processing rather than interactive use by users. The emphasis is on high throughput of data
access rather than low latency of data access. POSIX imposes many hard requirements that are not
needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been
traded to increase data throughput rates.
•
•
•
•
Large Data sets – tuned for large data sets
Simple Coherency Model – write-once-read-many, HDFS files are immutable
Data Locality (Moving computation to data)
Portability Across Heterogeneous Hardware and Software Platforms (Logical file system)
Given a scenario, identify appropriate use
case for HDFS Federation
• HDFS (two main layers)
• Namespace manages directories, files and blocks. It supports file system
operations such as creation, modification, deletion and listing of files and
directories.
• Block Storage
• Block Management maintains the membership of datanodes in the cluster. It supports
block-related operations such as creation, deletion, modification and getting location of
the blocks. It also takes care of replica placement and replication.
• Physical Storage stores the blocks and provides read/write access to it.
Given a scenario, identify appropriate use
case for HDFS Federation
Given a scenario, identify appropriate use
case for HDFS Federation
• HDFS (Limitations)
• Namespace Scalability
• Performance
• Isolation
Given a scenario, identify appropriate use
case for HDFS Federation
• HDFS Federation (Namenode)
Namenode Scalability
Better Performance
Isolation
• HDFS Federation (implementation)
Multiple namespaces
Multiple namenodes
Same set of datanodes for all namespaces
Block Pool
Namespace Volume (Block Pool and associated Namespace)
 Self contained
Given a scenario, identify appropriate use
case for HDFS Federation
Identify components and daemons of an
HDFS HA-Quorum cluster
• Namenode recovery and secondary namenode
• Editlogs
• FSImage
• It only contains files and blocks (to reduce the size of the FSImage and improve restore time –
which is serial in nature)
• It does not contain block locations
• Editlogs are merged into FSImage at regular intervals (checkpointing)
• Secondary Namenode
• A helper process which merges latest edit log with last snapshot of FSImage and create new
one
• Recovery process
•
•
•
•
Namenode starts in safemode
Restores latest FSImage
Recovers using latest edit log
Namenode will do a roll call to datanode to determine the locations of the blocks
Identify components and daemon of an HDFS
HA-Quorum cluster
• Namenode recovery and secondary namenode (limitations)
• Checkpointing is resource intensive
• If ip address is changed, then failover might not be transparent
• Recovery is time consuming
Identify components and daemon of an HDFS
HA-Quorum cluster
• HDFS HA – Quorum cluster components
Active (one) and Standby (one) Namenodes
Journal Nodes (Journal directories – at least 3 or more in odd number)
Zookeeper (quorum)
• HDFS HA – Quorum cluster scenarios
High Availability
Transparent Failover
Identify components and daemon of an HDFS
HA-Quorum cluster
• HDFS HA – Quorum cluster components
 Active and Standby Namenodes
 HA is different than Secondary namenode or Federation
 Only one node will be active
 Standby node will get edit logs at regular intervals from journal nodes (journal nodes get edit logs from active
nodes)
 Shared edits
 Shared Storage
 Uses NFS to store edit logs in shared location by both Namenodes (Active and Passive)
 Active Namenode writes to shared edit logs
 Passive Namenode reads from shared edit logs and apply
 Journal Nodes (Journal directories)
 Typically 3 (when greater than 3 it needs to be odd number)
 Active namenode will write edit logs to majority of the configured journal nodes
 Standby namenode will read edit logs from any of the surviving journal node
 Zookeeper (quorum)
 It will be running on typically 3 or 5 nodes (proc_zkfc)
 As proc_zkfc is lightweight it can be deployed on both Namenodes and ResourceManager
Identify components and daemon of an HDFS
HA-Quorum cluster
Analyze the role of HDFS security (Kerberos)
Determine the best data serialization choice
for a given scenario
• Serialization
• Writable
• Avro – Typically to use other languages to store data in HDFS
• Java Serialization – will not be used as it is heavy compared to
Writable and Avro (it is not compact, fast, extensible and
interoperable – will see these characteristics later)
Serialization and Deserialization
• Serialization is the process of turning structured objects into a byte stream
for transmission over a network or for writing to persistent storage.
• Deserialization is the reverse process of turning a byte stream back into a
series of structured objects.
• In the context of Hadoop, Serialization is used for inter-process
communication (between mappers and reducers) as well as while storing
data persistently.
• RPC (Remote Procedure Calls) is used for inter-process communication. The
RPC protocol uses serialization to render the message into a binary stream
to be sent to the remote node, which then deserializes the binary stream
into the original message.
• Serialization for RPC should be Compact, Fast, Extensible and Interoperable
Serialization in Hadoop
• Writable Interface
The Writable interface defines two methods—one for writing its state to a
DataOutput binary stream and one for reading its state from a DataInput
binary stream:
 write
 readFields
There are bunch of classes in Hadoop API which implement writable interface
such as IntWritable, Text etc
Writable types
Serialization Frameworks (eg: avro)
• It is not mandatory to implement or use writable
• Hadoop has API for pluggable serialization framework
• Package: org.apache.hadoop.io.serializer
• It has class WritableSerialization for implementing Serialization for
Writable types
• Parameter for customizing serialization: io.serializations
• Cloudera set this value to both writable and avro serialization, which
means that both hadoop writable objects and avro objects can be
serialized and deserialized out of the box
Avro primitive types
Describe file read and write paths
Determine how HDFS stores, reads, and writes files.
• If a file f1 of 400 MB has to be stored on a cluster with block size 128 MB, file will be divided into
4 blocks (3 128 MB and 1 16 MB).
• HDFS permits to read a file that is being written.
• HDFS uses checksum to validate both reads and writes at each block level. Checksums are stored
along with the blocks.
• HDFS logs the verification details persistently which assists in identifying bad disks.
Checksum
• Checksum files are used for data integrity.
• Whenever a block of a file is written a checksum file will be
generated.
• When client reads block HDFS passes pre-computed check
sum to the client to ensure data integrity.
• HDFS logs the verification details persistently which assists in
identifying bad disks.
Describe file read and write paths
• Anatomy of file read
• Anatomy of file write
Anatomy of file read
Anatomy of file read
• The client opens the file it wishes to read by calling open() on the FileSystem object, which for HDFS is an
instance of DistributedFileSystem (step 1) in Figure 3-2). DistributedFileSystem calls the namenode, using RPC,
to determine the locations of the blocks for the first few blocks in the file (step 2). For each block, the namenode
returns the addresses of the datanodes that have a copy of that block. Furthermore, the datanodes are sorted
according to their proximity to the client (according to the topology of the cluster’s network; see Network
Topology and Hadoop). If the client is itself a datanode (in the case of a MapReduce task, for instance), the
client will read from the local datanode if that datanode hosts a copy of the block.
• The DistributedFileSystem returns an FSDataInputStream (an input stream that supports file seeks) to the client
for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and
namenode I/O.
Anatomy of file read
• The client then calls read() on the stream (step 3). DFSInputStream, which has stored the datanode addresses
for the first few blocks in the file, then connects to the first (closest) datanode for the first block in the file. Data is
streamed from the datanode back to the client, which calls read() repeatedly on the stream (step 4). When the
end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best
datanode for the next block (step 5). This happens transparently to the client, which from its point of view is just
reading a continuous stream.
• Blocks are read in order, with the DFSInputStream opening new connections to datanodes as the client reads
through the stream. It will also call the namenode to retrieve the datanode locations for the next batch of blocks
as needed. When the client has finished reading, it calls close() on the FSDataInputStream (step 6).
Anatomy of file read
• During reading, if the DFSInputStream encounters an error while communicating with a datanode, it will try the
next closest one for that block. It will also remember datanodes that have failed so that it doesn’t needlessly
retry them for later blocks. The DFSInputStream also verifies checksums for the data transferred to it from the
datanode. If a corrupted block is found, it is reported to the namenode before the DFSInputStream attempts to
read a replica of the block from another datanode.
• One important aspect of this design is that the client contacts datanodes directly to retrieve data and is guided
by the namenode to the best datanode for each block. This design allows HDFS to scale to a large number of
concurrent clients because the data traffic is spread across all the datanodes in the cluster. Meanwhile, the
namenode merely has to service block location requests (which it stores in memory, making them very efficient)
and does not, for example, serve data, which would quickly become a bottleneck as the number of clients grew.
Anatomy of file write
Anatomy of file write
• The client creates the file by calling create() on DistributedFileSystem (step 1)
• DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no
blocks associated with it (step 2). Name node performs checks such as permissions, file exists etc
• As the client writes data (step 3), DFSOutputStream splits it into packets, which it writes to an internal queue, called the
data queue. The data queue is consumed by the DataStreamer, which is responsible for asking the namenode to allocate
new blocks by picking a list of suitable datanodes to store the replicas. The list of datanodes forms a pipeline, and here
we’ll assume the replication level is three, so there are three nodes in the pipeline. The DataStreamer streams the packets
to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline.
Similarly, the second datanode stores the packet and forwards it to the third (and last) datanode in the pipeline (step 4).
Anatomy of file write
• DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by
datanodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged
by all the datanodes in the pipeline (step 5).
• If a datanode fails while data is being written to it, then the following actions are taken, which are transparent to
the client writing the data. First, the pipeline is closed, and any packets in the ack queue are added to the front
of the data queue so that datanodes that are downstream from the failed node will not miss any packets. The
current block on the good datanodes is given a new identity, which is communicated to the namenode, so that
the partial block on the failed datanode will be deleted if the failed datanode recovers later on. The failed
datanode is removed from the pipeline, and the remainder of the block’s data is written to the two good
datanodes in the pipeline. The namenode notices that the block is under-replicated, and it arranges for a further
replica to be created on another node. Subsequent blocks are then treated as normal.
Anatomy of file write
• It’s possible, but unlikely, that multiple datanodes fail while a block is being written. As long as dfs.replication.min replicas
(which default to one) are written, the write will succeed, and the block will be asynchronously replicated across the
cluster until its target replication factor is reached (dfs.replication, which defaults to three).
• When the client has finished writing data, it calls close() on the stream (step 6). This action flushes all the remaining
packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is
complete (step 7). The namenode already knows which blocks the file is made up of (via DataStreamer asking for block
allocations), so it only has to wait for blocks to be minimally replicated before returning successfully.
Identify the commands to manipulate files in
the Hadoop File System Shell
• hadoop fs (Used to manage user spaces, directories and files)
• hadoop jar (Used to submit map reduce jobs)
• hdfs fsck (Used for administration of the cluster)
HDFS - Important parameters (Hadoop cluster
with one name node)
File Name
Parameter Name
Parameter value
Description
core-site.xml
fs.defaultFS/fs.default.name
hdfs://<namenode_ip>:8020
Namenode ip address or nameservice (HA config)
hdfs-site.xml
dfs.block.size, dfs.blocksize
128 MB
Block size at which files will be stored physically.
hdfs-site.xml
dfs.replication
3
Number of copies per block of a file for fault tolerance
hdfs-site.xml
dfs.namenode.http-address
0.0.0.0:50070
Namenode Web UI. By default it might use ip address of
namenode.
hdfs-site.xml
dfs.datanode.http.address
0.0.0.0:50075
Datanode Web UI
hdfs-site.xml
dfs.name.dir,
dfs.namenode.name.dir
<directory_location>
Directory location for FS Image and edit logs on name
node
hdfs-site.xml
dfs.data.dir, dfs.datanode.data.dir
<directory_location>
Directory location for storing blocks on data nodes
hdfs-site.xml
fs.checkpoint.dir,
dfs.namenode.checkpoint.dir
<directory_location>
Directory location which will be used by secondary
namenode for checkpoint.
hdfs-site.xml
fs.checkpoint.period,
dfs.namenode.checkpoint.period
1 hour
Checkpoint (merging edit logs with current fs image to
create new fs image) interval.
hdfs-site.xml
dfs.namenode.checkpoint.txns
1000000
Checkpoint (merging edit logs with current fs image to
create new fs image) transactions.
Exercise
• Understand daemon processes (Namenode, Secondary Namenode,
Datanode)
• Commands to stop and start HDFS daemons
• Copying data back and forth to HDFS
• Understand parameter files and data files
• Restore and recovery of Namenode
• Important parameters and their defaults (dfs.blocksize,
dfs.replication)
• Namenode Web UI
Interview questions
• What are different Hadoop, HDFS and Map Reduce daemons?
• How data can be copied in and out of HDFS?
• What is Namenode Web UI and what is default port number?
• How do you restore and recover namenode?
YARN and MapReduce version 2 (MRv2) (17%)
• Understand how upgrading a cluster from Hadoop 1 to Hadoop 2
affects cluster settings
• Understand how to deploy MapReduce v2 (MRv2 / YARN), including
all YARN daemons
• Understand basic design strategy for MapReduce v2 (MRv2)
• Determine how YARN handles resource allocations
• Identify the workflow of MapReduce job running on YARN
• Determine which files you must change and how in order to migrate a
cluster from MapReduce version 1 (MRv1) to MapReduce version 2
(MRv2) running on YARN.
Hadoop Cluster
(Processing)
• Mappers and Reducers are the tasks which processes data
• There are two frameworks to transition the job into tasks to process the data
• MRv1 “Classic” (Not covered in detail)
• Job Tracker (permanent – per cluster)
• Task Tracker (permanent – per node)
• Predetermined number of mappers and reducers
• MRv2/YARN
•
•
•
Resource Manager (permanent – per cluster)
Node Manager (permanent – per node)
Application Master (transient – per job)
•
Container (transient – per job per node)
• Here we will be covering YARN in more detail as it is path forward and default starting from Hadoop 2.x
version.
Understand how upgrading a cluster from
Hadoop 1 to Hadoop 2 affects cluster settings
Component
Hadoop 1
Hadoop 2
HDFS
Single Namenode
HA and Federation
Map Reduce Job
Management
MRv1
YARN
Understand how upgrading a cluster from
Hadoop 1 to Hadoop 2 affects cluster settings
• Hadoop 1 – By default uses MRv1/Classic for job management
Parameter files – mapred-site.xml
Daemon Processes (classic) – Job Tracker, Task Tracker
• Hadoop 2 – By default uses MRv2/YARN for job management
Parameter files – mapred-site.xml and yarn-site.xml
Daemon Processes (YARN) – Resource Manager, Node Manager
Understand how to deploy MapReduce v2
(MRv2 / YARN), including all YARN daemons
• YARN Daemons
Resource Manager (typically 1, but can configure HA)
Node Manager
App timeline server
Job history server
Understand how to deploy MapReduce v2
(MRv2 / YARN), including all YARN daemons
• Parameter files
mapred-site.xml
yarn-site.xml
• Important parameters for YARN
• Starting YARN daemons
Using Cloudera Manager
Using command line
Important parameters in MRv2/YARN
File Name
Parameter Name
Parameter value
Description
yarn-site.xml
yarn.resourcemanager.address
<ip_address>:<port>
Resource Manager ip and port
yarn-site.xml
yarn.resourcemanager.webapp.address
<ip_address>:<port>
Resource Manager web UI ip and port
yarn-site.xml
yarn.scheduler.minimum-allocation-mb
1024
Minimum total memory for containers on each of the
nodes
yarn-site.xml
yarn.scheduler.maximum-allocationmb
4096
Maximum total memory for containers on each of the
nodes
yarn-site.xml
yarn.scheduler.minimum-allocationvcores
1
Minimum number of virtual cores on each of the nodes
yarn-site.xml
yarn.scheduler.maximum-allocationvcores
4
Maximum number of virtual cores on each of the nodes
yarn-site.xml
yarn.resourcemanager.scheduler.class
yarn-site.xml
yarn-site.xml
yarn-site.xml
Class which determines scheduler – Fair or capacity
Important parameters in MRv2/YARN
File Name
Parameter Name
Parameter value
mapred-site.xml
mapreduce.framework.name
yarn
mapred-site.xml
mapreduce.jobhistory.webapp.address
<ip_address>:<port>
mapred-site.xml
yarn.app.mapreduce.am.*
Parameters related to application master
mapred-site.xml
mapreduce.map.java.opts
JVM Heap size for child task of map container
mapred-site.xml
mapreduce.reduce.java.opts
JVM Heap size for child task of reduce container
mapred-site.xml
mapreduce.map.memory.mb
Size of container for map task
mapred-site.xml
mapreduce.map.cpu.vcores
mapred-site.xml
mapreduce.reduce.memory.mb
mapred-site.xml
mapreduce.reduce.cpu.vcores
mapred-site.xml
1
Description
Job history server Web UI IP address and port number
Number of virtual cores required to run each map task
Size of container for reduce task
1
Number of virtual cores required to run each reduce task
Understand basic design strategy for
MapReduce v2 (MRv2)
Hadoop 2.0
Hadoop 1.0
MapReduce
(cluster resource management
& data processing)
HDFS
(distributed, redundant
and reliable storage)
MapReduce
Others
(data processing)
(non map reduce based data processing)
YARN
(cluster resource management)
HDFS2
(distributed, redundant and reliable storage with highly available namenode)
Determine how YARN handles resource
allocations
• Question: How YARN handles Resource Allocations?
• Answer: Using Resource Manager, Node Manager and per job
application master (unlike job tracker and task tracker in
MRv1/classic). We need to define several parameters for resource
allocation (CPU/cores and Memory).
• yarn-site.xml will have parameters at node level
• mapred-site.xml will have parameters at task level
Important parameters in MRv2/YARN
File Name
Parameter Name
Parameter value
Description
yarn-site.xml
yarn.resourcemanager.address
<ip_address>:<port>
Resource Manager ip and port
yarn-site.xml
yarn.resourcemanager.webapp.address
<ip_address>:<port>
Resource Manager web UI ip and port
yarn-site.xml
yarn.nodemanager.resource.memorymb
8096
Minimum total memory for containers on each of the
nodes
yarn-site.xml
yarn.nodemanager.resource.cpuvcores
4
Maximum total memory for containers on each of the
nodes
yarn-site.xml
yarn.scheduler.minimum-allocationmb
1024
Minimum total memory for containers on each of the
nodes
yarn-site.xml
yarn.scheduler.maximum-allocationmb
4096
Maximum total memory for containers on each of the
nodes
yarn-site.xml
yarn.scheduler.minimum-allocationvcores
1
Minimum number of virtual cores on each of the nodes
yarn-site.xml
yarn.scheduler.maximum-allocationvcores
4
Maximum number of virtual cores on each of the nodes
yarn-site.xml
yarn.resourcemanager.scheduler.class
yarn-site.xml
Class which determines scheduler – Fair or capacity
Important parameters in MRv2/YARN
File Name
Parameter Name
Parameter value
mapred-site.xml
mapreduce.framework.name
yarn
mapred-site.xml
mapreduce.jobhistory.webapp.address
<ip_address>:<port>
mapred-site.xml
yarn.app.mapreduce.am.*
mapred-site.xml
mapreduce.map.java.opts
0.8 * mapreduce.map.memory.mb
JVM Heap size for child task of map container
mapred-site.xml
mapreduce.reduce.java.opts
0.8 * mapreduce.reduce.memory.mb
JVM Heap size for child task of reduce container
mapred-site.xml
mapreduce.map.memory.mb
mapred-site.xml
mapreduce.map.cpu.vcores
mapred-site.xml
mapreduce.reduce.memory.mb
mapred-site.xml
mapreduce.reduce.cpu.vcores
mapred-site.xml
Description
Job history server Web UI IP address and port number
Parameters related to application master
Size of container for map task
1
Number of virtual cores required to run each map task
Size of container for reduce task
1
Number of virtual cores required to run each reduce task
Hadoop Cluster – Processing
(MRv1)
Mappers
Task Tracker
Reducers
Job Tracker
Mappers
Task Tracker
Reducers
Hadoop Cluster – Processing
(MRv2/YARN)
Containers
Mappers
Reducers
App Master
Node Manager
Resource Manager
Containers
Mappers
Reducers
App Master
Node Manager
Hadoop Cluster – Processing
(MRv2/YARN)
Resource Manager
• It manages nodes by tracking heartbeats from NodeManagers
• It manages containers
Handles application master requests for resources (like providing inputs for
creation of containers)
De-allocates expired or completed containers
• It manages per job application masters
Creates containers for application masters and also tracks their heartbeats
• It also manages security (if Kerberos is enabled)
Node Manager
• Communicates with Resource Manager. It sends information about
node resources, heartbeats, container status etc.
• Manages processes in containers
Launches Application Masters on request from Resource Manager
Launches containers (mappers/reducers) on request from Application Master
Monitors resource usage by containers(mappers/reducers)
• Provides logging services to applications. It aggregates logs for an
application and saves those logs to HDFS.
• Runs auxiliary services
• Maintains node level security (ACLs)
Application Master
• It will be created per job
• Keep track of progress of the job
Identify the workflow of MapReduce job
running on YARN
Determine which files you must change and how in order to migrate a
cluster from MapReduce version 1 (MRv1) to MapReduce version 2
(MRv2) running on YARN.
• MRv1
mapred-site.xml
• MRv2
mapred-site.xml and yarn-site.xml
• MRv1 to MRv2
Set framework to yarn in mapred-site.xml
Parameter file mapred-site.xml should not have any parameters related to yarnsite.xml
Define resource manager, node manager and other YARN related parameters in yarnsite.xml
Define core mapper and reducer related parameters in mapred-site.xml
Job history server needs to be defined in mapred-site.xml to aggregate logs to job
history server.
Important Parameters in MRv1/Classic
File Name
Parameter Name
Parameter value
Description
mapred-site.xml
mapred.job.tracker
<ip_address>:8021
Job Tracker ip address and port number
mapred-site.xml
mapred.job.tracker.http.address
<ip_address>:50030
Job tracker web UI ip address and port number
mapred-site.xml
mapred.system.dir
HDFS directory to store Map Reduce control files
mapred-site.xml
mapred.local.dir
Local directory to store intermediate data files (map
output)
mapred-site.xml
mapred.jobtracker.taskScheduler
Default is FIFO – Fair and Capacity are the viable options
for production deployments
mapred-site.xml
mapred.queue.names
mapred-site.xml
mapred.tasktracker.map.tasks.maxi
mum
Maximum Map slots per task tracker
mapred-site.xml
mapred.tasktracker.reduce.tasks.ma
ximum
Maximum Reduce slots per task tracker
mapred-site.xml
mapred.reduce.tasks
Reduce tasks per job
mapred-site.xml
default
Can provide multiple queue names to set priorities while
submitting the jobs
Important parameters in MRv2/YARN
File Name
Parameter Name
Parameter value
Description
yarn-site.xml
yarn.resourcemanager.address
<ip_address>:<port>
Resource Manager ip and port
yarn-site.xml
yarn.resourcemanager.webapp.address
<ip_address>:<port>
Resource Manager web UI ip and port
yarn-site.xml
yarn.nodemanager.resource.memorymb
8096
Memory allocated for each of the nodemanager
yarn-site.xml
yarn.nodemanager.resource.cpuvcores
4
Vcores allocated for each of the node manager
yarn-site.xml
yarn.scheduler.minimum-allocation-mb
1024
Minimum total memory for containers on each of the
nodes
yarn-site.xml
yarn.scheduler.maximum-allocationmb
4096
Maximum total memory for containers on each of the
nodes
yarn-site.xml
yarn.scheduler.minimum-allocationvcores
1
Minimum number of virtual cores on each of the nodes
yarn-site.xml
yarn.scheduler.maximum-allocationvcores
4
Maximum number of virtual cores on each of the nodes
yarn-site.xml
yarn.resourcemanager.scheduler.class
yarn-site.xml
Class which determines scheduler – Fair or capacity
Important parameters in MRv2/YARN
File Name
Parameter Name
Parameter value
mapred-site.xml
mapreduce.framework.name
yarn
mapred-site.xml
mapreduce.jobhistory.webapp.address
<ip_address>:<port>
mapred-site.xml
yarn.app.mapreduce.am.*
Parameters related to application master
mapred-site.xml
mapreduce.map.java.opts
JVM Heap size for child task of map container
mapred-site.xml
mapreduce.reduce.java.opts
JVM Heap size for child task of reduce container
mapred-site.xml
mapreduce.map.memory.mb
Size of container for map task
mapred-site.xml
mapreduce.map.cpu.vcores
mapred-site.xml
mapreduce.reduce.memory.mb
mapred-site.xml
mapreduce.reduce.cpu.vcores
mapred-site.xml
1
Description
Job history server Web UI IP address and port number
Number of virtual cores required to run each map task
Size of container for reduce task
1
Number of virtual cores required to run each reduce task
Hadoop Cluster Planning (16%)
• Principal points to consider in choosing the hardware and operating systems to host an Apache
Hadoop cluster.
• Analyze the choices in selecting an OS
• Understand kernel tuning and disk swapping
• Given a scenario and workload pattern, identify a hardware configuration appropriate to the
scenario
• Given a scenario, determine the ecosystem components your cluster needs to run in order to
fulfill the SLA
• Cluster sizing: given a scenario and frequency of execution, identify the specifics for the workload,
including CPU, memory, storage, disk I/O
• Disk Sizing and Configuration, including JBOD versus RAID, SANs, virtualization, and disk sizing
requirements in a cluster
• Network Topologies: understand network usage in Hadoop (for both HDFS and MapReduce) and
propose or identify key network design components for a given scenario
Typical Hadoop Cluster
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
YARN
HDFS
HDFS
Network
Switch(es)
YARN
Typical Hadoop Cluster
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
DN
NM
NN
SNN
Network
Switch(es)
RM
Principal points to consider in choosing the hardware and
operating systems to host an Apache Hadoop cluster.
• Hardware (Hadoop 2.x)
Different hardware for gateway/client nodes, master nodes and slave nodes
Slave nodes will have both Datanodes and Nodemanagers
Master nodes will have Namenode and Resourcemanager on different nodes
 More than 1 node for masters in production
 Typical Configuration: One for Namenode, one for secondary namenode and one for
resourcemanager
 HA configuration: One for Namenode, one for standby namenode and one or more
Resourcemanagers
 Federation configuration: More than one namenode, secondary or standby for each
namenode and one or more resourcemanagers
Principal points to consider in choosing the hardware and
operating systems to host an Apache Hadoop cluster.
• Slave Configuration
4x1TB or 4x2TB hard drives (just bunch of disks) with out RAID configuration.
At least 2 Quad-core CPUs
24 to 32 GB RAM
Gigabit Ethernet
• Multiples of 1 hard drive, 2 cores and 6-8 GB RAM work well for I/O
bound applications
• Buy as many nodes as possible with few dollars while considering
components based on performance
• More the nodes, performance will be better
Principal points to consider in choosing the hardware and
operating systems to host an Apache Hadoop cluster.
• Slave Configuration
 Quad core or hex core
 Enable Hyper threading
 Slave nodes are typically not CPU bound
 Containers needs to be configured for processing the data (YARN)
 Each container can take up to 2 GB of RAM for Map and Reducer tasks
 Slaves should not use virtual memory
 Need to consider other Hadoop eco system tools such as HBase, Impala etc while configuring
YARN
 More spindles will be better and more hard disks might be better
 3.5 inch disks are better than 2.5 inch disks
 7,200 RPM disks should be fine compared to SSD and 15,000 RPM disks
 24 TB is a considerable maximum on each of the slave nodes
 Do not use virtualization
 Blade servers are not recommended
Principal points to consider in choosing the hardware and
operating systems to host an Apache Hadoop cluster.
• Master Configuration
Spend more money on master nodes compared to slaves
Carrier class (instead of commodity hardware) unlike slaves
Dual power supplies
Dual Ethernet cards
RAID configuration for hard drives which store FS Image and Edit logs
More memory is better (depends up on how much data is stored in the
cluster)
*Network is covered later
Principal points to consider in choosing the hardware and
operating systems to host an Apache Hadoop cluster.
• Typically Linux based systems are used
• Disable SELinux
• Increase nofile ulimit for hadoop users such as mapred and hdfs to at
least 32k
• Disable IPv6
• Install and configure ntp daemon – to synchronize time
Analyze the choices in selecting an OS
• Operating systems
CentOS (slaves) and RHEL (masters)
Fedora Core: typically used for individual workstations also can be used
Ubuntu (uses Debian)
SUSE (popular in Europe)
Solaris (not popular in production clusters)
Understand kernel tuning and disk swapping
• Kernel Tuning – it is important to deploy any server/database
/etc/sysctl.conf
Disable vm.swappiness
vm.overcommit_memory (needs to be enabled for Hadoop streaming jobs)
Disable ipv6
Increase ulimit parameters for users who owns hadoop daemons
Disable noatime (access time need not be updated for blocks that are stored as
physical files)
TCP tuning
And many more
• Disk Swapping
Make sure memory is configured properly to reduce swapping between main
memory and virtual memory
Given a scenario and workload pattern, identify a
hardware configuration appropriate to the scenario
• Hardware Configuration
• You have to understand the scenario and then come up with
hardware configuration
Given a scenario, determine the ecosystem components
your cluster needs to run in order to fulfill the SLA
• Hadoop (HDFS and YARN) – Hadoop core components
• Hive – Logical database which can define tables/structure on data in HDFS and
queried using SQL type syntax
• Pig – Data flow language which can process structured, semi-structured as well as
unstructured data
• Sqoop – Import and export tool to copy data from relational databases to HDFS
and vice versa
• Impala – Ad hoc querying
• Oozie – workflow tool
• Spark – in memory processing
• Flume – to get data from weblogs into HDFS
• etc
Cluster sizing: given a scenario and frequency of execution, identify the
specifics for the workload, including CPU, memory, storage, disk I/O
• Workload – Identify the workload on the cluster including all the
applications that are part of Hadoop eco system as well as
complement applications.
• CPU – Need to count the number of cores configured in the cluster
• Memory – Total memory in the cluster
• Storage – Amount of data that can be stored in the cluster
• Disk I/O – Amount of read and write operations in the cluster
• Cloudera displays all this information as charts in cloudera manager
home page
Disk Sizing and Configuration, including JBOD versus RAID,
SANs, virtualization, and disk sizing requirements in a cluster
• JBOD – Just Bunch Of Disks
 JBOD should be used to mount storage on to slave nodes for HDFS
 RAID should not be used as fault tolerance is implemented by replication factor
 LVM should be disabled
• RAID
 RAID configuration might be considered to store edit logs and fs image of name node.
• SAN (network storage)
 SAN might be used for a copy of edit logs and fs image but not for HDFS
• Virtualization
 Virtualization should not be used.
• Disk Sizing Requirements
 One hard drive (1-2 TB), 2 cores and 6-8 GB RAM works well for most of the configurations.
 Disk sizing requirements for HDFS = Size of data that needs to be stored * average replication factor
 If you want to store 100 TB of data with average replication factor of 3.5, then 350 TB of storage needs to be
provisioned
Network Topologies: understand network usage in Hadoop (for both
HDFS and MapReduce) and propose or identify key network design
components for a given scenario
• Network usage in Hadoop
 HDFS
 Cluster housekeeping traffic (minimal)
 Client metadata operations on namenode (minimal)
 Block data transfer (can be network intensive, eg: disk/node failure)
 Map Reduce
 Shuffle and Sort phase between mapper and reducer will use network
• Network design
 1 Gb – Cheaper
 10 Gb – expensive but performance might not benefit much for HDFS and Map Reduce
(might help HBase)
 Fiber optics need not be necessary
 North/South traffic pattern
 East/West traffic pattern (Hadoop exhibits)
 Tree structure vs. Spine Fabric
Network Design – Tree Structure
Network Design – Spin Fabric
Hadoop Cluster Installation and
Administration (25%)
• Given a scenario, identify how the cluster will handle disk and machine
failures
• Analyze a logging configuration and logging configuration file format
• Understand the basics of Hadoop metrics and cluster health monitoring
• Identify the function and purpose of available tools for cluster monitoring
• Be able to install all the ecoystem components in CDH 5, including (but not
limited to): Impala, Flume, Oozie, Hue, Cloudera Manager, Sqoop, Hive, and
Pig
• Identify the function and purpose of available tools for managing the
Apache Hadoop file system
Given a scenario, identify how the cluster will
handle disk and machine failures
• Handling disk and machine failures
HDFS
 Replication factor is used to address both disk and machine failures.
 In multi rack configuration, it can effectively handle network switch failures as well.
Map Reduce
 MRv1/Classic
 MRv2/YARN
Rack awareness (HDFS and Map Reduce)
Rack Awareness
Given a scenario, identify how the cluster will
handle disk and machine failures
• MapReduce v1 (MRv1) – Fault Tolerance
• Task Failure
•
•
Failed due to bug in mapper/reducer code
Bugs in JVM
•
•
•
Hung
Number of task attempts are controlled by mapred.map.max.attempts, mapred.reduce.max.attempts (default 4)
If failures of a job can be ignored then use mapred.*.max.failures.percent (* => map/reduce)
•
Speculative execution – enabled by default, multiple tasks might process same data in case of slowness due to failures related to
hardware (servers, memory, network etc)
• Task Tracker Failure
•
If there are no heartbeats from task tracker to job tracker for 10 minutes, then that task tracker will be removed from the pool
•
•
If there are too many failures (default 4) for a task tracker, it will be blacklisted - mapred.max.tracker.blacklists
If there are too many failures (default 4) for a task tracker per job, it will be blacklisted - mapred.max.tracker.failures
• Job Tracker Failure
•
•
•
Job Tracker is master for scheduling all jobs.
Job Tracker is single point of failure
No jobs can be run
Given a scenario, identify how the cluster will
handle disk and machine failures
• MapReduce v2 (MRv2/YARN) – Fault Tolerance
• Task Failure (mostly same as classic/MRv1)
• Application Master Failure
•
If application master is failed that means a job is failed. It can be controlled by yarn.resourcemanager.am.max.retries (default 1)
• Node Manager Failure
•
If there are no heartbeats from Node Manager to Resource Manager for 10 minutes (default), then that task tracker will be removed from
the pool
• Resource Manager Failure
•
•
Although probability of Resource Manager failure is relatively low, no jobs can be submitted until RM is brought back up and running.
High availability can be configured in YARN which means there will be multiple RM running in the cluster. There is no high availability in
MRv1 (only one job tracker)
Analyze a logging configuration and logging
configuration file format
• Hadoop 1.x
Log files are stored locally where map/reduce tasks run
In some cases, it used to be tedious to troubleshoot using logs that are
scattered across multiple nodes in the cluster
• Hadoop 2.x
Log files are stored locally where map/reduce tasks run
Provides additional features to store logs in HDFS. So if any job fails we need
not go through multiple nodes to troubleshoot the issue. We can get the
details from HDFS.
Analyze a logging configuration and logging
configuration file format
• Default Hadoop logs location $HADOOP_HOME/logs
 In hadoop.env.sh HADOOP_LOG_DIR has to be set to different value in production clusters
 Typically under /var/log/hadoop
 CDH5 stores under /var/log and directory for each of the sub process
 Two log files *.log and *.out
 Log files are rotated daily (depending up on log4j configuration)
 Out files are rotated on restarts. They typically store information during daemon startup and do not contain much information.
 Log file naming convention (hadoop-<user-running-hadoop>-<daemon>-<hostname>. {log|out}
 Default log level – INFO
 Log level can be set for any specific class with log4j.logger.class.name = LEVEL
 Valid log levels: FATAL, ERROR, WARN, INFO, DEBUG, TRACE
• HDFS
 Settings in log4j.properties
• MRv2/YARN
 Settings in log4j.properties
 Settings in yarn-site.xml
HDFS – Log Configuration
• Update log4j.properties
MRv2/YARN – Log Configuration
File Name
Parameter Name
yarn-site.xml
yarn.log-aggregation-enable
yarn-site.xml
yarn.log-aggregation.retain-seconds
yarn-site.xml
yarn.nodemanager.log-dirs
yarn-site.xml
yarn.nodemanager.log.retain-seconds
yarn-site.xml
yarn.nodemanager.remote-app-log-dir
yarn-site.xml
yarn.nodemanager.remote-app-log-dirsuffix
log4j.properties
log4j.properties
log4j.properties
log4j.properties
Parameter value
Description
true or false – enable or disable log aggregation
604800
10800
Understand the basics of Hadoop metrics and
cluster health monitoring
• Hadoop Metrics





jvm
dfs
mapred
rpc
Source and sink
• Metrics will be collected from various sources (daemons) and pushed to sink (eg: Ganglia). Rules can be
defined to filter out the metrics that are not required by the sinks.
• Sample hadoop-metrics2.properties
# hadoop-metrics2.properties
# By default, send metrics from all sources to the sink
# named 'file', using the implementation class FileSink.
*.sink.file.class = org.apache.hadoop.metrics2.sink.FileSink
# Override the parameter 'filename' in 'file' for the namenode.
namenode.sink.file.filename = namenode-metrics.log
# Send the jobtracker metrics into a separate file.
jobtracker.sink.file.filename = jobtracker-metrics.log
Understand the basics of Hadoop metrics and
cluster health monitoring
• Cluster Health Monitoring examples
 Monitoring hadoop daemons
 Alert if daemon goes down
 Monitoring disks
 Alert immediately if disk fails
 Warn when usage on disk reaches 80%
 Critical alert when usage on disk reaches 90%
 Monitoring CPU on master nodes
 Alert excessive CPU usage on masters
 Excessive CPU usage on slaves is typical
 Monitor swap usage on all nodes
 Alert if swap partition is used
 Monitor network transfer speeds
 Monitor checkpoints by secondary namenode
 Age of fsimage
 Size of edit logs
Understand the basics of Hadoop metrics and
cluster health monitoring
• Health Monitoring
Thresholds are fluid
Start conservative and change them over time
Avoid unnecessary alerting
Alerting can be set at host level, overall, HDFS specific and Map Reduce specific
•
•
•
•
•
Host level checks
Overall Hadoop checks
HDFS checks
Map Reduce checks
CDH5 – monitoring and alerts
http://www.cloudera.com/content/cloudera/en/documentation/clouderamanager/v5-latest/Cloudera-Manager-DiagnosticsGuide/cm5dg_monitoring_settings.html
Identify the function and purpose of available
tools for cluster monitoring
• Ganglia
• Nagios
• Cacti
• Hyperix
• Zabbix
• Cloudera Manager
• Ambari
• Many more
Be able to install all the ecoystem components in CDH 5,
including (but not limited to): Impala, Flume, Oozie, Hue,
Cloudera Manager, Sqoop, Hive, and Pig
•
•
•
•
•
•
•
•
•
•
Cloudera Manager
Hive
Impala
Flume
Sqoop
Pig
Zookeeper
HBase
Oozie
Hue
Cloudera Manager
Hive (Architecture)
Hive
• Architecture
• Dependencies
 HDFS
 Map Reduce
 Relational DB for Metastore (MySQL or PostgreSQL)
• Daemon Processes
 No additional daemon processes on Hadoop cluster
 Zookeeper
 Hive Server
 Hive Metastore
• Configuration using CDH5
 Need to configure Hive Metastore (after relational database is already installed)
• Validation
 Logs
 Running simple queries
Impala
• Architecture
• Dependencies
HDFS only (does not require map reduce)
Metastore (Hive)
• Daemon Processes
Impalad
Storeserver
Catalogserver
• Configuration using CDH5
• Validation
Logs
Running simple queries
Flume
• Architecture
• Dependencies
HDFS
Map Reduce
• Daemon Processes
• Configuration using CDH5
• Validation
Sqoop (Architecture)
Sqoop Architecture
Sqoop2 (Architecture)
Sqoop
• Architecture
• Dependencies
HDFS
Map Reduce
• Daemon Processes
• Configuration using CDH5
• Validation
Pig
• Architecture
Install pig binaries on the gateway node
• Dependencies
HDFS
Map Reduce
• Daemon Processes
None
• Configuration using CDH5
• Validation
Zookeeper
• Architecture
• Dependencies
• Daemon Processes
• Configuration using CDH5
• Validation
HBase (Architecture)
Namenode
Node Manager
Datanode
Resource
Manager
HBase
Masters
Zoo Keeper
HBase Region
Servers
HBase
• Architecture
• Dependencies
HDFS
Zookeeper
• Daemon Processes
Masters (at least 3)
Region Servers
• Configuration using CDH5
• Validation
Oozie
• Architecture
• Dependencies
HDFS
Map Reduce
Require other components to run different workflows (for eg: Hive, Pig etc)
• Daemon Processes
Oozie Server
• Configuration using CDH5
• Validation
Hue
• Architecture
• Dependencies
All Hadoop eco system tools that needs to be accessed using Hue UI
• Daemon Processes
• Configuration using CDH5
• Validation
Identify the function and purpose of available
tools for managing the Apache Hadoop file system
• HDFS Federation
Load balancing Namenodes
• HDFS HA (Active/Passive)
Transparent fail over using journal nodes
• Namenode UI (default port: 50070)
• Datanode UI (default port: 50075)
• hadoop fs (command line utility)
• hdfs (command line utility)
Resource Management (10%)
• Understand the overall design goals of each of Hadoop schedulers
• Given a scenario, determine how the FIFO Scheduler allocates cluster
resources
• Given a scenario, determine how the Fair Scheduler allocates cluster
resources under YARN
• Given a scenario, determine how the Capacity Scheduler allocates
cluster resources
Understand the overall design goals of each
of Hadoop schedulers
• FIFO Scheduler – First In First Out
Default scheduler
Not suitable for production deployments
• Fair Scheduler
Uses available containers as criteria
• Capacity Scheduler
Uses available capacity as criteria
Given a scenario, determine how the FIFO
Scheduler allocates cluster resources
• FIFO – First in First out
Given a scenario, determine how the Fair
Scheduler allocates cluster resources under YARN
Given a scenario, determine how the Capacity
Scheduler allocates cluster resources
Monitoring and Logging (15%)
• Understand the functions and features of Hadoop’s metric collection
abilities
• Analyze the NameNode and JobTracker Web UIs
• Understand how to monitor cluster Daemons
• Identify and monitor CPU usage on master nodes
• Describe how to monitor swap and memory allocation on all nodes
• Identify how to view and manage Hadoop’s log files
• Interpret a log file
Understand the functions and features of
Hadoop’s metric collection abilities
• Hadoop Metrics





jvm
dfs
mapred
rpc
Source and sink
• Metrics will be collected from various sources (daemons) and pushed to sink (eg: Ganglia). Rules can be
defined to filter out the metrics that are not required by the sinks.
• Sample hadoop-metrics2.properties
# hadoop-metrics2.properties
# By default, send metrics from all sources to the sink
# named 'file', using the implementation class FileSink.
*.sink.file.class = org.apache.hadoop.metrics2.sink.FileSink
# Override the parameter 'filename' in 'file' for the namenode.
namenode.sink.file.filename = namenode-metrics.log
# Send the jobtracker metrics into a separate file.
jobtracker.sink.file.filename = jobtracker-metrics.log
Analyze the NameNode and JobTracker Web
UIs
• Namenode Web UI
Architecture recap
• JobTracker Web UI
Architecture recap
• ResourceManager Web UI
Architecture recap
Resource Manager
Application Master
Job History Server
Understand how to monitor cluster Daemons
• Cluster Daemons
HDFS
 proc_namenode
 proc_secondarynamenode
 proc_datanode
Map Reduce (MRv1/classic)
 proc_jobtracker
 proc_tasktracker
Map Reduce (MRv2/YARN)
 proc_resourcemanager
 proc_nodemanager
• ps command (ps -fu, ps -ef)
• Service command
• Cloudera Manager
Identify and monitor CPU usage on master
nodes
• Uptime
• Top
• Cpustat
• Cloudera Manager
Describe how to monitor swap and memory
allocation on all nodes
• Top command
• Cloudera Manager
Identify how to view and manage Hadoop’s
log files
• HDFS
Command Line
Web UI
• MapReduce
Command Line
Web UI
Interpret a log file
• Using Web UI navigate through logs and interpret the information to
monitor the cluster, running jobs and troubleshoot any issues.
Miscellaneous
• Compression