COSC6376 Cloud Computing Lecture 4. HDFS, MapReduce Implementation, and Spark Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston Outline • GFS/HDFS • MapReduce Implementation • Spark Distributed File System 3 Goals of HDFS • Very Large Distributed File System 10K nodes, 100 million files, 10PB • Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recover from them • Optimized for Batch Processing Data locations exposed so that computations can move to where data resides Provides very high aggregate bandwidth Assumptions • Inexpensive components that often fail • Large files • Large streaming reads and small random reads • Large sequential writes • Multiple users append to the same file • High bandwidth is more important than low latency 6 7 Hadoop Cluster 8 Failure Trends in a Large Disk Drive Population • The data are broken down by the age a drive was when it failed. Failure Trends in a Large Disk Drive Population • The data are broken down by the age a drive was when it failed. 11 12 Architecture • Chunks File chunks location of chunks (replicas) • Master server Single master Keep metadata accept requests on metadata Most mgr activities • Chunk servers Multiple Keep chunks of data Accept requests on chunk data User Interface • Commads for HDFS User: hadoop dfs -mkdir /foodir hadoop dfs -cat /foodir/myfile.txt hadoop dfs -rm /foodir/myfile.txt • Commands for HDFS Administrator hadoop dfsadmin -report hadoop dfsadmin -decommision datanodename • Web Interface http://host:port/dfshealth.jsp Design decisions • Single master Simplify design Single point-of-failure Limited number of files • Meta data kept in memory • Large chunk size: e.g., 64M Advantages • Reduce client-master traffic • Reduce network overhead – less network interactions • Chunk index is smaller Disadvantages • Not favor small files Master: meta data • Metadata is stored in memory • Namespaces Directory physical location • Files chunks chunk locations • Chunk locations Not stored by master, sent by chunk servers • Operation log Master Operations • All namespace operations Name lookup Create/remove directories/files, etc • Manage chunk replicas Placement decision Create new chunks & replicas Balance load across all chunkservers Garbage claim Master: chunk replica placement • Goals: maximize reliability, availability and bandwidth utilization • Physical location matters Lowest cost within the same rack “Distance”: # of network switches • In practice (hadoop) If we have 3 replicas Two chunks in the same rack The third one in another rack • Choice of chunkservers Low average disk utilization Limited # of recent writes distribute write traffic Master: chunk replica placement • Re-replication Lost replicas for many reasons Prioritized: low # of replicas, live files, actively used chunks Following the same principle to place • Rebalancing Redistribute replicas periodically • Better disk utilization • Load balancing Master: garbage collection • Lazy mechanism Mark deletion at once Reclaim resources later • Regular namespace scan For deleted files: remove metadata after three days (full deletion) For orphaned chunks, let chunkservers know they are deleted • Stale replica Use chunk version numbers Read/Write File Read 1. open 2. get block locations Distributed NameNode HDFS 3. read FileSystem client 6. close FSData name node InputStream client JVM client node File Write 1. create 2. create Distributed NameNode HDFS 3. writeFileSystem 8. complete client 7. close FSData name node OutputStream client JVM 4. get a list of 3 data nodes client node 4. read from the closest node 5. write packet 6. ack packet 5. read from the 2nd closest node DataNode DataNode DataNode DataNode DataNode DataNode data node data node data node data node data node data node If a data node crashed, the crashed node is removed, current block receives a newer id so as to delete the partial data from the crashed node later, and Namenode allocates an another node. 22 Consistency • It is expensive to maintain strict consistency • GFS uses a relaxed consistency Better support for appending checkpointing Fault Tolerance • High availability Fast recovery Chunk replication Master replication: inactive backup • Data integrity Checksumming • Use CRC32 Incremental update checksum to improve performance • A chunk is split into 64K-byte units • Update checksum after adding a unit Cost 25 Discussion • Advantages Works well for large data processing Using cheap commodity servers • Tradeoffs Single master design Reads most, appends most • Latest upgrades (GFS II) Distributed masters Introduce the “cell” – a number of racks in the same data center Improved performance of random r/w MapReduce Implementation 27 Execution • How is this distributed? 1. Partition input key/value pairs into chunks, run map() tasks in parallel 2. After all map()s are complete, consolidate all emitted values for each unique emitted key 3. Now partition space of output map keys, and run reduce() in parallel • If map() or reduce() fails, reexecute! MapReduce - Dataflow Map Implementation Chunk(block) A Map Process K1~ki Ki+1-kj … … … R parts in map’s local storage Map processes are allocated to be close to the chunks as possible One node can run a number of map processes. It depends on the setting. Reducer Implementation Mapper1 output Mapper2 output Mappern output K1~ki K1~ki K1~ki Ki+1-kj R Reducers … Ki+1-kj … Ki+1-kj … - R final output files stored in the user designated directory … Execution Execution Initialization • Split input file into 64MB sections (GFS) Read in parallel by multiple machines • Fork off program onto multiple machines • One machine is Master • Master assigns idle machines to either Map or Reduce tasks • Master Coordinates data communication between map and reduce machines Overview of Hadoop/MapReduce Partition Function • Inputs to map tasks are created by contiguous splits of input file • For reduce, we need to ensure that records with the same intermediate key end up at the same worker • System uses a default partition function e.g., hash(key) mod R Partition Key-Value Pairs for Reduce • Hadoop uses a hash code as its default method to partition key-value pairs. • The default hash code used by a string object in Java (used by Hadoop). • Wn represents the nth element in a string. • A partition function typically uses the hash code of the key and the modulo of reducers to determine which reducer to send the key-value pair to. Dataflow • Input, final output are stored on a distributed file system Scheduler tries to schedule map tasks “close” to physical storage location of input data • Intermediate results are stored on local FS of map and reduce workers • Output is often input to another map reduce task How Many Map and Reduce jobs? • M map tasks, R reduce tasks • Rule of thumb Make M and R much larger than the number of nodes in cluster One DFS chunk per map is common Improves dynamic load balancing and speeds recovery from worker failure • Usually R is smaller than M, because output is spread across R files Misc. Remarks Parallelism • map() functions run in parallel, creating different intermediate values from different input data sets • reduce() functions also run in parallel, each working on a different output key • All values are processed independently • Bottleneck: reduce phase can’t start until map phase is completely finished. Locality • Master program divides up tasks based on location of data: tries to have map() tasks on same machine as physical file data, or at least same rack • map() task inputs are divided into 64 MB blocks in HDFS (by default): same size as Google File System chunks Job Processing TaskTracker 0 TaskTracker 1 TaskTracker 2 JobTracker TaskTracker 3 “wordcount” TaskTracker 4 TaskTracker 5 1. Client submits “wordcount” job, indicating code and input files 2. JobTracker breaks input file into k chunks, (in this case 6). Assigns work to T.trackers. 3. After map(), T.trackers exchange map-output to build reduce() keyspace 4. JobTracker breaks reduce() keyspace into m chunks (in this case 6). Assigns work. 5. reduce() output may go to HDFS Fault Tolerance • Worker failure Map/reduce fails reassign to other workers Node failure redo jobs in other nodes • Chunk replicas make it possible • Master failure Log/checkpoint Master process and GFS master server • Backup tasks for “straggler”: The whole process is often delayed by a few workers, because of various reasons (network, disk I/O, node failure…) As close to completion, master starts multiple workers for each in-progress job • Without vs. with backup tasks: 44% longer Experiment: Effect of Backup Tasks, Reliability (example: sorting) MapReduce Chaining MapReduce Summary 46 MapReduce Conclusions • MapReduce has proven to be a useful abstraction • Greatly simplifies large-scale computations at Google • Functional programming paradigm can be applied to large-scale applications • Fun to use: focus on problem, let the middleware deal with messy details In Real World • Applied to various domains in Google Machine learning Clustering Reports Web page processing Indexing Graph computation … • Mahout library Scalable machine learning and data mining • Research projects MapReduce: The Good • • • • Built in fault tolerance Optimized IO path Scalable Developer focuses on Map/Reduce, not infrastructure • Simple? API 49 MapReduce: The Bad • Optimized for disk IO Doesn’t leverage memory well Iterative algorithms go through disk IO path again and again • Primitive API Developer’s have to build on very simple abstraction Key/Value in/out Even basic things like join require extensive code • Result often many files that need to be combined appropriately 50 Hadoop Acceleration 51 FPGA Cluster 52 FPGA for MapReduce 53 Density and Energy 54 ARM Cluster 6 node, 6 TB Hadoop cluster on ARM 55 MapReduce Using GPUs • Graphics Processing Unit (GPU) is a specialized processor that offloads 3D or 2D graphics rendering from the CPU • GPUs’ highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms GPGPU Programming • Each block can have up to 512 threads that synchronize • Millions of blocks can be issued Programming Environment: CUDA • Compute Unified Device Architecture (CUDA) • A parallel computing architecture developed by NVIDIA • The computing engine in GPUs is accessible to software developers through industry standard programming language Mars: A MapReduce Framework on Graphics Processors • System Workflow and Configuration Applications Hardware Mars: Performance • Performance speedup between Phoenix and the optimized Mars with the data size varied. MapReduce Advantages/Disadvantages • Now it’s easy to program for many CPUs Communication management effectively gone • I/O scheduling done for us Fault tolerance, monitoring • machine failures, suddenly-slow machines, etc are handled Can be much easier to design and program! Can cascade several MapReduce tasks • But … it further restricts solvable problems Might be hard to express problem in MapReduce Data parallelism is key • Need to be able to break up a problem by data chunks In-Memory MapReduce 64 In-Memory Data Grid 65 Speed of In-Memory Data Grid https://www.datanami.com/2013/10/21/accelerating_h adoop_mapreduce_using_an_in-memory_data_grid/ 66 Real-time 67 Motivation Many important applications must process large streams of live data and provide results in near-real-time - Social network trends Website statistics Intrustion detection systems etc. Require large clusters to handle workloads Require latencies of few seconds Tathagata Das. Spark Streaming: Large-scale near-real-time stream processing. Available data source • Twitter public status updates Jagane Sundar. Realtime Sentiment Analysis Application Using Hadoop and HBase Firehose • “Firehose” is the name given to the massive, realtime stream of Tweets that flow from Twitter each day. • Twitter provides access to this “firehose”, using a streaming technology called XMPP Who are using it Requirements Scalable to large clusters Second-scale latencies Simple programming model Apache Spark 73 74 75 Apache Spark • Originally developed in UC Berkeley’s AMP Lab • Fully open sourced – now at Apache Software Foundation • Commercial Vendor Developing/Supporting Apache Spark. MapR Technologies. 76 Spark: Easy and Fast Big Data • Easy to Develop Rich APIs in Java, Scala, Python Interactive shell • Fast to Run General execution graphs In-memory storage • 2-5× less code Apache Spark. MapR Technologies. 77 78
© Copyright 2026 Paperzz