Carson_MapReduce

By
Carson Gallimore
 Introduction
 Implementation
 Execution
Overview




Used to process large data sets
Google needed abstraction to hide
messy details.
Created a library with
parallelization, fault-tolerance, data
distribution and load balancing.
Simple and powerful interface.


The right choice depends on the environment
Google’s setup




typically dual x86 processors 2-4GB
memory
100Mb/s or 1Gb/s networking hardware
Hundreds or thousands of machines per
cluster
Inexpensive IDE disks directly on
machines
Input data partitioned into M splits
 Intermediate key space partitioned
into R pieces.
1) Splits input files into 16 to 64 MB per
piece.
2)Master assigns tasks.
3) Worker reads the input split.

4) Buffered pairs get written to local disk.
5) Reduce worker makes procedure call to
retrieve data from map worker’s local
disk.
6) Results are written to final output file
7) Master wakes user program and
returns all output files.
Image source:
http://delivery.acm.org.lib-proxy.radford.edu/10.1145/1330000/1327492/p107-dean.pdf?ip=137.45.30.160&id=1327492&acc=ACTIVE%20SERVICE&key=C2716FEBFA981EF12121826FBBBC8745CAA78830607D395F&CFID=251780983&CFTOKEN=84609340&__acm__=1381072513_fc0731bbf986fa5f 3ab0ebb5c02dc88f
 Introduction
 Implementation
 Execution
Overview

Dean, J. and Ghemawat, S. MapReduce:
Simplified data processing on large
clusters. Commun. ACM 51, 1 (Jan. 2008): 107–
113; doi.acm.org/10.1145/1327452.1327492.