By Carson Gallimore Introduction Implementation Execution Overview Used to process large data sets Google needed abstraction to hide messy details. Created a library with parallelization, fault-tolerance, data distribution and load balancing. Simple and powerful interface. The right choice depends on the environment Google’s setup typically dual x86 processors 2-4GB memory 100Mb/s or 1Gb/s networking hardware Hundreds or thousands of machines per cluster Inexpensive IDE disks directly on machines Input data partitioned into M splits Intermediate key space partitioned into R pieces. 1) Splits input files into 16 to 64 MB per piece. 2)Master assigns tasks. 3) Worker reads the input split. 4) Buffered pairs get written to local disk. 5) Reduce worker makes procedure call to retrieve data from map worker’s local disk. 6) Results are written to final output file 7) Master wakes user program and returns all output files. Image source: http://delivery.acm.org.lib-proxy.radford.edu/10.1145/1330000/1327492/p107-dean.pdf?ip=137.45.30.160&id=1327492&acc=ACTIVE%20SERVICE&key=C2716FEBFA981EF12121826FBBBC8745CAA78830607D395F&CFID=251780983&CFTOKEN=84609340&__acm__=1381072513_fc0731bbf986fa5f 3ab0ebb5c02dc88f Introduction Implementation Execution Overview Dean, J. and Ghemawat, S. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (Jan. 2008): 107– 113; doi.acm.org/10.1145/1327452.1327492.
© Copyright 2025 Paperzz