Final Exam Topics CprE 419, Spring 2015 Iowa State University HDFS and MapReduce • Big Data Storage, how does HDFS work, what are options to HDFS, what are the limitations of HDFS, fault-tolerance of HDFS. • MapReduce model of computation/programming, understanding of how to cast a problem in the map/reduce framework, examples discussed in class, analysis of computation, communication cost in MapReduce MapReduce • Hadoop MapReduce, including Combiners, Sorting using MapReduce (lab 4). • 1-2 problems on writing pseudocode for a problem using MapReduce • Also the use of different input formats in MapReduce – you don’t need to know specific syntax, but do need to know what is the purpose of an input format, and how did you use custom input formats in Lab 5 (JSON parsing).. Pig/Hive/HBase • You don’t need to remember syntax, but do need to know concepts. For example, we may give you an example program and ask you to tell us what it does and perhaps how we can make it more efficient (work through the lab on Pig). • Basic concepts of Hive and HBase – what are they, and how are they useful? How are they different from Pig and MapReduce? Apache Spark • How is it different from Hadoop? What are its advantages? • Concept of an RDD. Stream Processing Concepts • How is it different from batch processing? Why can’t we use Hadoop for this, and why do we need new software tools? • Algorithm design for stream processing, write pseudocode for data processing operators in a stream processing system. Concept of sliding and tumbling windows. Infosphere Streams • What does the operator graph mean? • What are the semantics of some popular operators? • We may show you some code and ask you to describe what it does, and how it can be improved. MapReduce 1 Consider a dataset with one record for each person in a country, with the following information per person: <latitude> <longitude> <person name> Design a map-reduce algorithm to list all pairs of entries who live within D miles of each other, and have the same name. Analyze the communication cost of the algorithm in terms of the input size. MapReduce 2 Consider a dataset with one record for each person in a country, with the following information per person: <latitude> <longitude> <zipcode> <person name> Design a map-reduce algorithm to list all pairs of people who live in the same zipcode, and whose names differ in at most two characters. Analyze the communication cost of the algorithm in terms of the input size. Streams Background Give examples of data streams with arrival rates of approximately the following. 1. one hundred items per second 2. one thousand items per second, and 3. one million items per second. Streams 1 Consider a stream of telephone call records, where every element of the stream is a tuple <src, dest, duration>, where src is the source phone number, dest is the destination phone number, and duration is the length of the call in seconds. Devise an algorithm to compute the (src, dest) pair that has the maximum number of calls among all possible pairs of phone numbers, over: 1. count-based tumbling window of size N 2. count-based sliding window of size N For each, write the algorithm in pseudocode, analyze the processing time per element, and the memory consumption of the operator. Streams 2 • Write an algorithm for the following stream processing tasks, being run at a busy webserver. The input stream, Requests, is a sequence of URLs requested by HTTP clients, of the format <rstring URL, rstring client_IP_Address> – The first output stream, Frequent, contains the most frequently requested URL so far. An item should be output into Frequent every 100 URLs seen in the input stream. Note that the scope of the aggregation is the entire stream so far. – The next output stream, RecentFrequent, contains the most frequently requested URL over the last 10,000 URL requests. Further, an item should be output into RecentFrequent every 10,000 URLs, i.e., the aggregation must be performed over a tumbling window of size 10,000. • For each of the above tasks, describe (1) the state your algorithm maintains, and how it is initialized, (2) the action taken upon a tuple arrival, and (3) the action taken when there is a trigger for a query, or when the window moves. Also analyze the space taken to maintain the state, and the time taken for processing each of the above actions.
© Copyright 2026 Paperzz