Final Exam Topics

Final Exam Topics
CprE 419, Spring 2015
Iowa State University
HDFS and MapReduce
• Big Data Storage, how does HDFS work, what are
options to HDFS, what are the limitations of
HDFS, fault-tolerance of HDFS.
• MapReduce model of
computation/programming, understanding of
how to cast a problem in the map/reduce
framework, examples discussed in class, analysis
of computation, communication cost in
MapReduce
MapReduce
• Hadoop MapReduce, including Combiners,
Sorting using MapReduce (lab 4).
• 1-2 problems on writing pseudocode for a
problem using MapReduce
• Also the use of different input formats in
MapReduce – you don’t need to know specific
syntax, but do need to know what is the purpose
of an input format, and how did you use custom
input formats in Lab 5 (JSON parsing)..
Pig/Hive/HBase
• You don’t need to remember syntax, but do need
to know concepts. For example, we may give you
an example program and ask you to tell us what it
does and perhaps how we can make it more
efficient (work through the lab on Pig).
• Basic concepts of Hive and HBase – what are
they, and how are they useful? How are they
different from Pig and MapReduce?
Apache Spark
• How is it different from Hadoop? What are its
advantages?
• Concept of an RDD.
Stream Processing Concepts
• How is it different from batch processing?
Why can’t we use Hadoop for this, and why do
we need new software tools?
• Algorithm design for stream processing, write
pseudocode for data processing operators in a
stream processing system. Concept of sliding
and tumbling windows.
Infosphere Streams
• What does the operator graph mean?
• What are the semantics of some popular
operators?
• We may show you some code and ask you to
describe what it does, and how it can be
improved.
MapReduce 1
Consider a dataset with one record for each person
in a country, with the following information per
person: <latitude> <longitude> <person name>
Design a map-reduce algorithm to list all pairs of
entries who live within D miles of each other, and
have the same name. Analyze the communication
cost of the algorithm in terms of the input size.
MapReduce 2
Consider a dataset with one record for each person
in a country, with the following information per
person: <latitude> <longitude> <zipcode> <person
name>
Design a map-reduce algorithm to list all pairs of
people who live in the same zipcode, and whose
names differ in at most two characters. Analyze the
communication cost of the algorithm in terms of
the input size.
Streams Background
Give examples of data streams with arrival rates
of approximately the following.
1. one hundred items per second
2. one thousand items per second, and
3. one million items per second.
Streams 1
Consider a stream of telephone call records, where every element of the
stream is a tuple <src, dest, duration>, where src is the source phone number,
dest is the destination phone number, and duration is the length of the call in
seconds. Devise an algorithm to compute the (src, dest) pair that has the
maximum number of calls among all possible pairs of phone numbers, over:
1. count-based tumbling window of size N
2. count-based sliding window of size N
For each, write the algorithm in pseudocode, analyze the processing time per
element, and the memory consumption of the operator.
Streams 2
•
Write an algorithm for the following stream processing tasks, being run at a busy webserver.
The input stream, Requests, is a sequence of URLs requested by HTTP clients, of the format
<rstring URL, rstring client_IP_Address>
– The first output stream, Frequent, contains the most frequently requested URL so far. An
item should be output into Frequent every 100 URLs seen in the input stream. Note that
the scope of the aggregation is the entire stream so far.
– The next output stream, RecentFrequent, contains the most frequently requested URL
over the last 10,000 URL requests. Further, an item should be output into
RecentFrequent every 10,000 URLs, i.e., the aggregation must be performed over a
tumbling window of size 10,000.
•
For each of the above tasks, describe (1) the state your algorithm maintains, and how it is
initialized, (2) the action taken upon a tuple arrival, and (3) the action taken when there is a
trigger for a query, or when the window moves. Also analyze the space taken to maintain the
state, and the time taken for processing each of the above actions.