CS236620 – hw1-2017

CS236620 – Introduction to Big Data Technology
HW Assignment #1 – due June 5, 2017
Please submit electronically, in pairs.
This assignment will use Spark to compute item-item lifts on a publically available
user-movie ratings dataset.
Technicalities:
1. Download and install the Hortonworks Sandbox on your machine. This
involves setting up a virtual machine environment – additional software (such
as VirtualBox or VMware Fusion) is required.
2. Read the tutorials about using Spark on the sandbox.
3. Download the MovieLens 1M dataset.
In the second homework assignment, you will be given a user and multiple movies
that the user has rated, and will be asked to recommend additional movies to the user
based on those ratings. To enable that, the current assignment computes, per each
rated movie, which other movies should be recommended given the rating.
We define the PosLift and the NegLift of movie y given movie x as follows:


PosLift(y|x) = Prob[u rated y positively | u rated x positively] / Prob[u rated y positively]
NegLift(y|x) = Prob[u rated y positively | u rated x negatively] / Prob[u rated y positively]
Where:
 u denotes a user.
 A rating of 3 stars and higher will be considered positive.
 A rating of 2.5 stars and below will be considered negative.
Intuitively, the PosLift of y given x indicates whether users who like x are more
inclined to like y than the general audience. The NegLift of y given x indicates
whether users who dislike x are more inclined to like y than the general audience.
Your Spark job(s) should compute, using λ=15 and k=10, the following files:
1. A file where for each movie x that was rated positively more than λ times, the
top-k movies by PosLift given x are output in <x, y, PosLift> format.
2. A file where for each movie x that was rated negatively more than λ times, the
top-k movies by NegLift given x are output in <x, y, NegLift> format.
In addition to submitting working code and the two output files, please submit:
1. An executing script or a readme file with instructions on how to run your code.
2. Clear external documentation of your solution.
3. Internal documentation, especially of any transformations applied to the data.
Emphasis will be put on the elegance, efficiency and scalability of your solution. Like
any wet assignment, there are multiple ways to solve the given problem, often with
various trade-offs. The documentation should call out and justify the design choices
you made, and any assumptions they are based on.
Good Luck!