CS236620 – Introduction to Big Data Technology HW Assignment #1 – due June 5, 2017 Please submit electronically, in pairs. This assignment will use Spark to compute item-item lifts on a publically available user-movie ratings dataset. Technicalities: 1. Download and install the Hortonworks Sandbox on your machine. This involves setting up a virtual machine environment – additional software (such as VirtualBox or VMware Fusion) is required. 2. Read the tutorials about using Spark on the sandbox. 3. Download the MovieLens 1M dataset. In the second homework assignment, you will be given a user and multiple movies that the user has rated, and will be asked to recommend additional movies to the user based on those ratings. To enable that, the current assignment computes, per each rated movie, which other movies should be recommended given the rating. We define the PosLift and the NegLift of movie y given movie x as follows: PosLift(y|x) = Prob[u rated y positively | u rated x positively] / Prob[u rated y positively] NegLift(y|x) = Prob[u rated y positively | u rated x negatively] / Prob[u rated y positively] Where: u denotes a user. A rating of 3 stars and higher will be considered positive. A rating of 2.5 stars and below will be considered negative. Intuitively, the PosLift of y given x indicates whether users who like x are more inclined to like y than the general audience. The NegLift of y given x indicates whether users who dislike x are more inclined to like y than the general audience. Your Spark job(s) should compute, using λ=15 and k=10, the following files: 1. A file where for each movie x that was rated positively more than λ times, the top-k movies by PosLift given x are output in <x, y, PosLift> format. 2. A file where for each movie x that was rated negatively more than λ times, the top-k movies by NegLift given x are output in <x, y, NegLift> format. In addition to submitting working code and the two output files, please submit: 1. An executing script or a readme file with instructions on how to run your code. 2. Clear external documentation of your solution. 3. Internal documentation, especially of any transformations applied to the data. Emphasis will be put on the elegance, efficiency and scalability of your solution. Like any wet assignment, there are multiple ways to solve the given problem, often with various trade-offs. The documentation should call out and justify the design choices you made, and any assumptions they are based on. Good Luck!
© Copyright 2026 Paperzz