The Netflix Challenge Parallel Collaborative Filtering James Jolly Ben Murrell CS 387 Parallel Programming with MPI Dr. Fikret Ercal What is Netflix? • subscription-based movie rental • online frontend • over 100,000 movies to pick from • 8M subscribers • 2007 net income: $67M What is the Netflix Prize? • attempt to increase Cinematch accuracy • predict how users will rate unseen movies • $1M for 10% improvement The contest dataset… • contains 100,480,577 ratings • from 480,189 users • for 17,770 movies Why is it hard? • user tastes difficult to model in general • movies tough to classify • large volume of data Sounds like a job for collaborative filtering! • infer relationships between users • leverage them to make predictions Why is it hard? User Dijkstra Knuth Turing Knuth Turing Boole Knuth Turing Movie Office Space Office Space Office Space Dr. Strangelove Dr. Strangelove Titanic Titanic Titanic Rating 5 5 5 4 2 5 1 2 What makes users similar? 5 Office Space 4 Turing 3 Titanic 2 Dr. Strangelove 1 0 0 1 2 3 Knuth 4 5 What makes users similar? The Pearson Correlation Coefficient! 5 Office Space 4 Turing 3 Titanic 2 Dr. Strangelove 1 pc = .813 0 0 1 2 3 Knuth 4 5 Building a similarity matrix… Turing Knuth Boole Chomsky Turing 1.000 0.813 0.750 0.125 Knuth 0.813 1.000 0.325 0.500 Boole 0.750 0.325 1.000 0.500 Chomsky 0.125 0.500 0.500 1.000 Predicting user ratings… Would Chomsky like “Grammar Rock”? approach: • use matrix to find users like Chomsky • drop ratings from those who haven’t seen it • take weighted average of remaining ratings Predicting user ratings… Turing Knuth Boole Chomsky Turing 1.000 0.813 0.750 0.125 Knuth 0.813 1.000 0.325 0.500 Boole 0.750 0.325 1.000 0.500 Chomsky 0.125 0.500 0.500 1.000 Suppose Turing, Knuth, and Boole rated it 5, 3, and 1. Since .125 + .5 + .5 = 1.125, we predict… rChomsky = ( (.125/1.125)5 + (.5/1.125)3 + (.5/1.125)1 )/3 rChomsky = 1.519 So how is the data really organized? movie file 1 movie file 2 movie file 3 … user 1, rating ‘5’ user 13, rating ‘3’ user 42, rating ‘2’ … user 13, rating ‘1’ user 42, rating ‘1’ user 1337, rating ‘2’ … user 13, rating ‘5’ user 311, rating ‘4’ user 666, rating ‘5’ … Training Data • 17,770 text files (one for each movie) • > 2 GB Parallelization Two Step Process: • Learning Step • Prediction Step Concerns: • Data Distribution • Task Distribution Parallelizing the learning step… user 1 user 2 user 3 user 4 user 5 user 6 user 7 user 8 user 1 c1,1 c1,2 c1,3 c1,4 c1,5 c1,6 c1,7 c1,8 user 2 c2,1 c2,2 c2,3 c2,4 c2,5 c2,6 c2,7 c2,8 user 3 c3,1 c3,2 c3,3 c3,4 c3,5 c3,6 c3,7 c3,8 user 4 c4,1 c4,2 c4,3 c4,4 c4,5 c4,6 c4,7 c4,8 user 5 c5,1 c5,2 c5,3 c5,4 c5,5 c5,6 c5,7 c5,8 user 6 c6,1 c6,2 c6,3 c6,4 c6,5 c6,6 c6,7 c6,8 user 7 c7,1 c7,2 c7,3 c7,4 c7,5 c7,6 c7,7 c7,8 user 8 c8,1 c8,2 c8,3 c8,4 c8,5 c8,6 c8,7 c8,8 Parallelizing the learning step… user 1 user 2 user 3 user 4 user 5 user 6 user 7 user 8 user 1 c1,1 c1,2 c1,3 c1,4 c1,5 c1,6 c1,7 c1,8 user 2 c2,1 c2,2 c2,3 c2,4 c2,5 c2,6 c2,7 c2,8 user 3 c3,1 c3,2 c3,3 c3,4 c3,5 c3,6 c3,7 c3,8 user 4 c4,1 c4,2 c4,3 c4,4 c4,5 c4,6 c4,7 c4,8 P=3 user 5 c5,1 c5,2 c5,3 c5,4 c5,5 c5,6 c5,7 c5,8 user 6 c6,1 c6,2 c6,3 c6,4 c6,5 c6,6 c6,7 c6,8 P=4 user 7 c7,1 c7,2 c7,3 c7,4 c7,5 c7,6 c7,7 c7,8 user 8 c8,1 c8,2 c8,3 c8,4 c8,5 c8,6 c8,7 c8,8 P=1 P=2 Parallelizing the learning step… • store data as user[movie] = rating • each proc has all rating data for n/p users • calculate each ci,j • calculation requires message passing (only 1/p of correlations can be calculated locally within a node) Parallelizing the prediction step… •Data distribution directly affects task distribution •Method 1: Store all user information on each processor and stripe movie information(less communication) P0 predict(user, movie) rating estimate P1 P2 P3 All User Information All User Information All User Information Movie1 Movie2 Movie3 Movie4 Movie5 Movie6 Movie7 Movie8 Movie9 Movie10 Movie11 Movie12 Parallelizing the prediction step… •Data distribution directly affects task distribution •Method 2: Store all movie information on each processor and stripe user information (more communication) P0 predict(user, movie) gather partial estimates P1 P2 P3 All Movie Ratings All Movie Ratings All Movie Ratings User1 User2 User3 User4 User5 User6 User7 User8 User9 User10 User11 User12 Parallelizing the prediction step… •Data distribution directly affects task distribution •Method 3: hybrid approach (lots of communication high number of nodes) P0 predict(user, movie) P1 P2 P3 P7 P8 P9 Users 1-3 Users 1-3 Users 1-3 Users 4-6 Users 4-6 Users 4-6 Movie1 Movie2 Movie3 Movie13 Movie14 Movie15 Movie4 Movie5 Movie6 Movie16 Movie17 Movie18 Movie7 Movie8 Movie9 Movie19 Movie20 Movie21 Movie10 Movie11 Movie12 Movie22 Movie23 Movie24 P4 P5 P6 Users 1-3 Users 1-3 Users 1-3 Movie13 Movie14 Movie15 Movie16 Movie17 Movie18 Movie19 Movie20 Movie21 Movie22 Movie23 Movie24 … … … Users 4-6 Users 4-6 Users 4-6 Movie25 Movie26 Movie27 Movie28 Movie29 Movie30 Movie31 Movie32 Movie33 Movie34 Movie35 Movie36 Our Present Implementation • operates on a trimmed-down dataset • stripes movie information and stores similarity matrix in each processor • this won’t scale well! • storing all movie information on each node would be optimal, but nic.mst.edu can’t handle it In summary… • tackling Netflix Prize requires lots of data handling • we are working toward an implementation that can operate on the entire training set • simple collaborative filtering should get us close to the old Cinematch performance
© Copyright 2026 Paperzz