Parallel Collaborative Filtering

The Netflix Challenge
Parallel Collaborative Filtering
James Jolly
Ben Murrell
CS 387
Parallel Programming
with MPI
Dr. Fikret Ercal
What is Netflix?
• subscription-based movie rental
• online frontend
• over 100,000 movies to pick from
• 8M subscribers
• 2007 net income: $67M
What is the Netflix Prize?
• attempt to increase Cinematch accuracy
• predict how users will rate unseen movies
• $1M for 10% improvement
The contest dataset…
• contains 100,480,577 ratings
• from 480,189 users
• for 17,770 movies
Why is it hard?
• user tastes difficult to model in general
• movies tough to classify
• large volume of data
Sounds like a job for collaborative filtering!
• infer relationships between users
• leverage them to make predictions
Why is it hard?
User
Dijkstra
Knuth
Turing
Knuth
Turing
Boole
Knuth
Turing
Movie
Office Space
Office Space
Office Space
Dr. Strangelove
Dr. Strangelove
Titanic
Titanic
Titanic
Rating
5
5
5
4
2
5
1
2
What makes users similar?
5
Office Space
4
Turing
3
Titanic
2
Dr. Strangelove
1
0
0
1
2
3
Knuth
4
5
What makes users similar?
The Pearson Correlation Coefficient!
5
Office Space
4
Turing
3
Titanic
2
Dr. Strangelove
1
pc = .813
0
0
1
2
3
Knuth
4
5
Building a similarity matrix…
Turing
Knuth
Boole
Chomsky
Turing
1.000
0.813
0.750
0.125
Knuth
0.813
1.000
0.325
0.500
Boole
0.750
0.325
1.000
0.500
Chomsky
0.125
0.500
0.500
1.000
Predicting user ratings…
Would Chomsky like “Grammar Rock”?
approach:
• use matrix to find users like Chomsky
• drop ratings from those who haven’t seen it
• take weighted average of remaining ratings
Predicting user ratings…
Turing
Knuth
Boole
Chomsky
Turing
1.000
0.813
0.750
0.125
Knuth
0.813
1.000
0.325
0.500
Boole
0.750
0.325
1.000
0.500
Chomsky
0.125
0.500
0.500
1.000
Suppose Turing, Knuth, and Boole rated it 5, 3, and 1.
Since .125 + .5 + .5 = 1.125, we predict…
rChomsky = ( (.125/1.125)5 + (.5/1.125)3 + (.5/1.125)1 )/3
rChomsky = 1.519
So how is the data really organized?
movie file 1
movie file 2
movie file 3
…
user 1, rating ‘5’
user 13, rating ‘3’
user 42, rating ‘2’
…
user 13, rating ‘1’
user 42, rating ‘1’
user 1337, rating ‘2’
…
user 13, rating ‘5’
user 311, rating ‘4’
user 666, rating ‘5’
…
Training Data
• 17,770 text files (one for each movie)
• > 2 GB
Parallelization
Two Step Process:
• Learning Step
• Prediction Step
Concerns:
• Data Distribution
• Task Distribution
Parallelizing the learning step…
user
1
user
2
user
3
user
4
user
5
user
6
user
7
user
8
user 1 c1,1
c1,2
c1,3
c1,4
c1,5
c1,6
c1,7
c1,8
user 2 c2,1
c2,2
c2,3
c2,4
c2,5
c2,6
c2,7
c2,8
user 3 c3,1
c3,2
c3,3
c3,4
c3,5
c3,6
c3,7
c3,8
user 4 c4,1
c4,2
c4,3
c4,4
c4,5
c4,6
c4,7
c4,8
user 5 c5,1
c5,2
c5,3
c5,4
c5,5
c5,6
c5,7
c5,8
user 6 c6,1
c6,2
c6,3
c6,4
c6,5
c6,6
c6,7
c6,8
user 7 c7,1
c7,2
c7,3
c7,4
c7,5
c7,6
c7,7
c7,8
user 8 c8,1
c8,2
c8,3
c8,4
c8,5
c8,6
c8,7
c8,8
Parallelizing the learning step…
user
1
user
2
user
3
user
4
user
5
user
6
user
7
user
8
user 1 c1,1
c1,2
c1,3
c1,4
c1,5
c1,6
c1,7
c1,8
user 2 c2,1
c2,2
c2,3
c2,4
c2,5
c2,6
c2,7
c2,8
user 3 c3,1
c3,2
c3,3
c3,4
c3,5
c3,6
c3,7
c3,8
user 4 c4,1
c4,2
c4,3
c4,4
c4,5
c4,6
c4,7
c4,8
P=3
user 5 c5,1
c5,2
c5,3
c5,4
c5,5
c5,6
c5,7
c5,8
user 6 c6,1
c6,2
c6,3
c6,4
c6,5
c6,6
c6,7
c6,8
P=4
user 7 c7,1
c7,2
c7,3
c7,4
c7,5
c7,6
c7,7
c7,8
user 8 c8,1
c8,2
c8,3
c8,4
c8,5
c8,6
c8,7
c8,8
P=1
P=2
Parallelizing the learning step…
• store data as user[movie] = rating
• each proc has all rating data for n/p users
• calculate each ci,j
• calculation requires message passing
(only 1/p of correlations can be calculated
locally within a node)
Parallelizing the prediction step…
•Data distribution directly affects task distribution
•Method 1: Store all user information on each processor
and stripe movie information(less communication)
P0
predict(user, movie)
rating estimate
P1
P2
P3
All User Information
All User Information
All User Information
Movie1
Movie2
Movie3
Movie4
Movie5
Movie6
Movie7
Movie8
Movie9
Movie10
Movie11
Movie12
Parallelizing the prediction step…
•Data distribution directly affects task distribution
•Method 2: Store all movie information on each processor
and stripe user information (more communication)
P0
predict(user, movie)
gather
partial
estimates
P1
P2
P3
All Movie Ratings
All Movie Ratings
All Movie Ratings
User1
User2
User3
User4
User5
User6
User7
User8
User9
User10
User11
User12
Parallelizing the prediction step…
•Data distribution directly affects task distribution
•Method 3: hybrid approach
(lots of communication
high number of nodes)
P0
predict(user, movie)
P1
P2
P3
P7
P8
P9
Users 1-3
Users 1-3
Users 1-3
Users 4-6
Users 4-6
Users 4-6
Movie1
Movie2
Movie3
Movie13
Movie14
Movie15
Movie4
Movie5
Movie6
Movie16
Movie17
Movie18
Movie7
Movie8
Movie9
Movie19
Movie20
Movie21
Movie10
Movie11
Movie12
Movie22
Movie23
Movie24
P4
P5
P6
Users 1-3
Users 1-3
Users 1-3
Movie13
Movie14
Movie15
Movie16
Movie17
Movie18
Movie19
Movie20
Movie21
Movie22
Movie23
Movie24
…
…
…
Users 4-6
Users 4-6
Users 4-6
Movie25
Movie26
Movie27
Movie28
Movie29
Movie30
Movie31
Movie32
Movie33
Movie34
Movie35
Movie36
Our Present Implementation
• operates on a trimmed-down dataset
• stripes movie information and stores
similarity matrix in each processor
• this won’t scale well!
• storing all movie information on each node
would be optimal, but nic.mst.edu can’t handle it
In summary…
• tackling Netflix Prize requires lots of data handling
• we are working toward an implementation that
can operate on the entire training set
• simple collaborative filtering should get us close
to the old Cinematch performance