Collaborative Filtering - University of Washington

Case Study 4: Collaborative Filtering
Collaborative Filtering
Matrix Completion
Alternating Least Squares
Machine Learning/Statistics for Big Data
CSE599C1/STAT592, University of Washington
Carlos Guestrin
February 28th, 2013
©Carlos Guestrin 2013
1
Collaborative Filtering
n 
n 
Goal: Find movies of interest to a user based on
movies watched by the user and others
Methods: matrix factorization, GraphLab
©Carlos Guestrin 2013
2
1
Women on the Verge of a Nervous Breakdown The Celebra8on en
m
m
co
re
What do I recommend??? City of God d Wild Strawberries La Dolce Vita ©Carlos Guestrin 2013 3 Cold-Start Problem
n 
n 
Challenge: Cold-start problem (new movie or user)
Methods: use features of movie/user
IN THEATERS
©Carlos Guestrin 2013
4
2
Netflix Prize
n 
Given 100 million ratings on a scale of 1
to 5, predict 3 million ratings to highest
accuracy
n 
n 
17770 total movies
480189 total users
Over 8 billion total ratings
n 
How to fill in the blanks?
n 
Figures from Ben Recht
©Carlos Guestrin 2013
5
Matrix Completion Problem
X=
n 
Xij known for black cells
Xij unknown for white cells
Rows
movies
Rowsindex
index users
Columns
index
users
Columns index movies
Filling missing data?
©Carlos Guestrin 2013
6
3
Interpreting Low-Rank Matrix Completion
(aka Matrix Factorization)
X
= L
R’
©Carlos Guestrin 2013
7
Matrix Completion via Rank Minimization
n 
Given observed values:
n 
Find matrix
n 
Such that:
n 
But…
n 
Introduce bias:
n 
Two issues:
©Carlos Guestrin 2013
8
4
Approximate Matrix Completion
n 
Minimize squared error:
¨ 
(Other loss functions are possible)
n 
Choose rank k:
n 
Optimization problem:
9
©Carlos Guestrin 2013
Coordinate Descent for Matrix Factorization
min
L,R
X
(Lu · Rv
ruv )2
(u,v,ruv )2X:ruv 6=?
n 
Fix movie factors, optimize for user factors
n 
First Observation:
©Carlos Guestrin 2013
10
5
Minimizing Over User Factors
n 
For each user u: min
Lu
X
v2Vu
(Lu · Rv
n 
In matrix form:
n 
Second observation: Solve by
ruv )2
11
©Carlos Guestrin 2013
Coordinate Descent for Matrix
Factorization: Alternating Least-Squares
min
L,R
n 
(Lu · Rv
ruv )2
(u,v,ruv )2X:ruv 6=?
Fix movie factors, optimize for user factors
¨ 
n 
X
Independent least-squares over users
min
Lu
Fix user factors, optimize for movie factors
¨ 
Independent least-squares over movies
n 
System may be underdetermined:
n 
Converges to
©Carlos Guestrin 2013
min
Rv
X
(Lu · Rv
ruv )2
(Lu · Rv
ruv )2
v2Vu
X
u2Uv
12
6
Effect of Regularization
min
L,R
X
X
(Lu · Rv
ruv )2
(u,v,ruv )2X:ruv 6=?
= L
R’
©Carlos Guestrin 2013
13
What you need to know…
n 
n 
n 
n 
Matrix completion problem for collaborative
filtering
Over-determined -> low-rank approximation
Rank minimization is NP-hard
Minimize least-squares prediction for known
values for given rank of matrix
¨  Must
n 
use regularization
Coordinate descent algorithm = “Alternating
Least Squares”
©Carlos Guestrin 2013
14
7
Case Study 4: Collaborative Filtering
SGD for Matrix Completion
Matrix-norm Minimization
Machine Learning/Statistics for Big Data
CSE599C1/STAT592, University of Washington
Carlos Guestrin
March 7th, 2013
15
©Carlos Guestrin 2013
Stochastic Gradient Descent
min
L,R
1X
(Lu · Rv
2r
ruv )2 +
uv
n 
Observe one rating at a time ruv
n 
Gradient observing ruv:
n 
Updates:
©Carlos Guestrin 2013
u
2
||L||2F +
v
2
||R||2F
16
8
Local Optima v. Global Optima
n 
We are solving:
X
min
L,R
ruv
(Lu · Rv
ruv )2 +
n 
We (kind of) wanted to solve:
n 
Which is NP-hard…
¨ 
2
u ||L||F
+
2
v ||R||F
How do these things relate???
©Carlos Guestrin 2013
17
Eigenvalue Decompositions for PSD Matrices
n 
Given a (square) symmetric positive semidefinite matrix:
¨ 
Eigenvalues:
n 
Thus rank is:
n 
Approximation:
n 
Property of trace:
n 
Thus, approximate rank minimization by:
©Carlos Guestrin 2013
18
9
Generalizing the Trace Trick
n 
Non-square matrices ain’t got no trace
n 
For (square) positive definite matrices, matrix factorization:
n 
For rectangular matrices, singular value decomposition:
n 
Nuclear norm:
©Carlos Guestrin 2013
19
Nuclear Norm Minimization
n 
Optimization problem:
n 
Possible to relax equality constraints:
n 
Both are convex problems!
(solved by semidefinite programming)
©Carlos Guestrin 2013
20
10
Analysis of Nuclear Norm
n 
Nuclear norm minimization is a convex relaxation of rank minimization
problem:
min rank(⇥)
min ||⇥||⇤
ruv = ⇥uv , 8ruv 2 X, ruv 6=?
ruv = ⇥uv , 8ruv 2 X, ruv 6=?
⇥
⇥
n 
Theorem [Candes, Recht ‘08]:
¨ 
¨ 
If there is a true matrix of rank k,
And, we observe at least
C k n1.2 log n
random entries of true matrix
¨ 
Then true matrix is recovered exactly with high probability with convex nuclear norm
minimization!
n 
Under certain conditions
21
©Carlos Guestrin 2013
Nuclear Norm Minimization versus
Direct (Bilinear) Low Rank Solutions
n 
¨ 
n 
min
Nuclear norm minimization:
⇥
Annoying because:
Instead: min
L,R
X
ruv
(Lu · Rv
X
ruv )2 + ||⇥||⇤
(⇥uv
ruv
ruv )2 +
2
u ||L||F
+
¨ 
Annoying because:
¨ 
⇢
1
1
||⇥||
=
inf
min ||L||2F + ||R||2F : ⇥ = LR0
But
⇤
L,R 2
2
n 
So
And
n 
Under certain conditions [Burer, Monteiro ‘04]
n 
©Carlos Guestrin 2013
2
v ||R||F
22
11
What you need to know…
n 
Stochastic gradient descent for matrix
factorization
n 
Norm minimization as convex relaxation of rank
minimization
¨  Trace
norm for PSD matrices
¨  Nuclear norm in general
n 
Intuitive relationship between nuclear norm
minimization and direct (bilinear) minimization
©Carlos Guestrin 2013
23
Case Study 4: Collaborative Filtering
Nonnegative Matrix Factorization
Projected Gradient
Machine Learning/Statistics for Big Data
CSE599C1/STAT592, University of Washington
Carlos Guestrin
March 7th, 2013
©Carlos Guestrin 2013
24
12
Matrix factorization solutions can
be unintuitive…
n 
Many, many, many applications of matrix factorization
n 
E.g., in text data, can do topic modeling (alternative to LDA):
= L
X
n 
Would like:
n 
But…
R’
25
©Carlos Guestrin 2013
Nonnegative Matrix Factorization
X
n 
Just like before, but
min
L 0,R 0
n 
= L
R’
X
ruv
(Lu · Rv
ruv )2 +
2
u ||L||F
+
2
v ||R||F
Constrained optimization problem
¨ 
Many, many, many, many solution methods… we’ll check out a simple one
©Carlos Guestrin 2013
26
13
Projected Gradient
n 
Standard optimization:
min f (⇥)
¨  Want to minimize:
⇥
¨ 
Use gradient updates:
⇥(t+1)
n 
Constrained optimization:
¨ 
¨ 
n 
⇥(t)
⌘t rf (⇥(t) )
Given convex set C of feasible solutions
Want to find minima within C: min f (⇥)
⇥
Projected gradient:
⇥2C
¨ 
Take a gradient step (ignoring constraints):
¨ 
Projection into feasible set:
27
©Carlos Guestrin 2013
Projected Stochastic Gradient Descent
for Nonnegative Matrix Factorization
min
L 0,R
1X
(Lu · Rv
0 2
r
ruv )2 +
uv
n 
n 
n 
u
2
||L||2F +
v
2
||R||2F
Gradient step observing ruv ignoring constraints:
"
(t+1)
L̃u
(t+1)
R̃v
#
"
(1
(1
⌘t
⌘t
(t)
u )Lu
(t)
v )Rv
(t)
⌘ t ✏ t Rv
(t)
⌘t ✏ t Lu
#
Convex set:
Projection step:
©Carlos Guestrin 2013
28
14
What you need to know…
n 
In many applications, want factors to be
nonnegative
n 
Corresponds to constrained optimization
problem
n 
Many possible approaches to solve, e.g.,
projected gradient
©Carlos Guestrin 2013
29
15