PCA with Outliers and Missing Data - Math-UMN

PCA with Outliers and Missing Data
Sujay Sanghavi
Electrical and Computer Engg.
University of Texas, Austin
Joint w/ C. Caramanis, Y. Chen, H. Xu
Outline
PCA and Outliers
- Why SVD fails
- Corrupted features vs. corrupted points
Our idea + Algorithms
Results
- Full observation
- Missing Data
Framework for Robustness in High Dimensions
Principal Components Analysis
Given points that lie on/near a
Lower dimensional subspace,
find this subspace.
Classical technique:
1.  Organize points as matrix
2.  Take SVD
3.  Top singular vectors span space
Data Points
Low Rank Matrix
Fragility
Gross errors of even one/few points can completely throw off PCA
Reason: Classical PCA minimizes
`2
error, which is susceptible to gross outliers
Two types of gross errors
Corrupted Features
-  Individual entries corrupted
.. and missing data versions of both
Outliers
- Entire columns corrupted
PCA with Outliers
Points
Objective: find identities of outliers
(and hence col. space of true matrix)
Outlier Pursuit - Idea
Points
Standard PCA
min ⇥M
L
L⇥F
s.t. rank(L) = r
min ⇥M
L,C
L
C⇥F
s.t. rank(L) = r
col(C) = c
Outlier Pursuit - Method
Points
We propose:
min ⇥M
L,C
L
C⇥F +
Convex surrogate for
Rank constraint
1 ⇥L⇥
+
2 ⇥C⇥1,2
Convex surrogate for
Column-sparsity
When does it (not) work ?
When certain directions
of column space of L
poorly represented
L =U V⇥
This vector has large inner
product with some coordinate
axes
max V ei
i
is large
Results
Assumption:
Columns of true L are incoherent:
max ⇥V ei ⇥
2
i
Note:
r
First consider: Noiseless case
min kLk⇤ + kCk1,2
L,C
s.t. L + C = M
µr
µr
n
n
Results
Assumption:
Columns of true L are incoherent:
max ⇥V ei ⇥
2
i
Note:
r
µr
n
µr
n
Theorem: (noiseless case)
Our convex program can identify upto a fraction
1
Outer bound:
of outliers as long as
c
⇥
µr
1 makes the problem un-identifiable
>
r+1
⇥=
3
7
n
Proof Technique
A point x is the optimum of
a convex function f
Zero lies in the (sub) gradient
of f at x
Steps: 1. guess a “nice” point, -- oracle problem
2. show it is the optimum by showing zero is in subgradient
f (x)
Proof Technique
Guessing a “nice” optimum
(Note: in “single structure” problems like matrix completion,
compressed sensing etc., this is not an issue)
Oracle Problem:
min ⇥M
L,C
L
C⇥F +
s.t. ColSupp(C)
ColSpace(L)
(L, C) is, by definition, a nice point.
1 ⇥L⇥
+
2 ⇥C⇥1,2
ColSupp(C )
ColSpace(L )
Rest of proof: showing it is the optimum of original program,
under our assumption.
21
21
19
19
rank r
rank r
Performance
17
17
15
15
13
13
0.05
0.1
0.15 0.2 0.25 0.3
fraction of outliers !
L + C formulation
0.35
0.4
0.05
0.1
0.15 0.2 0.25 0.3
fraction of ourliers !
0.35
0.4
L + S formulation
( from [Chandrasekaran et. al.],
[Candes,et. al.] )
Another view…
Mean is solution of
min
x
(xi
Standard PCA of M is solution of
2
x)
X
j
i
Fragile: Can be easily skewed
by one / few points
x
i
|xi
L j k2
rank(L)  r
Our method is (convex rel. of)
Median is solution of
min
kMj
x|
Robust: skewing requires
Error in constant fraction of pts
X
j
kMj
Lj k
rank(L)  r
Collaborative Filtering w/ Adversaries
Collaborative Filtering w/ Adversaries
Users
Low-rank matrix that
-  Is partiallly observed
-  Has some corrupted columns
== outliers with missing data !
Our setting:
-  Good users == random sampling of incoherent matrix (as in matrix completion)
-  Manipulators == completely arbitrary sampling, values
Outlier Pursuit with Missing Data
min
||L|| + ||C||1,2
s.t. lij + cij = mij
for observed
(i, j)
Now: need row space to be incoherent as well
- since we are doing matrix completion and manipulator identification
Our Result
Theorem:
Convex program optimum (L, C) is such that
and the support of
C
L
has the correct column space
is exactly the set of manipulators, whp, provided
Sampling density
Fraction of users that
are manipulators
µ2 r2 log3 (4n)
c1
p
1
Note: no assumptions on manipulators
⇥2
⇥ c2
(1 + µrp )µ2 r2 log6 (4n)
n
p
1
fraction of observed entries "
fraction of observed entries "
Robust Collaborative Filtering
0.8
0.6
0.45
0.3
1
0.8
0.6
0.45
0.3
0.02
0.05
0.1
0.2
fraction of manipulators !
0.4
0.02
0.05
0.1
0.2
fraction of manipulators !
Algo: Partially observed
Algo: Partially observed
Low-rank + Column-sparse
Sparse + Low-rank
0.4
More generally …
Several methods in
High-dim. Statistics
min
X
L(y, A; X) +
Loss function
r(X)
regularizer
Our approach:
min
X1 ,X2
L(y, A; X1 + X2 ) +
(same) Loss function
1 r1 (X1 )
2 r2 (X2 )
Weighted sum of regularizers
Yields robustness + flexibility in several settings.
Today: PCA wit Outliers + missing data
+
Latent factors in time series (ICML’12)
Matrix completion from Errors and Erasures
(ISIT 2011, SIAM J. Optim. 2011)
Ali Jalali
Yudong Chen
Graph clustering (ICML 2011)
Robust Recommender Systems
(ICML 2011)
Multiple Sparse Regression (NIPS 2010)
PCA that is robust to Outliers
(NIPS 2010, Trans IT)
All papers on my website, Arxiv.
Huan Xu
Constantine
Caramanis
Pradeep Ravikumar