slides

Coresets and Sketches for
High Dimensional Subspace
Approximation Problems
Morteza Monemizadeh
TU Dortmund
Joint work with: D. Feldman, C. Sohler, D. Woodruff
SODA 2010
Unbounded Precision
Input:
P = f p1 ; p2; ¢¢¢; pn g µ < d ; j ¸ 0
Insertion-only Streaming:
Head of stream
Unseen points
Seen points
Subspace Problem
Find a j-subspace F:
OPT = mi nF µ < d cost(P; F )
P
= mi nF µ < d pi 2 P di st(pi ; F )
p1
p2
p4
p3
p5
Euclidean Distance
p6
Subspace Approximation
PTAS:
Find a j-subspace F 0 such that
cost(P; F 0) · (1 + ²) ¢OPT
Simple Cases
cost(P; F ) =
j = 0:
P
pi 2 P (di st(pi ; F ))
1-median
j :
k-median
PCA/SVD
Machine Learning
LSI, PageRank, HIITS
Collaborative Filtering, Recommendation Systems
Clustering
Simple Cases
j = d¡ 1 :
Linear regression
Nonlinear regression
Shape-fitting
Known Before
d =O(1): Low-dimensions
Coresets
(Har-Peled)
Dynamic Programming (Arora, Mitchell)
d =O(n): High-dimensions
Dimensionality Reduction (Indyk, Rabani, …)
d =O(1): Low-dimensions
Simple PTAS
Centroid Set:
¡ =
F1
F2
Fi
Fj ¡ j
9Fi : cost(P; Fi ) · (1 + ²) ¢OPT
PTAS: O(nd ¢j¡ j)
PTAS
Weak Coreset:
S = f s1 ; s2 ; ¢¢¢; si ; ¢¢¢; sj Sj g
jSj = O(j¡ j)
8Fi 2 ¡ : jcost(S; Fi ) ¡ cost(P; Fi )j · ² ¢cost(P; Fi )
PTAS:
j¡ j = 2pol y(j =² )
2
O(d ¢j¡ j )
O(d ¢2pol y(j =² ) )
Tools
Weak Coreset
 Centroid Set

Coreset Construction
Assumptions:
d=2, j=1
Fix a 1-subspace (line):
Fi
Have a 1-subspace (line): cost(P; F ¤) · O(1) ¢OPT
GOAL:
Pr ob ¸ 1 ¡ ±
9Q µ < d : jcost(P; Fi ) ¡ cost(Q; Fi )j · ² ¢cost(P; F ¤)
· O(1) ¢² ¢OPT
jQj =
l og( 1=±)
O( ² 2 )
· O(²) ¢cost(P; Fi )
1st Try
Sampling u.a.r or even non-uniformly:
Fi
E(cost(Q; Fi )) = cost(P; Fi )
2nd Try
(pj ; 1)
(p
¹ 1; ¡ 1)
(p
¹ 1; + 1)
(p
¹ j ; + 1)
( p¹ j ; ¡ 1)
(p1 ; 1)
Fi
F¤
(pj ; 1)
Fi
(p
¹ 1; ¡ 1)
(p
¹ j ; + 1)
(p
¹ 1; + 1)
(p
¹ j ; ¡ 1)
F¤
(p1 ; 1)
cost(P; Fi )
(pj ; 1)
(p
¹ 1; + 1)
(p
¹ j ; + 1)
(p
¹ 1; ¡ 1)
F¤
(p
¹ j ; ¡ 1)
(p1 ; 1)
Fi
P
¹
cost( P; Fi ) =
pj 2 P
Fi
di st( p
¹ j ; Fi )
¹ Fi )
cost(P; Fi ) ¡ cost( P;
F¤
(pj ; 2)
1)
( p¹ 1; ¡ 1)
( p¹ j ; ¡ 2)
1)
F¤
(p1; 1)
Fi
¹ Fi )
E[cost(S; Fi )] = cost(P; Fi ) ¡ cost( P;
¹ Fi )j · cost(P; F ¤ )
cost(S; Fi ) · jcost(P; Fi ) ¡ cost( P;
Chernoff Bounds
¹ Fi )
E[cost(S; Fi )] = cost(P; Fi ) ¡ cost( P;
¹ Fi )j · cost(P; F ¤ )
cost(S; Fi ) · jcost(P; Fi ) ¡ cost( P;
¹ Fi ))j · ² ¢cost(P; F ¤ )
jcost(S; Fi ) ¡ (cost(P; Fi ) ¡ cost( P;
jSj = O( l og(² 1=±)
)
2
Recursion
(p
¹ 1 ; + 1)
(p
¹ j ; + 1)
F¤
0
Fi
¹ Fi ) =
cost( P;
P
pj 2 P
di st( p
¹ j ; Fi )
Recursion
( ¹pj ; ¡ 1)
( ¹pj ; + 1)
(p
¹ 1 ; + 1)
(p
¹ j ; + 1)
F¤
0
Fi
¹ Fi ) =
cost( P;
P
pj 2 P
di st( p
¹ j ; Fi )
¹ Fi )
cost( P;
¹ j ; ¡ 1)
(p
( ¹pj ; + 1)
(p
¹ 1 ; + 1)
(p
¹ j ; + 1)
F¤
0 Fi
( ¹p1 ; ¡ 1)
( ¹pj ; ¡ 1)
(p
¹ 1 ; + 1)
F¤
(0,n)
(p
¹ j ; + 1)
F¤
0
Fi
Fi
cost(~
0 £ n; Fi )
¹ Fi ) ¡ cost(~
cost( P;
0 £ n; Fi )
( ¹p1 ; ¡ 1)
( ¹pj ; + 2)
(p
¹ 1 ; + 1)
(( p
¹¹ jj ;; +
+ 2)
1)
p
F¤
0
Fi
¹ Fi ) ¡ cost(~0£ n; Fi )
E(cost(S0; Fi )) = cost(P;
cost(S0; Fi ) · O(1) ¢cost(P; F ¤)
¹ Fi ) ¡ cost(~0 £ n; Fi ))j · ² ¢cost(P; F ¤ )
jcost(S0; Fi ) ¡ (cost( P;
jSj = O( log(1=±)
)
²2
cost(P; Fi ) = cost(~
0 £ n; Fi )
¹ Fi ) ¡ cost(~
+ cost( P;
0 £ n; Fi )
¹ i ) ¡ cost(~0£ n;Fi )j · ² ¢cost(P;F ¤)
jcost(S0;Fi ) ¡ (cost(P;F
¹ Fi )
+ cost(P; Fi ) ¡ cost( P;
¤
¹
jcost(S;Fi ) ¡ (cost(P;Fi ) ¡ cost(P;Fi )j · ² ¢cost(P;F )
Strong Coreset
S = f s1 ; s2 ; ¢¢¢; si ; ¢¢¢; sj Sj g
8Fi 2 < d : jcost(S; Fi ) ¡ cost(P; Fi )j · ² ¢cost(P; Fi )
jSj = O(dj
O(j 2 )
¢² ¡
2
¢log n)
p
Stream: jSj = O(d(
j ¢2
l og n
²2
) pol y ( j ) )
Centroid Set
pol y(j =² )
n ¢2
In time
Centroid Set:
¡ =
F1
F2
Fi
Fj ¡ j
9Fi : cost(P; Fi ) · (1 + ²) ¢OPT
pol y( j =² )
j¡ j = 2
Centroid Set Construction
Bounded Precision
p1
p2
=A[n,d]
pi
pn
A[i; j ] 2 f ¡ M ; ¢¢¢; + M g
A[i,j]-5
A[i,j]+10
Stream S: …., (i,j, -5), …, (i,j, +10), … : |S|=poly(n,M)
Bounded Precision
1-pass streaming algorithm
~ 4 ¢log(nd))=² 5
Space: O(nj
Time:
M pol y(j =² )
Open Problems
Coreset size:
jSj = O(dj
O(j 2 )
¢² 2 ¢log n)
p
Stream: jSj = O(d(
j ¢2
l og n
²2
) pol y ( j ) )
PTAS: O(nd ¢poly(j =²) + (n + d) ¢2pol y(j =² ) )
What other classes of Clustering?
Thanks!