Coresets and Sketches for High Dimensional Subspace Approximation Problems Morteza Monemizadeh TU Dortmund Joint work with: D. Feldman, C. Sohler, D. Woodruff SODA 2010 Unbounded Precision Input: P = f p1 ; p2; ¢¢¢; pn g µ < d ; j ¸ 0 Insertion-only Streaming: Head of stream Unseen points Seen points Subspace Problem Find a j-subspace F: OPT = mi nF µ < d cost(P; F ) P = mi nF µ < d pi 2 P di st(pi ; F ) p1 p2 p4 p3 p5 Euclidean Distance p6 Subspace Approximation PTAS: Find a j-subspace F 0 such that cost(P; F 0) · (1 + ²) ¢OPT Simple Cases cost(P; F ) = j = 0: P pi 2 P (di st(pi ; F )) 1-median j : k-median PCA/SVD Machine Learning LSI, PageRank, HIITS Collaborative Filtering, Recommendation Systems Clustering Simple Cases j = d¡ 1 : Linear regression Nonlinear regression Shape-fitting Known Before d =O(1): Low-dimensions Coresets (Har-Peled) Dynamic Programming (Arora, Mitchell) d =O(n): High-dimensions Dimensionality Reduction (Indyk, Rabani, …) d =O(1): Low-dimensions Simple PTAS Centroid Set: ¡ = F1 F2 Fi Fj ¡ j 9Fi : cost(P; Fi ) · (1 + ²) ¢OPT PTAS: O(nd ¢j¡ j) PTAS Weak Coreset: S = f s1 ; s2 ; ¢¢¢; si ; ¢¢¢; sj Sj g jSj = O(j¡ j) 8Fi 2 ¡ : jcost(S; Fi ) ¡ cost(P; Fi )j · ² ¢cost(P; Fi ) PTAS: j¡ j = 2pol y(j =² ) 2 O(d ¢j¡ j ) O(d ¢2pol y(j =² ) ) Tools Weak Coreset Centroid Set Coreset Construction Assumptions: d=2, j=1 Fix a 1-subspace (line): Fi Have a 1-subspace (line): cost(P; F ¤) · O(1) ¢OPT GOAL: Pr ob ¸ 1 ¡ ± 9Q µ < d : jcost(P; Fi ) ¡ cost(Q; Fi )j · ² ¢cost(P; F ¤) · O(1) ¢² ¢OPT jQj = l og( 1=±) O( ² 2 ) · O(²) ¢cost(P; Fi ) 1st Try Sampling u.a.r or even non-uniformly: Fi E(cost(Q; Fi )) = cost(P; Fi ) 2nd Try (pj ; 1) (p ¹ 1; ¡ 1) (p ¹ 1; + 1) (p ¹ j ; + 1) ( p¹ j ; ¡ 1) (p1 ; 1) Fi F¤ (pj ; 1) Fi (p ¹ 1; ¡ 1) (p ¹ j ; + 1) (p ¹ 1; + 1) (p ¹ j ; ¡ 1) F¤ (p1 ; 1) cost(P; Fi ) (pj ; 1) (p ¹ 1; + 1) (p ¹ j ; + 1) (p ¹ 1; ¡ 1) F¤ (p ¹ j ; ¡ 1) (p1 ; 1) Fi P ¹ cost( P; Fi ) = pj 2 P Fi di st( p ¹ j ; Fi ) ¹ Fi ) cost(P; Fi ) ¡ cost( P; F¤ (pj ; 2) 1) ( p¹ 1; ¡ 1) ( p¹ j ; ¡ 2) 1) F¤ (p1; 1) Fi ¹ Fi ) E[cost(S; Fi )] = cost(P; Fi ) ¡ cost( P; ¹ Fi )j · cost(P; F ¤ ) cost(S; Fi ) · jcost(P; Fi ) ¡ cost( P; Chernoff Bounds ¹ Fi ) E[cost(S; Fi )] = cost(P; Fi ) ¡ cost( P; ¹ Fi )j · cost(P; F ¤ ) cost(S; Fi ) · jcost(P; Fi ) ¡ cost( P; ¹ Fi ))j · ² ¢cost(P; F ¤ ) jcost(S; Fi ) ¡ (cost(P; Fi ) ¡ cost( P; jSj = O( l og(² 1=±) ) 2 Recursion (p ¹ 1 ; + 1) (p ¹ j ; + 1) F¤ 0 Fi ¹ Fi ) = cost( P; P pj 2 P di st( p ¹ j ; Fi ) Recursion ( ¹pj ; ¡ 1) ( ¹pj ; + 1) (p ¹ 1 ; + 1) (p ¹ j ; + 1) F¤ 0 Fi ¹ Fi ) = cost( P; P pj 2 P di st( p ¹ j ; Fi ) ¹ Fi ) cost( P; ¹ j ; ¡ 1) (p ( ¹pj ; + 1) (p ¹ 1 ; + 1) (p ¹ j ; + 1) F¤ 0 Fi ( ¹p1 ; ¡ 1) ( ¹pj ; ¡ 1) (p ¹ 1 ; + 1) F¤ (0,n) (p ¹ j ; + 1) F¤ 0 Fi Fi cost(~ 0 £ n; Fi ) ¹ Fi ) ¡ cost(~ cost( P; 0 £ n; Fi ) ( ¹p1 ; ¡ 1) ( ¹pj ; + 2) (p ¹ 1 ; + 1) (( p ¹¹ jj ;; + + 2) 1) p F¤ 0 Fi ¹ Fi ) ¡ cost(~0£ n; Fi ) E(cost(S0; Fi )) = cost(P; cost(S0; Fi ) · O(1) ¢cost(P; F ¤) ¹ Fi ) ¡ cost(~0 £ n; Fi ))j · ² ¢cost(P; F ¤ ) jcost(S0; Fi ) ¡ (cost( P; jSj = O( log(1=±) ) ²2 cost(P; Fi ) = cost(~ 0 £ n; Fi ) ¹ Fi ) ¡ cost(~ + cost( P; 0 £ n; Fi ) ¹ i ) ¡ cost(~0£ n;Fi )j · ² ¢cost(P;F ¤) jcost(S0;Fi ) ¡ (cost(P;F ¹ Fi ) + cost(P; Fi ) ¡ cost( P; ¤ ¹ jcost(S;Fi ) ¡ (cost(P;Fi ) ¡ cost(P;Fi )j · ² ¢cost(P;F ) Strong Coreset S = f s1 ; s2 ; ¢¢¢; si ; ¢¢¢; sj Sj g 8Fi 2 < d : jcost(S; Fi ) ¡ cost(P; Fi )j · ² ¢cost(P; Fi ) jSj = O(dj O(j 2 ) ¢² ¡ 2 ¢log n) p Stream: jSj = O(d( j ¢2 l og n ²2 ) pol y ( j ) ) Centroid Set pol y(j =² ) n ¢2 In time Centroid Set: ¡ = F1 F2 Fi Fj ¡ j 9Fi : cost(P; Fi ) · (1 + ²) ¢OPT pol y( j =² ) j¡ j = 2 Centroid Set Construction Bounded Precision p1 p2 =A[n,d] pi pn A[i; j ] 2 f ¡ M ; ¢¢¢; + M g A[i,j]-5 A[i,j]+10 Stream S: …., (i,j, -5), …, (i,j, +10), … : |S|=poly(n,M) Bounded Precision 1-pass streaming algorithm ~ 4 ¢log(nd))=² 5 Space: O(nj Time: M pol y(j =² ) Open Problems Coreset size: jSj = O(dj O(j 2 ) ¢² 2 ¢log n) p Stream: jSj = O(d( j ¢2 l og n ²2 ) pol y ( j ) ) PTAS: O(nd ¢poly(j =²) + (n + d) ¢2pol y(j =² ) ) What other classes of Clustering? Thanks!
© Copyright 2026 Paperzz