Fast Monte-Carlo Algorithms for Matrix Multiplication

Computing Sketches of Matrices
Efficiently & (Privacy Preserving) Data Mining
Petros Drineas
Rensselaer Polytechnic Institute
[email protected]
(joint work with R. Kannan and M. Mahoney)
@ DIMACS Workshop on Privacy Preserving Data Mining
Motivation (Data Mining)
In many applications large matrices appear (too large to store in RAM).
• We can make a few “passes” (sequential READS) through the matrices.
• We can create and store a small “sketch” of the matrices in RAM.
• Computing the “sketch” should be a very fast process.
Discard the original matrix and work with the “sketch”.
2
Motivation (Privacy Preserving)
In many applications, instead of revealing a large matrix, we only reveal its
“sketch”.
• Intuition: The “sketch” is an approximation to the original matrix.
Instead of viewing the approximation as a “necessary evil”, we might be
able to use it to achieve privacy preservation (similar ideas in Feigenbaum
et. al., ICALP 2001).
• Goal: Formulate a technical definition of privacy that might be
achievable by such “sketching” algorithms and provide meaningful and
quantifiable protection.
Achieving the goal is an open problem !
3
Our approach & our results
1.
A “sketch” consisting of a few rows/columns of the matrix is adequate for
efficient approximations.
[see D & Kannan ’03, and D, Kannan & Mahoney ’04]
2.
We draw the rows/columns randomly, using adaptive sampling; e.g.
rows/columns are picked with probability proportional to their lengths.
Create an approximation to
the original matrix which can
be stored in much less space.
4
Overview
• A Data Mining setup
• Approximating a large matrix
• Algorithm
• Error bounds
• Tightness of the results
• An alternative approach (Achlioptas and McSherry ’01 and ’03)
• Conclusions
5
Applications: Data Mining
We are given m (>106) objects and n(>105) features describing the objects.
Database
An m-by-n matrix A (Aij shows the “importance” of feature j for object i).
Every row of A represents an object.
Queries
Given a new object x, find similar objects in the database (nearest
neighbors).
6
Two objects are “close” if the angle between their
corresponding vectors is small. So, assuming that
the vectors are normalized,
feature 2
Applications (cont’d)
xT·d = cos(x,d)
is high when the two objects are close.
Object d
(d,x)
Object x
A·x computes all the angles and answers the query.
feature 1
Key observation: The exact value xT· d might not be necessary.
1.
The feature values in the vectors are set by coarse heuristics.
2.
It is in general enough to see if xT· d > Threshold.
7
Using an approximation to A
Assume that A’ = CUR is an approximation to A, such that A’ is stored
efficiently (e.g. in RAM).
Given a query vector x, instead of computing A · x, compute A’ · x to
identify its nearest neighbors.
The CUR algorithm guarantees a bound on the worst case choice of x.
8
Approximating A efficiently
Given a large m-by-n matrix A (stored on disk), compute an
approximation A’ to A such that:
1.
A’ can be stored in O(m+n) space, after making two passes through
the entire matrix A, and using O(m+n) additional space and time.
2.
A’ satisfies (with high probability)
||A-A’||22 < ε ||A||F2
(and a similar bound with respect to the Frobenius norm).
9
Describing A’ = C · U · R
•
C consists of c = θ(1/ε2) columns of A and R consists of r =
rows of A (the “description length” of A is O(m+n)).
•
C and R are created using adaptive sampling.
θ(1/ε2)
10
Creating C and R
•
Create C (R) by performing c (r) i.i.d trials.
•
In each trial, pick a column (row) of A with probability
•
Include A(i) (A(i)) as a column of C (R).
[A(i) (A(i)) is the i-th column (row) of A]
11
Singular Value Decomposition (SVD)
U (V): orthogonal matrix containing the left (right) singular vectors of A.
S: diagonal matrix containing the singular values of A.
1.
Exact computation of the SVD takes O(min(mn2 , m2n)) time.
2.
The top few singular vectors/values can be approximated faster
(Lanczos/ Arnoldi methods).
12
Rank k approximations (Ak)
Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A.
S k: diagonal matrix containing the top k singular values of A.
Ak is a matrix of rank k such that ||A-Ak||2,F is minimized over all rank k
matrices!
13
The CUR algorithm
Input:
1.
The matrix A in “sparse unordered representation”.
(e.g. non-zero entries of A are presented as triples (i,j,Aij) in any order)
2.
Positive integers c < n and r < m (number or columns/rows that we pick).
3.
Positive integer k (the rank of A’=CUR).
Note: Since A’ is of rank k, ||A-A’||2,F >= ||A-Ak||2,F.
We choose a k such that ||A-Ak||2,F is small. As k grows, for the Frobenius
norm approximation, c and r grow as well.
14
Computing U
Intuition:
The CUR algorithm essentially expresses every row of the matrix A as a
linear combination of a small subset of the rows of A.
•
This small subset consists of the rows in R.
•
Given a row of A – say A(i) – the algorithm computes the “best fit” for
the row A(i) using the rows in R as the basis.
e.g.
Notice that only c = O(1) element of the i-th row are given as input.
However, a vector of coefficients u can still be computed.
15
Creating U
Running time
Computing the elements of U amounts to a pseudo-inverse computation. It
can be done in O(c2m + c3 + r3) time.
Thus, U can be computed in O(m) time.
Note on the rank of U and CUR
The rank of U (by construction) is k.
Thus, the rank of A’=CUR is at most k.
16
Error bounds (Frobenius norm)
Assume Ak is the “best” rank k approximation to A (through SVD). Then
We need to pick O(k/ε2) rows and O(k/ε2) columns.
17
Error bounds (2-norm)
Assume Ak is the “best” rank k approximation to A (through SVD). Then
since |A-Ak|22 <= |A|F2/(k+1).
We need to pick O(1/ε2) rows and O(1/ε2) columns.
18
Can we do better?
Lemma
For any e < 1, there is a set of Ω(e–n) n-by-n matrices, such that for
two distinct matrices A,B in the set,
||A-B||22 > (e/20)||A||F2
Lower bound Theorem
Any algorithm which approximates these matrices must output a
different “sketch” for each one, thus it must output at least
Ω(n log(1/e)) bits
Tighter lower bounds, matching almost exactly with our upper bounds,
have been obtained by Ziv-Bar Yossef, STOC ’03.
19
A different technique
(D. Achlioptas and F. McSherry, ’01 and ’03)
The Algorithm in 2 lines:
• To approximate a matrix A, keep a few elements of the matrix (instead of rows or
columns) and zero out the remaining elements.
• Compute a rank k approximation to this sparse matrix (using Lanczos methods).
Comparing the two techniques:
• The error bound w.r.t. the 2-norm is better, while the error bound w.r.t. the
Frobenius norm is the same.
(weighted sampling is used - heavier elements are kept with higher probabilities)
• Running times are the same.
20
Conclusions
• Given the small “sketch” of a matrix A, a “friendly user” can
•
reconstruct a (provably accurate) approximation A’ to the original matrix
A and employ any algorithms that he would use to process the original
matrix A on A’,
•
use the Frobenius and spectral norm bounds for A-A’ to argue about the
approximation error of his algorithms.
• How do we ensure privacy for the object-vectors (rows) of A that are
revealed as part of R?
• Are such sketches offering some privacy preserving guarantees, under
some (relaxed) definition of privacy?
21