Singular value decomposition - People @ EECS at UC Berkeley

Notes
Optimization Models
EE 127 / EE 227AT
Laurent El Ghaoui
EECS department
UC Berkeley
Spring 2015
Sp’15
1 / 27
Notes
LECTURE 5
Singular Value Decomposition
The license plate of Gene Golub
(1932–2007).
Sp’15
2 / 27
Outline
1
The singular value decomposition (SVD)
Motivation
Dyads
SVD theorem
2
Matrix properties via SVD
Rank, nullspace and range
Matrix norms, pseudo-inverses
Projectors
3
Low-rank matrix approximation
4
Principal component analysis (PCA)
Notes
Sp’15
3 / 27
Motivation
Notes
Figure : Votes of US Senators, 2002-2004. The plot is impossible to read. . .
Can we make sense of this data?
For example, do the two political parties emerge from the data alone?
Could we further provide a “political spectrum score” for each Senator?
Sp’15
4 / 27
Dyads
Notes
A matrix A 2 Rm,n is called a dyad if it can be written as
A = pq >
for some vectors p 2 Rm , q 2 Rn . Element-wise the above reads
Aij = pi qi , 1  i  m, 1  j  n.
Interpretation:
The columns of A are scaled copies the same column p, with scaling factors given
in vector q.
The rows of A are scaled copies the same row q > , with scaling factors given in
vector p.
Sp’15
5 / 27
Dyads
Notes
Example: video frames
We are given a set of image frames representing a video. Assuming that each image is
represented by a row vector of pixels, we can represent the whole video sequence as a
matrix A. Each row is an image.
...
Figure : Row vector representation of an image.
If the video shows a scene where no movement occurs, then the matrix A is a dyad.
Sp’15
6 / 27
Sums of dyads
Notes
The singular value decomposition theorem, seen next, states that any matrix can be
written as a sum of dyads:
r
X
A=
pi qi>
i=1
for vectors pi , qi that are mutually orthogonal.
This allows to intepret data matrices as sums of “simpler” matrices (dyads).
Sp’15
7 / 27
The singular value decomposition (SVD)
Notes
SVD theorem
The singular value decomposition (SVD) of a matrix provides a three-term factorization
which is similar to the spectral factorization, but holds for any, possibly non-symmetric
and rectangular, matrix A 2 Rm,n .
Theorem 1 (SVD decomposition)
Any matrix A 2 Rm,n can be factored as
˜ >
A = U ⌃V
where V 2 R and U 2 R
are orthogonal matrices (i.e., U > U = Im , V > V = In ),
.
˜ 2 Rm,n is a matrix having the first r =
and ⌃
rank A diagonal entries ( 1 , . . . , r )
positive and decreasing in magnitude, and all other entries zero:

⌃
0r ,n r
˜ =
⌃
, ⌃ = diag ( 1 , . . . , r ) 0.
0m r ,r 0m r ,n r
n,n
m,m
Sp’15
8 / 27
Compact-form SVD
Notes
Corollary 1 (Compact-form SVD)
Any matrix A 2 Rm,n can be expressed as
r
X
A=
>
i ui vi
= Ur ⌃Vr>
i=1
where r = rank A, Ur = [u1 · · · ur ] is such that Ur> Ur = Ir , Vr = [v1 · · · vr ] is such
that Vr> Vr = Ir , and 1
···
2
r > 0.
The positive numbers i are called the singular values of A, vectors ui are called the
left singular vectors of A, and vi the right singular vectors. These quantities satisfy
Avi =
>
2
i
i ui ,
ui> A =
i vi ,
i = 1, . . . , r .
>
Moreover,
= i (AA ) = i (A A), i = 1, . . . , r , and ui , vi are the eigenvectors
of A> A and of AA> , respectively.
Sp’15
9 / 27
Interpretation
Notes
The singular value decomposition theorem allows to write any matrix can be written as a
sum of dyads:
r
X
>
A=
i ui vi
i=1
where
Vectors ui , vi are normalized, with
corresponding dyad;
i
> 0 providing the “strength” of the
The vectors ui , i = 1, . . . , r (resp. vi , i = 1, . . . , r ) are mutually orthogonal.
Sp’15
10 / 27
Matrix properties via SVD
Notes
Rank, nullspace and range
The rank r of A is the cardinality of the nonzero singular values, that is the number
˜
of nonzero entries on the diagonal of ⌃.
Since r = rank A, by the fundamental theorem of linear algebra the dimension of
the nullspace of A is dim N (A) = n r . An orthonormal basis spanning N (A) is
given by the last n r columns of V , i.e.
.
Vnr = [vr +1 · · · vn ].
N (A) = R(Vnr ),
Similarly, an orthonormal basis spanning the range of A is given by the first r
columns of U, i.e.
.
R(A) = R(Ur ), Ur = [u1 · · · ur ].
Sp’15
11 / 27
Matrix properties via SVD
Notes
Matrix norms
The squared Frobenius matrix norm of a matrix A 2 Rm,n can be defined as
kAk2F = trace A> A =
n
X
i (A
>
A) =
i=1
n
X
2
i ,
i=1
where i are the singular values of A. Hence the squared Frobenius norm is nothing
but the sum of the squares of the singular values.
The squared spectral matrix norm kAk22 is equal to the maximum eigenvalue of
A> A, therefore
kAk22 = 12 ,
i.e., the spectral norm of A coincides with the maximum singular value of A.
The so-called nuclear norm of a matrix A is defined in terms of its singular values:
kAk⇤ =
r
X
i,
r = rank A.
i=1
The nuclear norm appears in several problems related to low-rank matrix
completion or rank minimization problems.
Sp’15
12 / 27
Matrix properties via SVD
Notes
Condition number
The condition number of an invertible matrix A 2 Rn,n is defined as the ratio
between the largest and the smallest singular value:
(A) =
1
= kAk2 · kA
n
1
k2 .
This number provides a quantitative measure of how close A is to being singular
(the larger (A) is, the more close to singular A is).
The condition number also provides a measure of the sensitivity of the solution of a
system of linear equations to changes in the equation coefficients.
Sp’15
13 / 27
Matrix properties via SVD
Notes
Matrix pseudo-inverses
Given A 2 Rm,n , a pseudoiverse is a matrix A† that satisfies
AA† A
A† AA†
(AA† )>
(A† A)>
=
=
=
=
A
A†
AA†
A† A.
A specific pseudoinverse is the so-called Moore-Penrose pseudoinverse:
˜ † U > 2 Rn,m
A† = V ⌃
where
˜† =
⌃

⌃
0n
1
r ,r
0r ,m r
0n r ,m r
,
⌃
1
= diag
✓
1
,...,
1
1
r
◆
0.
˜ A† can be written compactly as
Due to the zero blocks in ⌃,
A† = Vr ⌃
1
Ur> .
Sp’15
14 / 27
Matrix properties via SVD
Notes
Moore-Penrose pseudo-inverse: special cases
If A is square and nonsingular, then A† = A
1
.
If A 2 Rm,n is full column rank, that is r = n  m, then
A† A = Vr Vr> = VV > = In ,
†
that is, A is a left inverse of A, and it has the explicit expression
A† = (A> A)
1
A> .
If A 2 Rm,n is full row rank, that is r = m  n, then
AA† = Ur Ur> = UU > = Im ,
†
that is, A is a right inverse of A, and it has the explicit expression
A† = A> (AA> )
1
.
Sp’15
15 / 27
Matrix properties via SVD
Notes
Projectors
.
Matrix PR(A) = AA† = Ur Ur> is an orthogonal projector onto R(A). This means
that, for any y 2 Rn , the solution of problem
min kz
z2R(A)
y k2
⇤
is given by z = PR(A) y .
.
Similarly, matrix PN (A> ) = (Im AA† ) is an orthogonal projector onto
?
>
R(A) = N (A ).
.
Matrix PN (A) = In A† A is an orthogonal projector onto N (A).
.
Matrix PN (A)? = A† A is an orthogonal projector onto N (A)? = R(A> ).
Sp’15
16 / 27
Low-rank matrix approximation
Let A 2 Rm,n be a given matrix, with rank(A) = r > 0. We consider the problem of
approximating A with a matrix of lower rank. In particular, we consider the
following rank-constrained approximation problem
Ak k2F
kA
min
Ak 2Rm,n
s.t.:
Notes
rank(Ak ) = k,
where 1  k  r is given.
Let
r
X
˜ >=
A = U ⌃V
>
i ui vi
i=1
be an SVD of A. An optimal solution of the above problem is simply obtained by
truncating the previous summation to the k-th term, that is
Ak =
k
X
>
i ui vi .
i=1
Sp’15
17 / 27
Low-rank matrix approximation
The ratio
Notes
kAk k2F
⌘k =
=
kAk2F
2
1
2
1
+ ··· +
+ ··· +
2
k
2
r
indicates what fraction of the total variance (Frobenius norm) in A is explained by
the rank k approximation of A.
A plot of ⌘k as a function of k may give useful indications on a good rank level k at
which to approximate A.
⌘k is related to the relative norm approximation error
ek =
kA
2
2
k+1 + · · · + r
2
2
1 + ··· + r
Ak k2F
=
kAk2F
=1
⌘k .
Sp’15
18 / 27
Minimum “distance” to rank deficiency
Suppose A 2 Rm,n , m n is full rank, i.e., rank(A) = n. We ask what is a minimal
perturbation A of A that makes A + A rank deficient. The Frobenius norm (or
the spectral norm) of the minimal perturbation A measures the “distance” of A
from rank deficiency.
Notes
Formally, we need to solve
k Ak2F
min
A2Rm,n
s.t.:
rank(A + A) = n
1.
This problem is equivalent to rank approximation, for A = Ak
solution is thus readily obtained as
A⇤ = Ak
where Ak =
Pn
1
i=1
A. The optimal
A,
>
i ui vi . Therefore, we have
A⇤ =
>
n un vn .
The minimal perturbation that leads to rank deficiency is a rank-one matrix. The
distance to rank deficiency is k A⇤ kF = k A⇤ k2 = n .
Sp’15
19 / 27
Example: Image compression
Notes
A 266 ⇥ 400 matrix A of integers corresponding to the gray levels of the pixels in an
image.
Compute the SVD of matrix A, and plot the ratio ⌘k , for k from 1 to 266
1
0.98
0.96
0.94
0.92
0.9
0.88
0.86
50
100
150
200
250
k = 9 already captures 96% of the image variance.
Sp’15
20 / 27
Example: Image compression
Notes
Figure : Rank k approximations of the original image, for k = 9 (top left), k = 23 (top
right), k = 49 (bottom left), and k = 154 (bottom right).
Sp’15
21 / 27
Principal component analysis (PCA)
Notes
Principal component analysis (PCA) is a technique of unsupervised learning, widely
used to “discover” the most important, or informative, directions in a data set, that
is the directions along which the data varies the most.
In the data cloud below it is apparent that there exist a direction (at about 45
degrees from the horizontal axis) along which almost all the variation of the data is
contained. In contrast, the direction at about 135 degrees contains very little
variation of the data.
The important direction was easy to spot in this two-dimensional example.
However, graphical intuition does not help when analyzing data in dimension n > 3,
which is where Principal Components Analysis (PCA) comes in handy.
Sp’15
22 / 27
Principal component analysis (PCA)
Notes
P
xi 2 Rn , i = 1, . . . , m: data points; x̄ = m1 m
i=1 xi the barycenter of the data
points; X̃ n ⇥ m matrix containing the centered data points:
X̃ = [x̃1 · · · x̃m ],
.
x̃i = xi
x̄, i = 1, . . . , m,
We look for a a normalized direction in data space, z 2 Rn , kzk2 = 1, such that the
the variance of the projections of the centered data points on the line determined
by z is maximal.
The components of the centered data along direction z are given by
↵i = x̃i> z,
i = 1, . . . , m.
(↵i z are the projections of x̃i along the span of z).
The mean-square variation of the data along direction z is thus given by
m
m
1 X 2 X > >
↵i =
z x̃i x̃i z = z > X̃ X̃ > z.
m i=1
i=1
Sp’15
23 / 27
Principal component analysis (PCA)
Notes
The direction z along which the data has the largest variation can thus be found as
the solution to the following optimization problem:
maxn
z2R
z > (X̃ X̃ > )z
s.t.: kzk2 = 1.
Let us now solve the previous problem via the SVD of X̃ : let
X̃ = Ur ⌃Vr> =
r
X
>
i ui vi .
i=1
.
Then, H = X̃ X̃ > = Ur ⌃2 Ur> .
From the variational representation, we have that the optimal solution of this
problem is given by the column u1 of Ur corresponding to the largest eigenvalue of
H, which is 12 .
The direction of largest data variation is thus readily found as z = u1 , and the
mean-square variation along this direction is proportional to 12 .
Successive principal axes can be found by “removing” the first principal
components, and applying the same approach again on the “deflated” data matrix.
Sp’15
24 / 27
Example: PCA of market data
Notes
Consider data consisting in the returns of six financial indices: (1) the MSCI US
index, (2) the MSCI EUR index, (3) the MSCI JAP index, (4) the MSCI PACIFIC
index, the (5) MSCI BOT liquidity index, and the (6) MSCI WORLD index.
0
100
Month
0
150
50
MSCI BOT
MSCI PAC
14
0
100
Month
150
10
8
6
4
50
100
Month
0
50
100
Month
150
50
100
Month
150
x 10
MSCI WRLD
50
MSCI JAP
MSCI EUR
MSCI US
We used monthly return data, from Feb. 26, 1993 to Feb. 28, 2007, for a total of
169 data points.
150
50
100
Month
0
150
The data matrix X has thus m = 169 data points in dimension n = 6.
Sp’15
25 / 27
Example: PCA of market data
Notes
Centering the data, and performing the SVD on the centered data matrix X̃ , we
obtain the principal axes ui , and the corresponding singular values:
2
6
6
6
U=6
6
6
4
=
⇥
0.4143
0.4671
0.4075
0.5199
0.0019
0.4169
1.0765
0.2287
0.1714
0.9057
0.2986
0.0057
0.0937
0.5363
0.3865
0.3621
0.0690
0.7995
0.0005
0.2746
0.4459
0.658
0.7428
0.0431
0.0173
0.0053
0.1146
0.2519
Computing the ratios ⌘k , we have
0.0379
0.0172
0.0020
0.0056
0.9972
0.0612
0.0354
0.4385
0.2632
0.0832
0.0315
0.0739
0.8515
3
7
7
7
7
7
7
5
0.0114
⇤
.
⌘ ⇥ 100 = [67.77 84.58 96.21 99.92 99.99 100].
We deduce, for instance, that over 96% of the variability in the returns of these six
assets can be explained in terms of only three implicit “factors”.
In statistical terms, this means that each realization of the return vector x 2 R6 can
be expressed (up to a 96% “approximation”) as
x = x̄ + U3 z,
where z is a zero-mean vector of random factors, and U3 = [u1 u2 u3 ] is the factor
loading matrix, composed of the first three principal directions of the data.
Sp’15
26 / 27
Example: PCA of voting data
Notes
Figure : Projection of US Senate voting data on random direction (left panel) and
direction of maximal variance (right panel). The latter reveals party structure (party
affiliations added after the fact). Note also the much higher range of values it provides.
Sp’15
27 / 27
Notes