Notes Optimization Models EE 127 / EE 227AT Laurent El Ghaoui EECS department UC Berkeley Spring 2015 Sp’15 1 / 27 Notes LECTURE 5 Singular Value Decomposition The license plate of Gene Golub (1932–2007). Sp’15 2 / 27 Outline 1 The singular value decomposition (SVD) Motivation Dyads SVD theorem 2 Matrix properties via SVD Rank, nullspace and range Matrix norms, pseudo-inverses Projectors 3 Low-rank matrix approximation 4 Principal component analysis (PCA) Notes Sp’15 3 / 27 Motivation Notes Figure : Votes of US Senators, 2002-2004. The plot is impossible to read. . . Can we make sense of this data? For example, do the two political parties emerge from the data alone? Could we further provide a “political spectrum score” for each Senator? Sp’15 4 / 27 Dyads Notes A matrix A 2 Rm,n is called a dyad if it can be written as A = pq > for some vectors p 2 Rm , q 2 Rn . Element-wise the above reads Aij = pi qi , 1 i m, 1 j n. Interpretation: The columns of A are scaled copies the same column p, with scaling factors given in vector q. The rows of A are scaled copies the same row q > , with scaling factors given in vector p. Sp’15 5 / 27 Dyads Notes Example: video frames We are given a set of image frames representing a video. Assuming that each image is represented by a row vector of pixels, we can represent the whole video sequence as a matrix A. Each row is an image. ... Figure : Row vector representation of an image. If the video shows a scene where no movement occurs, then the matrix A is a dyad. Sp’15 6 / 27 Sums of dyads Notes The singular value decomposition theorem, seen next, states that any matrix can be written as a sum of dyads: r X A= pi qi> i=1 for vectors pi , qi that are mutually orthogonal. This allows to intepret data matrices as sums of “simpler” matrices (dyads). Sp’15 7 / 27 The singular value decomposition (SVD) Notes SVD theorem The singular value decomposition (SVD) of a matrix provides a three-term factorization which is similar to the spectral factorization, but holds for any, possibly non-symmetric and rectangular, matrix A 2 Rm,n . Theorem 1 (SVD decomposition) Any matrix A 2 Rm,n can be factored as ˜ > A = U ⌃V where V 2 R and U 2 R are orthogonal matrices (i.e., U > U = Im , V > V = In ), . ˜ 2 Rm,n is a matrix having the first r = and ⌃ rank A diagonal entries ( 1 , . . . , r ) positive and decreasing in magnitude, and all other entries zero: ⌃ 0r ,n r ˜ = ⌃ , ⌃ = diag ( 1 , . . . , r ) 0. 0m r ,r 0m r ,n r n,n m,m Sp’15 8 / 27 Compact-form SVD Notes Corollary 1 (Compact-form SVD) Any matrix A 2 Rm,n can be expressed as r X A= > i ui vi = Ur ⌃Vr> i=1 where r = rank A, Ur = [u1 · · · ur ] is such that Ur> Ur = Ir , Vr = [v1 · · · vr ] is such that Vr> Vr = Ir , and 1 ··· 2 r > 0. The positive numbers i are called the singular values of A, vectors ui are called the left singular vectors of A, and vi the right singular vectors. These quantities satisfy Avi = > 2 i i ui , ui> A = i vi , i = 1, . . . , r . > Moreover, = i (AA ) = i (A A), i = 1, . . . , r , and ui , vi are the eigenvectors of A> A and of AA> , respectively. Sp’15 9 / 27 Interpretation Notes The singular value decomposition theorem allows to write any matrix can be written as a sum of dyads: r X > A= i ui vi i=1 where Vectors ui , vi are normalized, with corresponding dyad; i > 0 providing the “strength” of the The vectors ui , i = 1, . . . , r (resp. vi , i = 1, . . . , r ) are mutually orthogonal. Sp’15 10 / 27 Matrix properties via SVD Notes Rank, nullspace and range The rank r of A is the cardinality of the nonzero singular values, that is the number ˜ of nonzero entries on the diagonal of ⌃. Since r = rank A, by the fundamental theorem of linear algebra the dimension of the nullspace of A is dim N (A) = n r . An orthonormal basis spanning N (A) is given by the last n r columns of V , i.e. . Vnr = [vr +1 · · · vn ]. N (A) = R(Vnr ), Similarly, an orthonormal basis spanning the range of A is given by the first r columns of U, i.e. . R(A) = R(Ur ), Ur = [u1 · · · ur ]. Sp’15 11 / 27 Matrix properties via SVD Notes Matrix norms The squared Frobenius matrix norm of a matrix A 2 Rm,n can be defined as kAk2F = trace A> A = n X i (A > A) = i=1 n X 2 i , i=1 where i are the singular values of A. Hence the squared Frobenius norm is nothing but the sum of the squares of the singular values. The squared spectral matrix norm kAk22 is equal to the maximum eigenvalue of A> A, therefore kAk22 = 12 , i.e., the spectral norm of A coincides with the maximum singular value of A. The so-called nuclear norm of a matrix A is defined in terms of its singular values: kAk⇤ = r X i, r = rank A. i=1 The nuclear norm appears in several problems related to low-rank matrix completion or rank minimization problems. Sp’15 12 / 27 Matrix properties via SVD Notes Condition number The condition number of an invertible matrix A 2 Rn,n is defined as the ratio between the largest and the smallest singular value: (A) = 1 = kAk2 · kA n 1 k2 . This number provides a quantitative measure of how close A is to being singular (the larger (A) is, the more close to singular A is). The condition number also provides a measure of the sensitivity of the solution of a system of linear equations to changes in the equation coefficients. Sp’15 13 / 27 Matrix properties via SVD Notes Matrix pseudo-inverses Given A 2 Rm,n , a pseudoiverse is a matrix A† that satisfies AA† A A† AA† (AA† )> (A† A)> = = = = A A† AA† A† A. A specific pseudoinverse is the so-called Moore-Penrose pseudoinverse: ˜ † U > 2 Rn,m A† = V ⌃ where ˜† = ⌃ ⌃ 0n 1 r ,r 0r ,m r 0n r ,m r , ⌃ 1 = diag ✓ 1 ,..., 1 1 r ◆ 0. ˜ A† can be written compactly as Due to the zero blocks in ⌃, A† = Vr ⌃ 1 Ur> . Sp’15 14 / 27 Matrix properties via SVD Notes Moore-Penrose pseudo-inverse: special cases If A is square and nonsingular, then A† = A 1 . If A 2 Rm,n is full column rank, that is r = n m, then A† A = Vr Vr> = VV > = In , † that is, A is a left inverse of A, and it has the explicit expression A† = (A> A) 1 A> . If A 2 Rm,n is full row rank, that is r = m n, then AA† = Ur Ur> = UU > = Im , † that is, A is a right inverse of A, and it has the explicit expression A† = A> (AA> ) 1 . Sp’15 15 / 27 Matrix properties via SVD Notes Projectors . Matrix PR(A) = AA† = Ur Ur> is an orthogonal projector onto R(A). This means that, for any y 2 Rn , the solution of problem min kz z2R(A) y k2 ⇤ is given by z = PR(A) y . . Similarly, matrix PN (A> ) = (Im AA† ) is an orthogonal projector onto ? > R(A) = N (A ). . Matrix PN (A) = In A† A is an orthogonal projector onto N (A). . Matrix PN (A)? = A† A is an orthogonal projector onto N (A)? = R(A> ). Sp’15 16 / 27 Low-rank matrix approximation Let A 2 Rm,n be a given matrix, with rank(A) = r > 0. We consider the problem of approximating A with a matrix of lower rank. In particular, we consider the following rank-constrained approximation problem Ak k2F kA min Ak 2Rm,n s.t.: Notes rank(Ak ) = k, where 1 k r is given. Let r X ˜ >= A = U ⌃V > i ui vi i=1 be an SVD of A. An optimal solution of the above problem is simply obtained by truncating the previous summation to the k-th term, that is Ak = k X > i ui vi . i=1 Sp’15 17 / 27 Low-rank matrix approximation The ratio Notes kAk k2F ⌘k = = kAk2F 2 1 2 1 + ··· + + ··· + 2 k 2 r indicates what fraction of the total variance (Frobenius norm) in A is explained by the rank k approximation of A. A plot of ⌘k as a function of k may give useful indications on a good rank level k at which to approximate A. ⌘k is related to the relative norm approximation error ek = kA 2 2 k+1 + · · · + r 2 2 1 + ··· + r Ak k2F = kAk2F =1 ⌘k . Sp’15 18 / 27 Minimum “distance” to rank deficiency Suppose A 2 Rm,n , m n is full rank, i.e., rank(A) = n. We ask what is a minimal perturbation A of A that makes A + A rank deficient. The Frobenius norm (or the spectral norm) of the minimal perturbation A measures the “distance” of A from rank deficiency. Notes Formally, we need to solve k Ak2F min A2Rm,n s.t.: rank(A + A) = n 1. This problem is equivalent to rank approximation, for A = Ak solution is thus readily obtained as A⇤ = Ak where Ak = Pn 1 i=1 A. The optimal A, > i ui vi . Therefore, we have A⇤ = > n un vn . The minimal perturbation that leads to rank deficiency is a rank-one matrix. The distance to rank deficiency is k A⇤ kF = k A⇤ k2 = n . Sp’15 19 / 27 Example: Image compression Notes A 266 ⇥ 400 matrix A of integers corresponding to the gray levels of the pixels in an image. Compute the SVD of matrix A, and plot the ratio ⌘k , for k from 1 to 266 1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 50 100 150 200 250 k = 9 already captures 96% of the image variance. Sp’15 20 / 27 Example: Image compression Notes Figure : Rank k approximations of the original image, for k = 9 (top left), k = 23 (top right), k = 49 (bottom left), and k = 154 (bottom right). Sp’15 21 / 27 Principal component analysis (PCA) Notes Principal component analysis (PCA) is a technique of unsupervised learning, widely used to “discover” the most important, or informative, directions in a data set, that is the directions along which the data varies the most. In the data cloud below it is apparent that there exist a direction (at about 45 degrees from the horizontal axis) along which almost all the variation of the data is contained. In contrast, the direction at about 135 degrees contains very little variation of the data. The important direction was easy to spot in this two-dimensional example. However, graphical intuition does not help when analyzing data in dimension n > 3, which is where Principal Components Analysis (PCA) comes in handy. Sp’15 22 / 27 Principal component analysis (PCA) Notes P xi 2 Rn , i = 1, . . . , m: data points; x̄ = m1 m i=1 xi the barycenter of the data points; X̃ n ⇥ m matrix containing the centered data points: X̃ = [x̃1 · · · x̃m ], . x̃i = xi x̄, i = 1, . . . , m, We look for a a normalized direction in data space, z 2 Rn , kzk2 = 1, such that the the variance of the projections of the centered data points on the line determined by z is maximal. The components of the centered data along direction z are given by ↵i = x̃i> z, i = 1, . . . , m. (↵i z are the projections of x̃i along the span of z). The mean-square variation of the data along direction z is thus given by m m 1 X 2 X > > ↵i = z x̃i x̃i z = z > X̃ X̃ > z. m i=1 i=1 Sp’15 23 / 27 Principal component analysis (PCA) Notes The direction z along which the data has the largest variation can thus be found as the solution to the following optimization problem: maxn z2R z > (X̃ X̃ > )z s.t.: kzk2 = 1. Let us now solve the previous problem via the SVD of X̃ : let X̃ = Ur ⌃Vr> = r X > i ui vi . i=1 . Then, H = X̃ X̃ > = Ur ⌃2 Ur> . From the variational representation, we have that the optimal solution of this problem is given by the column u1 of Ur corresponding to the largest eigenvalue of H, which is 12 . The direction of largest data variation is thus readily found as z = u1 , and the mean-square variation along this direction is proportional to 12 . Successive principal axes can be found by “removing” the first principal components, and applying the same approach again on the “deflated” data matrix. Sp’15 24 / 27 Example: PCA of market data Notes Consider data consisting in the returns of six financial indices: (1) the MSCI US index, (2) the MSCI EUR index, (3) the MSCI JAP index, (4) the MSCI PACIFIC index, the (5) MSCI BOT liquidity index, and the (6) MSCI WORLD index. 0 100 Month 0 150 50 MSCI BOT MSCI PAC 14 0 100 Month 150 10 8 6 4 50 100 Month 0 50 100 Month 150 50 100 Month 150 x 10 MSCI WRLD 50 MSCI JAP MSCI EUR MSCI US We used monthly return data, from Feb. 26, 1993 to Feb. 28, 2007, for a total of 169 data points. 150 50 100 Month 0 150 The data matrix X has thus m = 169 data points in dimension n = 6. Sp’15 25 / 27 Example: PCA of market data Notes Centering the data, and performing the SVD on the centered data matrix X̃ , we obtain the principal axes ui , and the corresponding singular values: 2 6 6 6 U=6 6 6 4 = ⇥ 0.4143 0.4671 0.4075 0.5199 0.0019 0.4169 1.0765 0.2287 0.1714 0.9057 0.2986 0.0057 0.0937 0.5363 0.3865 0.3621 0.0690 0.7995 0.0005 0.2746 0.4459 0.658 0.7428 0.0431 0.0173 0.0053 0.1146 0.2519 Computing the ratios ⌘k , we have 0.0379 0.0172 0.0020 0.0056 0.9972 0.0612 0.0354 0.4385 0.2632 0.0832 0.0315 0.0739 0.8515 3 7 7 7 7 7 7 5 0.0114 ⇤ . ⌘ ⇥ 100 = [67.77 84.58 96.21 99.92 99.99 100]. We deduce, for instance, that over 96% of the variability in the returns of these six assets can be explained in terms of only three implicit “factors”. In statistical terms, this means that each realization of the return vector x 2 R6 can be expressed (up to a 96% “approximation”) as x = x̄ + U3 z, where z is a zero-mean vector of random factors, and U3 = [u1 u2 u3 ] is the factor loading matrix, composed of the first three principal directions of the data. Sp’15 26 / 27 Example: PCA of voting data Notes Figure : Projection of US Senate voting data on random direction (left panel) and direction of maximal variance (right panel). The latter reveals party structure (party affiliations added after the fact). Note also the much higher range of values it provides. Sp’15 27 / 27 Notes
© Copyright 2026 Paperzz