HW 1

M532 Homework 1
Mark Blumstein
February 11, 2016
1) Graph the unit circle in R2 for the norms |x|1 , |x|2 , and |x|∞ .
Solution
Graphs made in matlab.
(a) Taxicab norm
(b) Euclidean norm
(c) Infinity norm
J
2)Consider the vector whose components are given w.r.t.

0

 0

x=
 1

−1
1
the standard basis are







Determine the coordinates of this point w.r.t. the Walsh

1 1

1
 1 1
U1 = 
4  1 −1

1
1
−1
−1
−1
1
and the Haar basis

basis
1

1
 1
U2 = 
2 1

1
2
√
− 2
−1
0
−1
0


−1 


1 

−1
√
1
1
1
0

0
√
2
√
− 2






Which representation do you think is better for this point x and why?
Solution In general, suppose we have an n × 1 column vector x written in terms of the standard basis,
and we desire to write x in terms of some collection B of n column vectors of size n × 1.
Since multiplying a matrix with a column vector simply computes the linear combination of the columns
of the matrix scaled by the components of the vector, we need a vector u such that Bu = x. We may always
find such a u provided that the columns of B are linearly independent, in particular u = B −1 x.
Using matlab we compute the given vector x in terms of the two given bases U1 and U2 . We find that,
U1−1 x = [0, 0, −2, 2]
√
U2−1 x = [0, 0, 0, 2]
Picking the “better” basis depends on what one means by the word better, and at this point I’m not
quite sure what that means. I might argue that U2 better represents the data point because the data point
lies on the line defined by the the fourth basis vector of U2 , as opposed to lying in the plane formed by the
third and fourth basis vectors of U1 . So in the sense that a line is simpler than a plane, U2 might be the
“better” basis.
On the other hand, there might be some physical reason where it would make more sense to express x in
terms of two vectors as opposed to one. I’m not sure what such a situation would be, and so I’m being very
vague, but maybe the fourth basis vector of U2 is in some way expensive, or maybe difficult to reproduce
physically, as opposed to a cheaper option provided by the third and fourth basis vectors of U1 .
J
3)Consider the two-dimensional subspace spanned by the vectors


1 1



1
 1 1 
U= 

4  1 −1 


−1
1
Compute the projection of the point

2



 3 


x=

 1 


−1
2
onto the subspace spanned by U . What is the novelty of this point? What is the representation of x in the
2D subspace w.r.t. the basis for the subspace?
Solution To compute the projection of a vector x onto a plane spanned by the vectors u and v, we can
just take the linear combination, proju x + projv x, i.e. project x separately onto u and v and then take the
vector sum.
To project the column vector x onto the column vector u, note that (uuT )x = (u · x)u = ||u||2 proju x.
Using this idea we define the projection matrix P as,
1
1
T
T
uu
+
vv
.
P = proju (−) + projv (−) =
||u||2
||v||2
For this example, u and v are the two column vectors of the given matrix U , and x is the given point. A
direct computation reveals the vector projection of x onto the given plane to be [2.5, 2.5, 0, 0].
The novelty is detected by (I − P )x, or in other words, the complementary projection. We compute that
(I − P )x = [−.5, .5, 1, −1].
J
4)Use the QR algorithm to compute an orthonormal basis from the vectors

 
 

0
1
1

 
 

 2  ,  −2  ,  −1 

 
 

−1
3
3
Verify that QT Q = I. Work through this problem by hand and verify via matlab.
Solution I’ve written the code to compute the QR algorithm on the given set of vectors. I’ll detail the
steps to execute the QR algorithm here. First pick any vector in the given set {xi } and normalize it. In
this example I began by normalizing the vector x1 and calling it u1 , where u1 is to be the first vector in the
orthonormal basis that the QR algorithm generates.
Since (u1 uT1 )x2 is the projection of x2 onto u1 , (I −u1 uT1 )x2 = x2 −proju1 x2 is the complentary projection
of x2 against u1 (the complementary projection picks up how much of x2 points orthogonally to u1 .) This
is exactly the second step of the QR algorithm. We define u2 to be the vector (I − u1 uT1 )x2 , normalized.
Clearly, we can repeat this process until we’ve exhausted the original list of vectors {xi }. For this
problem, since there are only 3 vectors, the final step is to define u3 to be the normalized form of the vector
(I − u2 uT2 )(I − u1 uT1 )x3 . We see that this has the effect of first performing the complentary projection of x3
against u1 , and taking this resulting vector and performing the complementary projection against u2 . Thus,
the resulting vector u3 captures the component of x3 pointing orthogonally to both u1 and u2 .
Here is the matlab code:
%%define variables
x1=[1,2,3]’; x2=[1,-2,3]’; x3=[0,-1,-1]’;
I=eye(3);
%%% Run QR algorithm. u1,u2,and u3 will form the orthonormal basis.
u1=x1./norm(x1,2)
u2= (I-u1*u1’)*x2./norm((I-u1*(u1’))*x2,2)
u3=(I-u2*u2’)*(I-u1*u1’)*x3./norm((I-u2*u2’)*(I-u1*u1’)*x3)
%% Verify u1, u2, u3 form an orthogonal set of vectors
U=[u1,u2,u3];
U*U’
U’*U
3
J
5) Recall that the eigenvectors of symmetric matrices associated with distinct eigenvalues are orthogonal
and that eigenvalues are real. Prove these facts. (Try without looking at your old linear algebra book.)
Solution Let A be a symmetric linear operator A. In matrix form, symmetry buys us that A = AT . Let
u · v denote the dot product of two vectors, we have:
.
(Ax) · y = (Ax)T y
= x T AT y
= xT Ay
(by symmetry of A)
= x · (Ay).
Then, if x and y are eigenvectors with eigenvalues λ and δ respectively, we see that λ(x · y) = δ(x · y) by
the previous computation. This is possible if and only if λ = δ or x · y = 0. The first case corresponds to x
and y belonging to the same eigenspace, if not, then x and y are orthogonal.
Note that in general, a linear operator on an inner product space with the property that hAx, yi = hx, Ayi
is called symmetric, or self-adjoint. You saw an example of this in PDEs , where the solution to the heat and
wave equations, given some symmetric boundary condition, were given by the eigenfunctions of the operator
−∇u = λu in some n-dimensional boundary D.
R This operator is symmetric, thus the eigenfunctions were
orthogonal (using the inner product hf, gi = D f · g dx.) The general Fourier series provided the correct
linear combination of eigenfunctions for the solution.
J
6) In class we showed that the optimization problem for the first principal vector is given by
max φT Cφ − λ(φT φ − 1)
leads to the necessary condition
Cφ = λφ
for the best eigenvector. Provide the details of this calculation.
Solution Suppose we are given a data set X of p column vectors, the idea behind finding the first principal
vector is to find the “best” 1-dimensional subspace representation of X. In this case, what is meant by “best”
is a vector u such that we maximize the statistical variance of the data set {uT x(i) }, where i ranges over
the number of column vectors of X, subject to the constraint that ||u|| = 1. In other words, we consider
the
of the magnitude of the projection of each x(i) onto the vector u. Thus, we need to maximize
Pvariance
1
T (i) 2
i (u x ) , where u is a unit vector.
p
P
Using the method of Lagrange multipliers, we search for a u such that L(u) = p1 i (uT x(i) )2 − λ(uT u − 1)
∂L
∂L
th
spot is the derivative of L with
with ∂L
∂u = 0 and ∂λ = 0. Note that ∂u is defined to be the vector whose j
respect to uj .
P
Observe that (uT x)2 = uT xxT u, so we may rewrite the variance by defining the matrix C = i x(i) x(i)T .
Then, L(u) = uT Cu − λ(ut u − 1).
4
Now, let’s compute the partial derivatives:
∂L
1 X ∂ T (i) 2
∂
=
(u x ) ) −
(λ(uuT − 1))
∂u
p i ∂u
∂u
2X
=
(u · x(i) )x(i) − λ2u
p i
∂L
= uuT − 1.
∂λ
Now suppose that u is a vector which satisfies the requirement that
partial derivative equals 0. That
P each(i)
∂L
∂L
(i)
= λpu. It remains to show
i (u · x )x
∂λ = 0 implies that u is a unit vector. That ∂u = 0 implies that
that such a u is an eigenvector for C with eigenvalue λ. We compute,
1 X (i) (i)T
(x x
u)
p i
1 X T (i) (i)
(u x )x
=
p
1
= (pλ)u
p
= λu.
Cu =
J
7) Convert the code below to the associated mathematical equations for implementing the QR matrix
factorization.
function [ Q R ] = MyQR( X )
%
Modified Gram-Schmidt (see Trefethen and Bau,
%
Numerical Linear Algebra p58.)
%
Compute an o.n. basis for the cols of X and put in Q.
%
X = QR is the QR factorization of X
n = size(X,2);
V = X;
R = zeros(n);
for i = 1:n
R(i,i) = norm(V(:,i));
Q(:,i) = V(:,i)/R(i,i);
for j = i+1:n
R(i,j) = Q(:,i)’*V(:,j);
V(:,j) = V(:,j)-R(i,j)*Q(:,i);
end
end
Solution The size command gives the number of columns of the data set X, V is used as a dummy variable,
and R = zeros(n) creates the zero matrix of size n × n.
The outer loop defines the columns of Q and the diagonal entries of R. For example, when i = 1, the
loop begins by taking the first column of X,and computes its norm. The norm is stored in the (1, 1) spot
of the R matrix, and the normalized vector is stored as the first column of Q. Q will ultimately contain the
set of orthonormal basis vectors.
5
Fixing i and entering the inner loop which runs over all j, we first define the ith row of R. The ith row of
R will contain in the j th spot the magnitude of the projection of x(j) onto q (i) . The second line in the inner
loop replaces the j th column of V with the complementary projection of x(j) against the first i columns of
Q. When we return to the outer loop again, we normalize the newly defined column vector of V and store
them back in Q. The process continues until the original data set has been exhausted.
J
Computing Problems:
The following ideas are used repeatedly throughout this assignment, I thought I’d outline them here:
Suppose that the columns of an n × m matrix U form an orthonormal set, and x is an n × 1 vector. We
use upper subscript notation to represent a column vector (e.g. x(i) might represent the ith column vector
of a matrix X.)
• U T x is the vector whose ith spot is u(i)T x. Since all columns of u have unit length, u(i)T x = ||proju(i) x||.
.
We often use the notation α = U T x.
If X is a matrix with n rows, then U T X is the matrix whose ith column is the coefficient vector α
corresponding to the ith column vector x(i) .
• The matrix U U T is the projection matrix. i.e. U U T x is the projection of x onto
P the hyperplane formed
by the columns of U . This is because U U T x = U α = α1 u(1) + · · · αm u(m) = i proju(i) x
• If x̂ = U U T x, is the projection of x onto U , then x̂ = x if and only if x is a linear combination
of the columns of U . This justifies the terminology “basis” for a data set, as U α = x is a unique
representation of x in terms of U , since U forms an orthogonal set.
1) Display the first 4 “eigencats” and “eigendogs” for the SVD basis.
Solution To reshape one of the data vectors x to form a 64 × 64 matrix of pixels, and then display the
image use the code:
imagesc(reshape(x,64,64));
colormap(gray)
Images on next page.
J
6
7
2) Compute a rank 25 and rank 99 projection of the first cat onto the QR basis and then the SVD basis.
Do the same for the first dog. Project the cat onto the dog basis and the dog onto the cat basis.
Solution All images are displayed on the next page. Recall that the QR algorithm starts by normalizing
the first vector in the data set(in this case the first cat or dog picture), and that this normalized vector
becomes the first vector in the orthonormal basis set. So when we project the first cat and dog onto the
respective QR basis, the rank of the projection doesn’t matter. In fact, we might as well have used a rank 1
projection. On the other hand, notice that the rank 25 projection onto the SVD bases didn’t fully capture
either the cat or the dog.
The pictures go from top to bottom, left to right: rank 25 QR, rank 99 QR, rank 25 SVD, rank 99
SVD. The 9th picture is the first cat projected onto the dog SVD basis, and the 10th picture is the first
dog projected onto the cat SVD basis. It’s interesting how one can see clearly the dog in picture 9 whose
features are very “cat like”.
8
J
9
3) Compute the most novel dog when compared to the cats and the most novel cat when compared to the
dogs.
Solution In the following matlab code, we compute the projection matrix for the cat SVD basis, and then
run a loop to compute the complementary projection of each dog on the basis. Also within the loop we
compute the norm of each projection. Then, the “max” function is used to determine which complementary
projection has the largest norm. The code for projecting dogs onto the cat basis is below (note that U1 is
the cat SVD basis, and columns 100 - 198 of the data matrix Y are the dog pictures.)
P=eye(4096)-U1*U1’;
for i =100:198
n=norm(P*Y(:,i),2);
N(1,i)=n;
end
[M,I]=max(N)
Above are the most “novel” dog with respect to cats, and most “novel” cat with respect to dogs. This
may be interpreted as the dog sharing the least number of “features” with the cats, and vice versa.
J
4) Shannon’s entropy is defined as
H=−
N
X
pi ln pi
i=1
P
(µ)
where the
pi = 1. If αi is the coefficient of x(µ) with respect to basis vector u(i) let the energy in the
ith direction be defined
X (µ)
ρi =
(αi )2
µ
The total energy is defined as
T =
X
ρj
j
Let
pi = ρi /T
a) Compute the QR (using the algorithm provided in Problem 7) and SVD basis for the cat data set and
compare the entropies H for each basis by computing the associated pi .
b) Compute the QR (using the algorithm provided in Problem 7) and SVD basis for the dog data set and
compare the entropies H for each basis by computing the associated pi .
c) Plot the values ρi for each basis as well as the singular values squared of the SVD and comment.
10
Solution Using the definitions above, I wrote the following code to compute energy and entropy. I’ll
display the code for just the cat data using the QR basis. In the code, QC is the matrix whose columns are
the orthonormal basis vectors from the QR decomposition of the cat data set.
To elaborate a bit on the process, suppose that U is a matrix whose column vectors form an orthonormal
set. As was remarked earlier, U T X is the matrix whose ith column contains the coefficients of the projection
of x(i) onto each of the columns of U . Next, we take the pointwise square of U T X, call this matrix E.
The energy in the j th -direction (ρj ) is the sum over the j th row of E. Summing over the j th row will
sum the squares of the projection coefficients of each data point x onto the j th basis vector of U . In some
sense, this computes the variance of a data set X around a specific basis vector.
Notice that computing the total energy T , and considering fractions ρj /T , allows us to speak in terms
of a probability distribution. A higher probability (or weight) should correspond to a basis vector which
accurately captures the data set X.
%%% Compute energy matrix using QR basis for cats
E1=(QC’*Y(:,1:99)).*(QC’*Y(:,1:99));
%% Total energy
T1=sum(sum(E1));
%% Shannon’s entropy
H1=0;
for i=1:99
p=sum(E1(i,:))/T1;
hold on
plot(i,p,’g--o’)
H1=H1+-1*p*log(p);
end
I computed the following entropies:
H1 = 1.5974 (Cat entropy for QR basis)
H2 = 1.0313 (Cat entropy for SVD basis)
H3 = 1.9177 (Dog entropy for QR basis)
H1 = 1.0766 (Dog entropy for SVD basis)
The data verifies a theorem discussed in class: The SVD basis minimizes Shannon’s entropy over all
orthononrmal bases.
Last are the plots of the energies for each basis compared to the singular values squared. Notice how
the singular values match the data. Also, if we zoom in very close, we notice that the energy levels using
the SVD basis are higher than energy levels for the QR basis. TWe saw in problem 6 that PCA/SVD is
constructed to maximize variance (or energy), so this makes sense.
11
J
5) This problem concerns estimating the dimension of data subspaces. A common measure is the number
of dimensions of the basis required to retain 95% of the energy, or statistical variance, of the data. In other
words, how many terms K are needed such that
PK
λi
(1)
EK = Pi=1
≥ 0.95
N
i=1 λi
This is referred to as the energy dimension. Another measure deals with how quickly the eigenvalues are
going to zero. Specifically, what is the smallest value of J such that
(2)
EJ = λ̃J ≤ 0.01
This is sometimes referred to as the stretching dimension dimension.
(a) Compute K for the cats using the basis from QR.
(b) Compute K for the cats using the basis from SVD and compare with the result in a).
(c) Repeat the above for the dogs data.
Solution The code below uses a while loop to compute the energy dimension, again I’ve only included the
code for the cat data with respect to QR basis:
%%%%% Computing K for Cats using QR
i=0;
j=1;
while i<=0
if sum(sum(E1(1:j,1:99)))/T1 >=.95
i=1;
else
j=j+1;
end
end
The energy dimensions were found to be:
32 = Energy dimension for Cats, QR basis
14 = Energy dimension for Cats, SVD basis
51 = Energy dimension for Dogs, QR basis
19 = Energy dimension for Dogs, SVD basis
J
12
6) Using your entropy and energy calculations describe which data set, cats or dogs, is more complex.
Solution The dog data set seems more complex. By this I mean that the cat data set is easier to capture
in lower dimensions. For example, the energy dimension using SVD for cats was 14 as opposed to 19 for
dogs, meaning that 5 fewer basis vectors were required to preserve 95% of the energy. This might be what
we expect since dogs are the more diverse species: 2 cats of different breeds look roughly the same, however
a chihuaha and a german shepherd hardly even look like the same species.
J
13