16_Appendix_D_PCA

113
APPENDIX D
PRINCIPAL COMPONENT ANALYSIS
Principal component analysis (PCA) is a statistical technique that can be constructed by
several ways, one commonly cited of which is stated in this appendix. By stating a few
directly useful properties of PCA for radar signal analysis, we by no means, tend to give
an even superficial survey of this ever-growing topic. For deeper and more complete
coverage of PCA and its applications, please refer to [Jolliffe, 2002]. [p266~p272,
Theodoridis and Koutroumbas, 2006; Chapter 8, Mardia et al., 1979; Chapter 11,
Anderson, 2003] are also nice and shorter materials to explain some general properties of
PCA. For simplifying the presentation, all the following properties of PCA are proved
under the assumption that all eigenvalues of whichever covariance matrix concerned are
positive and distinct.
One PCA construction: Assume a random vector X , taking values in  m , has a mean
and covariance matrix of  X and  X , respectively. 1  2    m  0 are ordered
eigenvalues of  X , such that the i -th eigenvalue of  X means the i -th largest of them.
Similarly, a vector  i is the i -th eigenvector of  X when it corresponds to the i -th
eigenvalue of  X . To derive the form of principal components (PCs), consider the
optimization problem of maximizing var[1T X ]  1T  X 1 , subject to 1T 1  1 . The
114
Lagrange multiplier method is used to solve this question.
L(1,1 )  1T  X 1  1 (1T 1 1)
L
 2  X 1  211  0   X 1  11  var[1T X ]  11T 1  1 .
1
Because  1 is the eigenvalue of  X , with 1 being the corresponding normalized
eigenvector, var[1T X ] is maximized by choosing 1 to be the first eigenvector of  X .
In this case, z1  1T X is named the first PC of X , 1 is the vector of coefficients for
z1 , and var( z1 )  1 .
To find the second PC, z 2   2T X , we need to maximize var[ 2T X ]   2T  X  2 subject
to z 2 being uncorrelated with z1 . Because cov(1T X ,  2T X )  0  1T  X  2  0 
1T  2  0 , this problem is equivalently set as maximizing  2T  X  2 , subject to
1T  2  0 , and  2T  2  1 . We still make use of the Lagrange multiplier method.
L( 2 , 1, 2 )   2T  X  2  11T  2  2 ( 2T  2  1)
L
 2  X  2  11  22 2  0
 2
 1T (2  X  2  11  22 2 )  0  1  0
  X  2  2 2   2T  X  2  2 .
Because  2 is the eigenvalue of  X , with  2 being the corresponding normalized
eigenvector, var[ 2T X ] is maximized by choosing  2 to be the second eigenvector of
115
 X . In this case, z2   2T X is named the second PC of X ,  2 is the vector of
coefficients for z 2 , and var( z2 )  2 . Continuing in this way, it can be shown that the i th PC zi   iT X is constructed by selecting  i to be the i -th eigenvector of  X , and
has variance of i . The key result in regards to PCA is that the principal components are
the only set of linear functions of original data that are uncorrelated and have orthogonal
vectors of coefficients.
Proposition D.1 [Jolliffe, 2002]: For any positive integer p  m , let B  [ 1 ,  2 ,  ,  p ]
be an real m  p matrix with orthonormal columns, i.e.,  iT  j   ij , and Y  BT X .
Then the trace of covariance matrix of Y is maximized by taking B  [1 ,  2 ,  ,  p ] ,
where  i is the i -th eigenvector of  X .
Proof:
Because  X is symmetric with all distinct eigenvalues, so {1 , 2 ,, m } is an
orthonormal basis with  i being the i -th eigenvector of  X , and we can represent the
columns of B as
m
 i   c ji j , i  1,  , p .
j 1
So we have
B  PC
116
where P  [1 ,, m ] , C  {cij } is an m  p matrix. Then, PT  X P   , with  being
a diagonal matrix whose k -th diagonal element is k , and the covariance matrix of Y is,
T
Y  BT  X B  C T PT  X PC  C T C  1c1c1T    m cm cm
where ciT is the i -th row of C . So,
m
m
m
i 1
i 1
i 1
p
m
trace(Y )   i trace(ci ciT )   i trace(ciT ci )   i ciT ci   (  cij2 )i .
i 1 j 1
m p
Because C T C  BT PPT B  BT B  I , so trace(C T C )    cij2  p , and the columns of
i 1 j 1
C are orthonormal. By the Gram-Schmidt method, C can expand to D , such that D has
its columns as an orthonormal basis of  m and contains C as its first p columns. D is
square shape, thus being an orthogonal matrix and having its rows as another orthonormal
p
basis of  m . One row of C is a part of one row of D , so  cij2  1 , i  1,, m .
j 1
p
m p
j 1
i 1 j 1
m
p
Considering the constraints  cij2  1 ,   cij2  p and the objective  (  cij2 )i . We
i 1 j 1
p
p
j 1
j 1
derive that trace(Y ) is maximized if  cij2  1 for i  1,, p , and  cij2  0 for
i  p  1,, m . When B  [1 ,  2 ,  ,  p ] , straightforward calculation yields that C is an
all-zero matrix except cii  1 , i  1,  , p . This fulfills the maximization condition.
Actually, by taking B  [ 1 ,  2 ,  ,  p ] , where { 1 ,  2 ,,  p } is any orthonormal basis of
117
the subspace of span{1 ,  2 ,  ,  p } , the maximization condition is also satisfied, thus
yielding the same trace of covariance matrix of Y .
■
Proposition D.2 [Jolliffe, 2002]: Suppose that we wish to approximate the random vector
X
by its projection onto a subspace spanned by columns of B , where
B  [ 1 ,  2 ,  ,  p ] is a real m  p matrix with orthonormal columns, i.e.,  iT  j   ij . If
m
 i2 is the residual variance for each component of X , then   i2 is minimized if
i 1
B  [1 ,  2 ,  ,  p ] , where {1 , 2 ,, p } are the first p eigenvectors of  X . In other
words,
the
trace
of
convariance
matrix
of
X  BBT X
is
minimized
if
B  [1 ,  2 ,  ,  p ] . When E ( X )  0 , which is a commonly applied preprocessing step
in data analysis methods, this property is saying that E X  BB T X
2
is minimized if
B  [1 ,  2 ,  ,  p ] .
Proof:
The projection of a random vector X onto a subspace spanned by columns of B is
Xˆ  BB T X . Then the residual vector is   X  BB T X , which has a convariance matrix
  ( I  BB T )  X ( I  BB T ) .
Then,
m
  i2  trace( )  trace( X   X BB T  BB T  X  BB T  X BB T ) .
i 1
Also, we know
118
trace( X BB T )  trace( BB T  X )  trace( BT  X B)
trace( BB T  X BB T )  trace( BT  X BB T B)  trace( BT  X B) .
The last equation comes from the fact that B has orthonormal columns.
So,
m
  i2  trace( X )  trace( BT  X B) .
i 1
m
To minimize   i2 , it suffices to maximize trace( BT  X B) . This can be done by
i 1
choosing B  [1 ,  2 ,  ,  p ] , where {1 , 2 ,, p } are the first p eigenvectors of  X ,
according to Proposition D.1 stated above.
■