PCA and LDA

PCA and LDA
Nuno Vasconcelos
ECE Department,
p
, UCSD
Principal component analysis
basic idea:
• if the data lives in a subspace, it is going to look very flat when
viewed from the full space, e.g.
1D subspace in 2D
2D subspace in 3D
• this means that if we fit a Gaussian to the data the equiprobability
contours are going to be highly skewed ellipsoids
2
Principal component analysis
If y is Gaussian with covariance Σ, the equiprobability
contours are the ellipses whose
y2
φ1
• principal components φi are the
φ2
eigenvectors of Σ
• principal
p
p lengths
g
λi are the
eigenvalues of Σ
λ1
λ2
y1
by computing the eigenvalues we know if the data is flat
λ1 >> λ2: flat
λ1=λ2: not flat
y2
λ2
y2
λ2
λ1
y1
λ1
y1
3
Principal component analysis (learning)
4
Principal component analysis
5
Principal component analysis
there is an alternative manner to compute the principal
components,
p
, based on singular
g
value decomposition
p
SVD:
• any real n x m matrix (n>m) can be decomposed as
A = ΜΠΝ T
• where M is a n x m column orthonormal matrix of left singular
vectors (columns of M)
•
Π a m x m diagonal matrix of singular values
• NT a m x m row orthonormal matrix of right singular vectors
(columns of N)
ΜT Μ = I
ΝT Ν = I
6
PCA by SVD
to relate this to PCA, we consider the data matrix
|⎤
⎡|
⎢
⎥
X = ⎢ x1 K xn ⎥
⎢⎣ |
| ⎥⎦
the sample
p mean is
| ⎤ ⎡1⎤
⎡|
1
1⎢
1
⎥
⎢
⎥
µ = ∑ x i = ⎢x 1 K x n ⎥ ⎢M⎥ = X 1
n i
n
n
| ⎦⎥ ⎢⎣1⎥⎦
⎣⎢ |
7
PCA by SVD
and we can center the data by subtracting the mean to
each column of X
this is the centered data matrix
| ⎤ ⎡|
|⎤
⎡|
X c = ⎢⎢x 1 K x n ⎥⎥ − ⎢⎢ µ K µ ⎥⎥
⎢⎣ |
| ⎥⎦ ⎢⎣ |
| ⎥⎦
1
1 T⎞
⎛
T
T
= X − µ1 = X − X 11 = X ⎜ I − 11
n
n
⎝
⎠
8
PCA by SVD
the sample covariance is
( )
1
1
T
Σ = ∑ ( xi − µ )( xi − µ ) = ∑ xic xic
n i
n i
T
where xic is the ith column of Xc
this can be written as
c
⎡
|
|
x
−
⎡
⎤
1
1⎢ c
c ⎥⎢
Σ = ⎢ x1 K xn ⎥ ⎢
M
n
⎢⎣ |
| ⎥⎦ ⎣⎢− xnc
−⎤
⎥ 1
T
=
X
X
⎥ n c c
−⎥⎦
9
PCA by SVD
the matrix
X cT
⎡− x 1c
⎢
=⎢
M
⎢− x nc
⎣
−⎤
⎥
⎥
−⎥⎦
is real n x d. Assuming n > d it has SVD decomposition
X cT = ΜΠΝT
ΜT Μ = I
ΝT Ν = I
and
Σ=
1
n
Xc X
T
c
=
1
n
T
T
ΝΠΜ ΜΠΝ =
1
n
ΝΠ 2 ΝT
10
PCA by SVD
⎛ 1 2⎞ T
Σ = Ν⎜ Π ⎟ Ν
⎝n
⎠
noting that N is d x d and orthonormal, and Π2 diagonal,
shows that this is jjust the eigenvalue
g
decomposition
p
of Σ
it follows that
• the eigenvectors of Σ are the columns of N
• the eigenvalues of Σ are
1 2
λi = π i
n
this gives an alternative algorithm for PCA
11
PCA by SVD
computation of PCA by SVD
given X with one example per column
• 1) create the centered data-matrix
X
T
c
1 T
⎛
= ⎜ I − 11
n
⎝
⎞ T
⎟X
⎠
• 2) compute
t its
it SVD
X cT = ΜΠΝT
• 3) principal components are columns of N, eigenvalues are
1 2
λi = π i
n
12
Limitations of PCA
PCA is not optimal for classification
• note that there is no mention of the class label in the definition of
PCA
• keeping the dimensions of largest energy (variance) is a good
idea but not always enough
idea,
• certainly improves the density estimation, since space has
smaller dimension
• but could be unwise from a classification point of view
• the discriminant dimensions could be thrown out
it is not hard to construct examples where PCA is the
worst possible thing we could do
13
Example
consider a problem with
• two n-D Gaussian classes with covariance Σ=σ2I,
I σ2 = 10
X ~ N ( µi ,10 I )
• we add an extra variable which is the class label itself
X ' = [ X , i]
• assuming that PY(0)=PY(1)=0.5
(1)=0 5
E[Y ] = 0.5 × 0 + 0.5 × 1 = 0.5
var[Y ] = 0.5 × (0 − 0.5) + 0.5 × (1 − 0.5)
2
2
= 0.125 < 10
• dimension n+1 has the smallest variance and is the first to be
discarded!
14
Example
this is
• a very contrived example
• but shows that PCA can throw away all the discriminant info
does this mean yyou should never use PCA?
• no, typically it is a good method to find a suitable subset of
variables, as long as you are not too greedy
• e
e.g.
g if you start with n = 100
100, and know that there are only 5
variables of interest
• picking the top 20 PCA components is likely to keep the desired 5
• your classifier will be much better than for n = 100, probably
not much worse than the one with the best 5 features
is there a rule of thumb for finding the number of PCA
components?
15
Principal component analysis
a natural measure is to pick the eigenvectors that explain
p % of the data variability
y
• can be done by plotting the ratio rk as a function of k
k
rk =
∑λ
i =1
2
i
n
2
λ
∑ i
i =1
• e.g. we need 3 eigenvectors to cover 70% of the variability of this
dataset
16
Fischer’s linear discriminant
what if we really need to find the best features?
• harder question, usually impossible with simple methods
• there are better methods at finding discriminant directions
one good example is linear discriminant analysis (LDA)
• the idea is to find the line that best separates the two classes
bad projection
good projection
17
Linear discriminant analysis
we have two classes such that
E X |Y [X | Y = i ] = µi
[
]
E X |Y ( X − µi )( X − µi ) | Y = i = Σ i
T
and want to find the line
z = wT x
that best separates them
one possibility would be to maximize
(E [Z | Y = 1] − E [Z | Y = 0]) =
(E [w x | Y = 1]− E [w x | Y = 0]) = (w [µ
2
Z |Y
Z |Y
T
X |Y
T
X |Y
2
T
1
)
− µ0 ]
2
18
Linear discriminant analysis
however, this
(w [µ
T
1
)
− µ0 ]
2
can be made arbitrarily large by simply scaling w
we are only interested in the direction, not the magnitude
need some type of normalization
Fischer suggested
between class scatter
max
=
within class scatter
2
(
EZ |Y [Z | Y = 1] − EZ |Y [Z | Y = 0])
max
w
var[Z | Y = 1] + var[Z | Y = 0]
19
Linear discriminant analysis
we have already seen that
(E [Z | Y = 1] − E [Z | Y = 0]) = (w [µ
2
Z |Y
T
Z |Y
1
)
− µ0 ]
2
= w [µ1 − µ 0 ][µ1 − µ 0 ] w
T
T
also
{
var[Z | Y = i ] = EZ |Y (z − EZ |Y [Z | Y = i ]) | Y = i
{(
2
)
= EZ |Y w [x − µi ] | Y = i
{
T
2
}
= EZ |Y w [x − µi ][x − µi ] w | Y = i
T
T
}
}
= wT Σ i w
20
Linear discriminant analysis
and
(
E [Z | Y = 1] − E [Z | Y = 0])
J ( w) =
2
Z |Y
Z |Y
var[Z | Y = 1] + var[Z | Y = 0]
wT (µ1 − µ 0 )(µ1 − µ 0 ) w
=
wT (Σ1 + Σ 0 )w
T
which can be written as
wT S B w
J ( w) = T
w SW w
between class scatter
S B = (µ1 − µ 0 )(µ1 − µ 0 )
T
SW = (Σ1 + Σ 0 )
within class scatter
21
Linear discriminant analysis
maximizing the ratio
wT S B w
J ( w) = T
w SW w
• is equivalent to maximizing the numerator while keeping the
denominator constant, i.e.
max wT S B w subject to
w
wT SW w = K
• and can be accomplished using Lagrange multipliers
• define
d fi th
the Lagrangian
L
i
(
L = wT S B w - λ wT SW w − K
)
• and maximize with respect to both w and λ
22
Linear discriminant analysis
setting the gradient of
L = wT (S B - λSW )w + λK
with respect to w to zero we get
∇ w L = 2(S B - λSW )w = 0
or
S B w = λSW w
this is a generalized eigenvalue problem
the solution is easy when S w−1 = (Σ1 + Σ 0 )−1 exists
23
Linear discriminant analysis
in this case
SW−11S B w = λw
and, using the definition of SB
SW−1 (µ1 − µ 0 )(µ1 − µ0 ) w = λw
T
noting that (µ1-µ0)Tw = α is a scalar this can be written as
λ
S (µ1 − µ0 ) = w
α
−1
W
and since we don’t care about the magnitude of w
w* = SW−1 (µ1 − µ0 ) = (Σ1 + Σ 0 ) (µ1 − µ 0 )
−1
24
Linear discriminant analysis
note that we have seen this before
• for a classification problem with Gaussian classes of equal
covariance Σi = Σ, the BDR boundary is the plane of normal
w = Σ −1 (µi − µ j )
• if Σ1 = Σ0, this is also the LDA solution
x0 w
µi
µj
Gaussian classes,
classes
equal covariance Σ
25
Linear discriminant analysis
this gives two different interpretations of LDA
• it is optimal if and only if the classes are Gaussian and have
equal covariance
• better than PCA, but not necessarily good enough
• a classifier on the LDA feature, is equivalent to
• the BDR after the approximation of the data by two Gaussians with
equal covariance
26
27