Title

From Last Meeting
Studied Approximation of Corpora Callosa
-
by Fourier (raw and centered)
-
by PCA
Fisher Linear Discrimination
Recall Toy Problem:
Show HDLSS/HDLSSod1Raw.ps
Want to find “separating direction vector”
Recall PCA didn’t work
Show HDLSS/HDLSSod1PCA.ps
Also “difference between means” doesn’t work:
Show HDLSS/HDLSSod1Mdif.ps
Fisher Linear Discrimination
A view of Fisher Linear Discrimination:
Adjust to “make covariance structure right”
Show HDLSS/HDLSSod1FLD.ps
Mathematical Notation (vectors with dimension d ):
Class 1:
(1)
(1)
Class Centerpoints:
( 2)
Class 2:
X 1 ,..., X n1
X
(1)
1 n1 (1)
 Xi
n1 i 1
and
( 2)
X 1 ,..., X n2
X
( 2)
1 n2 ( 2 )
 Xi
n2 i 1
Fisher Linear Discrimination (cont.)
~ ( j) ~ ( j)t
( j)
ˆ
Covariances:   X X , for j  1,2
(outer products)
Based on “normalized, centered data matrices”:
1
~
( j)
( j)
( j)
( j)
X ( j) 
X 1  X ,..., X n j  X
nj


note: Use “MLE” version of normalization, for simpler notation
Terminology (useful later):
ˆ ( j ) are “within class covariances”
Fisher Linear Discrimination (cont.)
Major assumption: Class covariances are same (or “similar”)
Good estimate of “common within class covariance”?
Show HDLSS/HDLSSod1FLD.ps
Pooled (weighted average) within class covariance:
(1)
( 2)
ˆ
ˆ
n


n

~~
2
ˆ  1
 XX t
n1  n2
w
for the “full data matrix”:
1
~
~ (1)
~ ( 2)
 n1 X n2 X 
X
n
Fisher Linear Discrimination (cont.)
Note: ̂ w is similar to ̂ from before
-
i.e. “covariance matrix ignoring class labels”
-
important difference is “class by class centering”
Again show HDLSS/HDLSSod1FLD.ps
Fisher Linear Discrimination (cont.)
Simple way to find “correct covariance adjustment”:
Individ’ly transform subpop’ns so “spherical” about their means
Show HDLSS\HDLSSod1egFLD.ps
Y
( j)
i
( j)
w 1 / 2
ˆ
   X i
then:
“best separating hyperplane”
is
“perpendicular bisector of line between means”
Fisher Linear Discrimination (cont.)
So in transformed space, the separating hyperlane has:
Transformed normal vector:
(1)
( 2)
(1)
( 2)
w 1 / 2
w 1 / 2
w 1 / 2
ˆ
ˆ
ˆ
n TFLD    X    X   
X X


Transformed intercept:
1 ˆ w 1/ 2 (1) 1 ˆ w 1/ 2 ( 2)
1
w 1 / 2  1
ˆ
 TFLD    X    X     X (1)  X ( 2) 
2
2
2
2

Equation:
Again show HDLSS\HDLSSod1egFLD.ps
y :
y, n TFLD   TFLD , n TFLD

Fisher Linear Discrimination (cont.)
Thus discrimination rule is:
0
Given a new data vector X , Choose Class 1 when:
w 1 / 2
ˆ
  X 0 , nTFLD   TFLD , nTFLD
i.e. (transforming back to original space)
1 / 2
1/ 2
1 / 2
0
X , ˆ w  nTFLD  ˆ w   TFLD , ˆ w  nTFLD
X , n FLD   FLD , n FLD
0
where:

1 / 2
1
(1)
( 2)
n FLD  ˆ w  n TFLD  ˆ w  X  X
1 (1) 1 ( 2) 

w 1/ 2
ˆ
 FLD     TFLD   X  X 
2
2


Fisher Linear Discrimination (cont.)
Thus (in original space) have separating hyperplane with:
Normal vector:
Intercept:
 FLD
Again show HDLSS\HDLSSod1egFLD.ps
n FLD
Next time: relate to Mahalanobis distance
FLD Likelihood View

N
Assume: Class distributions are multivariate
( j)
, w

(strong distributional assumption + common cov.)
0
At a location x , the likelihood ratio,
for choosing between Class 1 and Class 2, is:

LR x ,  , 
0
(1)
( 2)


,  w  w x  
0
(1)
/  x

w
0

( 2)

where   w is the Gaussian density with covariance  w
FLD Likelihood View (cont.)
Simplifying, using the form of the Gaussian density:
1
  x 
2 
w

d /2
w
e
1
  x t  w x  / 2


Gives (critically using the common covariance):

LR x ,  , 
0

(1)
( 2)
,
w
 e


  x 0   (1)  w

t

0

(1)
1
x
 2 log LR x ,  , 
 x 
0
  x
(1) t
w 1
0
(1)
  x
0
0


t
  (1)  x 0   ( 2 )  w
( 2)

  x
1
x
0

 (2)  / 2

, w 

( 2) t
w 1
0

( 2)

FLD Likelihood View (cont.)
But:
x
0

  x
( j) t
w 1
0

( j)
 x
so:
0t


w 1
0t
x  2x 
0
 2 log LR x ,  , 
0t
 2 x 

w 1
Thus LR x ,  , 
0
(1)

( 2)
(1)


( 2)
0
 
, w  1

(1)
(1)

0t
x w
1

(1)

( 2)
0
  12 
(1)
(1)
( 2)


( j)
 
( j)

, w 
( 2)
when
 2 log LR x ,  , 
i.e.
( 2)
w 1
 
w 1
(1)

( 2)

, w  0
( 2)
 
w 1
(1)

( 2)


w 1
 ( j)
FLD Likelihood View (cont.)
Replacing  , 
(1)
( 2)
and  w by maximum likelihood estimates:
(1)
X , X
( 2)
and ̂ w
gives the likelihood ratio discrimination rule:
Choose Class 1, when

 


1 (1)
(1)
( 2)
( 2)
(1)
( 2)
w 1
w 1
ˆ
ˆ
x 
X X
 X X 
X X
2
0t
same as above

FLD Generalization I
Gaussian Likelihood Ratio Discrimination
(a. k. a. “nonlinear discriminant analysis”)
Idea: Assume class distributions are

N
( j)
, ( j)

Different covariances!
Likelihood Ratio rule is straightforward numerical calculation
(thus can easily do discrimination)
FLD Generalization I (cont.)
But no longer have “separating hyperplane” representation
(instead “regions determined by quadratics”)
(fairly complicated case-wise calculations)
Graphical display: for each point, color as:
Yellow if assigned to Class 1
Cyan if assigned to Class 2
(“intensity” is “strength of assignment”)
show PolyEmbed\PEod1FLDe1.ps
FLD Generalization I (cont.)
Toy Examples:
1. Standard Tilted Point clouds:
also show PolyEmbed\PEod1GLRe1.ps
-
Both FLD and LR work well.
2. Donut:
Show PolyEmbed\PEdonFLDe1.ps & PEdonGLRe1.ps
-
FLD poor (no separating plane can work)
-
LR much better
FLD Generalization I (cont.)
3. Split X:
Show PolyEmbed\PExd3FLDe1.ps & PExd3GLRe1.ps
-
neither works well
-
although  good separating surfaces
-
they are not “from Gaussian likelihoods”
-
so this is not “general quadratic discrimination”
FLD Generalization II
Different prior probabilities
Main idea: Give different weights to 2 classes
I.e. assume not a priori equally likely
Development is “straightforward”
-
modified likelihood
-
change intercept in FLD
Might explore with toy examples, but time is short
FLD Generalization III
Principal Discriminant Analysis
Idea: FLD-like approach to more than two classes
Assumption:
Class covariance matrices are the same (similar)
(but not Gaussian, as for FLD)
Main idea:
quantify “location of classes” by their means
 1,   2  , … ,   k 
FLD Generalization III (cont.)
Simple way to find “interesting directions” among the means:
PCA on set of means
i.e.
Eigen-analysis of “between class covariance matrix”
 B  MM t
where
M
Aside: can show:

1
 (1)     ( k )  
k
overall

n  n B  k  w
FLD Generalization III (cont.)
But PCA only works like “mean difference”,
Expect can improve by “taking covariance into account”.
Again show HDLSS\HDLSSod1egFLD.ps
Blind application of above ideas suggests eigen-analysis of:

w 1
B
FLD Generalization III (cont.)
There are:
-
smarter ways to compute (“ generalized eigenvalue”)
-
other representations (this solves optimization prob’s)
Special case: 2 classes,
Good reference for more:
reduces to standard FLD
Section 3.8 of:
Duda, R. O., Hart, P. E. and Stork, D. G. (2001) Pattern
Classification, Wiley.