For a multivariate normal distribution

ECE 8443 – Pattern Recognition
LECTURE 03: GAUSSIAN CLASSIFIERS
• Objectives:
Normal Distributions
Whitening Transformations
Linear Discriminants
• Resources:
D.H.S: Chapter 2 (Part 3)
K.F.: Intro to PR
X. Z.: PR Course
M.B. : Gaussian Discriminants
E.M. : Linear Discriminants
URL:
Audio:
Multicategory Decision Surfaces
• Define a set of discriminant functions: gi(x), i = 1,…, c
• Define a decision rule:
choose i if: gi(x) > gj(x) j  i
• For a Bayes classifier, let gi(x) = -R(i|x) because the maximum discriminant
function will correspond to the minimum conditional risk.
• For the minimum error rate case, let gi(x) = P(i|x), so that the maximum
discriminant function corresponds to the maximum posterior probability.
• Choice of discriminant function is not unique:
 multiply or add by same positive constant
 Replace gi(x) with a monotonically increasing function, f(gi(x)).
ECE 8443: Lecture 03, Slide 1
Network Representation of a Classifier
• A classifier can be visualized as a connected graph with arcs and weights:
• What are the advantages of this type of visualization?
ECE 8443: Lecture 03, Slide 2
Log Probabilities
• Some monotonically increasing functions can simplify calculations
considerably:
(1) gi x   P(i x ) 
p x i P i 
c

  
 p x j P  j
j 1
(2) gi x   px i P i 
(3) f ( gi x )  ln( gi x )  ln( p x i )  ln( P i )
• What are some of the reasons (3) is particularly useful?
 Computational complexity (e.g., Gaussian)
 Numerical accuracy (e.g., probabilities tend to zero)
 Decomposition (e.g., likelihood and prior are separated and can be
weighted differently)
 Normalization (e.g., likelihoods are channel dependent).
ECE 8443: Lecture 03, Slide 3
Decision Surfaces
• We can visualize our decision rule several ways:
choose i if: gi(x) > gj(x) j  i
ECE 8443: Lecture 03, Slide 4
Two-Category Case
• A classifier that places a pattern in one of two classes is often referred to as a
dichotomizer.
• We can reshape the decision rule:
if g1(x)  g2(x)  g(x)  g1(x) - g2(x)  0
• If we use log of the posterior probabilities:
g(x)  P (1 x )  P (2 x )
f(x)  ln( g(x))  ln(
p x 1 
)  ln(
p x  2 
P1 
)
P2 
• A dichotomizer can be viewed as a machine that computes a single
discriminant function and classifies x according to the sign (e.g., support
vector machines).
ECE 8443: Lecture 03, Slide 5
Normal Distributions
• Recall the definition of a normal distribution (Gaussian):
p(x ) 
1
(2 )d / 2 
exp[
1/ 2
1
(x  )t  1(x  )]
2
• Why is this distribution so important in engineering?
• Mean:
• Covariance:
•
•
•
•
•
  Εx   xp(x)dx
  E[ (x -  )(x -  )t ]   (x -  )(x -  )t p(x)dx
Statistical independence?
Higher-order moments? Occam’s Razor?
Entropy?
Linear combinations of normal random variables?
Central Limit Theorem?
ECE 8443: Lecture 03, Slide 6
Univariate Normal Distribution
• A normal or Gaussian density is a powerful model for modeling continuousvalued feature vectors corrupted by noise due to its analytical tractability.
• Univariate normal distribution:
1
1 x   
p ( x) 
exp[ 
 ]
2
2  
2
where the mean and covariance are defined by:

  E[ x]   xp( x)dx


  E[( x   )   ( x   ) 2 p ( x)dx
2
2

• The entropy of a univariate normal distribution is given by:

1
H ( p( x))    p( x) ln p( x)dx  log( 2e 2 )
2

ECE 8443: Lecture 03, Slide 7
Mean and Variance
• A normal distribution is completely specified by its mean and variance:
• The peak is at:
p(  ) 
1
2 
• 66% of the area is within one ; 95% is
within two ; 99% is within three .
• A normal distribution achieves the maximum entropy of all distributions
having a given mean and variance.
• Central Limit Theorem: The sum of a large number of small, independent
random variables will lead to a Gaussian distribution.
ECE 8443: Lecture 03, Slide 8
Multivariate Normal Distributions
• A multivariate distribution is defined as:
p(x ) 
1
(2 )d / 2 
exp[
1/ 2
1
(x  )t  1(x  )]
2
where  represents the mean (vector) and  represents the covariance
(matrix).
• Note the exponent term is really a dot product or
weighted Euclidean distance.
• The covariance is always symmetric
and positive semidefinite.
• How does the shape vary as a
function of the covariance?
ECE 8443: Lecture 03, Slide 9
Support Regions
• A support region is the obtained by
the intersection of a Gaussian
distribution with a plane.
• For a horizontal plane, this generates
an ellipse whose points are of equal
probability density.
• The shape of the support region is
defined by the covariance matrix.
ECE 8443: Lecture 03, Slide 10
Derivation
ECE 8443: Lecture 03, Slide 11
Identity Covariance
ECE 8443: Lecture 03, Slide 12
Unequal Variances
ECE 8443: Lecture 03, Slide 13
Nonzero Off-Diagonal Elements
ECE 8443: Lecture 03, Slide 14
Unconstrained or “Full” Covariance
ECE 8443: Lecture 03, Slide 15
Coordinate Transformations
• Why is it convenient to convert an arbitrary distribution into a spherical one?
(Hint: Euclidean distance)
• Consider the transformation: Aw=  -1/2
where  is the matrix whose columns are the orthonormal eigenvectors of 
and  is a diagonal matrix of eigenvalues (=   t). Note that  is unitary.
• What is the covariance of y=Awx?
E[yyt] = (Awx)(Awx)t =( -1/2x) ( -1/2x)t
=  -1/2x xt -1/2 t =  -1/2  -1/2 t
=  -1/2   t -1/2 t
=  t -1/2  -1/2 (t)
=I
ECE 8443: Lecture 03, Slide 16
Mahalanobis Distance
• The weighted Euclidean distance:
(x  μ)t  1(x  μ)
is known as the Mahalanobis distance, and represents a statistically
normalized distance calculation that results from our whitening
transformation.
• Consider an example using our Java Applet.
ECE 8443: Lecture 03, Slide 17
Discriminant Functions
• Recall our discriminant function for minimum error rate classification:
g i (x)  ln p(x | i )  ln P(i )
• For a multivariate normal distribution:
1
d
1
g i (x )   (x  μi ) t  i 1 (x  μi )  ln( 2 )  ln  i  ln P(i )
2
2
2
• Consider the case: i = 2I
(statistical independence, equal variance, class-independent variance)
 2 0 0 0 


2
0

...
0
  2
i  
0
... ... ... 

2
0 ...  
 0
 d   2d
 i 1  (1 /  2 )I
 i   2 d and is independent of i
ECE 8443: Lecture 03, Slide 18
Gaussian Classifiers
• The discriminant function can be reduced to:
1
d
1
g i (x )   (x  μi ) t i 1 (x  μi )  ln( 2 )  ln i  ln P(i )
2
2
2
• Since these terms are constant w.r.t. the maximization:
1
g i ( x )   ( x  μi ) t i 1 (x  μi )  ln P (i )
2

x  μi
2
2
2
 ln P (i )
• We can expand this:
g i (x )  
1
2
t
t
t
(
x
x

2

x


 i )  ln P(i )
i
i
2
• The term xtx is a constant w.r.t. i, and iti is a constant that can be
precomputed.
ECE 8443: Lecture 03, Slide 19
Linear Machines
• We can use an equivalent linear discriminant function:
g i (x )  w it x  w i 0
wi 
1

2 i
wi 0 
1
t

i  ln P(i )
i
2

2
• wi0 is called the threshold or bias for the ith category.
• A classifier that uses linear discriminant functions is called a linear machine.
• The decision surfaces defined by the equation:
gi ( x ) - g j ( x )  0
x  i
2
2
2
 ln P (i ) 
x j
2
 ln P ( j )  0
2
x  i
ECE 8443: Lecture 03, Slide 20
2
2
 x j
2
 2 2 ln
P ( j )
P (i )
Threshold Decoding
• This has a simple geometric interpretation:
x  i
2
 x j
2
 2 ln
2
P( j )
P(i )
• The decision region when the priors are equal and the support regions are
spherical is simply halfway between the means (Euclidean distance).
ECE 8443: Lecture 03, Slide 21
Summary
• Decision Surfaces: geometric interpretation of a Bayesian classifier.
• Gaussian Distributions: how is the shape of the distribution influenced by the
mean and covariance?
• Bayesian classifiers for Gaussian distributions: how does the decision
surface change as a function of the mean and covariance?
ECE 8443: Lecture 03, Slide 22