Class-Conditional Probabilities

Bayesian Classification
Bayesian Classification
CS 650: Computer Vision
Bryan S. Morse
BYU Computer Science
Bayesian Classification
Statistical Basis
Training: Class-Conditional Probabilities
I
Suppose that we measure features for a large training set taken
from class ωi .
I
Each of these training patterns has a different value x for the
features. This can be written as the class-conditional probability:
p(x|ωi )
In other words,
How often do things in class ωi exhibit features x?
Bayesian Classification
Statistical Basis
Classification
When we classify, we measure the feature vector x, then we ask this
question:
“Given that this has features x, what is the probability that it belongs
to class ωi ?”.
Mathematically, this is written as
P(ωi |x)
Bayesian Classification
Statistical Basis
Why We Care About Conditional Probabilities
I
Training gives us
I
But we want
These are not the same!
How are they related?
p(x|ωi )
P(ωi |x)
Bayesian Classification
Statistical Basis
Bayes Theorem (Revisited)
Generally:
P(A|B) =
P(B|A) P(A)
P(B)
P(ωi |x) =
p(x|ωi ) P(ωi )
p(x)
P(ωi |x) =
p(x|ωi ) P(ωi )
p(x)
For our purposes:
Bayesian Classification
Statistical Basis
Definitions
p(x|ωi )
class conditioned probability or likelihood
P(ωi )
a priori or prior probability
p(x)
evidence (usually ignored)
P(ωi |x)
measurement-conditioned or posterior probability
Bayesian Classification
The Bayesian Classifier
Structure of a Bayesian Classifier
Training:
Measure p(x|ωi ) for each class.
Prior Knowledge:
Measure or estimate P(ωi ) in the general population.
(Can sometimes aggregate the training set if it is a reasonable
sampling of the population.)
Classification:
1. Measure feature (x) for new pattern.
2. Calculate posterior probabilities P(ωi |x) for each class.
3. Choose the one with the larger posterior P(ωi |x).
Bayesian Classification
The Bayesian Classifier
Example
Normally distributed class-conditional probabilities:
p(x|ωi ) = √
1
2πσi
1
e− 2 (x−µi )
2
/σi2
Bayesian Classification
The Bayesian Classifier
From Probabilities to Discriminants: 1-D Case
p(x|ωi ) P(ωi )
p(x)
Want to maximize
P(ωi |x) =
same as maximizing
p(x|ωi ) P(ωi )
which for a normal distribution is
√ 1
2πσi
applying logarithm
log
dropping constants
log P(ωi ) − log σi − 12 (x − µi )2 /σi2
1
e− 2 (x−µi )
√1
2π
2
/σi2
P(ωi )
− log σi − 12 (x − µi )2 /σi2 + log P(ωi )
Bayesian Classification
The Bayesian Classifier
Extending to Multiple Features
I
Note that the key term for a 1-D normal distribution is
(x − µi )2 /σi2
the squared distance from the mean in standard deviations
I
Can extend to multiple features by simply normalizing each
feature’s “distance” by the respective standard deviation, then
just use minimum distance classification (remembering to use
the priors as well)
Bayesian Classification
The Bayesian Classifier
Extending to Multiple Features
I
Some call normalizing each feature by its variance naive Bayes
I
So what’s naive about it?
I
It ignores relationships between features
Bayesian Classification
Multivariate Normal Distributions
The Multivariate Normal Distribution
In multiple dimensions, the normal distribution takes on the following
form:
d
1
1
− 12 (x−m)T C−1 (x−m)
√
e
p(x) =
1/2
2π
|C|
=
−d/2
(2π)
−1/2
|C|
[See examples in Mathematica]
1
T
e− 2 (x−m)
C−1 (x−m)
Bayesian Classification
Multivariate Normal Distributions
Multivariate Normal Bayesian Classification
For multiple classes, each class ωi has its own
I
mean vector mi
I
covariance matrix Ci
The class-conditional probabilities are
p(x|ωi )
=
−d/2
(2π)
1
−1/2
e− 2 (x−mi )
|Ci |
T
C−1
(x−mi )
i
Bayesian Classification
Multivariate Normal Distributions
From Probabilities to Discriminants
p(x|ωi ) P(ωi )
p(x)
Want to maximize
P(ωi |x) =
so maximize
p(x|ωi ) P(ωi )
so maximize
log p(x|ωi ) + log P(ωi )
for normal distribution:
− d2 log 2π − 21 log |Ci |
− 12 (x − mi )T C−1
i (x − mi ) + log P(ωi )
maximize
log P(ωi ) −
1
2
log |Ci | − 12 (x − mi )T C−1
i (x − mi )
Bayesian Classification
Multivariate Normal Distributions
Mahalonobis Distance
I
The expression
can be thought of as
(x − mi )T C−1 (x − mi )
2
kx − mi kC−1
I
This looks like squared distance, but the inverse covariance
matrix C−1 acts like a metric (stretching factor) on the space.
I
This is the Mahalonobis distance.
I
Pattern recognition using multivariate normal distributions is
simply a minimum (Mahalonobis) distance classifier.
Bayesian Classification
Different Cases
Case 1: Identity Matrix
Case 1: Identity Matrix
Suppose that the covariance matrix for all classes is the identity
matrix I:
Ci = I or Ci = σ 2 I
Discriminant becomes
1
gi (x) = − (x − mi )T (x − mi ) + log P(ωi )
2
Assuming all classes ωi are a priori equally likely,
1
gi (x) = − (x − mi )T (x − mi )
2
Ignoring the constant 12 , we can use
gi (x) = −(x − mi )T (x − mi )
Bayesian Classification
Different Cases
Case 1: Identity Matrix
Example: Equal Priors
Bayesian Classification
Different Cases
Case 1: Identity Matrix
Examples: Different Priors
Bayesian Classification
Different Cases
Case 2: Same Covariance Matrix
Case 2: Same Covariance Matrix
I
If each class has the same covariance matrix,
1
gi (x) = − (x − mi )T C(x − mi ) + log P(ωi )
2
Loci of constant probability are hyperellipes oriented with the
eigenvectors of C:
eigenvectors directions of ellipse axes
eigenvalues variance (squared axis length) in axis directions
I
The decision boundaries are still hyperplanes, though they may
no longer be normal to the lines between the respective class
means.
Bayesian Classification
Different Cases
Case 2: Same Covariance Matrix
Examples
Bayesian Classification
Different Cases
Case 3: Different Covariances for Each Class
Case 3: Different Covariances for Each Class
I
Suppose that each class has its own arbitrary covariance matrix
(the most general case):
Ci 6= Cj
I
Loci of constant probability for each class are hyperellipes
oriented with the eigenvectors of Ci for that class.
I
Decision boundaries are quadratic, specifically, hyperellipses or
hyperhyperboloids.
[See examples in Mathematica]
Bayesian Classification
Different Cases
Case 3: Different Covariances for Each Class
Examples: 2-D
Bayesian Classification
Different Cases
Case 3: Different Covariances for Each Class
Examples: 3-D
Bayesian Classification
Different Cases
Case 3: Different Covariances for Each Class
Example: Multiple Classes