0
Pattern
Classification
All materials in these slides were taken
from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and
the publisher
Pattern Classification, Chapter 3
0
1
Chapter 3:
Maximum-Likelihood and Bayesian
Parameter Estimation (part 2)
Bayesian Estimation (BE)
Bayesian Parameter Estimation: Gaussian Case
Bayesian Parameter Estimation: General
Estimation
Problems of Dimensionality
Computational Complexity
Component Analysis and Discriminants
Hidden Markov Models
2
Bayesian Estimation (Bayesian learning
to pattern classification problems)
In MLE was supposed fix
In BE is a random variable
The computation of posterior probabilities
P(i | x) lies at the heart of Bayesian
classification
Goal: compute P(i | x, D)
Given the sample D, Bayes formula can be
written
p(x | i , D) P(i | D)
P(i | x, D) c
p(x | j , D) P( j | D)
j 1
Pattern Classification, Chapter 3
2
3
3
To demonstrate the preceding equation, use:
P (x, D | i ) P (x | D, i ) P ( D | i ) (from def. of cond. prob.)
P ( x | D ) P ( x, j | D )
(from law of total prob.)
j
P (i ) P (i | D ) (Training sample provides this! )
We assume that samples in different classes are independen t
Thus :
P (i | x, D )
p (x | i , Di ) P (i )
c
p(x | , D) P( )
j 1
j
j
Pattern Classification, Chapter 3
3
3
4
Bayesian Parameter Estimation: Gaussian
Case
Goal: Estimate using the a-posteriori density
P( | D)
The univariate case: P( | D)
is the only unknown parameter
P(x | ) ~ N(, 2 )
2
P( ) ~ N( 0 , 0 )
(0 and 0 are known!)
Pattern Classification, Chapter 3
4
4
P ( | D)
5
P (D | ) P ( )
P(D | ) P( )d
Bayes Formula
k n
P ( xk | ) P ( )
k 1
But we know
p( xk | ) ~ N ( , 2 ) and p( ) ~ N (0 , 02 )
Plugging in their gaussian expressions and
extracting out factors not depending on yields:
1 n
1 2 1
p( | D) exp 2 2 2 2
2
0
(from eq. 29 page 93)
0
xk 2
0
k 1
n
Pattern Classification, Chapter 3
5
4
6
Observation: p(|D) is an exponential of a
quadratic
n
1 n
1
1
2
0
p( | D) exp 2 2 2 2 xk 2
2
k
1
0
0
P( | D) ~ N ( n , n )
It is again normal! It is called
2 a reproducing density
Pattern Classification, Chapter 3
6
4
7
1 n
1
1
2
p( | D) exp 2 2 2 2
2
0
0
xk 2
0
k 1
n
Identifying coefficients in the top equation with that of
the generic Gaussian
1 n
p ( | D)
exp
2 n
2 n
2
1
Yields expressions for n and n2
1
2
n
n
2
1
02
n
0
n
and 2 2 ˆ n 2
n
0
Pattern Classification, Chapter 3
7
4
8
Solving for n and n2 yields:
n 02
2
ˆ
n
0
2
2 n
2
2
n 0
n0 0
2 2
2
0
and n
2
2
n 0
From these equations we see as n increases:
the variance decreases monotonically
the estimate of p(|D) becomes more peaked
Pattern Classification, Chapter 3
8
4
9
Pattern Classification, Chapter 3
9
4
10
The univariate case P(x | D)
P( | D) computed (in preceding discussion)
P(x | D) remains to be computed!
P( x | D) P( x | ) P( | D)d is Gaussian
It provides:
P( x | D) ~ N ( n , 2 n2 )
n and σ n2
2
We know
and how to compute
(Desired class-conditional density P(x | Dj, j))
Therefore: using P(x | Dj, j) together with P(j)
And using Bayes formula, we obtain the
Bayesian classification rule:
Max P( j | x, D Max P( x | j , D j ) P( j )
j
j
Pattern Classification, Chapter 3
10
4
11
3.5 Bayesian Parameter Estimation: General
Theory
P(x | D) computation can be applied to any situation
in which the unknown density can be
parameterized: the basic assumptions are:
form of P(x | ) is assumed known, but the value of
is not known exactly
Our knowledge about is assumed to be contained in a
known prior density P()
The rest of our knowledge is contained in a set D of n
random variables x1, x2, …, xn that follows P(x)
The
Pattern Classification, Chapter 3
11
5
12
The basic problem is:
“Compute the posterior density P( | D)”
then “Derive P(x | D)”, where p (x | D) p (x | θ) p(θ | D)dθ
Using Bayes formula, we have:
P(θ | D)
P(D | θ) P(θ)
P(D | θ) P(θ)dθ
,
And by the independence assumption:
n
p(D | θ) p(x k | θ)
k 1
Pattern Classification, Chapter 3
12
5
13
Convergence (from notes)
Pattern Classification, Chapter 3
13
14
Problems of Dimensionality
Problems involving 50 or 100 features are
common (usually binary valued)
Note: microarray data might entail ~20000
real-valued features
Classification accuracy dependant on
dimensionality
amount
of training data
discrete vs continuous
Pattern Classification, Chapter 3
14
7
15
Case of two class multivariate normal
with the same covariance
P(x|j) ~N(j,), j=1,2
Statistically independent features
If the priors are equal then:
P (error )
1
2
e
u 2
2
du
(Bayes error)
r/2
where : r 2 ( 1 2 ) t 1 ( 1 2 )
r 2 is the squared Mahalanobi s distance
lim P (error ) 0
r
Pattern Classification, Chapter 3
15
7
16
If features are conditionally independent then:
diag( 12 , 22 ,..., 2d )
i d
r
i 1
2
i1
i2
i
2
Do we remember what conditional independence is?
Example for binary features:
Let pi= Pr[xi=1|1] then P(x|1) is the product of the pi
Pattern Classification, Chapter 3
16
7
17
Most useful features are the ones for which the difference between
the means is large relative to the standard deviation
Doesn’t require independence
Adding independent features helps increase r reduce error
Caution: adding features increases cost & complexity of feature
extractor and classifier
It has frequently been observed in practice that, beyond a certain
point, the inclusion of additional features leads to worse rather
than better performance:
we have the wrong model !
we don’t have enough training data to support the additional
dimensions
Pattern Classification, Chapter 3
17
7
18
7
Pattern Classification, Chapter 3
18
7
19
Computational Complexity
Our design methodology is affected by the
computational difficulty
“big
oh” notation
f(x) = O(h(x)) “big oh of h(x)”
If: (c 0 , x 0 ) 2 ; f (x ) c 0 h(x )
(An upper bound on f(x) grows no worse than h(x) for
sufficiently large x!)
f(x) = 2+3x+4x2
g(x) = x2
f(x) = O(x2)
Pattern Classification, Chapter 3
19
7
20
“big
oh” is not unique!
f(x) = O(x2); f(x) = O(x3); f(x) = O(x4)
“big
theta” notation
f(x) = (h(x))
If: (x 0 , c1, c 2 ) 3 ; x x 0
0 c 1g(x ) f (x ) c 2 g(x )
f(x) = (x2) but f(x) (x3)
Pattern Classification, Chapter 3
20
7
21
Complexity of the ML Estimation
Gaussian
priors in d dimensions classifier
with n training samples for each of c classes
For
each category, we have to compute the
discriminant function
O (1)
(d n)
O ( dn) O
1
d
1 ˆ
t ˆ 1
g (x) (x μˆ ) Σ (x μˆ ) ln 2 ln Σ ln P( )
2
2
2
O(n)
2
O ( d 2n)
Total = O(d2n)
Total for c classes = O(cd2n) O(d2n)
Cost
increase when d and n are large!
Pattern Classification, Chapter 3
21
7
22
Overfitting
Dimensionality of model vs size of training data
Issue: not enough data to support the model
Possible solutions:
Reduce
model dimensionality
Make (possibly incorrect) assumptions to better estimate
Pattern Classification, Chapter 3
22
23
Overfitting
Estimate better
use data pooled from all classes
normalization
issues
use pseudo-Bayesian form 0 + (1-)n
“doctor” by thresholding entries
reduces
chance correlations
assume statistical independence
zero
all off-diagonal elements
Pattern Classification, Chapter 3
23
24
Shrinkage
Shrinkage: weighted combination of common and
individual covariances
(1 )ni Σ i nΣ
Σ i ( )
for 0 1
(1 )ni n
We can also shrink the estimate common covariances
toward the identity matrix
Σ( ) (1 ) Σ I for 0 1
Pattern Classification, Chapter 3
24
Component Analysis and Discriminants
25
Combine features in order to reduce the
dimension of the feature space
Linear combinations are simple to compute and
tractable
Project high dimensional data onto a lower
dimensional space
Two classical approaches for finding “optimal”
linear transformation
PCA
(Principal Component Analysis) “Projection that
best represents the data in a least- square sense”
MDA (Multiple Discriminant Analysis) “Projection that
best separates the data in a least-squares sense”
Pattern Classification, Chapter 3
25
8
PCA (from notes)
Pattern Classification, Chapter 3
26
26
Hidden Markov Models:
Markov Chains
Goal: make a sequence of decisions
27
Processes
that unfold in time, states at time t are
influenced by a state at time t-1
Applications:
speech recognition, gesture recognition,
parts of speech tagging and DNA sequencing,
Any
temporal process without memory
T = {(1), (2), (3), …, (T)} sequence of states
We might have 6 = {1, 4, 2, 2, 1, 4}
The
system can revisit a state at different steps and not
27
every state need to be visitedPattern Classification, Chapter 3
10
28
First-order Markov models
Our
productions of any sequence is described by the
transition probabilities
P(j(t + 1) | i (t)) = aij
Pattern Classification, Chapter 3
28
10
29
Pattern Classification, Chapter 3
29
10
30
= (aij, T)
P(T | ) = a14 . a42 . a22 . a21 . a14 .
P((1) = i)
Example: speech recognition
“production of spoken words”
Production of the word: “pattern” represented
by phonemes
/p/ /a/ /tt/ /er/ /n/ // ( // = silent state)
Transitions from /p/ to /a/, /a/ to /tt/, /tt/ to er/,
/er/ to /n/ and /n/ to a silent state
Pattern Classification, Chapter 3
30
10
© Copyright 2026 Paperzz