D-Ch2: Bayesian Decision Th (§1-6)

Pattern
Classification
All materials in these slides were taken from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and the
publisher
Chapter 2 (Part 1):
Bayesian Decision Theory
(Sections 1-4)
1.
2.
3.
4.
Introduction – Bayesian Decision Theory
•
Pure statistics, probabilities known, optimal decision
Bayesian Decision Theory–Continuous Features
Minimum-Error-Rate Classification
Classifiers, Discriminant Functions and Decision
Surfaces
2
1. Introduction
• The sea bass/salmon example
• State of nature, prior
• State of nature is a random variable
• The catch of salmon and sea bass is equiprobable
•
P(1) = P(2) (uniform priors)
•
P(1) + P( 2) = 1 (exclusivity and exhaustivity)
Pattern Classification, Chapter 2 (Part 1)
3
• Decision rule with only the prior information
• Decide 1 if P(1) > P(2) otherwise decide 2
• Use of the class –conditional information
• P(x | 1) and P(x | 2) describe the difference in
lightness between populations of sea and salmon
Pattern Classification, Chapter 2 (Part 1)
4
Pattern Classification, Chapter 2 (Part 1)
5
• Posterior, likelihood, evidence
• P(j | x) = P(x | j) P (j) / P(x)
(Bayes Rule)
• Where in case of two categories
j2
P ( x )   P ( x |  j )P (  j )
j 1
• Posterior = (Likelihood. Prior) / Evidence
Pattern Classification, Chapter 2 (Part 1)
6
Pattern Classification, Chapter 2 (Part 1)
7
•
Decision given the posterior probabilities
X is an observation for which:
if P(1 | x) > P(2 | x)
if P(1 | x) < P(2 | x)
True state of nature = 1
True state of nature = 2
Therefore:
whenever we observe a particular x, the probability of
error is :
P(error | x) = P(1 | x) if we decide 2
P(error | x) = P(2 | x) if we decide 1
Pattern Classification, Chapter 2 (Part 1)
8
• Minimizing the probability of error
• Decide 1 if P(1 | x) > P(2 | x);
otherwise decide 2
Therefore:
P(error | x) = min [P(1 | x), P(2 | x)]
(Bayes decision)
Pattern Classification, Chapter 2 (Part 1)
2. Bayesian Decision Theory –
Continuous Features
9
• Generalization of the preceding ideas
1. Use of more than one feature
2. Use more than two states of nature
3. Allowing actions other than deciding the state of nature
4. Introduce a loss of function which is more general than
the probability of error
Pattern Classification, Chapter 2 (Part 1)
10
• Allowing actions other than classification primarily
allows the possibility of rejection
• Refusing to make a decision in close or bad cases!
• The loss function states how costly each action
taken is
Pattern Classification, Chapter 2 (Part 1)
11
Let {1, 2,…, c} be the set of c states of nature
(or “categories”)
Let {1, 2,…, a} be the set of possible actions
Let (i | j) be the loss incurred for taking
action i when the state of nature is j
Pattern Classification, Chapter 2 (Part 1)
12
Overall risk
R = Sum of all R(i | x) for i = 1,…,a
Conditional risk
Minimizing R
Minimizing R(i | x) for i = 1,…, a
j c
R(  i | x )    (  i |  j )P (  j | x )
j 1
for i = 1,…,a
Pattern Classification, Chapter 2 (Part 1)
13
Select the action i for which R(i | x) is minimum
R is minimum and R in this case is called the
Bayes risk = best performance that can be achieved!
Pattern Classification, Chapter 2 (Part 1)
14
• Two-category classification
1 : deciding 1
2 : deciding 2
ij = (i | j)
loss incurred for deciding i when the true state of nature is j
Conditional risk:
R(1 | x) = 11P(1 | x) + 12P(2 | x)
R(2 | x) = 21P(1 | x) + 22P(2 | x)
Pattern Classification, Chapter 2 (Part 1)
15
Our rule is the following:
if R(1 | x) < R(2 | x)
action 1: “decide 1” is taken
This results in the equivalent rule :
decide 1 if:
(21- 11) P(x | 1) P(1) >
(12- 22) P(x | 2) P(2)
and decide 2 otherwise
Pattern Classification, Chapter 2 (Part 1)
16
Likelihood ratio:
The preceding rule is equivalent to the following rule:
P ( x |  1 ) 12   22 P (  2 )
if

.
P ( x |  2 )  21  11 P (  1 )
Then take action 1 (decide 1)
Otherwise take action 2 (decide 2)
Pattern Classification, Chapter 2 (Part 1)
17
Optimal decision property
“If the likelihood ratio exceeds a threshold value
independent of the input pattern x, we can take
optimal actions”
Pattern Classification, Chapter 2 (Part 1)
18
3. Minimum-Error-Rate Classification
• Actions are decisions on classes
If action i is taken and the true state of nature is j then:
the decision is correct if i = j and in error if i  j
• Seek a decision rule that minimizes the probability
of error which is the error rate
Pattern Classification, Chapter 2 (Part 2)
19
•
Introduction of the zero-one loss function:
0 i  j
 (  i , j )  
1 i  j
i , j  1 ,..., c
Therefore, the conditional risk is:
j c
R(  i | x )    (  i |  j ) P (  j | x )
j 1
  P(  j | x )  1  P(  i | x )
j 1
“The risk corresponding to this loss function is the
average probability error”
Pattern Classification, Chapter 2 (Part 2)
20
• Minimize the risk requires maximize P(i | x)
(since R(i | x) = 1 – P(i | x))
• For Minimum error rate
• Decide i if P (i | x) > P(j | x) j  i
Pattern Classification, Chapter 2 (Part 2)
21
•
Regions of decision and zero-one loss function,
therefore:
12   22 P (  2 )
P( x | 1 )
Let
.
   then decide  1 if :
 
21  11 P (  1 )
P( x |  2 )
•
If  is the zero-one loss function wich means:
 0 1

  
1 0
P(  2 )
then   
 a
P( 1 )
0 2 
2 P(  2 )
 then   
if   
 b
P( 1 )
1 0
Pattern Classification, Chapter 2 (Part 2)
22
Pattern Classification, Chapter 2 (Part 2)
4. Classifiers, Discriminant Functions
and Decision Surfaces
23
• The multi-category case
• Set of discriminant functions gi(x), i = 1,…, c
• The classifier assigns a feature vector x to class i
if:
gi(x) > gj(x) j  i
Pattern Classification, Chapter 2 (Part 2)
24
Pattern Classification, Chapter 2 (Part 2)
25
• Let gi(x) = - R(i | x)
(max. discriminant corresponds to min. risk!)
• For the minimum error rate, we take
gi(x) = P(i | x)
(max. discrimination corresponds to max.
posterior!)
gi(x)  P(x | i) P(i)
gi(x) = ln P(x | i) + ln P(i)
(ln: natural logarithm!)
Pattern Classification, Chapter 2 (Part 2)
26
• Feature space divided into c decision regions
if gi(x) > gj(x) j  i then x is in Ri
(Ri means assign x to i)
• The two-category case
• A classifier is a “dichotomizer” that has two discriminant
functions g1 and g2
Let g(x)  g1(x) – g2(x)
Decide 1 if g(x) > 0 ; Otherwise decide 2
Pattern Classification, Chapter 2 (Part 2)
27
• The computation of g(x)
g( x )  P (  1 | x )  P (  2 | x )
P( x | 1 )
P( 1 )
 ln
 ln
P( x |  2 )
P(  2 )
Pattern Classification, Chapter 2 (Part 2)
28
Pattern Classification, Chapter 2 (Part 2)
Chapter 2 (Part 2):
Bayesian Decision Theory
(Sections 5, 6, 9)
5. The Normal Density
6. Discriminant Functions for the Normal Density
9. Bayes Decision Theory – Discrete Features
30
•
5. The Normal Density
Univariate density
•
•
•
•
Density which is analytically tractable
Continuous density
A lot of processes are asymptotically Gaussian
Handwritten characters, speech sounds are ideal or prototype
corrupted by random process (central limit theorem)
P( x ) 
2

1
1 x 
exp   
 ,
2 
 2    
Where:
 = mean (or expected value) of x
2 = expected squared standard deviation or variance
Pattern Classification, Chapter 2 (Part 2)
31
Pattern Classification, Chapter 2 (Part 2)
32
• Multivariate normal density
p(x) ~ N(, )
• Multivariate normal density in d dimensions is:
P( x ) 
1
( 2 )
d/2

1/ 2
 1

t
1
exp ( x   )  ( x   )
  2

where:
x = (x1, x2, …, xd)t (t stands for the transpose vector form)
 = (1, 2, …, d)t mean vector
 = d*d covariance matrix
|| and -1 are determinant and inverse respectively
Pattern Classification, Chapter 2 (Part 2)
Properties of the Normal Density
Covariance Matrix 
• Always symmetric
• Always positive semi-definite, but for our purposes
 is positive definite
•
•
determinant is positive
 invertible
• Eigenvalues real and positive, and the principal
axes of the hyperellipsoidal loci of points of
constant density are eigenvectors of 
34

Pattern Classification, Chapter 2 (Part 2)
35

Pattern Classification, Chapter 2 (Part 2)
6. Discriminant Functions for the
Normal Density
36
• We saw that the minimum error-rate classification
can be achieved by the discriminant function
gi(x) = ln P(x | i) + ln P(i)
• Case of multivariate normal
p(x) ~ N(, )
1
1
d
1
t
gi ( x )   ( x   i )  ( x   i )  ln 2  ln  i  ln P (  i )
2
2
2
i
Pattern Classification, Chapter 2 (Part 3)
37
1. Case i = 2 I
(I stands for the identity matrix)
g i ( x )  w x  w i 0 (linear discriminant function)
t
i
where :
i
1
t
wi  2 ; wi 0  

i  i  ln P (  i )
2

2
(  i 0 is called the threshold for the ith category! )
Pattern Classification, Chapter 2 (Part 3)
38
• A classifier that uses linear discriminant functions is
called “a linear machine”
• The decision surfaces for a linear machine are
pieces of hyperplanes defined by:
gi(x) = gj(x)
•
If equal priors for all classes, this reduces to the
minimum-distance classifier where an unknown is
assigned to the class of the nearest mean
Pattern Classification, Chapter 2 (Part 3)
39
Pattern Classification, Chapter 2 (Part 3)
40
• The hyperplane separating Ri and Rj
1
2
x0  (  i   j ) 
2
i   j
2
P(  i )
ln
( i   j )
P(  j )
always orthogonal to the line linking the means!
1
if P (  i )  P (  j ) then x0  (  i   j )
2
Pattern Classification, Chapter 2 (Part 3)
41
Pattern Classification, Chapter 2 (Part 3)
42
Pattern Classification, Chapter 2 (Part 3)
43
2. Case i =  (covariance of all classes are identical
but arbitrary!)
• Hyperplane separating Ri and Rj


ln P (  i ) / P (  j )
1
x0  (  i   j ) 
.(  i   j )
t
1
2
( i   j )  ( i   j )
(the hyperplane separating Ri and Rj is generally not
orthogonal to the line between the means!)
If priors equal, reduces to squared Mahalanobis distance
Pattern Classification, Chapter 2 (Part 3)
44
Pattern Classification, Chapter 2 (Part 3)
45
Pattern Classification, Chapter 2 (Part 3)
46
Pattern Classification, Chapter 2 (Part 3)
47
3. Case i = arbitrary
•
The covariance matrices are different for each category
g i ( x )  x tWi x  w it x  w i 0
where :
1 1
Wi    i
2
w i   i 1  i
1 t 1
1
w i 0    i  i  i  ln  i  ln P (  i )
2
2
(Hyperquadrics which are: hyperplanes, pairs of hyperplanes,
hyperspheres, hyperellipsoids, hyperparaboloids,
hyperhyperboloids)
Pattern Classification, Chapter 2 (Part 3)
48
Pattern Classification, Chapter 2 (Part 3)
49
Pattern Classification, Chapter 2 (Part 3)
50
Pattern Classification, Chapter 2 (Part 3)
51
Pattern Classification, Chapter 2 (Part 3)
52
Pattern Classification, Chapter 2 (Part 3)
53
Pattern Classification, Chapter 2 (Part 3)

Download Report

D-Ch2: Bayesian Decision Th (§1-6)

Paperzz.com

Your Paperzz