Some Test Results of CQT3

Bayesian Decision Theory
Shyh-Kang Jeng
Department of Electrical Engineering/
Graduate Institute of Communication/
Graduate Institute of Networking and
Multimedia, National Taiwan University
1
Basic Assumptions
The decision problem is posed in
probabilistic terms
All of the relevant probability values
are known
2
State of Nature
State of nature
–
  1 (sea bass) or 2 (salmon )
A priori probability (prior)
– P( ) : the next fish is sea bass
1
P(2 ) : the next fish is salmon
Decision rule to judge just one fish
– Decide 1 if P(1 )  P(2 ); otherwise decide 2
3
Class-Conditional
Probability Density
4
Bayes Formula
P( j | x) 
p( x |  j ) P( j )
p( x)
2
p( x)   p( x |  j ) P( j )
j 1
likelihood  prior
posterior 
evidence
5
Posterior Probabilities
P(1 )  2 / 3, P(2 )  1/ 3
6
Bayes Decision Rule
Probability of error
 P(1 | x) if we decide 2
P(error | x)  
 P(2 | x) if we decide 1
P(error ) 




 p(error , x)dx   p(error | x) p( x)dx
Bayes decision rule
Decide 1 if P(1 | x)  P(2 | x); otherwise decide 2
Or, decide 1 if p( x | 1 ) P(1 )  p( x | 2 ) P(2 );
otherwise decide 2
7
Bayes Decision Theory (1/3)
1 ,, c 
1 ,, a 
Categories
Actions
Loss functions
 ( i |  j )
Feature vector
d  component vector x
8
Bayes Decision Theory (2/3)
Bayes formula
p( j | x) 
p (x |  j ) P( j )
p ( x)
c
p(x)   p (x |  j ) P( j )
j 1
Conditional risk
c
R( i | x)    ( i |  j ) p( j | x)
j 1
9
Bayes Decision Theory (3/3)
Decision function  (x) assumes one of the a
values 1 , ,  a
Overall risk
R   R( (x) | x) p(x)dx
Bayes decision rule: compute the
conditional risk
c
R( i | x)    ( i |  j ) P( j | x), i  1,, a
j 1
then select the action
is minimum
i
for which R( i | x)
10
Two-Category Classification
Conditional risk
R(1 | x)  11P(1 | x)  12 P(2 | x)
R( 2 | x)  21P(1 | x)  22 P(2 | x)
ij   ( i |  j )
Decision rule: decide 1 if
(21  11 ) p(x | 1 ) P(1 )  (12  22 ) p(x | 2 ) P(2 )
Likelihood ratio
p(x | 1 ) 12  22 P(1 )

p(x | 2 ) 21  11 P(2 )
11
Minimum-Error-Rate Classification
If action  i is taken and the true state
is  j , then the decision is correct if i  j
and in error if i  j
Error rate (the probability of error) is to
be minimized
Symmetrical or zero-one loss function
0, i  j
 ( i |  j )  
, i, j  1,, c
1, i  j
Conditional risk
c
R( i | x)    ( i |  j ) P( j | x)  1  P(i | x)
j 1
12
Minimum-Error-Rate Classification
13
Mini-max Criterion
To perform well over a range of prior
probability
Minimize the maximum possible
overall risk
– So that the worst risk for any value of
the priors is as small as possible
14
Mini-maximizing Risk
R
 
11
P(1 ) p (x | 1 )  12 P(2 ) p(x | 2 )dx
R1

 
21
P(1 ) p (x | 1 )  22 P(2 ) p (x | 2 )dx
R2
 22  (12  22 )  p (x | 2 )dx
R1
 P(1 )[( 11  22 )
 (21  11 )  p (x | 1 )dx  (12  22 )  p (x | 2 )dx]
R2
 Rmm  R
R1
15
Searching for Mini-max Boundary
16
Neyman-Pearson Criterion
Minimize the overall risk subject to
a constraint
Example
– Minimize the total risk subject to
 R(
i
| x)dx  constant
17
Discriminant Functions
A classifier assigns x to class
gi (x)  g j (x) for all j  i
i
if
where g i (x) are called discriminant
functions
A discriminant function for a Bayes
classifier gi (x)   R( i | x)
Two discriminant functions for
minimum- error-rate classification
g i ( x) 
p(x | i ) P(i )
c
 p(x |  ) P( )
j 1
j
j
; g i (x)  ln p(x | i )  ln P(i )
18
Discriminant Functions
19
Two-Dimensional Two-Category
Classifier
20
Dichotomizers
Place a pattern in one of only two
categories
– cf. Polychotomizers
More common to define a single
duscriminant function
g (x)  g1 (x)  g 2 (x)
Some particular forms
g (x)  P(1 | x)  P(2 | x)
p(x | 1 )
P(1 )
g (x)  ln
 ln
p (x | 2 )
P(2 )
21
Univariate Normal PDF
 1  x   2 
1
2
p ( x) 
exp  
~
N
(

,

)
 
2 
 2    
22
Distribution with Maximum Entropy
and Central Limit Theorem
Entropy for discrete distribution
m
H   Pi log 2 Pi
(bits)
i 1
Entropy for continuous distribution
H ( p( x))    p( x) ln p( x)dx (nats)
Central limit theorem
– Aggregate effect of the sum of a large number
of small, independent random disturbances,
will lead to a Gaussian distrubution
23
Multivariate Normal PDF
 1

T
1
p ( x) 
exp  x  μ  Σ x  μ 
1/ 2
d /2
 2

2  Σ
1
~ N (μ, Σ)
μ  E[x] : d-component mean vector


Σ  E x  μ x  μ  : d-by-d
T
covariance matrix
24
Linear Combination of Gaussian
Random Variables
p(x) ~ N (μ, Σ), y  At x  p(y) ~ N (At μ, At ΣA)
25
Whitening Transform
F: matrix whose columns are the
orthonormal eigenvectors of S
L: diagonal matrix of the
corresponding eigenvalues
Whitening transform
A w  ΦΛ1/ 2
A tw ΣA w  I
26
Bivariate Gaussian PDF
27
Mahalanobis Distance
Squared Mahalanobus distance
1
r  ( x  μ) Σ ( x  μ)
2
t
Volume of the Hyperellipsoids of
constant Mahalanobis distance r
V  Vd Σ
1/ 2
rd
  d / 2 /( d / 2)! d even

Vd   d ( d 1) / 2 d  1
(
)! / d! d odd
2 
2
28
Discriminant Functions for
Normal Density
for minimum - error - rate classifica tion
g i (x)  ln p(x | i )  ln P(i )
for normasl density
1
d
1
t
1
g i (x)   (x  μ i ) Σ i (x  μ i )  ln 2  ln Σ i  ln P(i )
2
2
2
29
Case 1: Si = 2 I
g i ( x)  
x  μi
2
x  μi
2
2
 ln P(i )
2
 (x  μ i )t (x  μ i )
g i ( x)  
1
x x  2μ x  μ μ  ln P( )
t
2
g i (x)  w ti x  wi 0
2
t
i
t
i
i
i
1 t
w i  2 μ i , wi 0 
μ i μ i  ln P(i )
2
i
2 i
1
30
Decision Boundaries
g i ( x)  g j ( x)
w t (x  x 0 )  0
w  μi  μ j
1

x 0  (μ i  μ j ) 
2
μi  μ j
2
2
P(i )
ln
(μ i  μ j )
P( j )
31
Decision Boundaries when
P(i)=P(j)
32
Decision Boundaries when P(i)
and P(j) are unequal
33
Case 2: Si = S
1
t
1
g i (x)   (x  μ i ) Σ (x  μ i )  ln P(i )
2
1 t 1
1 t
t 1
g i (x)   x Σ x  μ i Σ x  μ i μ i  ln P(i )
2
2
g i (x)  w ti x  wi 0
1 t 1
w i  Σ μ i , wi 0   μ i Σ μ i  ln P(i )
2
1
34
Decision Boundaries
g i ( x)  g j ( x)
w (x  x 0 )  0
t
1
w  Σ (μ i  μ j )
ln[ P(i ) / P( j )]
1
x 0  (μ i  μ j ) 
(μ i  μ j )
t
1
2
(μ i  μ j ) Σ (μ i  μ j )
35
Decision Boundaries
36
Case 3: Si = arbitrary
g i (x)  x Wi x  w x  wi 0
t
t
i
1 1
1
Wi   Σ i , w i  Σ i μ i
2
1 t 1
1
wi 0   μ i Σ i μ i  ln Σ i  ln P(i )
2
2
37
Decision Boundaries for OneDimensional Case
38
Decision Boundaries for TwoDimensional Case
39
Decision Boundaries for ThreeDimensional Case (1/2)
40
Decision Boundaries for ThreeDimensional Case (2/2)
41
Decision Boundaries for Four
Normal Distributions
42
Example: Decision Regions for
Two-Dimensional Gaussian Data
43
Example: Decision Regions for
Two-Dimensional Gaussian Data
 3
1 / 2 0 
3
 2 0
, μ 2   , Σ 2  

μ1   , Σ1  
6 
 0 2
  2
 0 2
2 0 
1 / 2 0 
1
1
, Σ 2  

Σ1  
 0 1/ 2
 0 1/ 2
P(1 )  P(2 )  0.5
decision boundary
 3
x2  3.514  1.125 x1  0.1875 x , not passing  
 2
2
1
44
Bayes Decision Compared with
Other Decision Strategies
P(error )  P(x  R2 , 1 )  P(x  R1 , 2 )

 p(x |  ) P( )dx   p(x |  ) P( )dx
1
R2
1
2
R1
2
45
Multicategory Case
Probability of being correct
c
P(correct )    p(x | i ) P(i )dx
i 1 Ri
Bayes classifier maximizes this
probability by choosing the regions
so that the integrand is maximal for
all x
– No other partitioning can yield a
smaller probability of error
46
Error Bounds for Normal Densities
Full calculation of the error
probability is difficult for the
Gaussian case
– Especially in high dimensions
– Discontinuous nature of the decision
regions
Upper bound on the error can be
obtained for two-category case
– By approximating the error integral
analytically
47
Chernoff Bound
min[ a, b]  a  b1 
for a, b  0 and 0    1
P (error )  min[ P (1 | x), P (2 | x)]
p ( j | x) 
p (x |  j ) P ( j )
p ( x)
 p (x |  j ) P ( j )
P (error )  P  (1 ) P1  (2 )  p  (x | 1 ) p1  (x | 2 )dx
for normal densities,
k ( ) 
 (   1)
1
 ln
2

1 
k (  )
p
(
x
|

)
p
(
x
|

)
d
x

e
1
2

(μ 2  μ1 ) t [(1   ) Σ1  Σ 2 ]1 (μ 2  μ1 )
2
(1   ) Σ1   Σ 2
Σ1
1 
Σ2

48
Bhattacharyya Bound
set   1 / 2
P(error )  P(1 ) P(2 )  p (x | 1 ) p (x | 2 )dx
 P(1 ) P(2 )e  k (1/ 2 )
1
1
t  Σ1  Σ 2 
k (1 / 2)  (μ 2  μ1 ) 
(μ 2  μ1 )

8
 2 
1
 ln
2
Σ1  Σ 2
2
Σ1 Σ 2
49
Chernoff Bound and
Bhattacharyya Bound
50
Example: Error Bounds for
Gaussian Distribution
 3
1 / 2 0 
3
 2 0
, μ 2   , Σ 2  

μ1   , Σ1  
6 
 0 2
  2
 0 2
2 0 
1 / 2 0 
1
1
, Σ 2  

Σ1  
 0 1/ 2
 0 1/ 2
P(1 )  P(2 )  0.5
51
Example: Error Bounds for
Gaussian Distribution
Bhattacharyya bound
– k(1/2) = 4.11157
– P(error) < 0.0087
Chernoff bound
– 0.008190 by numerical searching
Error rate by numerical integration
– 0.0021
– Impractical for higher dimension
52
Signal Detection Theory
Internal signal in the detector x
– Has mean 2 when external signal (pulse)
is present
– Has mean 1 when external signal is not
present
– p(x|i) ~ N(i, 2)
53
Signal Detection Theory
 2  1
discrimina bility d ' 

54
Four Probabilities
Hit: P(x>x*|x in 2)
False alarm: P(x>x*|x in 1)
Miss: P(x<x*|x in 2)
Correct reject: P(x<x*|x in 1)
55
Receiver Operating Characteristic
(ROC)
56
Bayes Decision Theory:
Discrete Features
 p(x |  )dx   P(x |  )
i
i
x
P(x | i ) P(i )
P(i | x) 
P ( x)
c
P(x)   P(x | i ) P(i )
i 1
select action   arg min R( i | x)
*
i
57
Independent Binary Features
x   x1 , , xd 
t
pi  Pr[ xi  1 | 1 ], qi  Pr[ xi  1 | 2 ]
Assume conditiona l independen ce
d
P(x | 1 )   p (1  pi )
1 xi
xi
i
i 1
d
, P( x | 2 )   q (1  qi )
xi
i
1 xi
i 1
likelihood ratio
 pi 
P(x | 1 )
   
P(x | 2 ) i 1  qi 
d
xi
1 xi
 1  pi 


 1  qi 
58
Discriminant Function

pi
1  pi 
P(1 )
g (x)    xi ln  (1  xi ) ln
  ln
qi
1  qi 
P(2 )
i 1 
d
d
g (x)   wi xi  w0
i 1
d
pi (1  qi )
1  pi
P(1 )
wi  ln
, w0   ln
 ln
qi (1  pi )
1  qi
P(2 )
i 1
decide 1 if g (x)  0 and 2 if g (x)  0
59
Example: Three-Dimensional
Binary Data
P (1 )  P (2 )  0.5
pi  0.8, qi  0.5, i  1,2,3
0.8(1  0.5)
wi  ln
0.5(1  0.8)
 1.3863
1  0 .8
0 .5
w0   ln
 ln
1  0 .5
0 .5
i 1
 2.75
3
60
Example: Three-Dimensional
Binary Data
P (1 )  P (2 )  0.5
pi  0.8, qi  0.5, i  1,2
p3  q3  0.5
0.8(1  0.5)
wi  ln
0.5(1  0.8)
 1.3863, i  1,2
w3  0
1  0. 8
0 .5
w0   ln
 ln
1  0. 5
0 .5
i 1
 1.83
2
61
Illustration of Missing Features
62
Decision with Missing Features
x  [x g , x b ]
P(i | x g ) 
P(


i
P(i , x g )
P(x g )
p ( , x , x )dx


 p ( x ) dx
i
g
b
b
b
| x g , x b ) p (x g , x b )dxb
 p ( x ) dx
g (x) p (x)dx


 p(x)dx
i
b
b
b
63
Noisy Features
noise model : p (x b | x t ), assume xb independen t of i and x g
P(i | x g , x b
p ( , x

)
i
g
, x b , x t )dxt
P(x g , x b )
p (i , x g , x b , x t )  P (i | x g , x b , x t ) p (x g , x b , x t )
 P (i | x g , x t ) p (x b | x g , x t ) p (x g , x t ),
P(i | x g , x b
p (

)
i
| x g , x t ) p (x g , x t ) p (x b | x t )dxt
 p(x
g
g (x) p (x) p (x | x )dx


 p(x) p(x | x )dx
i
b
b
t
t
p(x b | x g , x t )  p(x b | x t )
t
, x t ) p (x b | x t )dxt
, x  [x g , x t ]
t
64
Example of Statistical Dependence
and Independence
p( x1 , x3 )  p( x1 ) p( x3 )
65
Example of Causal Dependence
State of an mobile
– Temperature of engine
– Pressure of brake fluid
– Pressure of air in the tires
– Voltages in the wires
– Oil temperature
– Coolant temperature
– Speed of the radiator fan
66
Bayesian Belief Nets
(Causal Networks)
67
Example: Belief Network for Fish
p(a3 , b1 , x2 , c3 , d 2 )  P(a3 ) P(b1 ) P( x2 | a3 , b1 ) P(c3 | x2 ) P(d 2 | x2 )
 0.25  0.6  0.4  0.5  0.4  0.012
68
Simple Belief Network 1
P(d) 
 P(a, b, c, d)
a ,b ,c

 P(a) P(b | a) P(c | b) P(d | c)
a ,b ,c
  P(d | c) P(c | b) P(b | a)P(a)
c
b
a
69
Simple Belief Network 2
P(h)   P(e, f , g, h)
e ,f , g
  P(e) P(f | e) P(g | e) P(h | f , g)
e ,f , g
  P(e) P(f | e) P(g | e) P(h | f , g)
e
f ,g
70
Use of Bayes Belief Nets
Seek to determine some particular
configuration of other variables
– Given the values of some of the
variables (evidence)
Determine values of several query
variables (x) given the evidence of
all other variables (e)
P(x, e)
P ( x | e) 
 P(x, e)
P(e)
71
Example
72
Example
P( x , c , b )
P( x | c , b ) 
   P ( x , a, b , c , d )
P (c , b )
1
1
1
1
2
2
1
1
2
2
1
a ,d
   P (a) P (b2 ) P ( x1 | a, b2 ) P (c1 | x1 ) P (d | x1 )
a ,d
 P (b2 ) P (c1 | x1 ) 
[ P (a1 ) P ( x1 | a1 , b2 )  P (a2 ) P ( x1 | a2 , b2 ) 
P (a3 ) P ( x1 | a3 , b2 )  P (a4 ) P ( x1 | a4 , b2 )] 
[ P (d1 | x1 )  P (d 2 | x1 )]   0.114
P ( x2 | c1 , b2 )   0.066
P ( x1 | c1 , b2 )  0.63, P ( x2 | c1 , b2 )  0.37
73
Naïve Bayes’ Rule
(Idiot Bayes’ Rule)
When the dependency relationship
among the features are unknown,
we generally take the simplest
assumption
– Features are conditionally
independent given the category
P(x | a, b)  P(x | a) P(x | b)
– Often works quite well
74
Applications in Medical Diagnosis
Uppermost nodes represent a fundamental
biological agent
– Such as the presence of a virus or bacteria
Intermediate nodes describe disease
– Such as flu or emphysema
Lowermost nodes describe the symptoms
– Such as high temperature or coughing
A physician enters measured values into
the net and finds the most likely disease
or cause
75
Compound Bayesian Decision
ω   (1),  ,  (n) 
t
 (i ) takes one values from 1 ,, c
X  x1 ,  , x n 
p ( X | ω) P(ω)
P(ω | X) 

p ( X)
p ( X | ω) P(ω)
 p(X | ω) P(ω)
simplifica tion
n
p( X | ω)   p(x i |  (i ))
i 1
76