Bayes Decision Theory

Bayes Decision Theory
Bayes Error Rate and Error Bounds
Receiver Operating Characteristics
Discrete Features
Missing Features
CSE 555: Srihari
0
Example of Bayes Decision Boundary
Two Gaussian distributions each with four data points
x2
10
⎡ 3⎤
µ1 = ⎢ ⎥
6
⎣ ⎦
⎡3⎤
µ2 = ⎢ ⎥
⎣ − 2⎦
µ1
2
-2
2
4
6
8
x1
µ2
⎛1 / 2 0 ⎞
=
∑1 ⎜⎜ 0 2 ⎟⎟
⎝
⎠
⎛ 2 0⎞
=
∑2 ⎜⎜ 0 2 ⎟⎟
⎠
⎝
Inverse Matrices
⎛2 0 ⎞
∑1 = ⎜⎜ 0 1 / 2 ⎟⎟
⎠
⎝
0 ⎞
-1 ⎛1 / 2
∑2 = ⎜⎜ 0 1 / 2 ⎟⎟
⎠
⎝
-1
x2 = 3.514 − 1.125 x1 + 0.1865 x1
Decision Boundary assumes
P(ω1) = P(ω2) = 0.5
CSE 555: Srihari
2
1
Bayes Error Rate
Two Class Case
optimal
Stated for the multidimensional
case (regions R1 and R2 not easy
to specify) Multi-Class Case
Components of P(error) for equal priors
and non-optimal decision point x*
CSE 555: Srihari
2
Error Bounds for Normal Densities
• Evaluation of Error Integrals difficult
• Discontinuous decision regions
• Instead, obtain bounds on error rate
• Useful inequality in obtaining a bound
• Minimum of two integers is smaller than the
square root of their product, more generally
β 1− β
min[a, b] ≤ a b
• Proof:
for a, b ≥ 0 and 0 ≤ β ≤ 1
If a ≥ b then (a / b) β ≥ 1
or (a / b) β b ≥ b
or a β b1− β ≥ b
CSE 555: Srihari
3
Chernoff Bound
P(error) ≤ Pβ (ω1 )P1−β (ω2 )∫ pβ ( x | ω1 ) p1−β ( x | ω2 )dx for
0 ≤ β ≤1
Note that integral is over all space– no need to impose integration
limits. If conditional probabilities are normal, the integral can be
evaluated analytically, yielding
∫
p β ( x | ω 1 ) p 1 − β ( x | ω 2 )dx = e − k ( β )
CSE 555: Srihari
4
Bhattacharya Bound
• Special case of Chernoff bound where β = 0.5
CSE 555: Srihari
5
Bhattacharya versus Chernoff Error Bounds
(as value of β is varied)
Chernoff bound is never looser than the Bhattacharya bound.
Here Chernoff bound is at β* = 0.66 and is slightly tighter
than the Bhattacharya bound (β = 0.5)
CSE 555: Srihari
6
Example of Bhattacharya bound
with normal densities
x2
10
⎡ 3⎤
µ1 = ⎢ ⎥
6
µ1
2
-2
2
4
6
8
x1
⎣ ⎦
⎡3⎤
µ2 = ⎢ ⎥
⎣ − 2⎦
⎛1 / 2 0 ⎞
=
∑1 ⎜⎜ 0 2 ⎟⎟
⎝
⎠
⎛ 2 0⎞
=
∑2 ⎜⎜ 0 2 ⎟⎟
⎠
⎝
µ2
k(1/2) = 4.06
P(error) < 0.0087
CSE 555: Srihari
7
Receiver Operating Characteristics
• Distance between Gaussian distributions
useful in experimental psychology, radar
detection, medical diagnosis
• Interested in detecting a weak pulse, or dim
flash of light
• Detector detects a signal whose mean value
is µ1 when signal is absent and µ2 when
signal is present
p ( x / ωi ) ~ N ( µ i , σ 2 )
CSE 555: Srihari
8
Four Types of Probability in Two-Class
Discrimination
Correct
Rejection
Hit Probability
Miss
False Alarm
When no signal present: p(x/ω1) ~ N(µ1,σ 2)
When signal present: p(x/ω1) ~ N(µ1,σ 2)
Discriminability
d'=
| µ 2 − µ1 |
σ
Decision threshold determines probabilities of hit and false alarm
CSE 555: Srihari
9
ROC Curve
Probability
of Hit
When no signal present: p(x/ω1) ~ N(µ1,σ 2)
When signal present: p(x/ω1) ~ N(µ1,σ 2)
Decision threshold determines probabilities of hit
and false alarm
d'=
| µ 2 − µ1 |
Probability of
False Alarm
σ
CSE 555: Srihari
10
ROCs need not be symmetric when
distributions are not Gaussian
Probability
of Hit
Probability of
False Alarm
CSE 555: Srihari
11
Bayes Decision Theory: Discrete
Features
CSE 555: Srihari
12
Bayes Decision Theory – Discrete Features
• Components of x are binary or integer valued, x can
take only one of m discrete values
v1, v2, …, vm
• Probability Density Functions replaced by Probabilities
P(ω j | x) =
P( x | ω j ) P(ω j )
P( x)
where
c
P(x) = ∑ P(x | ω j ) P (ω j )
j=1
CSE 555: Srihari
13
Independent Binary Features
2 category problem
Let x = [x1, x2, …, xd ]t
where each xi is either 0 or 1, with probabilities:
pi = p(xi = 1 | ω1)
qi = p(xi = 1 | ω2)
Assuming Conditional Independence
P( x | ω ) = ∏ p (1 − p )
d
i
1
1− xi
xi
i
i =1
and
d
P( x | ω2 ) = ∏ qi i (1 − qi )1− xi
x
CSE 555:
i =1Srihari
d
pi
P( x | ω1 )
= ∏(
P( x | ω2 ) i =1 qi
) (
xi
1 − pi
1 − qi
)1− x
i
14
Bayes discriminant function for
Independent Binary Features:
d
g ( x ) = ∑ w i x i + w0
i =1
where :
pi ( 1 − q i )
w i = ln
q i ( 1 − pi )
i = 1 ,..., d
and :
1 − pi
P( ω1 )
w0 = ∑ ln
+ ln
1 − qi
P( ω 2 )
i =1
d
decide ω 1 if g(x) > 0 and ω 2 if g(x) ≤ 0
CSE 555: Srihari
15
Bayesian Decisions for 3-D Binary Data
P (ω1 ) = P(ω2 ) = 0.5
pi = 0.8 q i = 0.5 for i = 1,2,3
p3 =q 3
w3 = 0
g ( x) = 0
CSE 555: Srihari
16
Missing and Noisy Features
Features are corrupted by a known noise
source
Ex: variability of light source may degrade
measurement of lightness
Features are missing
Ex: occlusion prevents measurement of length
CSE 555: Srihari
17
Missing Feature
Example of Missing Feature
Four categories with equal priors and class-conditional
distributions. Here x1 is missing and the other has value x2
We want to classify as ω2 since it has the largest likelihood
Choosing mean of missing feature (over all classes)
will result
CSEin
555:worse
Srihari performance!
18
Missing Feature Analysis
good features xg
bad features (unknown or missing) xb
Marginalize over
all values of
missing feature
This is the Bayes Discriminant Function
CSE 555: Srihari
19
Noisy Features
• Uncorrupted good features xg
• Noise model p(xb|xt)
• xt = True value of the Observed value xb
• Assume if xt were known xb would be independent
of wi and xg
Integral is weighted
by the noise model
CSE 555: Srihari
20