LECTURE 4: Bayesian Decision Theory

LECTURE 4: Bayesian Decision Theory
The Likelihood Ratio Test
g The Probability of Error
g The Bayes Risk
g Bayes, MAP and ML Criteria
g Multi-class problems
g Discriminant Functions
g
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
1
Likelihood Ratio Test (LRT)
g
g
Assume we are to classify an object based on the evidence provided by a
measurement (or feature vector) x
Would you agree that a reasonable decision rule would be the following?
n
"Choose the class that is most ‘probable’ given the observed feature vector x”
g
g
More formally: Evaluate the posterior probability of each class P(ωi|x) and choose the class with
largest P(ωi|x)
Let us examine the implications of this decision rule for a 2-class problem
n
In this case the decision rule becomes
if P(ω1 | x ) > P(ω2 | x ) choose ω1
else choose ω2
g
Or, in a more compact form
ω1
> P(ω2 | x )
P(ω1 | x ) <
n
ω2
Applying Bayes Rule
ω
P( x | ω1 )P(ω1 ) >1 P( x | ω2 )P(ω2 )
<
P( x )
P( x )
ω2
n
P(x) does not affect the decision rule so it can be eliminated*. Rearranging the previous
expression
ω
Λ( x ) =
n
P( x | ω1 ) >1 P(ω2 )
P( x | ω2 ) ω<2 P(ω1 )
The term Λ(x) is called the likelihood ratio, and the decision rule is known as the likelihood
ratio test
*P(x) can be disregarded in the decision rule since it is constant regardless of class ωI. However, P(x) will be needed if
we want to estimate the posterior P(ωi|x) which, unlike P(x|ωi)P(x), is a true probability value and, therefore, gives us an
estimate of the “goodness” of our decision.
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
2
Likelihood Ratio Test: an example
g
Given a classification problem with the following class conditional densities,
derive a decision rule based on the Likelihood Ratio Test (assume equal priors)
1
1
− (x −10)2
1
e 2
P(x | ω2 ) =
2π
− (x − 4)2
1
e 2
P(x | ω1 ) =
2π
g
1
Solution
n
n
Substituting the given likelihoods and priors into the LRT expression: Λ( x ) =
Simplifying the LRT expression:
Λ( x ) =
e
e
n
Changing signs and taking logs:
ω1
n
Which yields:
1
ω1
− ( x − 4 )2
2
1
− ( x −10 )2
2
>
<
− ( x − 4 )2
1
ω1
e 2
>1
2π
1
2
− ( x −10 ) < 1
1
e 2
ω2
2π
1
ω2
( x − 4)2 − ( x − 10)2
ω1
<
0
>
ω2
<
x 7
>
R1: say ω1
R2: say ω2
P(x|ω1)
P(x|ω2)
ω2
n
g
This LRT result makes sense from an intuitive point of
view since the likelihoods are identical and differ only
in their mean value
4
10
x
How would the LRT decision rule change if, say, the priors were such that
P(ω1)=2P(ω2) ?
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
3
The probability of error (1)
g
The performance of any decision rule can be measured by its probability of error P[error]
which, making use of the Theorem of total probability (Lecture 2), can be broken up into
C
P[error] = ∑ P[error | ωi ]P[ωi ]
i=1
g
The class conditional probability of error P[error|ωi] can be expressed as
P[error | ωi ] = P[choose ω j | ωi ] = ∫ P( x | ωi )dx
Rj
g
So, for our 2-class problem, the probability of error becomes
P[error ] = P[ω1 ] ∫ P( x | ω1 )dx + P[ω 2 ] ∫ P( x | ω 2 )dx
R2
14
4244
3
ε1
n
g
R1
14
4244
3
ε2
where εi is the integral of the likelihood P(x|ωi) over the region Rj where we choose ωj
For the decision rule of the previous example, the integrals ε1 and ε2 are depicted below
n
Since we assumed equal priors, then P[error] = (ε1 + ε2)/2
R1: say ω1
R2: say ω2
P(x|ω1)
g
Compute the probability for the example above
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
P(x|ω2)
4
ε2
10
x
ε1
4
The probability of error (2)
Now that we can measure the performance of a decision rule we can ask the following
question: How good is the Likelihood Ratio Test decision rule?
n
For this purpose it is convenient to express P[error] in terms of the posterior P[error|x]
+∞
P[error ] = ∫ P[error | x ]P( x )dx
−∞
n
n
The optimal decision rule will minimize P[error|x] for every value of x, so that the integral above is minimized
At each point x’, P[error|x’] is equal to P[ωi|x’] when we choose the other class ωj
g
This is depicted in the following figure:
R 1, ALT
R 2, ALT
R 1,LTR
Probability
g
R 2,LRT
P[error | x' ] for ALT decision rule
P(ω 1|x)
P(ω 2|x)
P[error | x' ] for LRT decision rule
x’
n
x
From the figure it becomes clear that, for any value of x’, the Likelihood Ratio Test decision rule will always
have a lower P[error|x’]
g
Therefore, when we integrate over the real line, the LRT decision rule will yield a lower P[error]
For
For any
any given
given problem,
problem, the
the minimum
minimum probability
probability of
of error
error is
is achieved
achieved
by
by the
the Likelihood
Likelihood Ratio
Ratio Test
Test decision
decision rule.
rule. This
This probability
probability of
of error
error is
is
called
called the
the Bayes
Bayes Error
Error Rate
Rate and
and is
is the
the BEST
BEST any
any classifier
classifier can
can do.
do.
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
5
The Bayes Risk (1)
g
So far we have assumed that the penalty of misclassifying a class ω1 example
as class ω2 is the same as the reciprocal. In general, this is not the case:
n
g
g
For example, misclassifying a cancer sufferer as a healthy patient is a much more serious
problem than the other way around
This concept can be formalized in terms of a cost function Cij
n C represents the cost of choosing class ω when class ω is the true class
ij
i
j
We define the Bayes Risk as the expected value of the cost
2
2
2
2
ℜ = E[C] = ∑∑ Cij ⋅ P[choose ωi and x ∈ω j ] = ∑∑ Cij ⋅ P[ x ∈Ri | ω j ] ⋅ P[ω j ]
i=1 j=1
g
i=1 j=1
What is the decision rule that minimizes the Bayes Risk?
n
First notice that
P[ x ∈ Ri | ω j ] = ∫ P( x | ω j )dx
Ri
n
We can express the Bayes Risk as
ℜ=
∫ [C
11
⋅ P[ω1 ] ⋅ P( x | ω1 ) + C12 ⋅ P[ω2 ] ⋅ P( x | ω2 )]dx +
∫ [C
21
⋅ P[ω1 ] ⋅ P( x | ω1 ) + C22 ⋅ P[ω2 ] ⋅ P( x | ω2 )]dx
R1
R2
n
Then we note that, for either likelihood, one can write:
∫ P(x | ω )dx + ∫ P(x | ω )dx = ∫ P(x | ω )dx = 1
i
R1
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
i
R2
i
R1∪R 2
6
The Bayes Risk (2)
n
Merging the last equation into the Bayes Risk expression yields
ℜ=
C11P[ω1 ] ∫ P(x | ω1 )dx
+ C12P[ω2 ] ∫ P(x | ω 2 )dx +
+ C21P[ω1 ] ∫ P(x | ω1 )dx
+ C22P[ω2 ] ∫ P(x | ω2 )dx +
+ C21P[ω1 ] ∫ P(x | ω1 )dx
+ C22P[ω2 ] ∫ P(x | ω2 )dx +
− C21P[ω1 ] ∫ P(x | ω1 )dx
− C22P[ω2 ] ∫ P(x | ω2 )dx
R1
R2
R1
R1
n
R1
R2
R1
R1
Now we cancel out all the integrals over R2
ℜ = C21P[ω1 ] + C22P[ω 2 ] +
+ (C12 − C22 ) P[ω 2 ] ∫ P( x | ω 2 )dx − (C21 − C11 ) P[ω1 ] ∫ P( x | ω1 )dx
∨
0
n
R1
∨
0
R1
The first two terms are constant as far as our minimization is concerned since they do not
depend on R1, so we will be seeking a decision region R1 that minimizes:


R1 = arg min  ∫ [(C12 − C22 )P[ω 2 ]P( x | ω 2 ) − (C21 − C11 )P[ω1 ]P( x | ω1 )]dx 
R1



= arg min ∫ g( x )dx 

R1
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
7
The Bayes Risk (3)
g
Let’s forget about the actual expression of g(x) to develop some intuition for
what kind of decision region R1 we are looking for
n
Intuitively, we will select for R1 those regions that minimize the integral
g
In other words, those regions where g(x)<0
g(x)
∫ g( x )dx
R1
R1=R1A∪ R1B∪ R1C
R1A
R1B
R1C
x
n
So we will choose R1 such that
ω1
(C21 − C11 )P[ω1 ]P( x | ω1 ) > (C12 − C22 )P[ω2 ]P( x | ω2 )
n
And rearranging
ω1
P( x | ω1 ) > (C12 − C22 ) P[ω2 ]
P( x | ω2 ) < (C21 − C11 ) P[ω1 ]
ω2
n
Therefore, minimization of the Bayes Risk also leads to a Likelihood Ratio Test
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
8
The Bayes Risk: an example
Consider a classification problem with two classes
defined by the following likelihood functions
1x
n
n
0.16
1
− (x − 2)2
1
P(x | ω 2 ) =
e 2
2π
0.12
Sketch the two densities
What is the likelihood ratio?
Assume P[ω1]=P[ω2]=0.5, C11=C22=0, C12=1 and C21=31/2.
Determine a decision rule that minimizes the probability of
error
1 x2
−
Λ( x ) =
e
e
−
2
1x
2 3
1
− ( x − 2 )2
2
1
e 2 3 ω1
> 1
2π 3
1
− ( x − 2 )2 <
1
3
e 2
ω2
2π
ω1
0.14
0.1
0.08
0.06
0.04
0.02
0
-6
-4
-2
0
x
2
R1
4
R2
6
R1
0.2
0.18
0.16
>
1
<
0.14
0.12
ω2
ω1
−
0.18
2
−
1
e 23
2π 3
P(x | ω1 ) =
n
0.2
likelihood
g
>
1 x2 1
+ ( x − 2)2 0
<
2 3 2
ω2
ω1
>
2x − 12x + 12 0 ⇒ x = 4.73,1.27
<
2
ω2
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
0.1
0.08
0.06
0.04
0.02
0
-6
-4
-2
0
x
2
4
6
9
Variations of the Likelihood Ratio Test (1)
g
The LRT decision rule that minimizes the Bayes Risk is commonly called the
Bayes Criterion
ω1
P( x | ω1 ) > (C12 − C22 ) P[ω 2 ]
Λ( x ) =
P( x | ω2 ) < (C21 − C11 ) P[ω1 ]
Bayes
Bayes criterion
criterion
ω2
g
Many times we will simply be interested in minimizing the probability of error,
which is a special case of the Bayes Criterion that uses the so-called
symmetrical or zero-one cost function. This version of the LRT decision rule is
referred to as the Maximum A Posteriori Criterion, since it seeks to maximize
the posterior P(ωi|x)
ω1
ω1
ω2
ω2
0 i = j
P( x | ω1 ) > P(ω 2 )
P(ω1 | x ) > Maximum
A Posteriori
Posteriori
⇒ Λ( x ) =
⇔
Cij = 
1 Maximum A
≠
<
<
1
i
j
P
(
x
|
ω
)
P
(
ω
)
P
(
ω
|
x
)
(MAP)

2
1
2
(MAP) Criterion
Criterion
g
Finally, for the case of equal priors P[ωi]=1/2, and the zero-one cost function the
LTR decision rule is called the Maximum Likelihood Criterion, since it will
minimize the likelihood P(x|ωi)
0 i = j
ω1
Cij = 

>
P
(
x
|
ω
)
1
1 i ≠ j ⇒ Λ( x ) =
1
P( x | ω 2 ) <
1

P(ωi ) =
∀i
ω2
C 
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
Maximum
Maximum Likelihood
Likelihood
(ML)
(ML) Criterion
Criterion
10
Variations of the Likelihood Ratio Test (2)
g
Two more decision rules are commonly cited in the related literature
n
The Neyman-Pearson Criterion, used in Detection and Estimation Theory, which also leads
to an LRT decision rule, fixes one class error probabilities, say ε1<α, and seeks to minimize
the other
g
g
n
The Minimax Criterion, used in Game Theory, is derived from the Bayes criterion, and
seeks to minimize the maximum Bayes Risk
g
n
For instance, for the sea-bass/salmon classification problem of Lecture 1, there may be some kind of
government regulation that we must not misclassify more than 1% of salmon as sea bass
The Neyman-Pearson Criterion is very attractive since it does not require knowledge of priors and
cost function
The Minimax Criterion does nor require knowledge of the priors, but it needs a cost function
For more information on these methods, the reader is referred to “Detection, Estimation and
Modulation Theory”, by H.L. van Trees, the classical reference in this field
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
11
Minimum P[error] rule for multi-class problems
g
The decision rule that minimizes P[error] generalizes very easily to multi-class
problems
n
For clarity in the derivation, the probability of error is better expressed in terms of the
probability of making a correct assignment
P[error ] = 1 − P[correct]
n
The probability of making a correct assignment is
C
P[correct ] = ∑ P(ωi ) ∫ P( x |ωi )dx
i=1
n
Ri
The problem of minimizing P[error] is equivalent to that of maximizing P[correct]. Expressing
P[correct] in terms of the posteriors:
C
C
C
i=1 Ri
i=1 Ri
P[correct] = ∑ P(ωi ) ∫ P(x |ωi )dx = ∑ ∫ P(x |ωi )P(ωi )dx = ∑ ∫ P(ωi | x)P(x)dx
i=1
Ri
1442443
n
In order to maximize P[correct], we will have to
maximize each of the integrals ℑi. In turn, each
integral ℑi will be maximized by choosing the
class ωi that yields the maximum P[ωi|x]
⇒ we will define Ri to be the region where
P[ωi|x] is maximum
Probability
ℑi
R2
R1
R3
R2
R1
P(ω 3|x)
P(ω 2|x)
P(ω 1|x)
x
g
Therefore, the decision rule that minimizes P[error] is the MAP Criterion
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
12
Minimum Bayes Risk for multi-class problems
g
To determine which decision rule yields the minimum Bayes Risk for the multi-class
problem we will use a slightly different formulation
n
n
g
We will denote by αi the decision to choose class ωi,
We will denote by α(x) the overall decision rule that maps features x into classes ωi: α(x)→{α1 , α2 , …, αC}
The (conditional) risk ℜ(αi|x) of assigning a feature x to class ωi is
C
ℜ(α( x ) → αi ) = ℜ(αi | x ) = ∑ CijP(ω j | x )
j=1
g
And the Bayes Risk associated with the decision rule α(x) is
ℜ(α( x )) = ∫ ℜ(α( x ) | x )P( x )dx
In order to minimize this expression,we will have to minimize the conditional risk ℜ(α(x)|x)
at each point x in the feature space, which in turn is equivalent to choosing ωi such that
ℜ(αi|x) is minimum
R2
R1
R2
R3
R2
R1
R2
Risk
g
ℜ (α3|x)
ℜ (α1|x)
ℜ (α2|x)
x
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
13
Discriminant functions
g
All the decision rules we have presented in this lecture have the same structure
n
g
g
At each point x in feature space choose class ωi which maximizes (or minimizes) some measure gi(x)
This structure can be formalized with a set of discriminant functions gi(x), i=1..C, and the
following decision rule
" assign x to class ωi if gi (x) > g j (x) ∀j ≠ i"
Therefore, we can visualize the decision rule as a network or machine that computes C
discriminant functions and selects the category corresponding to the largest discriminant.
Such network is depicted in the following figure (presented already in Lecture 1)
Class assignment
Select
Select max
max
Costs
Costs
gg1(x)
gg2(x)
1(x)
2(x)
Discriminant functions
Features
g
xx1
1
xx2
2
xx3
3
ggC (x)
C (x)
xxd
d
Finally, we express the three basic decision rules: Bayes, MAP and ML in terms of
Discriminant Functions to show the generality of this formulation
Criterion
Bayes
MAP
ML
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
Discriminant Function
gi(x)=-ℜ(αi|x)
gi(x)=P(ωi|x)
gi(x)=P(x|ωi)
14