Bayesian Decision Theory

Bayesian Decision Theory
1
Bayesian Decision Theory
 Know probability distribution of the
categories
 Almost never the case in real life!
 Nevertheless useful since other cases can be
reduced to this one after some work
 Do not even need training data
 Can design optimal classifier
2
Bayesian Decision theory
Fish Example:
 Each fish is in one of 2 states: sea bass or salmon
 Let w denote the state of nature
 w = w1 for sea bass
 w = w2 for salmon
 The state of nature is unpredictable w is a variable that
must be described probabilistically.
 If the catch produced as much salmon as sea bass the next fish is
equally likely to be sea bass or salmon.

Define:
 P(w1 ) : a priori probability that the next fish is sea
bass
 P(w2 ): a priori probability that the next fish is
salmon.
Bayesian Decision theory
 If other types of fish are irrelevant:
P( w1 ) + P( w2 ) = 1.
Prior probabilities reflect our prior knowledge
(e.g. time of year, fishing area, …)
 Simple decision Rule:
Make a decision without seeing the fish.
Decide w1 if P( w1 ) > P( w2 ); w2 otherwise.
OK if deciding for one fish
If several fish, all assigned to same class
In general, we have some features and
more information.
4
Cats and Dogs
 Suppose we have these conditional probability
mass functions for cats and dogs
 P(small ears | dog) = 0.1, P(large ears | dog) = 0.9
 P(small ears | cat) = 0.8, P(large ears | cat) = 0.2
 Observe an animal with large ears
 Dog or a cat?
 Makes sense to say dog because probability of
observing large ears in a dog is much larger than
probability of observing large ears in a cat
 P[large ears | dog] = 0.9 > 0.2= P[large ears | cat] = 0.2
 We choose the event of larger probability, i.e.
maximum likelihood event
Example: Fish Sorting
 Respected fish expert says that
 Salmon’ length has distribution N(5,1)
 Sea bass’s length has distribution N(10,4)
 Recall if r.v. is N  , 2  then it’s density is
1
px  
e
 2

x   2

2 2
6
Class Conditional Densities
1 
pl | salmon  
e
2
fixed
l  5 2
2

1
pl | bass  
e
2 2
fixed
l 10 2
2* 4
7
Likelihood function
 Fix length, let fish class vary. Then we get
likelihood function (it is not density and not
probability mass)
 1
e 2
if class  salmon

 2
pl | class   

l 102

1

fixed
e 8 if class  bass
 2 2

l 5 2

8
Likelihood vs. Class Conditional Density
p(l | class)
7
length
Suppose a fish has length 7. How do we classify it?
9
ML (maximum likelihood) Classifier
 We would like to choose salmon if
Pr  length  7 | salmon   Pr  length  7 | bass

 However, since length is a continuous r.v.,
Pr  length  7 | salmon   Pr  length  7 | bass   0
 Instead, we choose class
which maximizes likelihood
2

l 102

l 5 
1  2*4
1  2
pl | bass 
e
pl | salmon  
e
2 2
2
 ML classifier: for an observed l:

p l | salmon  ? p l | bass 
salmon

bass
in words: if p(l | salmon) > p(l | bass),
classify as salmon, else classify as bass
ML (maximum likelihood) Classifier
p( 7 |bass)
p( 7 |salmon)
Thus we choose
the class (bass)
which is more
likely to have given
the observation
7
11
Decision Boundary
classify as salmon
classify as sea bass
6.70
length
12
How Prior Changes Decision Boundary?
 Without priors
salmon
sea bass
length
6.70
 How should this change with prior?
 P(salmon) = 2/3
 P(bass) = 1/3
salmon
? ?
6.70
sea bass
length
13
Bayes Decision Rule
1. Have likelihood functions
p(length | salmon) and p(length | bass)
2. Have priors P(salmon) and P(bass)

Question: Having observed fish of certain
length, do we classify it as salmon or bass?

Natural Idea:
 salmon if P salmon | length   P bass | length 
 bass if P bass | length   P salmon | length 
14
Posterior

P(salmon | length) and P(bass | length)
are called posterior distributions, because
the data (length) was revealed (post data)

How to compute posteriors? Not obvious

From Bayes rule:
p length| salm onP salm on
P(salm on | length) 
p length

Similarly:
plength | bass P bass 
P bass | length  
plength 
15
MAP (maximum a posteriori) classifier
 salmon
P salmon | length  ? P bass | length 
bass 
salmon
plength | salmon P salmon   plength | bass P bass 
?
plength 
plength 
bass 
salmon
plength | salmon P salmon  ? plength | bass P bass 
bass 
16
Back to Fish Sorting Example
 Likelihood
pl | salmon  
 Priors:
1 
e
2
l  5 2

1
pl | bass  
e
2 2
2
l 10 2
8
P(salmon) = 2/3, P(bass) = 1/3
 Solve inequality
salmon
1 
e
2
l 5 2
2

2
1
 
e
3 2 2
new decision
boundary
6.70 7.18
l 10 2
8
1

3
sea bass
length
 New decision boundary makes sense since
we expect to see more salmon
17
Prior P(s)=2/3 and P(b)= 1/3 vs.
Prior P(s)=0.999 and P(b)= 0.001
salmon
bass
7.1 8.9
length
Likelihood vs Posteriors
P(salmon|l)
likelihood
p(l|fish class)
P(bass|l)
density with
respect to
length, area
under the
curve is 1
p(l|salmon)
p(l|bass)
length
posterior P(fish class| l)
mass function with respect to fish class, so for
each l, P(salmon| l )+P(bass| l ) = 1
More on Posterior
posterior density
(our goal)
likelihood
(given)
P  l | c  P c 
P c | l  
P l 
Prior
(given)
normalizing factor, often do not even need
it for classification since P(l) does not
depend on class c. If we do need it, from
the law of total probability:
P l   pl | salmon psalmon   pl | bass pbass 
Notice this formula consists of likelihoods
and priors, which are given
More on Priors
 Prior comes from prior knowledge, no data
has been seen yet
 If there is a reliable source prior knowledge,
it should be used
 Some problems cannot even be solved
reliably without a good prior
21
More on Map Classifier
likelihood prior
P  l | c  P c 
P c | l  
P l 
posterior
 Do not care about P(l) when maximizing P(c|l )
P c | l 
P  l | c P c 
proportional

 If P(salmon)=P(bass) (uniform prior) MAP classifier
becomes ML classifier P c | l   P  l | c 
 If for some observation l, P(l|salmon)=P(l|bass), then
this observation is uninformative and decision is
based solely on the prior P c | l  P c 
Justification for MAP Classifier
 Let’s compute probability of error for the
MAP estimate:
 salmon
P salmon | l  ? P bass | l 
bass 
 For any particular l, probability of error
Pr[error| l ]=
P(bass|l)
if we decide salmon
P(salmon|l)
if we decide bass
Thus MAP classifier is optimal for each
individual l !
23
Justification for MAP Classifier
 We are interested to minimize error not just for
one l, we really want to minimize the average
error over all l
Pr error  




 perror , l dl   Pr error | l pl dl
 If Pr[error| l ]is as small as possible, the integral is
small as possible
 But Bayes rule makes Pr[error| l ] as small as
possible
Thus MAP classifier minimizes the probability of error!
More General Case
 Let’s generalize a little bit
 Have more than one feature x  x1 , x 2 ,..., xd 
 Have more than 2 classes c1 , c2 ,..., cm 
More General Case
 As before, for each j we have


 p x | c j is likelihood of observation x given that
the true class is c j
 P c j is prior probability of class c j
 P c j | x is posterior probability of class c j given
that we observed data x
 
 
 Evidence, or probability density for data
m

  
p x    p x | c j P c j
j 1
26
Minimum Error Rate Classification
 Want to minimize average probability of error
Pr error    perror , x dx   Pr error | x p x dx
need to make this
as small as possible
 Pr error | x   1  P ci | x  if we decide class ci
 Pr error | x  is minimized with MAP classifier
 Decide on class ci


if
P ci | x   P c j | x j  i
MAP classifier is optimal
If we want to minimize the
probability of error
1
1-P(c1|x) 1-P(c |x)
2
P(c1|x)
P(c2|x)
1-P(c3|x)
P(c3|x)
General Bayesian Decision Theory
 In some cases we may want to refuse to
make a decision (let human expert handle
tough case)
 allow actions 1 , 2 ,...,  k 
 Suppose some mistakes are more costly
than others (classifying a benign tumor as
cancer is not as bad as classifying cancer
as benign tumor)
 Allow loss functions   i | c j  describing loss
occurred when taking action  i when the true
class is c j
28
Conditional Risk
 Suppose we observe x and wish to take
action  i
 If the true class is c j , by definition, we incur
loss   i | c j 
 Probability that the true class is c j after
observing x is P c j | x 
 The expected loss associated with taking
action  i is called conditional risk and it is:
m

 
R  i | x      i | c j P c j | x
j 1

Conditional Risk
sum over disjoint events
(different classes)
m

probability of
class c j given
observation x
 
R  i | x      i | c j P c j | x

j 1
penalty for
taking action  i
if observe x
part of overall penalty
which comes from event
that true class is c j
Example: Zero-One loss function
 action  i is decision that true class is ci
  i | c j 
0

1
if i  j
otherwise
(no mistake)
(mistake)
R  i | x      i | c j P c j | x   P c j | x  
m

 

j 1
i j
 1  P ci | x   Pr error if decide ci 
 Thus MAP classifier optimizes R(i|x)
P ci | x   P c j | x j  i


 MAP classifier is Bayes decision rule under
zero-one loss function
Overall Risk
 Decision rule is a
function (x) which for
every x specifies action
out of 1 ,2 ,...,  k 
x1
x2
x3
X (x1)
(x2) 1 ,2 ,...,  k 
(x3)
 The average risk for (x)
R    R  x  | x p x dx
need to make this as small as possible
 Bayes decision rule (x) for every x is the action
which minimizes the conditional risk
m

 
R  i | x      i | c j P c j | x

j 1
 Bayes decision rule (x) is optimal, i.e. gives the
minimum possible overall risk R *
Bayes Risk: Example
 Salmon is more tasty and expensive than sea bass
sb   salmon | bass   2 classify bass as salmon
bs   bass | salmon   1 classify salmon as bass
no mistake, no loss
ss  bb  0
 Likelihoods
1 
pl | salmon  
e
2
l  5 2
2

1
pl | bass  
e
2 2
l 10 2
2* 4
 Priors P(salmon)= P(bass)
 Risk

 

R  | x      | c j P c j | x   s P s | l    bP b | l 
m
j 1
Rsalmon | l   ssP s | l   sbP b | l   sbP b | l 
Rbass | l   bs P s | l   bb P b | l   bs P s | l 
Bayes Risk: Example
Rsalmon | l   sbP b | l 
Rbass | l   bs P s | l 
 Bayes decision rule (optimal for our loss function)
 salmon
sbP b | l ? bs P s | l 
 bass
P b | l  bs
 Need to solve

P s | l  sb
 Or, equivalently, since priors are equal:
P l | b P b pl  P l | b  bs


pl P l | s P s  P l | s  sb
Bayes Risk: Example
 Need to solve
P l | b  bs

P l | s  sb
 Substituting likelihoods and losses
2  2 exp

l 10 2

8
1  2 2 exp
 

l 5 2
1 
2
exp
l  10 2 l  5 2
8

exp

l 10 2

2
salmon
8

 l 5 2
2
l 10 


 exp 8
 1  ln 
l 5 2


2
 exp
2


  ln 1 


 0  3l 2  20l  0  0  l  6.6667
new decision
boundary
6.67 6.70
sea bass
length
Likelihood Ratio Rule
 In 2 category case, use likelihood ratio rule
P  x | c1  12  22 P c2 

P  x | c2  21  11 P c1 
likelihood
ratio
fixed number
Independent of x
 If above inequality holds, decide c1
 Otherwise decide c2
36
Discriminant Functions
 All decision rules have the same structure:
at observation x choose class ci s.t.
gi  x   g j  x  j  i
discriminant
function
 ML decision rule:
gi  x   P x | ci 
 MAP decision rule:
gi  x   P ci | x 
 Bayes decision rule: gi  x   Rci | x 
Decision Regions
 Discriminant functions split the feature
vector space X into decision regions
g2  x   maxgi 
c1
c2
c3
c3
c1
38
Important Points
 If we know probability distributions for the
classes, we can design the optimal
classifier
 Definition of “optimal” depends on the
chosen loss function
 Under the minimum error rate (zero-one loss
function
 No prior: ML classifier is optimal
 Have prior: MAP classifier is optimal
 More general loss function
 General Bayes classifier is optimal
39