Bayesian Decision Theory 1 Bayesian Decision Theory Know probability distribution of the categories Almost never the case in real life! Nevertheless useful since other cases can be reduced to this one after some work Do not even need training data Can design optimal classifier 2 Bayesian Decision theory Fish Example: Each fish is in one of 2 states: sea bass or salmon Let w denote the state of nature w = w1 for sea bass w = w2 for salmon The state of nature is unpredictable w is a variable that must be described probabilistically. If the catch produced as much salmon as sea bass the next fish is equally likely to be sea bass or salmon. Define: P(w1 ) : a priori probability that the next fish is sea bass P(w2 ): a priori probability that the next fish is salmon. Bayesian Decision theory If other types of fish are irrelevant: P( w1 ) + P( w2 ) = 1. Prior probabilities reflect our prior knowledge (e.g. time of year, fishing area, …) Simple decision Rule: Make a decision without seeing the fish. Decide w1 if P( w1 ) > P( w2 ); w2 otherwise. OK if deciding for one fish If several fish, all assigned to same class In general, we have some features and more information. 4 Cats and Dogs Suppose we have these conditional probability mass functions for cats and dogs P(small ears | dog) = 0.1, P(large ears | dog) = 0.9 P(small ears | cat) = 0.8, P(large ears | cat) = 0.2 Observe an animal with large ears Dog or a cat? Makes sense to say dog because probability of observing large ears in a dog is much larger than probability of observing large ears in a cat P[large ears | dog] = 0.9 > 0.2= P[large ears | cat] = 0.2 We choose the event of larger probability, i.e. maximum likelihood event Example: Fish Sorting Respected fish expert says that Salmon’ length has distribution N(5,1) Sea bass’s length has distribution N(10,4) Recall if r.v. is N , 2 then it’s density is 1 px e 2 x 2 2 2 6 Class Conditional Densities 1 pl | salmon e 2 fixed l 5 2 2 1 pl | bass e 2 2 fixed l 10 2 2* 4 7 Likelihood function Fix length, let fish class vary. Then we get likelihood function (it is not density and not probability mass) 1 e 2 if class salmon 2 pl | class l 102 1 fixed e 8 if class bass 2 2 l 5 2 8 Likelihood vs. Class Conditional Density p(l | class) 7 length Suppose a fish has length 7. How do we classify it? 9 ML (maximum likelihood) Classifier We would like to choose salmon if Pr length 7 | salmon Pr length 7 | bass However, since length is a continuous r.v., Pr length 7 | salmon Pr length 7 | bass 0 Instead, we choose class which maximizes likelihood 2 l 102 l 5 1 2*4 1 2 pl | bass e pl | salmon e 2 2 2 ML classifier: for an observed l: p l | salmon ? p l | bass salmon bass in words: if p(l | salmon) > p(l | bass), classify as salmon, else classify as bass ML (maximum likelihood) Classifier p( 7 |bass) p( 7 |salmon) Thus we choose the class (bass) which is more likely to have given the observation 7 11 Decision Boundary classify as salmon classify as sea bass 6.70 length 12 How Prior Changes Decision Boundary? Without priors salmon sea bass length 6.70 How should this change with prior? P(salmon) = 2/3 P(bass) = 1/3 salmon ? ? 6.70 sea bass length 13 Bayes Decision Rule 1. Have likelihood functions p(length | salmon) and p(length | bass) 2. Have priors P(salmon) and P(bass) Question: Having observed fish of certain length, do we classify it as salmon or bass? Natural Idea: salmon if P salmon | length P bass | length bass if P bass | length P salmon | length 14 Posterior P(salmon | length) and P(bass | length) are called posterior distributions, because the data (length) was revealed (post data) How to compute posteriors? Not obvious From Bayes rule: p length| salm onP salm on P(salm on | length) p length Similarly: plength | bass P bass P bass | length plength 15 MAP (maximum a posteriori) classifier salmon P salmon | length ? P bass | length bass salmon plength | salmon P salmon plength | bass P bass ? plength plength bass salmon plength | salmon P salmon ? plength | bass P bass bass 16 Back to Fish Sorting Example Likelihood pl | salmon Priors: 1 e 2 l 5 2 1 pl | bass e 2 2 2 l 10 2 8 P(salmon) = 2/3, P(bass) = 1/3 Solve inequality salmon 1 e 2 l 5 2 2 2 1 e 3 2 2 new decision boundary 6.70 7.18 l 10 2 8 1 3 sea bass length New decision boundary makes sense since we expect to see more salmon 17 Prior P(s)=2/3 and P(b)= 1/3 vs. Prior P(s)=0.999 and P(b)= 0.001 salmon bass 7.1 8.9 length Likelihood vs Posteriors P(salmon|l) likelihood p(l|fish class) P(bass|l) density with respect to length, area under the curve is 1 p(l|salmon) p(l|bass) length posterior P(fish class| l) mass function with respect to fish class, so for each l, P(salmon| l )+P(bass| l ) = 1 More on Posterior posterior density (our goal) likelihood (given) P l | c P c P c | l P l Prior (given) normalizing factor, often do not even need it for classification since P(l) does not depend on class c. If we do need it, from the law of total probability: P l pl | salmon psalmon pl | bass pbass Notice this formula consists of likelihoods and priors, which are given More on Priors Prior comes from prior knowledge, no data has been seen yet If there is a reliable source prior knowledge, it should be used Some problems cannot even be solved reliably without a good prior 21 More on Map Classifier likelihood prior P l | c P c P c | l P l posterior Do not care about P(l) when maximizing P(c|l ) P c | l P l | c P c proportional If P(salmon)=P(bass) (uniform prior) MAP classifier becomes ML classifier P c | l P l | c If for some observation l, P(l|salmon)=P(l|bass), then this observation is uninformative and decision is based solely on the prior P c | l P c Justification for MAP Classifier Let’s compute probability of error for the MAP estimate: salmon P salmon | l ? P bass | l bass For any particular l, probability of error Pr[error| l ]= P(bass|l) if we decide salmon P(salmon|l) if we decide bass Thus MAP classifier is optimal for each individual l ! 23 Justification for MAP Classifier We are interested to minimize error not just for one l, we really want to minimize the average error over all l Pr error perror , l dl Pr error | l pl dl If Pr[error| l ]is as small as possible, the integral is small as possible But Bayes rule makes Pr[error| l ] as small as possible Thus MAP classifier minimizes the probability of error! More General Case Let’s generalize a little bit Have more than one feature x x1 , x 2 ,..., xd Have more than 2 classes c1 , c2 ,..., cm More General Case As before, for each j we have p x | c j is likelihood of observation x given that the true class is c j P c j is prior probability of class c j P c j | x is posterior probability of class c j given that we observed data x Evidence, or probability density for data m p x p x | c j P c j j 1 26 Minimum Error Rate Classification Want to minimize average probability of error Pr error perror , x dx Pr error | x p x dx need to make this as small as possible Pr error | x 1 P ci | x if we decide class ci Pr error | x is minimized with MAP classifier Decide on class ci if P ci | x P c j | x j i MAP classifier is optimal If we want to minimize the probability of error 1 1-P(c1|x) 1-P(c |x) 2 P(c1|x) P(c2|x) 1-P(c3|x) P(c3|x) General Bayesian Decision Theory In some cases we may want to refuse to make a decision (let human expert handle tough case) allow actions 1 , 2 ,..., k Suppose some mistakes are more costly than others (classifying a benign tumor as cancer is not as bad as classifying cancer as benign tumor) Allow loss functions i | c j describing loss occurred when taking action i when the true class is c j 28 Conditional Risk Suppose we observe x and wish to take action i If the true class is c j , by definition, we incur loss i | c j Probability that the true class is c j after observing x is P c j | x The expected loss associated with taking action i is called conditional risk and it is: m R i | x i | c j P c j | x j 1 Conditional Risk sum over disjoint events (different classes) m probability of class c j given observation x R i | x i | c j P c j | x j 1 penalty for taking action i if observe x part of overall penalty which comes from event that true class is c j Example: Zero-One loss function action i is decision that true class is ci i | c j 0 1 if i j otherwise (no mistake) (mistake) R i | x i | c j P c j | x P c j | x m j 1 i j 1 P ci | x Pr error if decide ci Thus MAP classifier optimizes R(i|x) P ci | x P c j | x j i MAP classifier is Bayes decision rule under zero-one loss function Overall Risk Decision rule is a function (x) which for every x specifies action out of 1 ,2 ,..., k x1 x2 x3 X (x1) (x2) 1 ,2 ,..., k (x3) The average risk for (x) R R x | x p x dx need to make this as small as possible Bayes decision rule (x) for every x is the action which minimizes the conditional risk m R i | x i | c j P c j | x j 1 Bayes decision rule (x) is optimal, i.e. gives the minimum possible overall risk R * Bayes Risk: Example Salmon is more tasty and expensive than sea bass sb salmon | bass 2 classify bass as salmon bs bass | salmon 1 classify salmon as bass no mistake, no loss ss bb 0 Likelihoods 1 pl | salmon e 2 l 5 2 2 1 pl | bass e 2 2 l 10 2 2* 4 Priors P(salmon)= P(bass) Risk R | x | c j P c j | x s P s | l bP b | l m j 1 Rsalmon | l ssP s | l sbP b | l sbP b | l Rbass | l bs P s | l bb P b | l bs P s | l Bayes Risk: Example Rsalmon | l sbP b | l Rbass | l bs P s | l Bayes decision rule (optimal for our loss function) salmon sbP b | l ? bs P s | l bass P b | l bs Need to solve P s | l sb Or, equivalently, since priors are equal: P l | b P b pl P l | b bs pl P l | s P s P l | s sb Bayes Risk: Example Need to solve P l | b bs P l | s sb Substituting likelihoods and losses 2 2 exp l 10 2 8 1 2 2 exp l 5 2 1 2 exp l 10 2 l 5 2 8 exp l 10 2 2 salmon 8 l 5 2 2 l 10 exp 8 1 ln l 5 2 2 exp 2 ln 1 0 3l 2 20l 0 0 l 6.6667 new decision boundary 6.67 6.70 sea bass length Likelihood Ratio Rule In 2 category case, use likelihood ratio rule P x | c1 12 22 P c2 P x | c2 21 11 P c1 likelihood ratio fixed number Independent of x If above inequality holds, decide c1 Otherwise decide c2 36 Discriminant Functions All decision rules have the same structure: at observation x choose class ci s.t. gi x g j x j i discriminant function ML decision rule: gi x P x | ci MAP decision rule: gi x P ci | x Bayes decision rule: gi x Rci | x Decision Regions Discriminant functions split the feature vector space X into decision regions g2 x maxgi c1 c2 c3 c3 c1 38 Important Points If we know probability distributions for the classes, we can design the optimal classifier Definition of “optimal” depends on the chosen loss function Under the minimum error rate (zero-one loss function No prior: ML classifier is optimal Have prior: MAP classifier is optimal More general loss function General Bayes classifier is optimal 39
© Copyright 2025 Paperzz