LECTURE 4: Bayesian Decision Theory The Likelihood Ratio Test g The Probability of Error g The Bayes Risk g Bayes, MAP and ML Criteria g Multi-class problems g Discriminant Functions g Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 1 Likelihood Ratio Test (LRT) g g Assume we are to classify an object based on the evidence provided by a measurement (or feature vector) x Would you agree that a reasonable decision rule would be the following? n "Choose the class that is most ‘probable’ given the observed feature vector x” g g More formally: Evaluate the posterior probability of each class P(ωi|x) and choose the class with largest P(ωi|x) Let us examine the implications of this decision rule for a 2-class problem n In this case the decision rule becomes if P(ω1 | x ) > P(ω2 | x ) choose ω1 else choose ω2 g Or, in a more compact form ω1 > P(ω2 | x ) P(ω1 | x ) < n ω2 Applying Bayes Rule ω P( x | ω1 )P(ω1 ) >1 P( x | ω2 )P(ω2 ) < P( x ) P( x ) ω2 n P(x) does not affect the decision rule so it can be eliminated*. Rearranging the previous expression ω Λ( x ) = n P( x | ω1 ) >1 P(ω2 ) P( x | ω2 ) ω<2 P(ω1 ) The term Λ(x) is called the likelihood ratio, and the decision rule is known as the likelihood ratio test *P(x) can be disregarded in the decision rule since it is constant regardless of class ωI. However, P(x) will be needed if we want to estimate the posterior P(ωi|x) which, unlike P(x|ωi)P(x), is a true probability value and, therefore, gives us an estimate of the “goodness” of our decision. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 2 Likelihood Ratio Test: an example g Given a classification problem with the following class conditional densities, derive a decision rule based on the Likelihood Ratio Test (assume equal priors) 1 1 − (x −10)2 1 e 2 P(x | ω2 ) = 2π − (x − 4)2 1 e 2 P(x | ω1 ) = 2π g 1 Solution n n Substituting the given likelihoods and priors into the LRT expression: Λ( x ) = Simplifying the LRT expression: Λ( x ) = e e n Changing signs and taking logs: ω1 n Which yields: 1 ω1 − ( x − 4 )2 2 1 − ( x −10 )2 2 > < − ( x − 4 )2 1 ω1 e 2 >1 2π 1 2 − ( x −10 ) < 1 1 e 2 ω2 2π 1 ω2 ( x − 4)2 − ( x − 10)2 ω1 < 0 > ω2 < x 7 > R1: say ω1 R2: say ω2 P(x|ω1) P(x|ω2) ω2 n g This LRT result makes sense from an intuitive point of view since the likelihoods are identical and differ only in their mean value 4 10 x How would the LRT decision rule change if, say, the priors were such that P(ω1)=2P(ω2) ? Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 3 The probability of error (1) g The performance of any decision rule can be measured by its probability of error P[error] which, making use of the Theorem of total probability (Lecture 2), can be broken up into C P[error] = ∑ P[error | ωi ]P[ωi ] i=1 g The class conditional probability of error P[error|ωi] can be expressed as P[error | ωi ] = P[choose ω j | ωi ] = ∫ P( x | ωi )dx Rj g So, for our 2-class problem, the probability of error becomes P[error ] = P[ω1 ] ∫ P( x | ω1 )dx + P[ω 2 ] ∫ P( x | ω 2 )dx R2 14 4244 3 ε1 n g R1 14 4244 3 ε2 where εi is the integral of the likelihood P(x|ωi) over the region Rj where we choose ωj For the decision rule of the previous example, the integrals ε1 and ε2 are depicted below n Since we assumed equal priors, then P[error] = (ε1 + ε2)/2 R1: say ω1 R2: say ω2 P(x|ω1) g Compute the probability for the example above Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University P(x|ω2) 4 ε2 10 x ε1 4 The probability of error (2) Now that we can measure the performance of a decision rule we can ask the following question: How good is the Likelihood Ratio Test decision rule? n For this purpose it is convenient to express P[error] in terms of the posterior P[error|x] +∞ P[error ] = ∫ P[error | x ]P( x )dx −∞ n n The optimal decision rule will minimize P[error|x] for every value of x, so that the integral above is minimized At each point x’, P[error|x’] is equal to P[ωi|x’] when we choose the other class ωj g This is depicted in the following figure: R 1, ALT R 2, ALT R 1,LTR Probability g R 2,LRT P[error | x' ] for ALT decision rule P(ω 1|x) P(ω 2|x) P[error | x' ] for LRT decision rule x’ n x From the figure it becomes clear that, for any value of x’, the Likelihood Ratio Test decision rule will always have a lower P[error|x’] g Therefore, when we integrate over the real line, the LRT decision rule will yield a lower P[error] For For any any given given problem, problem, the the minimum minimum probability probability of of error error is is achieved achieved by by the the Likelihood Likelihood Ratio Ratio Test Test decision decision rule. rule. This This probability probability of of error error is is called called the the Bayes Bayes Error Error Rate Rate and and is is the the BEST BEST any any classifier classifier can can do. do. Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 5 The Bayes Risk (1) g So far we have assumed that the penalty of misclassifying a class ω1 example as class ω2 is the same as the reciprocal. In general, this is not the case: n g g For example, misclassifying a cancer sufferer as a healthy patient is a much more serious problem than the other way around This concept can be formalized in terms of a cost function Cij n C represents the cost of choosing class ω when class ω is the true class ij i j We define the Bayes Risk as the expected value of the cost 2 2 2 2 ℜ = E[C] = ∑∑ Cij ⋅ P[choose ωi and x ∈ω j ] = ∑∑ Cij ⋅ P[ x ∈Ri | ω j ] ⋅ P[ω j ] i=1 j=1 g i=1 j=1 What is the decision rule that minimizes the Bayes Risk? n First notice that P[ x ∈ Ri | ω j ] = ∫ P( x | ω j )dx Ri n We can express the Bayes Risk as ℜ= ∫ [C 11 ⋅ P[ω1 ] ⋅ P( x | ω1 ) + C12 ⋅ P[ω2 ] ⋅ P( x | ω2 )]dx + ∫ [C 21 ⋅ P[ω1 ] ⋅ P( x | ω1 ) + C22 ⋅ P[ω2 ] ⋅ P( x | ω2 )]dx R1 R2 n Then we note that, for either likelihood, one can write: ∫ P(x | ω )dx + ∫ P(x | ω )dx = ∫ P(x | ω )dx = 1 i R1 Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University i R2 i R1∪R 2 6 The Bayes Risk (2) n Merging the last equation into the Bayes Risk expression yields ℜ= C11P[ω1 ] ∫ P(x | ω1 )dx + C12P[ω2 ] ∫ P(x | ω 2 )dx + + C21P[ω1 ] ∫ P(x | ω1 )dx + C22P[ω2 ] ∫ P(x | ω2 )dx + + C21P[ω1 ] ∫ P(x | ω1 )dx + C22P[ω2 ] ∫ P(x | ω2 )dx + − C21P[ω1 ] ∫ P(x | ω1 )dx − C22P[ω2 ] ∫ P(x | ω2 )dx R1 R2 R1 R1 n R1 R2 R1 R1 Now we cancel out all the integrals over R2 ℜ = C21P[ω1 ] + C22P[ω 2 ] + + (C12 − C22 ) P[ω 2 ] ∫ P( x | ω 2 )dx − (C21 − C11 ) P[ω1 ] ∫ P( x | ω1 )dx ∨ 0 n R1 ∨ 0 R1 The first two terms are constant as far as our minimization is concerned since they do not depend on R1, so we will be seeking a decision region R1 that minimizes: R1 = arg min ∫ [(C12 − C22 )P[ω 2 ]P( x | ω 2 ) − (C21 − C11 )P[ω1 ]P( x | ω1 )]dx R1 = arg min ∫ g( x )dx R1 Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 7 The Bayes Risk (3) g Let’s forget about the actual expression of g(x) to develop some intuition for what kind of decision region R1 we are looking for n Intuitively, we will select for R1 those regions that minimize the integral g In other words, those regions where g(x)<0 g(x) ∫ g( x )dx R1 R1=R1A∪ R1B∪ R1C R1A R1B R1C x n So we will choose R1 such that ω1 (C21 − C11 )P[ω1 ]P( x | ω1 ) > (C12 − C22 )P[ω2 ]P( x | ω2 ) n And rearranging ω1 P( x | ω1 ) > (C12 − C22 ) P[ω2 ] P( x | ω2 ) < (C21 − C11 ) P[ω1 ] ω2 n Therefore, minimization of the Bayes Risk also leads to a Likelihood Ratio Test Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 8 The Bayes Risk: an example Consider a classification problem with two classes defined by the following likelihood functions 1x n n 0.16 1 − (x − 2)2 1 P(x | ω 2 ) = e 2 2π 0.12 Sketch the two densities What is the likelihood ratio? Assume P[ω1]=P[ω2]=0.5, C11=C22=0, C12=1 and C21=31/2. Determine a decision rule that minimizes the probability of error 1 x2 − Λ( x ) = e e − 2 1x 2 3 1 − ( x − 2 )2 2 1 e 2 3 ω1 > 1 2π 3 1 − ( x − 2 )2 < 1 3 e 2 ω2 2π ω1 0.14 0.1 0.08 0.06 0.04 0.02 0 -6 -4 -2 0 x 2 R1 4 R2 6 R1 0.2 0.18 0.16 > 1 < 0.14 0.12 ω2 ω1 − 0.18 2 − 1 e 23 2π 3 P(x | ω1 ) = n 0.2 likelihood g > 1 x2 1 + ( x − 2)2 0 < 2 3 2 ω2 ω1 > 2x − 12x + 12 0 ⇒ x = 4.73,1.27 < 2 ω2 Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 0.1 0.08 0.06 0.04 0.02 0 -6 -4 -2 0 x 2 4 6 9 Variations of the Likelihood Ratio Test (1) g The LRT decision rule that minimizes the Bayes Risk is commonly called the Bayes Criterion ω1 P( x | ω1 ) > (C12 − C22 ) P[ω 2 ] Λ( x ) = P( x | ω2 ) < (C21 − C11 ) P[ω1 ] Bayes Bayes criterion criterion ω2 g Many times we will simply be interested in minimizing the probability of error, which is a special case of the Bayes Criterion that uses the so-called symmetrical or zero-one cost function. This version of the LRT decision rule is referred to as the Maximum A Posteriori Criterion, since it seeks to maximize the posterior P(ωi|x) ω1 ω1 ω2 ω2 0 i = j P( x | ω1 ) > P(ω 2 ) P(ω1 | x ) > Maximum A Posteriori Posteriori ⇒ Λ( x ) = ⇔ Cij = 1 Maximum A ≠ < < 1 i j P ( x | ω ) P ( ω ) P ( ω | x ) (MAP) 2 1 2 (MAP) Criterion Criterion g Finally, for the case of equal priors P[ωi]=1/2, and the zero-one cost function the LTR decision rule is called the Maximum Likelihood Criterion, since it will minimize the likelihood P(x|ωi) 0 i = j ω1 Cij = > P ( x | ω ) 1 1 i ≠ j ⇒ Λ( x ) = 1 P( x | ω 2 ) < 1 P(ωi ) = ∀i ω2 C Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University Maximum Maximum Likelihood Likelihood (ML) (ML) Criterion Criterion 10 Variations of the Likelihood Ratio Test (2) g Two more decision rules are commonly cited in the related literature n The Neyman-Pearson Criterion, used in Detection and Estimation Theory, which also leads to an LRT decision rule, fixes one class error probabilities, say ε1<α, and seeks to minimize the other g g n The Minimax Criterion, used in Game Theory, is derived from the Bayes criterion, and seeks to minimize the maximum Bayes Risk g n For instance, for the sea-bass/salmon classification problem of Lecture 1, there may be some kind of government regulation that we must not misclassify more than 1% of salmon as sea bass The Neyman-Pearson Criterion is very attractive since it does not require knowledge of priors and cost function The Minimax Criterion does nor require knowledge of the priors, but it needs a cost function For more information on these methods, the reader is referred to “Detection, Estimation and Modulation Theory”, by H.L. van Trees, the classical reference in this field Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 11 Minimum P[error] rule for multi-class problems g The decision rule that minimizes P[error] generalizes very easily to multi-class problems n For clarity in the derivation, the probability of error is better expressed in terms of the probability of making a correct assignment P[error ] = 1 − P[correct] n The probability of making a correct assignment is C P[correct ] = ∑ P(ωi ) ∫ P( x |ωi )dx i=1 n Ri The problem of minimizing P[error] is equivalent to that of maximizing P[correct]. Expressing P[correct] in terms of the posteriors: C C C i=1 Ri i=1 Ri P[correct] = ∑ P(ωi ) ∫ P(x |ωi )dx = ∑ ∫ P(x |ωi )P(ωi )dx = ∑ ∫ P(ωi | x)P(x)dx i=1 Ri 1442443 n In order to maximize P[correct], we will have to maximize each of the integrals ℑi. In turn, each integral ℑi will be maximized by choosing the class ωi that yields the maximum P[ωi|x] ⇒ we will define Ri to be the region where P[ωi|x] is maximum Probability ℑi R2 R1 R3 R2 R1 P(ω 3|x) P(ω 2|x) P(ω 1|x) x g Therefore, the decision rule that minimizes P[error] is the MAP Criterion Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 12 Minimum Bayes Risk for multi-class problems g To determine which decision rule yields the minimum Bayes Risk for the multi-class problem we will use a slightly different formulation n n g We will denote by αi the decision to choose class ωi, We will denote by α(x) the overall decision rule that maps features x into classes ωi: α(x)→{α1 , α2 , …, αC} The (conditional) risk ℜ(αi|x) of assigning a feature x to class ωi is C ℜ(α( x ) → αi ) = ℜ(αi | x ) = ∑ CijP(ω j | x ) j=1 g And the Bayes Risk associated with the decision rule α(x) is ℜ(α( x )) = ∫ ℜ(α( x ) | x )P( x )dx In order to minimize this expression,we will have to minimize the conditional risk ℜ(α(x)|x) at each point x in the feature space, which in turn is equivalent to choosing ωi such that ℜ(αi|x) is minimum R2 R1 R2 R3 R2 R1 R2 Risk g ℜ (α3|x) ℜ (α1|x) ℜ (α2|x) x Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University 13 Discriminant functions g All the decision rules we have presented in this lecture have the same structure n g g At each point x in feature space choose class ωi which maximizes (or minimizes) some measure gi(x) This structure can be formalized with a set of discriminant functions gi(x), i=1..C, and the following decision rule " assign x to class ωi if gi (x) > g j (x) ∀j ≠ i" Therefore, we can visualize the decision rule as a network or machine that computes C discriminant functions and selects the category corresponding to the largest discriminant. Such network is depicted in the following figure (presented already in Lecture 1) Class assignment Select Select max max Costs Costs gg1(x) gg2(x) 1(x) 2(x) Discriminant functions Features g xx1 1 xx2 2 xx3 3 ggC (x) C (x) xxd d Finally, we express the three basic decision rules: Bayes, MAP and ML in terms of Discriminant Functions to show the generality of this formulation Criterion Bayes MAP ML Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University Discriminant Function gi(x)=-ℜ(αi|x) gi(x)=P(ωi|x) gi(x)=P(x|ωi) 14
© Copyright 2025 Paperzz