No.4 Bayesian Decision Theory Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline • Pattern Classification Problems • Bayesian Decision Theory – How to make the optimal decision? • Generative models: – Maximum a posterior (MAP) decision rule: leading to minimum error rate classification. – Model estimation: maximum likelihood, Bayesian learning, discriminative training, etc. Pattern Classification Problem • • • Given a fixed set of finite number of classes: ω1, ω2, … , ωN. • Pattern classification problem: X ! ωi We have an unknown pattern/object P without its class identity. BUT we can measure (observe) some feature(s) X about P: – X is called feature or observation of the pattern P. – X can be scalar or vector or vector sequence – X can be continuous or discrete – Determine the class identity for any a pattern based on its observation or feature. • Fundamental issues in pattern classification – How to make an optimal classification? – In what sense is it optimal? Examples of pattern classification(I) • Speech recognition: – Pattern: voice spoken by a human being – Classes: language words/sentences used by the speaker – Features: speech signal characteristics measured by a microphone " a sequence of feature vectors • Each vector: continuous, high-dimensional, real-valued numbers • Natural language understanding: – Pattern: written or spoken languages of human – Classes: all possible semantic meanings or intentions – Features: the used words or word-sequences (sentences) • Discrete, scalars or vector Examples of pattern classification(II) • • • Image understanding: – Pattern: given images – Classes: all known object categories – Features: color or gray scales in all pixels • Continuous, multiple vectors/matrix – Examples: face recognition, OCR (optical character recognition). Gene finding in bioinformatics: – Pattern: a newly sequenced DNA sequence – Classes: all known genes – Features: all nucleotides in the sequence • Discrete; 4 types (adenine, guanine, cytosine, thymine) Protein classification in bioinformatics: – Pattern: protein primary 1-D sequence – Classes: all known protein families or domains – Features: all amino acids in the sequence: discrete; 20 types Bayesian Decision Theory(I) • • Bayesian decision theory is a fundamental statistical approach to all pattern classification problems. Pattern classification problem is posed in probabilistic terms. – Observation X is viewed as random variables (vectors,…) – Class id ω is treated as a discrete random variable, which could take values ω1, ω2, … , ωN. – Therefore, we are interested in the joint probability distribution of X and ω which contains all info about X and ω. p ( X , ω ) = p (ω ) ⋅ p ( X | ω ) • If all the relevant probability values and densities are known in the problem (we have complete knowledge of the problem), Bayesian decision theory leads to the optimal classification – Optimal " Guarantee minimum average classification error – The minimum classification error is called the Bayes error. Bayesian Decision Theory(II) • In pattern classification, all relevant probabilities mean: – Prior probabilities of each class P(ωi) (i=1,2,… ,N): how likely any a pattern comes from each class before observing any features ! prior knowledge from previous experience • All priors sum to one: N P(ω ) = 1 ∑ i =1 i – Class-conditional probability density functions of the observed feature, X, p(X | ωi) (i=1,2,… ,N): how the feature distributes for all patterns belonging to one class ωi . • If X is continuous, p(X | ωi) is a continuous p.d.f. distribution For every class ωi: p ( X | ωi ) ⋅ dX = 1 ∫ X • If X is discrete, p(X | ωi) is discrete probability mass function (p.m.f.) distribution. For every class ωi: p ( X | ωi ) = 1 ∑ X Example of class-conditional p.d.f. p(x|ω2) p(x|ω1) x=11.8 Figure from Duda et. al., Pattern classification John Wiley & Sons®, Inc. Bayes Decision Rule: Maximum a posterior (MAP) (I) • If not observe any feature of an incoming unknown pattern P, classify it based on prior knowledge only ω P = arg max P(ωi ) ωi – Roughly guess it as the class with largest prior probability • If observe some features X of the unknown patter P, we can convert the prior probability P(ωi) into a posterior probability based on the Bayes theorem: prior × likelihood posterior = evidence Bayes decision rule: Maximum a posterior (MAP) (II) • Where – Prior P(ωi): probability of receiving a pattern from class ωi before observing anything. (prior knowledge) – Likelihood p(X | ωi): probability of observing feature X if assume X comes from a pattern in class ωi. (if assume X is given, treat it as a function of ωi, it is called likelihood function) – Posterior p(ωi | X): probability of getting a pattern from class ωi after observing its features as X. – Evidence p(x): a scalar factor to guarantee posterior probabilities sum to one regarding ωi . P(ωi ) ⋅ p ( X | ωi ) P(ωi ) ⋅ p ( X | ωi ) p (ωi | X ) = = p( X ) ∑ P(ωi ) ⋅ p( X | ωi ) i Bayes decision rule: Maximum a posterior (MAP) (II) • If we observe some features X of an unknown pattern, the observation can convert the prior into posterior. Intuitively, we can class the pattern based on the posterior probabilities, resulting in the maximum a posterior (MAP) decision rule, also called Bayes decision rule. • For an unknown pattern P, after observing some features X, we classify it into the class with the largest posterior probability: P(ωi ) ⋅ p ( X | ωi ) ω P = arg max p (ωi | X ) = arg max p( X ) ωi ωi = arg max P(ωi ) ⋅ p ( X | ωi ) ωi The MAP decision rule is optimal (I) • How well the MAP decision rule behaves?? • Optimality: assume we have complete knowledge, including P(ωi) and p(X | ωi) (i=1,2,… ,N), the MAP decision rule is optimal to classify patterns, which means it will achieve the lowest average classification error rate. • Proof of optimality of the MAP rule: – Given a pattern P, if its true class id is ωi, but we classify it as ωP, then the classification error is counted as l(ωP | ωi): ⎧0 (ω P = ωi ) l (ω P | ωi ) = ⎨ ⎩1 (ω P ≠ ωi ) which is also known as 0-1 loss function. • The MAP decision rule is optimal (II) Proof of optimality of the MAP rule (cont’) – For a pattern P, after observing X, the posterior p(ωi | X) is the probability that the true class id of P is ωi. Thus the expected (average) classification error associated with classifying P as ωP is calculated: N R(ω P | X ) = ∑ l (ω P | ωi ) ⋅ p (ωi | X ) i =1 = ∑ p(ω ωi ≠ω P i | X) = 1 − p (ω P | X ) – The optimal classification is to minimize the above average classification error, i.e., if observing X, we classify P as ωP to minimize R(ωP|X) ! maximize p(ωP|X) ! the MAP decision rule is optimal, which achieves the minimum average average error rate. The minimum error is called Bayes error. The MAP decision rule • A general decision rule is a mapping function, given any an observation X, output a class id ωP: X" ωP • If we totally have N classes, a decision rule will partition the entire feature space of X into N different regions, O1, O2, … , ON. If X is located in the region Oi, we classify it as class ωi . • • Each region Oi could consist of many contiguous areas. The MAP decision rule (the Bayes decision rule) is optimal among all possible decision rules in terms of minimizing average classification errors conditional on that we have complete and precise knowledge about the underlying problem. Feature space Class ω1 X Class ω2 Class ωN The MAP decision rule example O2 O1 O2 O1 Figure from Duda et. al., Pattern classification John Wiley & Sons®, Inc. Classification Error Probability of a decision rule • Assume N-class problem, any a decision rule partition the feature space into N regions, O1, O2, … , ON. • Pr( X ∈ Oi , ω j ) denotes the probability of the observation X of a pattern (its true class id is ωj) falls in the region Oi. The overall classification error probability of the decision rule is: • Pr(error ) = 1 − Pr(correct ) N = 1 − ∑ Pr( X ∈ Oi , ωi ) i =1 N = 1 − ∑ Pr(X ∈ Oi | ωi ) ⋅ P(ωi ) i =1 N = 1− ∑ ∫ p( X | ω ) ⋅ P(ω ) dX i =1 Oi i i Example of Error Probability in 2-class case Error ω2 " ω1 Error ω1 " ω2 Figure from Duda et. al., Pattern classification John Wiley & Sons®, Inc. Bayes Error • Bayes error: error probability of the Bayes (MAP) decision rule. • Since Bayes decision rule guarantees the minimum error, the Bayes error is the lower bound of all possible error probabilities. • It is difficult to calculate the Bayes error, even for the very simple cases because of discontinuous nature of the decision regions in the integral, especially in high dimensions. • Some approximation methods to estimate an upper bound. – Chernoff bound – Bhattacharyya bound • Evaluate on an independent test set. Example: Bayes decision for independent binary features • Bayes decision rule (the MAP rule) is also applicable when feature X is discrete. • A simple case (Binomial model): 2-class (ω1, ω2), feature vector is d-dimensional vector, whose components are binary-valued and conditionally independent. X = ( x1 , x2 , , xd ) t pi = Pr( xi = 1 | ω1 ) xi = 0,1 (1 ≤ i ≤ d ) and qi = Pr( xi = 1 | ω2 ) d p ( X | ω1 ) = ∏ pixi (1 − pi )1− xi i =1 d p ( X | ω2 ) = ∏ qixi (1 − qi )1− xi i =1 Example: Bayes decision for independent binary features • The MAP decision rule: classify to ω1 if P(ω1 ) ⋅ p ( X | ω1 ) ≥ P(ω2 ) ⋅ p ( X | ω2 ), otherwise ω2 Equivalently, we have the decision function : d ⎡ pi 1 − pi ⎤ P(ω1 ) g ( X ) = ∑ ⎢ xi ln + (1 − xi ) ln = ∑ λi xi + λ0 ⎥ + ln qi 1 − qi ⎦ P(ω2 ) i =1 i =1 ⎣ d pi (1 − qi ) 1 − pi P(ω1 ) λi = ln λ0 = ∑ ln + ln qi (1 − pi ) 1 − qi P(ω2 ) i =1 d If g(X) ≥ 0, classify to ω1 , otherwise ω2 . Missing features/data (I) • • • • • If we know the full probability structure of a problem, we can construct the optimal Bayes decision rule. In some practical situations, for some patterns, we can’t observe the full feature vector described in the probability structure. Only partial information of the feature vector is observed, but some components are missing. How to classify such corrupted inputs to obtain minimum average error? Let the full feature vector X=[Xg,Xb], Xg represents the observed or good features, Xb represents the missing or bad ones. In this case, the optimal decision rule is constructed based on the posterior, p(ωi |Xg), as follows: ω P = arg max p (ωi | X g ) ωi Missing features/data(II) • Where p (ωi | X g ) = (1) p (ωi , X g ) p( X g ) p (ω ∫ = i p (ω , X ∫ = i g , X b ) dX b p( X g ) | X g , X b ) ⋅ p ( X g , X b ) dX b p( X g ) p (ω | X ) ⋅ p ( X ) dX ∫ = ∫ p( X ) dX P(ω ) ⋅ p (X , X | ω ) dX ∫ (2) = ∑ ∫ P(ω ) ⋅ p(X , X | ω ) dX i b b i g i ωi b g i b i P(ω ) ⋅ p (X | ω ) dX ∫ = ∑ ∫ P(ω ) ⋅ p(X | ω ) dX i ωi i i b b i b b Optimal Bayes decision rule is not achievable in practice • The optimal Bayes decision rule is not feasible in practice. – In any practical problem, we can not have a complete knowledge about the problem. – e.g., the class-conditional probability density functions (p.d.f.), p(X | ωi), are always unavailable and extremely hard to estimate. • However, possible to collect a set of sample data (a set of feature observations) for each class in question. – The sample data are always far from enough to estimate a reliable p.d.f. by using sample data themselves ONLY, e.g., some nonparametric methods " sampling density / histogram. • Question: How to build a reasonable classifier based on a limited set of sample data, instead of the true p.d.f.’s? • • Statistical Data Modeling For any real problem, the true p.d.f.’s are always unknown, neither the function form nor the parameters. Our approach – statistical data modeling : based on the available sample data set, choose a proper statistical model to fit into the available data set. – Data Modeling stage: once the statistical model is selected, its function form becomes known except a set of model parameters associated with the model are unknown to us. – Learning (training) stage: the unknown parameters can be estimated by fitting the model into the data set based on certain estimation criterion. • the estimated statistical model (assumed model format + estimated parameters) will give a parametric p.d.f. to approximate the real but unknown p.d.f. of each class. – Decision (test) stage: the estimated p.d.f.’s are plugged into the optimal Bayes decision rule in place of the real p.d.f.’s ! plug-in MAP decision rule • Not optimal any more but performs reasonably well in practice • Theoretical analysis: ! Statistical Learning Theory Data modeling Estimated model (p.d.f.) for class ω2 pλ2(x) Estimated model (p.d.f.) for class ω1 pλ1(x) Plug-in MAP decision rule • Once the statistical models are estimated, they are treated as if they were true distributions of the data, and plug into the form of the optimal Bayes (MAP) decision rule in place of the unknown true p.d.f.’s. • The plug-in MAP decision rule: P(ωi ) ⋅ p ( X | ωi ) ω P = arg max p (ωi | X ) = arg max p( X ) ωi ωi = arg max P(ωi ) ⋅ p ( X | ωi ) ωi ≈ arg max PΓi (ωi ) ⋅ pΛ i ( X | ωi ) ωi Some useful models(I) • A proper model must be chosen based on the nature of observation data. • Some useful statistical models for a variety of data types: – Normal (Gaussian) distribution ! uni-modal continuous feature scalars – Multivariate normal (Gaussian) distribution ! uni-modal continuous feature vectors – Gaussian Mixture models (GMM) ! continuous feature scalars/vectors with multi-modal distribution nature ! For speaker recognition/verification distribution of speech features over a large population Some useful models (II) • Some useful models (cont’d) – Markov chain model: discrete sequential data • N-gram model in language modeling – Hidden Markov Models (HMM): ideal for various kinds of sequential observation data; provides better modeling capability than simple Markov chain model. • Model speech signals for recognition (one of the most successful story of data modeling) • Model language/text data for part-of-speech tagging, shallow language understanding, etc. • Model biological data (DNA & protein sequence): profile HMM. • Lots of other application domains. • Some useful models (III) Some useful models (cont’d) – Random Markov Field: multi-dimensional spatial data • Model image data: e.g., used for OCR, etc. • HMM is a special case of random Markov field – Graphical models (a.k.a., Bayesian networks, Belief networks) • High-dimensional data (discrete or continuous) • To model a very complex stochastic process • Automatically learn dependency from data • Used widely in machine learning, data mining • HMM is also a special case of graphical model • Neural networks, support vector machine (SVM) DON’T fit here. – Not to model the distribution (p.d.f.) of data directly. – Discriminative method: model the boundaries of data sets Generative vs. discriminative models • Posterior probability p(ωi|X) pattern classification. plays the key role in – Generative Models: focus on probability distribution of data p(ωi|X) ~ p(ωi) · p(X| ωi) ≈ p’(ωi) · p’(X| ωi) (the plug-in MAP rule) – Discriminative Models: directly model discriminant function: p(ωi|X) ~ gi(X)
© Copyright 2026 Paperzz