Bayesian Decision Theory

No.4
Bayesian Decision Theory
Hui Jiang
Department of Electrical Engineering and Computer Science
Lassonde School of Engineering
York University, Toronto, Canada
Outline
•  Pattern Classification Problems
•  Bayesian Decision Theory
–  How to make the optimal decision?
•  Generative models:
–  Maximum a posterior (MAP) decision rule:
leading to minimum error rate classification.
–  Model estimation: maximum likelihood,
Bayesian learning, discriminative training, etc.
Pattern Classification Problem
• 
• 
• 
Given a fixed set of finite number of classes: ω1, ω2, … , ωN.
• 
Pattern classification problem: X ! ωi
We have an unknown pattern/object P without its class identity.
BUT we can measure (observe) some feature(s) X about P:
–  X is called feature or observation of the pattern P.
–  X can be scalar or vector or vector sequence
–  X can be continuous or discrete
–  Determine the class identity for any a pattern based on its
observation or feature.
• 
Fundamental issues in pattern classification
–  How to make an optimal classification?
–  In what sense is it optimal?
Examples of pattern classification(I)
• 
Speech recognition:
–  Pattern: voice spoken by a human being
–  Classes: language words/sentences used by the speaker
–  Features: speech signal characteristics measured by a
microphone " a sequence of feature vectors
•  Each vector: continuous, high-dimensional, real-valued
numbers
• 
Natural language understanding:
–  Pattern: written or spoken languages of human
–  Classes: all possible semantic meanings or intentions
–  Features: the used words or word-sequences (sentences)
•  Discrete, scalars or vector
Examples of pattern classification(II)
• 
• 
• 
Image understanding:
–  Pattern: given images
–  Classes: all known object categories
–  Features: color or gray scales in all pixels
•  Continuous, multiple vectors/matrix
–  Examples: face recognition, OCR (optical character recognition).
Gene finding in bioinformatics:
–  Pattern: a newly sequenced DNA sequence
–  Classes: all known genes
–  Features: all nucleotides in the sequence
•  Discrete; 4 types (adenine, guanine, cytosine, thymine)
Protein classification in bioinformatics:
–  Pattern: protein primary 1-D sequence
–  Classes: all known protein families or domains
–  Features: all amino acids in the sequence: discrete; 20 types
Bayesian Decision Theory(I)
• 
• 
Bayesian decision theory is a fundamental statistical approach to
all pattern classification problems.
Pattern classification problem is posed in probabilistic terms.
–  Observation X is viewed as random variables (vectors,…)
–  Class id ω is treated as a discrete random variable, which could
take values ω1, ω2, … , ωN.
–  Therefore, we are interested in the joint probability distribution
of X and ω which contains all info about X and ω.
p ( X , ω ) = p (ω ) ⋅ p ( X | ω )
• 
If all the relevant probability values and densities are known in the
problem (we have complete knowledge of the problem), Bayesian
decision theory leads to the optimal classification
–  Optimal " Guarantee minimum average classification error
–  The minimum classification error is called the Bayes error.
Bayesian Decision Theory(II)
• 
In pattern classification, all relevant probabilities mean:
–  Prior probabilities of each class P(ωi) (i=1,2,… ,N): how likely
any a pattern comes from each class before observing any
features ! prior knowledge from previous experience
•  All priors sum to one: N P(ω ) = 1
∑
i =1
i
–  Class-conditional probability density functions of the observed
feature, X, p(X | ωi) (i=1,2,… ,N): how the feature distributes for
all patterns belonging to one class ωi .
•  If X is continuous, p(X | ωi) is a continuous p.d.f. distribution
For every class ωi:
p ( X | ωi ) ⋅ dX = 1
∫
X
•  If X is discrete, p(X | ωi) is discrete probability mass function
(p.m.f.) distribution. For every class ωi:
p ( X | ωi ) = 1
∑
X
Example of class-conditional p.d.f.
p(x|ω2)
p(x|ω1)
x=11.8
Figure from Duda et. al.,
Pattern classification
John Wiley & Sons®, Inc.
Bayes Decision Rule:
Maximum a posterior (MAP) (I)
• 
If not observe any feature of an incoming unknown pattern P,
classify it based on prior knowledge only
ω P = arg max P(ωi )
ωi
–  Roughly guess it as the class with largest prior probability
• 
If observe some features X of the unknown patter P, we can
convert the prior probability P(ωi) into a posterior probability
based on the Bayes theorem:
prior × likelihood
posterior =
evidence
Bayes decision rule:
Maximum a posterior (MAP) (II)
• 
Where
–  Prior P(ωi): probability of receiving a pattern from class ωi
before observing anything. (prior knowledge)
–  Likelihood p(X | ωi): probability of observing feature X if assume
X comes from a pattern in class ωi. (if assume X is given, treat it
as a function of ωi, it is called likelihood function)
–  Posterior p(ωi | X): probability of getting a pattern from class ωi
after observing its features as X.
–  Evidence p(x): a scalar factor to guarantee posterior probabilities
sum to one regarding ωi .
P(ωi ) ⋅ p ( X | ωi )
P(ωi ) ⋅ p ( X | ωi )
p (ωi | X ) =
=
p( X )
∑ P(ωi ) ⋅ p( X | ωi )
i
Bayes decision rule:
Maximum a posterior (MAP) (II)
• 
If we observe some features X of an unknown pattern, the
observation can convert the prior into posterior. Intuitively, we can
class the pattern based on the posterior probabilities, resulting in
the maximum a posterior (MAP) decision rule, also called Bayes
decision rule.
• 
For an unknown pattern P, after observing some features X, we
classify it into the class with the largest posterior probability:
P(ωi ) ⋅ p ( X | ωi )
ω P = arg max p (ωi | X ) = arg max
p( X )
ωi
ωi
= arg max P(ωi ) ⋅ p ( X | ωi )
ωi
The MAP decision rule is optimal (I)
• 
How well the MAP decision rule behaves??
• 
Optimality: assume we have complete knowledge, including P(ωi)
and p(X | ωi) (i=1,2,… ,N), the MAP decision rule is optimal to
classify patterns, which means it will achieve the lowest average
classification error rate.
• 
Proof of optimality of the MAP rule:
–  Given a pattern P, if its true class id is ωi, but we classify it as
ωP, then the classification error is counted as l(ωP | ωi):
⎧0 (ω P = ωi )
l (ω P | ωi ) = ⎨
⎩1 (ω P ≠ ωi )
which is also known as 0-1 loss function.
• 
The MAP decision rule is optimal (II)
Proof of optimality of the MAP rule (cont’)
–  For a pattern P, after observing X, the posterior p(ωi | X) is the
probability that the true class id of P is ωi. Thus the expected
(average) classification error associated with classifying P as ωP
is calculated:
N
R(ω P | X ) = ∑ l (ω P | ωi ) ⋅ p (ωi | X )
i =1
=
∑ p(ω
ωi ≠ω P
i
| X)
= 1 − p (ω P | X )
–  The optimal classification is to minimize the above average
classification error, i.e., if observing X, we classify P as ωP to
minimize R(ωP|X) ! maximize p(ωP|X)
! the MAP decision rule is optimal, which achieves the minimum
average average error rate. The minimum error is called Bayes
error.
The MAP decision rule
• 
A general decision rule is a mapping function, given any an
observation X, output a class id ωP: X" ωP
• 
If we totally have N classes, a decision rule will partition the entire
feature space of X into N different regions, O1, O2, … , ON. If X is
located in the region Oi, we classify it as class ωi .
• 
• 
Each region Oi could consist of many contiguous areas.
The MAP decision rule (the Bayes decision rule) is optimal among
all possible decision rules in terms of minimizing average
classification errors conditional on that we have complete and
precise knowledge about the underlying problem.
Feature space
Class ω1
X
Class ω2
Class ωN
The MAP decision rule example
O2
O1
O2
O1
Figure from Duda et. al.,
Pattern classification
John Wiley & Sons®, Inc.
Classification Error Probability
of a decision rule
• 
Assume N-class problem, any a decision rule partition the feature
space into N regions, O1, O2, … , ON.
• 
Pr( X ∈ Oi , ω j ) denotes the probability of the observation X of a
pattern (its true class id is ωj) falls in the region Oi.
The overall classification error probability of the decision rule is:
• 
Pr(error ) = 1 − Pr(correct )
N
= 1 − ∑ Pr( X ∈ Oi , ωi )
i =1
N
= 1 − ∑ Pr(X ∈ Oi | ωi ) ⋅ P(ωi )
i =1
N
= 1− ∑
∫ p( X | ω ) ⋅ P(ω ) dX
i =1 Oi
i
i
Example of Error Probability in 2-class case
Error ω2 " ω1
Error ω1 " ω2
Figure from Duda et. al.,
Pattern classification
John Wiley & Sons®, Inc.
Bayes Error
• 
Bayes error: error probability of the Bayes (MAP) decision rule.
• 
Since Bayes decision rule guarantees the minimum error, the
Bayes error is the lower bound of all possible error
probabilities.
• 
It is difficult to calculate the Bayes error, even for the very
simple cases because of discontinuous nature of the decision
regions in the integral, especially in high dimensions.
• 
Some approximation methods to estimate an upper bound.
–  Chernoff bound
–  Bhattacharyya bound
• 
Evaluate on an independent test set.
Example: Bayes decision for
independent binary features
• 
Bayes decision rule (the MAP rule) is also applicable when
feature X is discrete.
• 
A simple case (Binomial model): 2-class (ω1, ω2), feature vector
is d-dimensional vector, whose components are binary-valued
and conditionally independent.
X = ( x1 , x2 ,  , xd ) t
pi = Pr( xi = 1 | ω1 )
xi = 0,1 (1 ≤ i ≤ d )
and
qi = Pr( xi = 1 | ω2 )
d
p ( X | ω1 ) = ∏ pixi (1 − pi )1− xi
i =1
d
p ( X | ω2 ) = ∏ qixi (1 − qi )1− xi
i =1
Example: Bayes decision for
independent binary features
• 
The MAP decision rule:
classify to ω1 if P(ω1 ) ⋅ p ( X | ω1 ) ≥ P(ω2 ) ⋅ p ( X | ω2 ), otherwise ω2
Equivalently, we have the decision function :
d
⎡
pi
1 − pi ⎤
P(ω1 )
g ( X ) = ∑ ⎢ xi ln + (1 − xi ) ln
= ∑ λi xi + λ0
⎥ + ln
qi
1 − qi ⎦
P(ω2 ) i =1
i =1 ⎣
d
pi (1 − qi )
1 − pi
P(ω1 )
λi = ln
λ0 = ∑ ln
+ ln
qi (1 − pi )
1 − qi
P(ω2 )
i =1
d
If g(X) ≥ 0, classify to ω1 , otherwise ω2 .
Missing features/data (I)
• 
• 
• 
• 
• 
If we know the full probability structure of a problem, we can
construct the optimal Bayes decision rule.
In some practical situations, for some patterns, we can’t observe
the full feature vector described in the probability structure. Only
partial information of the feature vector is observed, but some
components are missing.
How to classify such corrupted inputs to obtain minimum average
error?
Let the full feature vector X=[Xg,Xb], Xg represents the observed or
good features, Xb represents the missing or bad ones.
In this case, the optimal decision rule is constructed based on the
posterior, p(ωi |Xg), as follows:
ω P = arg max p (ωi | X g )
ωi
Missing features/data(II)
• 
Where
p (ωi | X g ) =
(1)
p (ωi , X g )
p( X g )
p (ω
∫
=
i
p (ω , X
∫
=
i
g
, X b ) dX b
p( X g )
| X g , X b ) ⋅ p ( X g , X b ) dX b
p( X g )
p (ω | X ) ⋅ p ( X ) dX
∫
=
∫ p( X ) dX
P(ω ) ⋅ p (X , X | ω ) dX
∫
(2) =
∑ ∫ P(ω ) ⋅ p(X , X | ω ) dX
i
b
b
i
g
i
ωi
b
g
i
b
i
P(ω ) ⋅ p (X | ω ) dX
∫
=
∑ ∫ P(ω ) ⋅ p(X | ω ) dX
i
ωi
i
i
b
b
i
b
b
Optimal Bayes decision rule
is not achievable in practice
• 
The optimal Bayes decision rule is not feasible in practice.
–  In any practical problem, we can not have a complete
knowledge about the problem.
–  e.g., the class-conditional probability density functions (p.d.f.),
p(X | ωi), are always unavailable and extremely hard to
estimate.
• 
However, possible to collect a set of sample data (a set of feature
observations) for each class in question.
–  The sample data are always far from enough to estimate a
reliable p.d.f. by using sample data themselves ONLY, e.g.,
some nonparametric methods " sampling density / histogram.
• 
Question: How to build a reasonable classifier based on a
limited set of sample data, instead of the true p.d.f.’s?
• 
• 
Statistical Data Modeling
For any real problem, the true p.d.f.’s are always unknown, neither the
function form nor the parameters.
Our approach – statistical data modeling : based on the available sample
data set, choose a proper statistical model to fit into the available data set.
–  Data Modeling stage: once the statistical model is selected, its
function form becomes known except a set of model parameters
associated with the model are unknown to us.
–  Learning (training) stage: the unknown parameters can be estimated
by fitting the model into the data set based on certain estimation
criterion.
•  the estimated statistical model (assumed model format + estimated
parameters) will give a parametric p.d.f. to approximate the real but
unknown p.d.f. of each class.
–  Decision (test) stage: the estimated p.d.f.’s are plugged into the
optimal Bayes decision rule in place of the real p.d.f.’s
! plug-in MAP decision rule
•  Not optimal any more but performs reasonably well in practice
•  Theoretical analysis: ! Statistical Learning Theory
Data modeling
Estimated model (p.d.f.) for class ω2
pλ2(x)
Estimated model (p.d.f.) for class ω1
pλ1(x)
Plug-in MAP decision rule
• 
Once the statistical models are estimated, they are treated
as if they were true distributions of the data, and plug into
the form of the optimal Bayes (MAP) decision rule in place
of the unknown true p.d.f.’s.
• 
The plug-in MAP decision rule:
P(ωi ) ⋅ p ( X | ωi )
ω P = arg max p (ωi | X ) = arg max
p( X )
ωi
ωi
= arg max P(ωi ) ⋅ p ( X | ωi )
ωi
≈ arg max PΓi (ωi ) ⋅ pΛ i ( X | ωi )
ωi
Some useful models(I)
• 
A proper model must be chosen based on the nature of observation
data.
• 
Some useful statistical models for a variety of data types:
–  Normal (Gaussian) distribution
! uni-modal continuous feature scalars
–  Multivariate normal (Gaussian) distribution
! uni-modal continuous feature vectors
–  Gaussian Mixture models (GMM)
! continuous feature scalars/vectors with multi-modal
distribution nature
! For speaker recognition/verification
distribution of speech features over a large population
Some useful models (II)
• 
Some useful models (cont’d)
–  Markov chain model: discrete sequential data
•  N-gram model in language modeling
–  Hidden Markov Models (HMM): ideal for various kinds of
sequential observation data; provides better modeling
capability than simple Markov chain model.
•  Model speech signals for recognition (one of the most
successful story of data modeling)
•  Model language/text data for part-of-speech tagging,
shallow language understanding, etc.
•  Model biological data (DNA & protein sequence): profile
HMM.
•  Lots of other application domains.
• 
Some useful models (III)
Some useful models (cont’d)
–  Random Markov Field: multi-dimensional spatial data
•  Model image data: e.g., used for OCR, etc.
•  HMM is a special case of random Markov field
–  Graphical models (a.k.a., Bayesian networks, Belief networks)
•  High-dimensional data (discrete or continuous)
•  To model a very complex stochastic process
•  Automatically learn dependency from data
•  Used widely in machine learning, data mining
•  HMM is also a special case of graphical model
• 
Neural networks, support vector machine (SVM) DON’T fit here.
–  Not to model the distribution (p.d.f.) of data directly.
–  Discriminative method: model the boundaries of data sets
Generative vs. discriminative
models
•  Posterior probability p(ωi|X)
pattern classification.
plays the key role in
–  Generative Models: focus on probability distribution
of data
p(ωi|X) ~ p(ωi) · p(X| ωi)
≈ p’(ωi) · p’(X| ωi)
(the plug-in MAP rule)
–  Discriminative Models: directly model discriminant
function:
p(ωi|X) ~ gi(X)