Bayesian decision theory: introduction
Overview
• Bayesian decision theory allows to take optimal decisions in a fully probabilistic setting
• It assumes all relevant probabilities are known
• It allows to provide upper bounds on acheivable errors and evaluate classifiers accordingly
• Bayesian reasoning can be generalized to cases when the probabilistic structure is not entirely known
Bayesian decision theory: introduction
Binary classification
• Assume examples (x, y) ∈ X × {−1, 1} are drawn from a known distribution p(x, y).
• The task is predicting the class y of examples given the input x.
• Bayes rule allows us to write it in probabilistic terms as:
P (y|x) =
p(x|y)P (y)
p(x)
Bayesian decision theory: introduction
Bayes rule
Bayes rule allows to compute the posterior probability given likelihood, prior and evidence:
likelihood × prior
posterior =
evidence
posterior P (y|x) is the probability that class is y given that x was observed
likelihood p(x|y) is the probability of observing x given that its class is y
prior P (y) is the prior probability of the class, without any evidence
evidence p(x) is the probability of the observation, and by the law of total probability can be computed as:
2
X
p(x) =
p(x|y)P (y)
i=1
Bayesian decision theory: introduction
Probability of error
• Probability of error given x:
P (error|x) =
P (y2 |x) if we decide y1
P (y1 |x) if we decide y2
• Average probability of error:
Z
∞
P (error) =
P (error|x)p(x)dx
−∞
Bayesian decision theory: introduction
Bayes decision rule
yB = argmaxyi ∈{−1,1} P (yi |x) = argmaxyi ∈{−1,1} p(x|yi )P (yi )
• The probability of error given x is:
P (error|x) = min[P (y1 |x), P (y2 |x)]
• The Bayes decision rule minimizes the probability of error
Bayesian decision theory: Continuous features
Setting
• Inputs are vectors in a d-dimensional Euclidean space IRd called feature space
• The output is one of c possible categories or classes Y = {y1 , . . . , yc } (binary/multiclass classification)
• A decision rule is a function f : X → Y telling what class to predict for a given observation
• A loss function ` : Y × Y → IR is a function measuring the cost for predicting yi when the true class is yj
• The conditional risk of predicting y given x is:
R(y|x) =
c
X
`(y, yi )P (yi |x)
i=1
Bayesian decision theory: Continuous features
Bayes decision rule
y ∗ = argminy∈Y R(y|x)
• The overall risk of a decision rule f is given by
Z
Z X
c
R[f ] = R(f (x)|x)p(x)dx =
`(f (x), yi )p(yi , x)dx
i=1
• The Bayes decision rule minimizes the overall risk.
• The Bayes risk R∗ is the overall risk of a Bayes decision rule, and it’s the best performance achievable.
Minimum-error-rate classification
Zero-one loss
`(y, yi ) =
0
1
if y = yi
if y =
6 yi
• It assigns a unit cost to a misclassified example, and a zero cost to a correctly classified one
• It assumes that all errors have same cost
• Its risk corresponds to the average probability of error:
R(yi |x) =
c
X
`(yi , yj )P (yj |x) =
j=1
X
j6=i
2
P (yj |x) = 1 − P (yi |x)
Minimum-error-rate classification: binary case
p(x|ω1)
p(x|ω2)
θb
θa
x
R2
R1
R2
R1
Bayes decision rule
yB = argmaxyi ∈{−1,1} P (yi |x) = argmaxyi ∈{−1,1} p(x|yi )P (yi )
• Likelihood ratio:
p(x|y1 )P (y1 ) > p(x|y2 )P (y2 )
p(x|y1 )
P (y2 )
>
= θa
p(x|y2 )
P (y1 )
Representing classifiers
Discriminant functions
• A classifier can be represented as a set of discriminant functions gi (x), i ∈ 1, . . . , c, giving:
y = argmaxi∈1,...,c gi (x)
• A discriminant function is not unique ⇒ the most convenient one for computational or explanatory reasons can
be used:
p(x|yi )P (yi )
p(x)
gi (x) = p(x|yi )P (yi )
gi (x) = P (yi |x) =
gi (x) = lnp(x|yi ) + lnP (yi )
Representing classifiers
3
p(x|ω1)P(ω1)
0.3
p(x| ω2)P(ω2)
0.2
0.1
0
R2
5
R1
R2
decision
boundary
5
0
0
Decision regions
• The feature space is divided into decision regions R1 , . . . , Rc such that:
x ∈ Ri
if gi (x) > gj (x) ∀ j 6= i
• Decision regions are separated by decision boundaries, regions in which ties occur among the largest discriminant functions
Normal density
Multivariate normal density
1
1
exp − (x − µ)t Σ−1 (x − µ)
2
(2π)d/2 |Σ|1/2
• The covariance matrix Σ is always symmetric and positive semi-definite
• The covariance matrix is strictly positive definite if the dimension of the feature space is d (otherwise |Σ| = 0)
Normal density
4
x2
µ
x1
Hyperellipsoids
• The loci of points of constant density are hyperellipsoids of constant Mahalanobis distance from x to µ.
• The principal axes of such hyperellipsoids are the eigenvectors of Σ, their lengths are given by the corresponding
eigenvalues
Discriminant functions for normal density
Discriminant functions
gi (x) = ln p(x|yi ) + ln P (yi )
d
1
1
= − (x − µi )t Σ−1
i (x − µi ) − ln 2π − ln |Σi | + ln P (yi )
2
2
2
Discarding terms which are independent of i we obtain:
1
1
gi (x) = − (x − µi )t Σ−1
i (x − µi ) − ln |Σi | + ln P (yi )
2
2
Discriminant functions for normal density
case Σi = σ 2 I
• Features are statistically independent
• All features have same variance σ 2
• Covariance determinant |Σi | = σ 2d can be ignored being independent of i
• Covariance inverse is given by Σ−1
= (1/σ 2 )I
i
• The discriminant functions become:
gi (x) = −
||x − µi ||2
+ ln P (yi )
2σ 2
5
Discriminant functions for normal density
0
-2
2
0.4
2
ω2
ω1
0.15
p(x|ωi)
4
1
0
P(ω2)=.5
0.1
2
ω2
ω1
ω2
0.05
1
0.3
ω1
0
R2
0
0.2
-1
0.1
-2
0
R1
P(ω1)=.5
P(ω2)=.5
P(ω1)=.5
x
2
4
R1
R2
P(ω1)=.5 R 1
-2
-2
-2
-1
0
R2
0
2
P(ω2)=.5
1
4
2
case Σi = σ 2 I
• Expansion of the quadratic form leads to:
gi (x) = −
1
[xt x − 2µti x + µti µi ] + ln P (yi )
2σ 2
• Discarding terms which are independent of i we obtain linear discriminant functions:
1
1
gi (x) = 2 µti x − 2 µti µi + ln P (yi )
σ
2σ
| {z } |
{z
}
wi0
wti
case Σi = σ 2 I
Separating hyperplane
• Setting gi (x) = gj (x) we note that the decision boundaries are pieces of hyperplanes:
1
σ2
P (yi )
(µi − µj )t (x − (µi + µj ) −
ln
(µ − µj ))
2
2
||µi − µj ||
P (yj ) i
| {z }
|
{z
}
wt
x0
• The hyperplane is orthogonal to vector w ⇒ orthogonal to the line linking the means
• The hyperplane passes through x0 :
– if the prior probabilities of classes are equal, x0 is halfway between the means
– otherwise, x0 shifts towards the more likely mean
case Σi = σ 2 I
6
p(x|ωi)
p(x|ωi)
ω2
ω1
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
-2
2
R1
2
P(ω2)=.3
-2
ω1
0.15
ω2
2
0.1
0.05
0.05
0
0
4
ω2
P(ω2)=.01
P(ω2)=.2
R2
R1
-2
0
0
2
2
4
4
3
4
2
2
1
P(ω2)=.2
0
2
0
2
R2
R1
1
R2
R1
ω2
1
ω1
ω1
0
0
-1
-1
P(ω1)=.8
-2
R2
P(ω1)=.99
R1
-2
P(ω2)=.1
ω1
0.15
0.1
P(ω1)=.8
R2
P(ω1)=.9
0
x
4
R1
4
2
0
0
-2
R2
P(ω1)=.7
-2
x
4
ω2
ω1
P(ω2)=.01
ω2
P(ω1)=.99
-2
-2
-2
-1
-1
0
0
1
1
2
2
Discriminant functions for normal density
case Σi = Σ
• All classes have same covariance matrix
7
• The discriminant functions become:
1
gi (x) = − (x − µi )t Σ−1 (x − µi ) + ln P (yi )
2
• Expanding the quadratic form and discarding terms independent of i we again obtain linear discriminant func1
tions:
gi (x) = µti Σ−1 x − µti Σ−1 µi + ln P (yi )
| {z } | 2
{z
}
wti
wi0
• The separating hyperplanes are not necessarily orthogonal to the line linking the means:
1
ln P (yi )/P (yj )
(µi − µj )t Σ−1 (x − (µi + µj ) −
(µ − µj ))
2
(µi − µj )t Σ−1 (µi − µj ) i
{z
}
|
|
{z
}
wt
x0
case Σi = Σ
8
ω2
0.2
ω1
ω2
0.2
-0.1
ω1
-0.1
0
0
P(ω2)=.5
R2
P(ω2)=.9
R1
5
P(ω1)=.5
-5
R2
0
P(ω1)=.1
-5
0
0
0
5
-5
5
-5
10
7.5
R1
R1
P(ω1)=.5 5
R2
ω1
0
ω2
P(ω2)=.5
0
2
-2
0
-2.5
2
4
ω2
-2
P(ω2)=.9
2
case Σi = arbitrary
9
-2.5
R2
0
Discriminant functions for normal density
7.5
P(ω1)=.1 5
2.5
ω1
-2
5
R1
-2
0
0
2
4
• The discriminant functions are inherently quadratic:
1
1 t −1
1
t −1
gi (x) = xt (− Σ−1
i ) x + µi Σi x − µi Σi µi − ln |Σi | + ln P (yi )
2
2
2
|
{z
}
{z
}
| {z }
|
wti
wio
Wi
• In two category case, decision surfaces are hyperquadratics: hyperplanes, pairs of hyperplanes, hyperspheres,
hyperellipsoids, etc.
case Σi = arbitrary
10
Reducible error
11
p(x|ωi)P(ωi)
ω2
ω1
reducible
error
x
xB x*
R1
∫p(x|ω )P(ω ) dx
2
R2
∫p(x|ω )P(ω ) dx
2
1
R1
1
R2
Probability of error in binary classification
P (error) = P (x ∈ R1 , y2 ) + P (x ∈ R2 , y1 )
Z
Z
=
p(x|y2 )P (y2 )dx +
p(x|y1 )P (y1 )dx
R1
R2
The reducible error is the error resulting from a suboptimal choice of the decision boundary
Bayesian decision theory: arbitrary inputs and outputs
Setting
• Examples are input-output pairs (x, y) ∈ X × Y generated with probability p(x, y).
• The conditional risk of predicting y ∗ given x is:
∗
Z
R(y |x) =
`(y ∗ , y)P (y|x)dy
Y
• The overall risk of a decision rule f is given by
Z
Z Z
R[f ] = R(f (x)|x)p(x)dx =
`(f (x), y)p(y, x)dxdy
X
Y
• Bayes decision rule
y B = argminy∈Y R(y|x)
Handling missing features
Marginalize over missing variables
• Assume input x consists of an observed part xo and missing part xm .
12
• Posterior probability of yi given the observation can be obtained from probabilities over entire inputs by
marginalizing over the missing part:
R
p(yi , xo , xm )dxm
p(yi , xo )
=
P (yi |xo ) =
p(xo )
p(xo )
R
P (yi |xo , xm )p(xo , xm )dxm
R
=
p(xo , xm )dxm
R
P (yi |x)p(x)dxm
R
=
p(x)dxm
Handling noisy features
Marginalize over true variables
• Assume x consists of a clean part xc and noisy part xn .
• Assume we have a noise model for the probability of the noisy feature given its true version p(xn |xt ).
• Posterior probability of yi given the observation can be obtained from probabilities over clean inputs by marginalizing over true variables via the noise model:
R
p(yi , xc , xn , xt )dxt
p(yi , xc , xn )
P (yi |xc , xn ) =
= R
p(xc , xn )
p(xc , xn , xt )dxt
R
p(yi |xc , xn , xt )p(xc , xn , xt )dxt
R
=
p(xc , xn , xt )dxt
R
p(yi |xc , xt )p(xn |xc , xt )p(xc , xt )dxt
R
=
p(xn |xc , xt )p(xc , xt )dxt
R
p(yi |x)p(xn |xt )p(x)dxt
R
=
p(xn |xt )p(x)dxt
13
© Copyright 2026 Paperzz