Optimal Classification with Bayes Rule

METU Informatics Institute
Min 720
Pattern Classification with Bio-Medical Applications
PART 2: Statistical Pattern
Classification: Optimal Classification
with Bayes Rule
Statistical Approach to P.R
R3
X = [ X 1 , X 2 ,..., X d ]
g1
R1
Dimension of the feature space: d
Set of different states of nature: {ω1 , ω2 ,..., ωc }
Categories: c
find Ri Ri ∩ R j = ϕ uRi = R d
set of possible actions (decisions): {α1 , α 2 ,..., α a }
Here, a decision might include a ‘reject option’
A Discriminant Function g i ( X ) ≥ g j ( X )
gi ( X ) 1 ≤ i ≤ c
in region Ri ; decision rule : α k if g k ( X ) > g j ( X )
R2
g2
g3
A Pattern Classifier
g1 ( X )
X
g2 ( X )
Max
gc ( X )
αk
So our aim now will be to define these functions g1 , g 2 ,..., g c
to minimize or optimize a criterion.
Parametric Approach to Classification
• 'Bayes DecisionTheory' is used for minimum-error/minimum risk
pattern classifier design.
• Here, it is assumed that if a sample X is drawn from a class ωi
it is a random variable represented with a multivariate
probability density function.
‘Class- conditional density function’
P ( X ωi )
• We also know a-priori probability P(ωi )
1≤ i ≤ c
(c is no. of classes)
• Then, we can talk about a decision rule that minimizes the
probability of error.
• Suppose we have the observation
X
• This observation is going to change a-priori assumption to aposteriori probability:
P(ωi X )
• which can be found by the Bayes Rule.
P(ωi X ) = P(ωi , X ) / P( X )
P( X ωi ).P(ωi )
=
P( X )
• P( X ) can be found by Total Probability Rule:
When
ω i ‘s
are disjoint,
X
c
P ( X ) = ∑ P (ωi , X )
i =1
c
ω1
ω2
P ( X ) = ∑ P ( X ωi ).P (ωi )
i =1
• Decision Rule: Choose the category with highest a-posteriori
probability, calculated as above, using Bayes Rule.
then,
g i ( X ) = P (ωi X )
1
g1 = g 2
Decision boundary:
g1 > g 2
R1
or in general, decision boundaries are where:
gi ( X ) = g j ( X )
between regions Ri and
Rj
g 2 > g1
R2
• Single feature – decision boundary – point
2 features –
curve
3 features –
surface
More than 3 –
hypersurface
g i ( X ) = P( X ωi ).P (ωi )
gi( X ) =
P( X ωi ).P(ωi )
P( X )
• Sometimes, it is easier to work with logarithms
g i ( X ) = log[ P ( X ω i ). P (ω i )]
g i ( X ) = log P ( X ω i ) + log P (ω i )
• Since logarithmic function is a monotonically increasing function,
log fn will give the same result.
c1 , c2
2 Category Case:
Assign to
c1
if
(α1 )
P (ω1 X ) > P (ω2 X )
c2
if
(α 2 )
P (ω1 X ) < P (ω2 X )
But this is the same as:
c1
if
P ( X ω1 ).P (ω1 )
P( X )
By throwing away
c1
if
P( X )
>
P ( X ω2 ).P (ω2 )
P( X )
‘s, we end up with:
P ( X ωi ).P(ω1 ) > P ( X ω2 ).P(ω2 )
Which the same as:
Likelihood ratio
P ( X ω1 )
P( X ω2 )
>
P( X ω2 )
P ( X ω1 )
=k
Example: a single feature, 2 category problem with gaussian density
: Diagnosis of diabetes using sugar count X
P (c1 ) = 0.7
c1 state of being healthy
c 2 state of being sick (diabetes)
P(c2 ) = 0.3
P ( X c1 ) =
1
2π∂ 1
2
.e
− ( X − m1 ) 2 / 2 ∂ 1 2
P ( X c2 ) =
P( X )
P ( X c1 )
∂1
if
.e
P ( X c2 ).P (c2 )
m1
c1
2π∂ 2
2
−( X − m2 )2 / 2∂ 22
P ( X c2 )
P ( X c1 ).P ( X c1 )
The decision rule:
1
d
m2
X
∂2
P( X c1 ).P (c1 ) > P ( X c2 ).Por
( c2 )
0.7 P ( X c1 ) > 0.3P( X c2 )
m1 = 10
Assume now:
m2 = 20
∂1 = ∂ 2 = 2
And we measured: X = 17
Assign the unknown sample:
Find likelihood ratio:
=
e
X
to the correct category.
− ( X −10 ) 2 / 8
e −( X − 20 )
2
for
/8
= e −4.9 = 0.006
Compare with:
So assign:
Xto
P ( c 2 ) 0 .3
=
= 0.43 > 0.006
P (c1 ) 0.7
. c2
X = 17
Example: A discrete problem
Consider a 2-feature, 3 category case
where:
And
1

=
for a i < X 1 < bi

2
P ( X 1 , X 2 c i ) =  ( a i − bi )
a i < X 2 < bi
= 0
other wise

P ( c1 ) = 0 . 4
P ( c 2 ) = 0 .4
,
Find the decision boundaries and regions:
R3
1× 0.2 = 0.2
Solution:
3
R2
1
0.4
× 0.4 =
9
9
1
× (0.4) = 0.1
4
R1
−1
3
P ( c 3 ) = 0 .2
,
a1 = −1
b1 = 1
a 2 = 0. 5
b2 = 3.5
a3 = 3
b3 = 4
Remember now that for the 2-class case:
ifc1
P ( X c1 ). P ( c1 ) > P ( X c2 ). P ( c2 )
or
P ( X c1 )
P ( X c2 )
Likelihood ratio
>
P ( X c2 )
P ( X c1 )
=k
Error probabilities and a simple proof of minimum error
Consider again a 2-class 1-d problem:
P ( X c1 ).P (c1 )
R1
d
d′
Let’s show that: if the decision boundary is
rather than any arbitrary point d ′ .
P ( X c2 ).P (c2 )
R2
d (intersection point)
Then P (E ) (probability of error) is minimum.
P ( E ) = P( X ∈R 2 , c1 ) + P( X ∈ R1 , c2 )
= P ( X ∈ R2 c1 ).P (c1 ) + P ( X ∈ R1 c2 ).P (c2 )
= [ ∫ P ( X c1 )dX ].P (c1 ) + [ ∫ P ( X c2 )dX ].P (c2 )
R2
R1
= ∫ P( X c1 ).P (c1 )dX + ∫ P ( X c2 ).P(c2 )dX
R2
R1
d
d′
It can very easily be seen that the P(E ) is minimum if d ′ = d
.
Minimum Risk Classification
Risk associated with incorrect decision might be more important than
the probability of error.
So our decision criterion might be modified to minimize the average risk
in making an incorrect decision.
We define a conditional risk (expected loss) for decision α i when X
c
occurs as:
Ri (X ) =
∑ λ (α
i
ω j ). P (ω j X )
j =1
Where λ (α i ω j ) is defined as the conditional loss associated with
decision α i when the true class is ω j . It is assumed that λ is
known.
i
j
The decision rule: decide on c i if R ( X ) < R ( X )
for all 1 ≤ j ≤ c i ≠ j
The discriminant function here can be defined as:
4
gi ( X ) = − Ri ( X )
• We can show that minimum – error decision is a special case
of above rule where:
λ (α i ωi ) = 0
λ (α i ω j ) = 1
then,
Ri (X ) =
∑ P (ω
X)
j
j≠i
= 1 − P (ωi X )
so the rule is α i if 1 − P ( ω i X ) < 1 − P (ω
≡ P (ωi X ) > R (ω j X )
j
X)
For the 2 – category case, minimum – risk classifier becomes:
Rα1 ( X ) = λ11 P(ω1 X ) + λ12 P(ω2 X )
Rα 2 ( X ) = λ22 P(ω2 X ) + λ21P(ω1 X )
λ11 P (ω1 X ) + λ12 P (ω 2 X ) > λ 22 P (ω 2 X ) + λ 21 P (ω1 X )
α 1 if
⇒ ( λ11 − λ 21 ). P (ω 1 X ) > ( λ12 − λ 22 ). P (ω 2 X )
⇒ ( λ11 − λ 21 ). P ( X ω 1 ). P (ω 1 ) > ( λ12 − λ 22 ). P ( X ω 2 ) P (ω 2 )
α 1 if
P ( X ω1 )
P( X ω 2 )
Otherwise,
>
( λ12 − λ 22 ) P (ω 2 )
.
( λ 21 − λ11 ) P (ω 1 )
α 2.
This is the same as likelihood rule if λ 22 = λ11 = 0
and λ12 = λ 21 = 1
Discriminant Functions so far
For Minimum Error:
+ P (ωi X )
+ P ( X ωi ).P (ωi )
+ log P( X ωi ) + log P(ωi )
For Minimum Risk:
Where
− R i (X )
R (X ) =
i
c
∑ λ (α
j =1
i
ω j ). P (ω j X )
Bayes (Maximum Likelihood)Decision:
• Most general optimal solution
• Provides an upper limit(you cannot do better with other
rule)
• Useful in comparing with other classifiers
Special Cases of Discriminant Functions
Multivariate Gaussian (Normal) Density
N (M , Σ) :
The general density form: P ( X ) =
1
( 2π )
d /2
Σ
1/ 2
e
−1 / 2 ( X − M ) T
∑
−1
( X −M )
Here X in the feature vector of size .d
M : d element mean vector E ( X ) = M = [ µ1 , µ 2 ,..., µ d ]T
Σ dxd: covariance matrix
Σ ij = E [( X i − µ i )( X j − µ j )]
Σ ii = E [( X i − µ i ) 2 ]
=σi
(variance
of feature
2
Σ
)
- symmetric
Σij = 0 when X i and X j are statistically independent.
Xi
Σ- determinant of Σ
General shape:
where
Hyper ellipsoids
µ2
−1
( X − M )T ∑ ( X − M )
is constant:
X1
Mahalanobis
µ1
Distance
X2
 µ1 
M = 
µ2 
σ 12 Σ12 
Σ=
2
Σ 21 σ 2 
2 – d problem:
X1, X 2
,
X2
If Σ12 = 0 ,
Σ 21 = 0
(statistically independent features) then,
major axes are parallel to major ellipsoid axes
X1
if in addition
σ 12 = σ 2 2
µ2
circular
µ1
in general, the equal density curves are hyper ellipsoids. Now
g i ( X ) = log e P( X ωi ) + log e P(ωi )
is used for N ( M i , Σ i ) since its ease in manipulation
g i ( X ) = −(1 / 2).( X − M i )
T
∑
−1
i
(X − Mi )
− (1 / 2) log Σ i + log P(ωi )
g i ( X ) is a quadratic function of X as will be shown.
−1
−1
g i ( X ) = −1 / 2. X T ∑i X − 1 / 2.M iT ∑i M i
+ 1 / 2. X
T
∑
−1
i
M i + 1 / 2M
T
i
∑
−1
i
X
− 1 / 2. log Σ i−1 + log P(ωi )
−1
Wi = −1 / 2.∑i
−1
Vi = M iT ∑i
a scalar Wio = −1 / 2.M iT Σ i−1M i − 1 / 2. log Σ i + log P(ωi )
Then,
gi ( X ) = X TWi X + Vi X + Wio
On the decision boundary,
gi ( X ) = g j ( X )
X T Wi X − X T W j X + Vi X − V j X + Wio − W jo = 0
X T (Wi − W j ) X + (Vi − V j ) X + (Wio − W jo ) = 0
X T WX + VX + W0 = 0
Decision boundary function is hyperquadratic in general.
Example in 2d.
ω11 ω12 
W =

ω21 ω22 
V = [v1 v2 ]
[
X=
x1
Then, above boundary
becomes
x2 ]
[x1
 x1 
ω11 ω12   x1 
+ [v1 v2 ]  + W0 = 0
x2 ]



 x2 
ω21 ω22   x2 
ω11 x1 2 + 2ω12 x1 x 2 + ω 22 x 2 2 + v1 x1 + v 2 x 2 + W 0 = 0
General form of hyper quadratic boundary IN 2-d.
The special cases of Gaussian:
Assume
Σ i matrix
= σ 2I
Where
is the unit
I
Σi = σ 2d
σ 2 0

2
σ
0
Σi = 
0
0

0
 0
0

0 0
.. 0 
2
0 σ 
0
−1
i
Σ =
gi ( X ) = −
gi ( X ) = −
1
2σ
2
1
2σ
2
1
σ
I
2
( X − M i ) T .( X − M i ) − 12 log σ 2 d + log P (ω i )
X,Mi
2
+ log P (ω i )
(not a function
M of X so can be removed)
i
Now assume
P (ω i ) = P (ω j )
gi ( X ) = −
1
2σ
2
X,Mi
2
= −d 2 (X , M i )
euclidian distance between X and Mi
Then,the decision boundary is linear !
Decision
Rule: Assign the unknown sample to the closest mean’s category
unknown sample
d1
d2
M1
M2
d
P(ωtowards
i ) ≠ P (ωthe
j ) less probable
d= Perpendicular bisector that will move
category
Minimum Distance Classifier
• Classify an unknown sample X to the category with closest mean !
• Optimum when gaussian densities with equal variance and equal apriori probability.
R M
1
M2
R2
1
M4
R4
M3
R3
Piecewise linear boundary in case of more than 2 categories.
• Another special case: It can be shown that when (Covariance
matrices are the same)
Σ i size
= Σand shape
• Samples fall in clusters of equal
unknown sample
P (ωi ) = P (ω j )
is called Mahalonobis Distance
g i ( X ) = − 12 ( X − M i )T Σ −1 ( X − M i ) + log P(ωi )
is called
Mahalonobis
Distance
T −1
1
− 2 (X − Mi ) Σ (X − Mi )
Then, if P (ωi ) = P (ω j )
The decision rule:
α i if (Mahalanobis Distance of unknown sample to M i ) >
(Mahalanobis Distance of unknown sample to
If
P(ωi ) ≠ P(ω j )
The boundary moves toward the less probable one.
Mj
)
Binary Random Variables
• Discrete features: Features can take only discrete values. Integrals
are replaced by summations.
pi = ( X i = 1ω1 )
• Binary Features: 0 or 1
qi = ( X i = 1ω2 )
pi
1 − pi
1
0
•
•
P ( X i ω1 )
Xi
Assume binary features are statistically independent.
Where
is binary
Xi
X = [X 1 , X 2 ,..., X d ]
T
Binary Random Variables
Example: Bit – matrix for machine – printed characters
a pixel
Xi
1
0
Here, each pixel may be taken as a featureX i
For above problem, we have
d = 10 × 10 = 100
is the probability that
pi
for letter A,B,…
Xi =1
P(xi ) = ( pi )xi (1− pi )1−xi
defined for xi = 0,1
undefined elsewhere:
d
d
i =1
i =1
P( X ) = Π P(xi ) = Π( pi )xi (1− pi )1−xi
d
gk ( X ) = log(P( X wk ) + logP(wk )) = ∑xi log pi + ∑ (1− xi ) log(1− pi ) + logP(wk )
i =1
• If statistical independence of features is assumed.
• Consider the 2 category problem; assume:
pi = ( x = 1ω1 )
qi = ( x = 1ω2 )
then, the decision boundary is:
∑x log p + ∑(1− x ) log(1− p ) − ∑x logq − ∑(1− x ) log(1− q ) +
i
i
i
i
i
i
i
logP(ω1) − logP(ω2 ) = 0
So if
pi
1− pi
P(ω1 )
x
log
(
1
x
)
log
log
+
−
+
∑ i qi ∑ i 1−qi
P(ω2 )
> 0 category 1
else 2
g ( X ) = Σboundary
Wi X i + W0 is linear in X.
The decision
a weighted sum of the inputs
where:
and
Wi = ln qPii ((11−− qpii ))
W0 = ∑ ln 11−− qpii + ln PP((ωω12 ))
i