L5: Quadratic classifiers

L5: Quadratic classifiers
β€’ Bayes classifiers for Normally distributed classes
– Case 1: Σ𝑖 = 𝜎 2 𝐼
– Case 2: Σ𝑖 = Ξ£ (Ξ£ diagonal)
– Case 3: Σ𝑖 = Ξ£ (Ξ£ non-diagonal)
– Case 4: Σ𝑖 = πœŽπ‘–2 𝐼
– Case 5: Σ𝑖 β‰  Σ𝑗 (general case)
β€’ Numerical example
β€’ Linear and quadratic classifiers: conclusions
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
1
Bayes classifiers for Gaussian classes
Class assignment
β€’ Recap
– On L4 we showed that the decision rule that
minimized 𝑃[π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ] could be formulated in
terms of a family of discriminant functions
β€’ For normally Gaussian classes, these
DFs reduce to simple expressions
– The multivariate Normal pdf is
βˆ’π‘/2
Costs
Costs
Discriminant
functions
Features
βˆ’1/2
βˆ’
xx1
1
xx2
2
xx3
3
ggC (x)
C (x)
xxd
d
𝑝 π‘₯
1
𝑇 βˆ’1
βˆ’1/2 βˆ’2 π‘₯βˆ’πœ‡π‘– Σ𝑖 π‘₯βˆ’πœ‡π‘–
Σ𝑖
𝑒
= 2πœ‹
– Eliminating constant terms
gg1(x)
gg2(x)
1(x)
2(x)
1
𝑇 βˆ’1 π‘₯βˆ’πœ‡
βˆ’1/2 βˆ’2 π‘₯βˆ’πœ‡ Ξ£
𝑒
𝑓𝑋 π‘₯ = 2πœ‹
Ξ£
– Using Bayes rule, the DFs become
𝑔𝑖 π‘₯ = 𝑃 πœ”π‘– |π‘₯ = 𝑃 πœ”π‘– 𝑝 π‘₯ πœ”π‘–
βˆ’π‘/2
Select
Select max
max
𝑃 πœ”π‘–
𝑝 π‘₯
1
π‘₯βˆ’πœ‡π‘– 𝑇 Ξ£βˆ’1
π‘₯βˆ’πœ‡π‘–
𝑖
2
𝑔𝑖 π‘₯ = Σ𝑖
𝑒
𝑃 πœ”π‘–
– And taking natural logs
1
1
𝑔𝑖 π‘₯ = βˆ’ π‘₯ βˆ’ πœ‡π‘– 𝑇 Ξ£π‘–βˆ’1 π‘₯ βˆ’ πœ‡π‘– βˆ’ log Σ𝑖 + π‘™π‘œπ‘”π‘ƒ(πœ”π‘– )
2
2
– This expression is called a quadratic discriminant function
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
2
Case 1: πšΊπ’Š = 𝝈𝟐 𝑰
β€’ This situation occurs when features are statistically independent with
equal variance for all classes
– In this case, the quadratic DFs become
𝑔𝑖 π‘₯ = βˆ’
1
π‘₯ βˆ’ πœ‡π‘–
2
𝑇
𝜎 2𝐼
βˆ’1
1
1
π‘₯ βˆ’ πœ‡π‘– βˆ’ log 𝜎 2 𝐼 + π‘™π‘œπ‘”π‘ƒπ‘– ≑ βˆ’ 2 π‘₯ βˆ’ πœ‡π‘–
2
2𝜎
𝑇
π‘₯ βˆ’ πœ‡π‘– + π‘™π‘œπ‘”π‘ƒπ‘–
– Expanding this expression
𝑔𝑖 π‘₯ = βˆ’
1
π‘₯ 𝑇 π‘₯ βˆ’ 2πœ‡π‘–π‘‡ π‘₯ + πœ‡π‘–π‘‡ πœ‡π‘– + π‘™π‘œπ‘”π‘ƒπ‘–
2
2𝜎
– Eliminating the term π‘₯ 𝑇 π‘₯, which is constant for all classes
𝑔𝑖 π‘₯ = βˆ’
1
𝑇
𝑇
𝑇
βˆ’2πœ‡
π‘₯
+
πœ‡
πœ‡
+
π‘™π‘œπ‘”π‘ƒ
=
𝑀
π‘₯ + 𝑀0
𝑖
𝑖
𝑖
𝑖
𝑖
2𝜎 2
– So the DFs are linear, and the boundaries 𝑔𝑖 (π‘₯) = 𝑔𝑗 (π‘₯) are hyper-planes
– If we assume equal priors
1
2𝜎2
π‘₯ βˆ’ πœ‡π‘–
𝑇
π‘₯ βˆ’ πœ‡π‘–
β€’ This is called a minimum-distance
or nearest mean classifier
β€’ The equiprobable contours are hyper-spheres
β€’ For unit variance (𝜎 2 = 1), 𝑔𝑖 π‘₯ is
the Euclidean distance
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
x

1
2
C
Distance
Distance
Minimum Selector
𝑔𝑖 π‘₯ = βˆ’
class
Distance
[Schalkoff, 1992]
3
β€’ Example
– Three-class 2D problem
with equal priors
πœ‡1 = 3 2
Ξ£1 = 2
𝑇
2
πœ‡2 = 7 4
Ξ£1 = 2
𝑇
2
πœ‡3 = 2 5
Ξ£1 = 2
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
𝑇
2
4
Case 2: πšΊπ’Š = 𝚺 (diagonal)
β€’ Classes still have the same covariance, but features are
allowed to have different variances
– In this case, the quadratic DFs becomes
𝑔𝑖 π‘₯ = βˆ’
1
βˆ’ βˆ‘π‘
2 π‘˜=1
1
1
π‘₯ βˆ’ πœ‡π‘– 𝑇 Ξ£π‘–βˆ’1 π‘₯ βˆ’ πœ‡π‘– βˆ’ log Σ𝑖 + π‘™π‘œπ‘”π‘ƒπ‘– =
2
2
2
π‘₯π‘˜ βˆ’ πœ‡π‘–,π‘˜
1
2
βˆ’
π‘™π‘œπ‘”βˆπ‘
π‘˜=1 πœŽπ‘˜ + π‘™π‘œπ‘”π‘ƒπ‘–
2
2
πœŽπ‘˜
– Eliminating the term π‘₯π‘˜2 , which is constant for all classes
2
1 𝑁 βˆ’2π‘₯π‘˜ πœ‡π‘–,π‘˜ + πœ‡π‘–,π‘˜
1
𝑁
𝑔𝑖 π‘₯ = βˆ’ βˆ‘π‘˜=1
βˆ’
π‘™π‘œπ‘”Ξ π‘˜=1
πœŽπ‘˜2 + π‘™π‘œπ‘”π‘ƒπ‘–
2
2
2
πœŽπ‘˜
– This discriminant is also linear, so the decision boundaries
𝑔𝑖 (π‘₯) = 𝑔𝑗 (π‘₯) will also be hyper-planes
– The equiprobable contours are hyper-ellipses aligned with the
reference frame
– Note that the only difference with the previous classifier is that the
distance of each axis is normalized by its variance
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
5
β€’ Example
– Three-class 2D problem
with equal priors
πœ‡1 = 3 2
Ξ£1 = 1
𝑇
2
πœ‡2 = 5 4
Ξ£2 = 1
𝑇
2
πœ‡3 = 2 5
Ξ£3 = 1
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
𝑇
2
6
Case 3: πšΊπ’Š = 𝚺 (non-diagonal)
β€’ Classes have equal covariance matrix, but no longer diagonal
– The quadratic discriminant becomes
1
1
𝑔𝑖 π‘₯ = βˆ’ π‘₯ βˆ’ πœ‡π‘– 𝑇 Ξ£βˆ’1 π‘₯ βˆ’ πœ‡π‘– βˆ’ log Ξ£ + π‘™π‘œπ‘”π‘ƒπ‘–
2
2
– Eliminating the term log Ξ£ , which is constant for all classes, and assuming
equal priors
1
𝑔𝑖 π‘₯ = βˆ’ π‘₯ βˆ’ πœ‡π‘– 𝑇 Ξ£βˆ’1 π‘₯ βˆ’ πœ‡π‘–
2
– The quadratic term is called the Mahalanobis distance, a very important
concept in statistical pattern recognition
– The Mahalanobis distance is a
vector distance that uses a Ξ£ βˆ’1 norm,
– Ξ£ βˆ’1 acts as a stretching factor on
the space
– Note that when Ξ£ = 𝐼, the
Mahalanobis distance becomes
the familiar Euclidean distance
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
x1

x2
2
x i - ΞΌ å -1 = K
2
xi - μ = Κ
7
– Expanding the quadratic term
1
𝑔𝑖 π‘₯ = βˆ’ π‘₯ 𝑇 Ξ£βˆ’1 π‘₯ βˆ’ 2πœ‡π‘–π‘‡ Ξ£ βˆ’1 π‘₯ + πœ‡π‘–π‘‡ Ξ£βˆ’1 πœ‡π‘–
2
– Removing the term π‘₯ 𝑇 Ξ£βˆ’1 π‘₯, which is constant for all classes
1
𝑔𝑖 π‘₯ = βˆ’ βˆ’2πœ‡π‘–π‘‡ Ξ£ βˆ’1 π‘₯ + πœ‡π‘–π‘‡ Ξ£βˆ’1 πœ‡π‘– = 𝑀1𝑇 π‘₯ + 𝑀0
2
– So the DFs are still linear, and the decision boundaries will also be
hyper-planes
– The equiprobable contours are hyper-ellipses aligned with the
eigenvectors of Ξ£
– This is known as a minimum (Mahalanobis) distance classifier
å
1
2
C
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
Distance
Distance
Minimum Selector
x
class
Distance
8
β€’ Example
– Three-class 2D problem
with equal priors
πœ‡1 = 3 2
Ξ£1 =
𝑇
1 .7
.7 2
πœ‡2 = 5 4
Ξ£2 =
𝑇
1 .7
.7 2
πœ‡3 = 2 5
Ξ£3 =
𝑇
1 .7
.7 2
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
9
𝟐
Case 4: πšΊπ’Š = πˆπ’Š 𝑰
β€’ In this case, each class has a different covariance matrix,
which is proportional to the identity matrix
– The quadratic discriminant becomes
1
1
βˆ’2
𝑇
𝑔𝑖 π‘₯ = βˆ’ π‘₯ βˆ’ πœ‡π‘– πœŽπ‘– π‘₯ βˆ’ πœ‡π‘– βˆ’ 𝑁log πœŽπ‘–2 + π‘™π‘œπ‘”π‘ƒπ‘–
2
2
– This expression cannot be reduced further
– The decision boundaries are quadratic: hyper-ellipses
– The equiprobable contours are hyper-spheres aligned with the feature
axis
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
10
β€’ Example
– Three-class 2D problem
with equal priors
πœ‡1 = 3 2
Ξ£1 = .5
𝑇
.5
πœ‡2 = 5 4
Ξ£2 = 1
𝑇
1
πœ‡3 = 2 5
Ξ£3 = 2
𝑇
2
Zoom out
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
11
Case 5: πšΊπ’Š β‰  πšΊπ’‹ (general case)
β€’ We already derived the expression for the general case
1
1
π‘₯ βˆ’ πœ‡π‘– 𝑇 Ξ£π‘–βˆ’1 π‘₯ βˆ’ πœ‡π‘– βˆ’ log Σ𝑖 + π‘™π‘œπ‘”π‘ƒπ‘–
2
2
– Reorganizing terms in a quadratic form yields
𝑇
𝑔𝑖 π‘₯ = π‘₯ 𝑇 π‘Š2,𝑖 π‘₯ + 𝑀1,𝑖
π‘₯ + 𝑀0,𝑖
𝑔𝑖 π‘₯ = βˆ’
1
π‘Š2,𝑖 = βˆ’ 2 Ξ£π‘–βˆ’1
where
𝑀1,𝑖 = Ξ£π‘–βˆ’1 πœ‡π‘–
1
1
π‘€π‘œ,𝑖 = βˆ’ 2 πœ‡π‘–π‘‡ Ξ£π‘–βˆ’1 πœ‡π‘– βˆ’ 2 log Σ𝑖 + π‘™π‘œπ‘”π‘ƒπ‘–
– The equiprobable contours are hyper-ellipses, oriented with the
eigenvectors of Σ𝑖 for that class
– The decision boundaries are again quadratic: hyper-ellipses or hyperparabolloids
– Notice that the quadratic expression in the discriminant is
proportional to the Mahalanobis distance for covariance Σ𝑖
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
12
β€’ Example
– Three-class 2D problem
with equal priors
πœ‡1 = 3 2
Ξ£1 =
𝑇
1 βˆ’1
βˆ’1 2
πœ‡2 = 5 4
Ξ£2 =
1
βˆ’1
𝑇
βˆ’1
7
πœ‡3 = 3 4
Ξ£3 =
.5
.5
𝑇
.5
3
Zoom
out
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
13
Numerical example
β€’ Derive a linear DF for the following 2-class 3D problem
𝑇
𝑇
πœ‡1 = 0 0 0 ; πœ‡2 = 1 1 1 ; Ξ£1 = Ξ£2 =
.25
.25
.25
– Solution
1
𝑔1 π‘₯ = βˆ’ 2 π‘₯ βˆ’ πœ‡1
2𝜎
β€’
β€’
𝑔1 π‘₯
πœ”1
> 𝑔
< 2
πœ”2
𝑇
1 π‘₯βˆ’0
+ π‘™π‘œπ‘”π‘ƒ1 = βˆ’ 𝑦 βˆ’ 0
2
π‘§βˆ’0
π‘₯ βˆ’ πœ‡1
𝑔2 π‘₯ =
2
1
βˆ’
2
2
π‘₯βˆ’1
π‘¦βˆ’1
π‘§βˆ’1
π‘₯ β‡’ βˆ’2 π‘₯ + 𝑦 + 𝑧
β€’
2
+
𝑇
4
4
1
𝑙𝑔
3
π‘₯+𝑦+𝑧
πœ”1
>
<
πœ”2
4
𝑇
4
4
4
π‘₯βˆ’0
1
𝑦 βˆ’ 0 + π‘™π‘œπ‘”
3
π‘§βˆ’0
π‘₯βˆ’1
2
𝑦 βˆ’ 1 + π‘™π‘œπ‘” 3
π‘§βˆ’1
βˆ’2 π‘₯βˆ’1
πœ”2
> 6βˆ’π‘™π‘œπ‘”2
<
4
πœ”1
𝑇
; P2 = 2P1
2
+ π‘¦βˆ’1
2
+ π‘§βˆ’1
2
+ lg
2
3
= 1.32
– Classify the test example π‘₯𝑒 = 0.1 0.7 0.8
πœ”2
>
0.1 + 0.7 + 0.8 = 1.6 < 1.32 β‡’ π‘₯𝑒 ∈ πœ”2
πœ”1
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
14
Conclusions
β€’ The examples in this lecture illustrate the following points
– The Bayes classifier for Gaussian classes (general case) is quadratic
– The Bayes classifier for Gaussian classes with equal covariance is linear
– The Mahalanobis distance classifier is Bayes-optimal for
β€’ normally distributed classes and
β€’ equal covariance matrices and
β€’ equal priors
– The Euclidean distance classifier is Bayes-optimal for
β€’ normally distributed classes and
β€’ equal covariance matrices proportional to the identity matrix and
β€’ equal priors
– Both Euclidean and Mahalanobis distance classifiers are linear classifiers
β€’ Thus, some of the simplest and most popular classifiers can be
derived from decision-theoretic principles
– Using a specific (Euclidean or Mahalanobis) minimum distance classifier
implicitly corresponds to certain statistical assumptions
– The question whether these assumptions hold or don’t can rarely be
answered in practice; in most cases we can only determine whether the
classifier solves our problem
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
15