Multivariate Analysis (slides 8)
• Today we consider Linear Discriminant Analysis (LDA) and Quadratic
Discriminant Analysis (QDA).
• These are used if it is assumed that there exists a set of k groups within the
data and that there is a subset of the data that is labelled, i.e., whose group
membership is known.
• Discriminant analysis refers to a set of ‘supervised’ statistical techniques
where the class information is used to help reveal the structure of the data.
• This structure then allows the ‘classification’ of future observations.
1
Discriminant Analysis
• We want to be able to use knowledge of labelled data (i.e., those whose
group membership is known) in order to classify the group membership of
unlabelled data.
• We previously considered the k-nearest neighbours technique for this
problem.
• We shall now consider the alternative approaches of:
– LDA (linear discriminant analysis)
– QDA (quadratic discriminant analysis)
2
LDA & QDA
• Unlike k-Nearest Neighbours (and all the other techniques so far covered),
both LDA and QDA assume the use of a distribution over the data.
• Once we introduce distributions (and parameters of those distributions), we
can start to quantify uncertainty over the structure of the data.
• As far as classification is concerned, this means that we can start to talk
about the probability of group assignment.
• The distinction between a point that is assigned a probability of 0.51 to one
group and 0.49 to another, against a point that is assigned a probability of
0.99 to one group and 0.01 to another, can be quite important.
3
Multivariate Normal Distribution
• Let xT = (x1 , x2 , ..., xm ), where x1 , x2 , ..., xm are random variables.
• The Multi-Variate Normal (MVN) distribution has two parameters:
– Mean µ, an m-dimensional vector.
– Covariance matrix Σ, with dimension m × m.
• A vector x is said to follow a MVN distribution, denoted x ∼ M V N (µ, Σ),
if it has the following probability density function:
1
1
T −1
(x − µ)
f (x|µ, Σ) =
m
1 exp − (x − µ) Σ
2
2
2
(2π) |Σ|
• Here |Σ| is used to denote the determinant of Σ.
4
Multivariate Normal Distribution
• The MVN distribution is very useful when modelling multivariate data.
• Notice:
n
T
{x : f (x|µ, Σ) > C} = x : (x − µ) Σ
−1
h
m
2
(x − µ) < −2 log C(2π) |Σ|
1
2
io
• This corresponds to an m-dimensional ellipsoid centered at point µ.
• If it is assumed that the data within a group k follows a MVN distribution
with mean µk and covariance Σk , then the scatter of the data should be
roughly elliptical.
• The mean fixes the location of the scatter and the covariance affects the
shape of the ellipsoid.
5
Normal Contours
0
0
,
1
0.8
0.8
3
−3
−2
−1
0
1
2
3
• For example, the contour plot of a MVN
−2
−1
0
6
1
2
is:
Normal Contours: Data
−6
−4
−2
0
2
4
6
• Sampling from this distribution and overlaying the results on the contour
plot gives:
−4
−2
0
7
2
4
Shape of Scatter
• If we assume that the data within each group follows a MVN distribution
with mean µk and covariance Σk , then we also assume that the scatter is
roughly elliptical.
−4
−4
−2
−2
0
0
2
2
4
4
• The mean sets the location of this scatter and the covariance sets the shape
of the ellipse.
−3
−1
1
2
−3
8
−1
1
2
3
Mahalanobis Distance
• The Mahalanobis distance from a point x to a mean µ is D, where
D 2 = (x − µ)T Σ−1 (x − µ).
2
4
6
• Two points have the same Mahalanobis distance if they are on the same
ellipsoid centered on µ (as defined earlier).
−6
−4
−2
0
µ
−4
−2
0
9
2
4
Which Is Closest?
• Suppose we wish to find the mean µk that a point x is closest to as
measured by Mahalanobis distance.
• That is, we want to find the k that minimizes the expression:
(x − µk )T Σ−1
k (x − µk )
• The point x is closer to µk than it is to µl (under Mahalanobis distance)
when:
T −1
(x − µk )T Σ−1
(x
−
µ
)
<
(x
−
µ
)
Σl (x − µl ).
k
l
k
• Note that this is a quadratic expression for x.
10
When Covariance is Equal
• If Σk = Σ for all k, then the previous expression becomes:
(x − µk )T Σ−1 (x − µk ) < (x − µl )T Σ−1 (x − µl ).
• This can be simplified as:
−2xT Σ−1 µk + µTk Σ−1 µk < −2xT Σ−1 µl + µTl Σ−1 µl
⇔ −2µTk Σ−1 x + µTk Σ−1 µk < −2µTl Σ−1 x + µTl Σ−1 µl
• This is now a linear expression for x
• Note the names of ‘linear’ discriminant analysis and ‘quadratic’
discriminant analysis.
11
Estimating Equal Covariance
• In LDA we need to pool the covariance matrices of individual classes.
• Remember that the sample covariance matrix Q for a set of n observations
of dimension m is the matrix whose elements are
n
1 X
(xki − xi )(xkj − xj )
qij =
n−1
k=1
for i = 1, 2, . . . , m and j = 1, 2, . . . , m.
• Then the pooled covariance matrix is defined as:
g
1 X
Qp =
(nl − 1)Ql
n−g
l=1
Where g is the number of classes, Ql is the estimated sample covariance
matrix for class l, nl is the number of data points in class l, whilst n is the
total number of data points.
12
Estimating Equal Covariance
• This formula arises from summing the squares and cross products over data
points in all classes:
Wij =
g X
nl
X
(xki − xli )(xkj − xlj )
l=1 k=1
for i = 1, . . . , m and j = 1, . . . , m.
• Hence:
g
X
(nl − 1)Ql
W =
l=1
• Given n data points falling in g groups, we have n − g degrees of freedom
because we need to estimate the g group means.
• This results in the previous formula for the pooled covariance matrix:
W
Qp =
n−g
13
Modelling Assumptions
• Both LDA and QDA are parametric statistical methods.
• In order to classify a new observation x into one of the known K groups, we
need to know P(x ∈ k|x) for k = 1, . . . K.
• That is to say, we need to know the posterior probability of belonging to
each of the possible groups given the data.
• Then classify the new observation as belonging to the class which has
largest posterior probability.
• Bayes’ Theorem states that the posterior probability of observation x
belonging to group k is:
πk f (xi |x ∈ k)
P(x ∈ k|x) = PK
l=1 πl f (xi |x ∈ l)
14
Modelling Assumptions
• Discriminant analysis assumes that observations from group k follow a
MVN distribution with mean µk and covariance Σk .
• That is
1
1
T −1
exp
−
f (x|x ∈ k) = f (x|µk , Σk ) =
(x
−
µ
)
Σk (x − µk )
k
1
m
2
2
2
(2π) |Σk |
• Discriminant analysis (as presented here) also assumes values for
πk = P(x ∈ k), which is the proportion of population objects belonging to
class k (this can be known or estimated).
PK
• Note that k=1 πk = 1.
• Typically, πk = 1/K is used.
• πk are sometimes referred to as prior probabilities.
• Using all this we can compute P(x ∈ k|x) and assign data points to groups
so as to maximise this probability.
15
Some Calculations
• The probability of x belonging to group k conditional on x being known
satisfies:
P(x ∈ k|x) ∝ πk f (x|µk , Σk ).
• Hence,
P(x ∈ k|x) > P(x ∈ l|x) ⇔ πk f (x|µk , Σk ) > πl f (x|µl , Σl )
• Taking logarithms and substituting in the probability density function for a
MVN distribution we find after simplification:
log πk −
1
1
log |Σk | − (x − µk )T Σ−1
k (x − µk )
2
2
> log πl −
1
1
log |Σl | − (x − µl )T Σ−1
l (x − µl )
2
2
16
Linear Discriminant Analysis
• If equal covariances are assumed then P(x ∈ k|x) > P(x ∈ l|x) if and only if:
1
1
log πk + xT Σ−1 µk − µTk Σ−1 µk > log πl + xT Σ−1 µl − µTl Σ−1 µl
2
2
• Hence the name linear discriminant analysis.
• If πk = 1/K for all k, then this reduces further.
T
1
x − (µk + µl ) Σ−1 (µk − µl ) > 0
2
17
Quadratic Discriminant Analysis
• No simplification arises in the unequal covariance case, hence
P(x ∈ k|x) > P(x ∈ l|x) if and only if:
log πk −
1
1
log |Σk | − (x − µk )T Σ−1
k (x − µk )
2
2
> log πl −
1
1
log |Σl | − (x − µl )T Σ−1
l (x − µl )
2
2
• Hence the name quadratic discriminant analysis.
• If πk = 1/K for all k, then some simplification arises.
18
Summary
• In LDA the decision boundary between class k and class l is given by:
log
P (k|x)
πk
f (x|k)
= log
=0
+ log
P (l|x)
πl
f (x|l)
• Unlike k-nearest neighbour, both LDA and QDA are model based classifiers
where P(data|group) is assumed to follow a MVN distribution:
– The model based assumption allows for the generation of the probability
for class membership.
– The MVN assumption means that groups are assumed to follow an
elliptical shape.
• Whilst LDA assumes groups have the same covariance matrix, QDA
permits different covariance structures between groups.
19
© Copyright 2026 Paperzz