Generative classifiers - School of Computer Science, University of

Generative classifiers: The
Gaussian classifier
Ata Kaban
School of Computer Science
University of Birmingham
Outline
• We have already seen how Bayes rule can be
turned into a classifier
• In all our examples so far we had discrete
valued attributes (e.g. in {‘sunny’,’rainy’}, {+,-})
• Today we learn how to do this when the data
attributes are continuous valued
Example
• Task: predict gender
of individuals based
on their heights
40
Empirical data for male
35
Empirical data for female
•
Given
• 100 height
examples of
women
Frequency
30
25
20
15
10
5
• 100 height
examples of man
0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
Height (meters)
1.8
1.9
2
Class priors
• We can encode the values of the hypothesis
(class) as 1 (male) and 0 (female). So, ℎ ∈ 0,1 .
• Since in this example we had the same number of
males and females, we have P(h=1)=P(h=0)=0.5.
These are the prior probabilities of class
membership because they can be set before
measuring any data.
• Note that in cases when the class proportions are
imbalanced, we can use the priors to make
predictions even before seeing any data.
Class-conditional likelihood
• Our measurements are heights. This is our
data, 𝑥.
• Class-conditional likelihoods:
p(x|h=1): probability that a male has height x
meters
p(x|h=0):
Class posterior
• As before, from Bayes rule we can obtain the
class posteriors: 𝑃 ℎ = 1 𝑥 =
𝑝 𝑥 ℎ = 1 𝑃(ℎ=1)
𝑝 𝑥 ℎ = 1 𝑃 ℎ=1 +𝑝 𝑥 ℎ = 0 𝑃(ℎ=0)
Meaning of the denominator is the probability
of measuring the height value x irrespective of
the class.
• If we can compute this then we can use it for
predicting the gender from the height
measurement
Discriminant function
• When does our prediction switch from predicting h=0 vs predicting
h=1?
40
Empirical data for male
35
Empirical data for female
Frequency
30
25
20
15
10
5
0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
Height (meters)
• "When the measured hight passes a certain threshold … "
• … more precisely, when 𝑝 ℎ = 0 𝑥 = 𝑝 ℎ = 1 𝑥
Discriminant function
• If we make a measurement, say we get 𝑥 =
1.7 𝑚
• We compute the posteriors and find
𝑃 ℎ = 1 𝑥 = 1.7 > 𝑃(ℎ = 0|𝑥 = 1.7)
• Then we decide to predict ‘ℎ = 1′, i.e. , male
• If we measured 𝑥 = 1.2 𝑚, we will get
𝑃 ℎ = 1 𝑥 = 1.2 < 𝑃(ℎ = 0|𝑥 = 1.2)
Discriminant function
• We can define a discriminant function as:
𝑃(ℎ = 1|𝑥)
𝑓1 𝑥 =
𝑃(ℎ = 0|𝑥)
and compare the function value to 1.
• More convenient to have the switching at 0
rather than at 1. Define discriminant function as
the log of f1:
𝑃(ℎ = 1|𝑥)
𝑓 𝑥 = log
𝑃(ℎ = 0|𝑥)
• Then the sign of this function defines the
prediction (if f(x)>0 => male, if f(x)<0 => female)
How do we compute it?
• Let’s write it out using Bayes rule:
𝑃(ℎ = 1|𝑥)
𝑝 𝑥 ℎ = 1 𝑃(ℎ = 1)
𝑓 𝑥 = log
= log
𝑃(ℎ = 0|𝑥)
𝑝 𝑥 ℎ = 0 𝑃(ℎ = 0)
• Now, we need the class conditional likelihood terms,
𝑝 𝑥 ℎ = 0 and 𝑝 𝑥 ℎ = 1 . Note that 𝑥 now takes
continuous real values.
• We will model each class by a Gaussian distribution.
(Note, there are other ways to do it, this is a generic
problem that Density Estimation deals with. Here
consider the specific case of using Gaussian, which is
fairly commonly done in practice.)
Illustration – our 1D example
40
35
Empirical data for male
Fitted distributionfor male
Empirical data for female
Fitted distribution for female
30
Frequency
25
20
15
10
5
0
1.1
1.2
1.3
1.4
1.5
1.6
Height (meters)
1.7
1.8
1.9
2
Gaussian - univariate
(𝑥 − 𝑚)2
𝑝 𝑥 =
exp −
2
2𝜎 2
2𝜋𝜎
1
Where 𝑚 is the mean (center), and 𝜎 2 is the variance
(spread). These are the parameters that describe the
distributions.
We will have a separate Gaussian for each class. So, the
female class will have 𝑚0 as its mean, and 𝜎0 2 as its
variance. The male class will have 𝑚1 as its mean, and
𝜎1 2 as its variance. We need to estimate these
parameters from the data.
Gaussian - multivariate
Let 𝑥 = (𝑥1 , 𝑥2 , … , 𝑥𝑑 ). So x has d attributes.
Let k in {0,1}.
𝑝(𝑥|ℎ = 𝑘)=
1
(2𝜋)𝑑 |𝝨𝑘 |
1
exp{− (𝑥
2
− 𝑚𝑘
)𝑇 Σ
−1
𝑘
(𝑥 − 𝑚𝑘 )}
Where 𝑚𝑘 are the mean vectors, and Σ𝑘 is the covariance matrices. These
are the parameters that describe the distributions, and they are estimated
from the data.
Gaussian - multivariate
Attribute 2
2D example with 2 classes
Attribute 1
Naïve Bayes
• Notice the full covariances are 𝑑 × 𝑑.
• In many situations there is not enough data to
estimate the full covariance – e.g. when d is large.
• The Naïve Bayes assumption is again an easy
simplification that we can make and tends to
work well in practice. In the Gaussian model it
means that the covariance matrix is diagonal.
• For the brave: Check this last statement for
yourself! – 3% extra credit if you hand in a correct
solution to me before next Thursday’s class!
Are we done?
• How do we estimate the parameters, i.e. the
means 𝑚𝑘 and the variance/ covariance Σ𝑘 ?
• If we use the Naïve Bayes assumption, we can
compute the estimates of the mean and variance
in each class separately for each feature.
• If d is small, and you have many points in your
training set, then working with full covariance is
expected to work better.
• In MatLab there are built-in functions that you
can use: mean, cov, var.
Multi-class classification
• We may have more than 2 classes – e.g.
‘healthy’, ‘disease type 1’, ‘disease type 2’.
• Our Gaussian classifier is easy to use in multiclass problems.
• We compute the posterior probability for each
of the classes
• We predict the class whose posterior
probability is highest.
Summing up
• This type of classifier is called ‘generative’, because it rests
on the assumption that the cloud of points in each class can
be seen as generated by some distribution, e.g. a Gaussian,
and works out its decisions based on estimating these
distributions.
• One could instead model the discriminant function directly!
That type of classifier is called ‘discriminative’.
• For the brave: Try to work out the form of the discriminant
function by plugging into it the form of the Gaussian class
conditional densities. You will get a quadratic function of x
in general. When does it reduce to a linear functon?
• Recommended reading: Rogers & Girolami, Chapter 5.