unit #3
Giansalvo EXIN Cirrincione
PROBABILITY DENSITY ESTIMATION
• labelled
• unlabelled
finite number of training samples
parametric
methods
non-parametric
methods
semi-parametric
methods
A specific functional form for the density model is
assumed. This contains a number of parameters which are
then optimized by fitting the model to the training set.
The chosen form is not correct
PROBABILITY DENSITY ESTIMATION
finite number of training samples
parametric
methods
non-parametric
methods
semi-parametric
methods
It does not assume a particular functional form, but allows
the form of the density to be determined entirely by the data.
The number of parameters grows with the size of the TS
PROBABILITY DENSITY ESTIMATION
finite number of training samples
parametric
methods
non-parametric
methods
semi-parametric
methods
It allows a very general class of functional forms in which
the number of adaptive parameters can be increased in a
sistematic way to build even more flexible models, but
where the total number of parameters in the model can be
varied independently from the size of the data set.
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
Parametric
model:stochastic techniques
Bayesian inference
for on-line learning
normal or Gaussian distribution
maximum likelihood
d d 3
parameters
2
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
Parametric
model:stochastic techniques
Bayesian inference
for on-line learning
normal or Gaussian distribution
maximum likelihood
Mahalanobis distance
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
Parametric
model:stochastic techniques
Bayesian inference
for on-line learning
normal or Gaussian distribution
maximum likelihood
Σ u i i u i
contour of constant probability density
(smaller by a factor exp(-1/2))
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
Parametric
model:stochastic techniques
Bayesian inference
for on-line learning
normal or Gaussian distribution
maximum likelihood
2d parameters
The components of x are
statistically independent
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
Parametric
model:stochastic techniques
Bayesian inference
for on-line learning
normal or Gaussian distribution
maximum likelihood
j
j
d 1 parameters
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
Parametric
model:stochastic techniques
Bayesian inference
for on-line learning
normal or Gaussian distribution
maximum likelihood
Some properties :
• any moment can be expressed as a function of m and S
• under general assumptions, the mean of M random variables
tends to be distributed normally, in the limit as M tends to infinity
(central limit theorem). Example: sum of a set of variables drawn
independently from the same distribution
• under any non-singular linear transformation of the coordinate
system, the pdf is again normal, but with different parameters
• the marginal and conditional densities are normal.
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
Parametric
model:stochastic techniques
Bayesian inference
for on-line learning
normal or Gaussian distribution
maximum likelihood
discriminant function
independent normal class-conditional pdf’s
quadratic decision boundary
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
Parametric
model:stochastic techniques
Bayesian inference
for on-line learning
normal or Gaussian distribution
maximum likelihood
independent normal class-conditional pdf’s
Sk =S
linear decision boundary
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
Parametric
model:stochastic techniques
Bayesian inference
for on-line learning
normal or Gaussian distribution
maximum likelihood
P(C1) = P(C2)
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
Parametric
model:stochastic techniques
Bayesian inference
for on-line learning
normal or Gaussian distribution
maximum likelihood
P(C1) = P(C2) = P(C3)
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
Parametric
model:stochastic techniques
Bayesian inference
for on-line learning
normal or Gaussian distribution
maximum likelihood
S = 2I
template matching
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
maximum likelihood
Bayesian inference
stochastic techniques
for on-line learning
ML finds the optimum values for the parameters by maximizing
a likelihood function derived from the training data.
drawn independently from the required distribution
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
maximum likelihood
Bayesian inference
stochastic techniques
for on-line learning
ML finds the optimum values for the parameters by maximizing
a likelihood function derived from the training data.
TS joint probability density
Likelihood of for the given TS
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
maximum likelihood
Bayesian inference
stochastic techniques
for on-line learning
error function
Gaussian pdf
homework
sample averages
homework
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
maximum likelihood
Bayesian inference
stochastic techniques
for on-line learning
Uncertainty in the values
of the parameters
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
maximum likelihood
Bayesian inference
stochastic techniques
for on-line learning
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
maximum likelihood
Bayesian inference
stochastic techniques
for on-line learning
weighting factor (posterior distribution)
drawn independently from the underlying distribution
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
maximum likelihood
Bayesian inference
stochastic techniques
for on-line learning
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
maximum likelihood
Bayesian inference
stochastic techniques
for on-line learning
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
maximum likelihood
Bayesian inference
stochastic techniques
for on-line learning
For large numbers of observations, the Bayesian representation
of the density approaches the maximum likelihood solution.
A prior which gives rise to a posterior having the same functional form
is said to be a conjugate prior (reproducing densities, e.g. Gaussian).
Example
Assume known
Find m given
sample mean
homework
homework
normal
distribution
Example
normal
distribution
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
maximum likelihood
Bayesian inference
stochastic techniques
for on-line learning
Iterative techniques:
• no storage of a complete TS
• on-line learning in real-time adaptive systems
• tracking of slowly varying systems
From the ML estimate of the mean of a normal distribution
The Robbins-Monro algorithm
Consider a pair of random variables g and which are correlated
regression function
Assume g has finite variance:
The Robbins-Monro algorithm
Successive corrections
decrease in magnitude
for convergence
positive
Corrections are
sufficiently large that
the root is found
The accumulated noise
has finite variance
(noise doesn’t spoil
convergence )
The Robbins-Monro algorithm
The ML parameter estimate can be formulated as a sequential
update method using the Robbins-Monro formula.
homework
Consider the case where the pdf is taken to be a normal distribution,
with known standard deviation and unknown mean m. Show that, by
choosing aN = 2 / (N+1), the one-dimensional iterative version of the
ML estimate of the mean is recovered by using the Robbins-Monro
formula for sequential ML. Obtain the corresponding formula for the
iterative estimate of 2 and repeat the same analysis.
x mˆ
g
2
m mˆ
f
2
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
SUPERVISED LEARNING
histograms
We can choose both the number of bins
M and their starting position on the axis.
The number of bins (viz. the bin width)
acts as a smoothing parameter.
Curse of dimensionality ( Md bins)
Density estimation in general
The probability that a new vector x, drawn from the unknown pdf
p(x), will fall inside some region R of x-space is given by:
If we have N points drawn independently from p(x), the probability
that K of them will fall within R is given by the binomial law:
The distribution is sharply peaked as N tends to infinity.
Assume p(x) is continuous and slightly
varies over the region R of volume V.
Density estimation in general
Assumption #1
R relatively large so
that P will be large
and the binomial
distribution will be
sharply peaked
FIXED
Assumption #2
R small justifies the
assumption of p(x)
nearly constant inside
the integration region.
DETERMINED
FROM DATA
K-nearest-neighbours
Density estimation in general
Assumption #1
R relatively large so
that P will be large
and the binomial
distribution will be
sharply peaked
DETERMINED
FROM DATA
Assumption #2
R small justifies the
assumption of p(x)
nearly constant inside
the integration region.
FIXED
Kernel-based methods
Kernel-based methods
R is a hypercube
centred on x
V h
d
We can find an expression for K by defining a kernel
function H(u), also known as a Parzen window, given by:
interpolation
function (ZOH)
Superposition of N cubes of side h with each
cube centred on one of the data points.
Kernel-based methods
smoother estimate
Kernel-based methods
ZOH
30 samples
Gaussian
Kernel-based methods
Over different selections
of data points xn
Allthe
of estimated
the data points
be stored ! of the true pdf
The expectation of
densitymust
is a convolution
with
theakernel
function
so isrepresents
a smoothed
versionwhich
of the pdf.
For
finite data
set, and
there
no non-negative
estimator
is
unbiased for all continuous pdf’s (Rosenblatt, 1956)
K-nearest neighbours
Consider a small hypersphere centred at a point x and allow the
radius of the sphere to grow until it contains precisely K data
points. The estimate of the density is then given by K / NV.
The optimum choice of h may be a function of position.
One of the potential problems with the kernel-based approach arises
from the use of a fixed width parameter (h) for all of the data points. If h
is too large, there may be regions of x-space in which the estimate is
oversmoothed. Reducing h may lead to problems in regions of lower
density where the model density will become noisy.
K-nearest neighbours
The estimate is not a true
probability density since its
integral over all x-space diverges.
All of the data points must be stored !
Branch-and-bound
K-nearest neighbour classification rule
Kk
p x Ck
N kV
The data set contains Nk
points in class Ck and N
points in total.
Nk
K
p x
Draw a hypersphere p Ck
NV
N
around x which
encompasses K points
irrespective of their class.
PCk x
px Ck PCk
px
Kk
K
K-nearest neighbour classification rule
PCk x
px Ck PCk
px
Kk
K
K-nearest neighbour classification rule
PCk x
px Ck PCk
px
Kk
K
K-nearest neighbour classification rule
1-NNR
L 0 with equality iff the two pdf’s are equal.
Measure of the distance between two density functions
Kullback-Leibler distance
or
asymmetric divergence
homework
finite number of training samples
non-parametric
methods
parametric
methods
semi-parametric
methods
Techniques not restricted to specific functional forms, where the
size of the model onlycomputationally
grows with the complexity
intensive of the problem
being solved, and not simply with the size of the data set.
MIXTURE MODEL
Training methods based on ML:
nonlinear optimization
re-estimation (EM algorithm)
stochastic sequential estimation
MIXTURE DISTRIBUTION
mixing parameters
prior probability of the data point
having been generated from
component j of the mixture
incomplete data
(no component label)
To generate
a data
the pdf, one ofdensity
the components
j isaccuracy
first
It can
approximate
anyfrom
CONTINUOUS
to arbitrary
selected
random
probabilitylarge
P(j) number
and thenofa components,
data point is
provided
theatmodel
haswith
a sufficiently
the corresponding
component
density
p(xj).
andgenerated
provided from
the parameters
of the model
are chosen
correctly.
posterior probability
spherical Gaussian
mM1
m1d
MAXIMUM LIKELIHOOD
Adjustable parameters :
P( j )
mj j = 1, … , M
j j = 1, … , M
One of the Gaussian
components collapses onto
one of the data points
Problems :
singular solutions (likelihood goes to infinity)
local minima
MAXIMUM LIKELIHOOD
Possible solutions :
constrain the components to have equal variance
minimum (underflow) threshold for the variance
Problems :
singular solutions (likelihood goes to infinity)
local minima
softmax or normalized exponential
Mean of the data vectors
weighted by the posterior
probabilities that the
corresponding data points
were generated from that
component.
Expressions for the parameters at a minimum of E
Variance of the data w.r.t.
the mean of that
component, again weighted
with the posterior
probabilities.
Expressions for the parameters at a minimum of E
Posterior probabilities for
that component, averaged
over the data set.
Expressions for the parameters at a minimum of E
Highly
non-linear
coupled
equations
Expressions for the parameters at a minimum of E
old
new
new
old
The error
function
decreases at each
iteration until a
local minimum is
found
old
new
Expectation-maximization (EM) algorithm
proof
Jensen’s inequality
Given a set of non-negative
numbers j that sum to one :
Minimizing Q leads to a decrease in the value of
the Enew unless Enew is already at a local minimum.
E
new
E
old
Q
Q
Gaussian mixture model
Minimize :
end proof
example
EM algorithm
• 1000 data points
• uniform distribution
• seven components
after 20 cycles
xm
j
Contours of constant
probability density
j
2
1
P j
M
j
after 20 cycles
Why expectation-maximization ?
px px Ck PCk
c
k 1
Hypothetical complete data set
xn introduce zn , integer in the range (1,M), specifying
which component of the mixture generated x.
The distribution of zn
is unknown
Why expectation-maximization ?
First we guess some values for the parameters of the mixture model
(the old parameter values) and then we use these, together with
Bayes’ theorem, to find the probability distribution of the {zn}. We
then compute the expectation of Ecomp w.r.t. this distribution. This is
the E-step of the EM algorithm. The new parameter values are then
found by minimizing this expected error w.r.t. the parameters. This is
the maximization or M-step of the EM algorithm (min E = ML).
Why expectation-maximization ?
Pold(zn|xn) is the probability for zn, given the value of xn and the old
parameter values. Thus, the expectation of Ecomp over the complete
set of {zn} values is given by:
probability distribution for the {zn}
Why expectation-maximization ?
Pold(zn|xn) is the probability for zn, given the value of xn and the old
parameter values. Thus, the expectation of Ecomp over the complete
set of {zn} values is given by:
homework
homework
Why expectation-maximization ?
Pold(zn|xn) is the probability for zn, given the value of xn and the old
parameter values. Thus, the expectation of Ecomp over the complete
set of {zn} values is given by:
~
which is equal to Q
Stochastic estimation of parameters
It requires the storage
of all previous data
points
Stochastic estimation of parameters
no singular solutions in
on-line problems
© Copyright 2026 Paperzz