Mixture of Gaussians

Machine Learning
Srihari
Mixture of Gaussians
Sargur Srihari
[email protected]
1
Machine Learning
Srihari
Mixtures of Gaussians
•  Also Called as Finite Mixture Models (FMMs)
K
p(x) = ∑ π k N(x | µ k , Σ k )
k=1
Goal:
Determine the three sets of parameters
π , µ, Σ using maximum likelihood
Machine Learning
Srihari
Gaussian Mixture Model
•  A linear superposition of Gaussian components
•  Provides a richer class of density models than
the single Gaussian
•  GMM are formulated in terms of discrete latent
variables
–  Provides deeper insight
–  Motivates EM algorithm
•  Which gives a maximum likelihood solution to no. of
components and their means/covariances
3
Machine Learning
GMM Formulation
Srihari
•  Linear superposition of K Gaussians:
p(x) = ∑ π N(x | µ , Σ )
K
k
k
k
k=1
•  Introduce a latent variable z that has K values
–  Use 1-of-K representation
•  Let z = z1,..,zK whose elements are
zk ∈{0,1} and ∑ zk = 1
k
•  There are K possible states of z corresponding to K
components
•  Define joint distribution p(x,z)=p(x|z)p(z)
–  x is observed variable
–  z is hidden or missing variable
conditional marginal
Machine Learning
Srihari
Specifying p(z), z ={z1,..,zK }
•  Associate a probability with each component zk
–  Denote
p(zk = 1) = π k where parameters satisfy
0 ≤ π k ≤ 1 and ∑ π k = 1
k
•  Because z uses 1-of-K it follows that
K
p(z) = ∏ π
k =1
zk
k
With one component p(z1 ) = π 1z1
With two components p(z1 , z2 ) = π 1z1π 2 z2
–  since components are mutually exclusive and hence
are independent
5
Machine Learning
Srihari
Conditional distribution p(x|z)
•  For a particular component (value of z)
p(x | zk = 1) = N(x | µ k , Σ k )
•  Thus p(x|z) Kcan be written in the form
z
Due to the exponent z all product
p(x | z) = ∏ N ( x | µk ,Σ k )
terms except for one equal one
k=1
•  Thus marginal distribution of x is obtained by
summing over all possible states of z
k
K
k
zk
K
p(x) = ∑ p(z)p(x | z) = ∑ ∏ π N ( x | µ k , Σk ) = ∑ π k N ( x | µ k , Σk )
zk
k
z
z
k=1
k=1
•  This is the standard form of a Gaussian mixture
6
Machine Learning
Srihari
Latent Variable
•  If we have observations x1,..,xN
•  Because marginal distribution is in the form
p(x) = ∑ p(x,z)
z
–  It follows that for every observed data point xn there is
a corresponding latent vector zn , i.e., its sub-class
•  Thus we have found a formulation of Gaussian
mixture involving an explicit latent variable
–  We are now able to work with joint distribution p(x,z)
instead of marginal p(x)
•  Leads to significant simplification through
introduction of expectation maximization
7
Machine Learning
Srihari
Another conditional probability (Responsibility)
•  In EM p(z |x) plays a role
•  The probability p(zk=1|x) is denoted γ (zk )
•  From Bayes theorem
γ (z k ) ≡ p(z k = 1 | x) =
p(z k = 1) p(x | z k = 1)
K
∑ p(z
j
= 1) p(x | z j = 1)
j=1
=
p(x,z)=p(x|z)p(z)
π k N(x | µk ,Σ k )
K
∑ π N(x | µ ,Σ
j
k
j
)
j=1
•  View p(zk = 1) = π k as prior probability of component k
γ (zk ) = p(zk = 1 | x) as the posterior probability
it is also the responsibility that component k takes for explaining
the observation x
8
Machine Learning
Srihari
Plan of Discussion
•  Next we look at
1.  How to get data from a mixture model
synthetically and then
2.  How to estimate the parameters from the data
9
Machine Learning
Srihari
Synthesizing data from mixture
500 points from
•  Use ancestral sampling
three Gaussians
–  Start with lowest numbered node and
draw a sample,
•  Generate sample of z, called z^
–  move to successor node and draw a
sample given the parent value, etc.
Complete Data set
•  Then generate a value for x from
conditional p(x|z^)
•  Samples from p(x,z) are plotted
according to value of x and colored
with value of z
•  Samples from marginal p(x)
obtained by ignoring values of z
Incomplete Data set
10
Machine Learning
Srihari
Illustration of responsibilities
•  Evaluate for every data point
–  Posterior probability of each component
•  Responsibility γ (znk ) is associated with
data point xn
•  Color using proportion of red, blue and
green ink
–  If for a data point γ (zn1 ) = 1 it is colored red
–  If for another point γ (zn2 ) = γ (zn 3 ) = 0.5 it
has equal blue and green and will appear
as cyan
11
Machine Learning
Srihari
Maximum Likelihood for GMM
•  We wish to model data set {x1,..xN} using a mixture
of Gaussians (N items each of dimension
D)
"x %
•  Represent by N x D matrix X X = $$x ''
1
2
– 
Nth
row is given by xn
T
$x n '
$ '
# &
•  Represent N latent variables with N x K matrix Z with
€
rows znT
•  Bayesian Network for p(x,z) is
"z1 %
$ '
z
Z = $ 2'
$z n '
$ '
# &
€
We are given N i.i.d. samples {x n } with unknown latent
points {z n } with parameters π n .
The samples have parameters: mean µ and covariance Σ
Goal is to estimate the three sets of parameters
12
Machine Learning
Srihari
Likelihood Function for GMM
Mixture density function is
K
p(x) = ∑ p(z) p(x | z) = ∑ π k N ( x | µk ,Σ k )
z
Summation is for
Marginalizing the joint
k=1
Therefore Likelihood function is
⎧K
⎫
p(X | π , µ,Σ) = ∏ ⎨∑ π k N(x n | µk ,Σ k )⎬
⎭
n=1 ⎩ k=1
N
Product is over the N
i.i.d. samples
Therefore log-likelihood function is
⎧K
⎫
ln p(X | π , µ,Σ) = ∑ ln⎨∑ π k N(x n | µk ,Σ k )⎬
⎭
n=1 ⎩ k=1
N
Which we wish to maximize
A more difficult problem than for a single Gaussian
13
Machine Learning
Srihari
Maximization of Log-Likelihood
⎧K
⎫
ln p(X | π , µ, Σ) = ∑ ln ⎨∑ π k N(x n | µ k , Σ k ) ⎬
n=1
⎩ k=1
⎭
N
•  Goal is to estimate the three sets of
parameters π , µ, Σ
•  Before proceeding with the m.l.e. briefly
mention two technical issues:
–  Problem of singularities
–  Identifiability of mixtures
14
Machine Learning
Srihari
Problem of Singularities with Gaussian mixtures
•  Consider Gaussian mixture
–  components with covariance matrices Σ k = σ k 2 I
•  Data point that falls on a mean µ j = x n will
contribute to the likelihood function
1
1
N (x n | x n , σ I ) =
(2π )1/2 σ j
2
j
since exp(xn-µj)2=1
•  As σ j → 0 term goes to infinity
•  Therefore maximization of log-likelihood
⎧
⎫
ln p(X | π , µ, Σ) = ∑ ln ⎨∑ π N(x | µ , Σ ) ⎬ is not well-posed
N
n=1
One component
assigns finite values
and other to large
value
K
⎩ k=1
k
n
k
k
⎭
–  Does not happen with a single Gaussian
•  Multiplicative factors go to zero
–  Does not happen in the Bayesian approach
•  Problem is avoided using heuristics
–  Resetting mean or covariance
Multiplicative values
15
Take it to zero
Machine Learning
Srihari
Problem of Identifiability
A density p(x | θ ) is identifiable if θ ≠ θ ' then there is an x for which p(x | θ ) ≠ p(x | θ ')
A K-component mixture will have a total of K!
equivalent solutions
–  Corresponding to K! ways of assigning K sets of
parameters to K components
•  E.g., for K=3 K!=6: 123, 132, 213, 231, 312, 321
–  For any given point in the space of parameter values
there will be a further K!-1 additional points all giving
exactly same distribution
•  However any of the equivalent solutions is as
good as the other
A
Two ways of labeling three Gaussian subclasses
C
B
B
C
A
16
Machine Learning
Srihari
EM for Gaussian Mixtures
•  EM is a method for finding maximum likelihood
solutions for models with latent variables
•  Begin with log-likelihood function
⎧K
⎫
ln p(X | π , µ,Σ) = ∑ ln⎨∑ π k N(x n | µk ,Σ k )⎬
⎭
n=1 ⎩ k=1
N
•  We wish to find π , µ, Σ that maximize this quantity
•  Take derivatives in turn w.r.t
–  Means µ k and set to zero
–  covariance matrices Σ and set to zero
k
–  mixing coefficients π and set to zero
k
17
Machine Learning
Srihari
EM for GMM: Derivative wrt µk
•  Begin with log-likelihood function
⎧K
⎫
ln p(X | π , µ,Σ) = ∑ ln⎨∑ π k N(x n | µk ,Σ k )⎬
⎭
n=1 ⎩ k=1
N
•  Take derivative w.r.t the means µ k and set to zero
–  Making use of exponential form of Gaussian
–  Use formulas: d
u'
d u
ln u =
and
e = e u u'
–  We get
dx
u
dx
−1
π k N(x n | µk ,Σ k )
0=∑
(x n − µk )
∑
k
n=1 ∑ π j N(x n | µ j ,Σ j )
N
j
Inverse of covariance
matrix
γ (znk ) the posterior probabilities
18
Machine Learning
Srihari
M.L.E. solution for Means
Σ k(assuming non-singularity)
•  Multiplying by
1
µk =
Nk
N
∑γ (z
n =1
nk
)x n
•  Where we have defined
Mean of kth Gaussian component
is the weighted mean of all the
points in the data set:
where data point xn is weighted by
the posterior probability that
component k was responsible for
generating xn
N
N k = ∑ γ (znk )
n=1
–  Which is the effective number of points assigned
to cluster k
19
Machine Learning
Srihari
M.L.E. solution for Covariance
•  Set derivative wrt
Σk
to zero
–  Making use of mle solution for covariance
matrix of single Gaussian
1 N
Σk =
γ (znk )(x n − µk )(x n − µk )T
∑
N k n=1
–  Similar to result for a single Gaussian for the
data set but each data point weighted by the
corresponding posterior probability
–  Denominator is effective no of points in
component
20
Machine Learning
Srihari
M.L.E. solution for Mixing Coefficients
•  Maximize ln p(X | π , µ, Σ)
w.r.t. π k
–  Must take into account that mixing coefficients
sum to one
–  Achieved using Lagrange multiplier and
maximizing
⎛K
⎞
ln p(X | π , µ,Σ) + λ⎜ ∑ π k −1⎟
⎝ k=1
⎠
–  Setting derivative wrt
gives
Nk
πk =
N
π k to zero and solving
21
Machine Learning
Srihari
Summary of m.l.e. expressions
•  GMM maximum likelihood estimates
1
µk =
Nk
N
∑γ (z
n =1
nk
)x n
Parameters (means)
1 N
Σk =
γ (znk )(x n − µk )(x n − µk )T
∑
N k n=1
Nk
πk =
N
N
N k = ∑ γ (znk )
Parameters(covariance matrices)
Parameters
(Mixing Coefficients)
n=1
All three are in terms of responsibliities
22
Machine Learning
Srihari
EM Formulation
•  The results for µk , Σ k , π k are not closed form
solutions for the parameters
–  Since γ (znk ) the responsibilities depend on
those parameters in a complex way
•  Results suggest an iterative solution
•  An instance of EM algorithm for the
particular case of GMM
23
Machine Learning
Srihari
Informal EM for GMM
•  First choose initial values for means,
covariances and mixing coefficients
•  Alternate between following two updates
–  Called E step and M step
•  In E step use current value of parameters to
evaluate posterior probabilities, or
responsibilities
•  In the M step use these posterior probabilities to
to re-estimate means, covariances and mixing
coefficients
24
Machine Learning
Srihari
EM using Old Faithful
Data points and
Initial mixture model
After 2 cycles
Initial E step
Determine
responsibilities
After 5 cycles
After first M step
Re-evaluate Parameters
After 20 cycles
25
Machine Learning
Srihari
Comparison with K-Means
K-means result
E-M result
26
Machine Learning
Srihari
Animation of EM for Old Faithful Data
•  http://en.wikipedia.org/wiki/
File:Em_old_faithful.gif
Code in R
#initial parameter estimates (chosen to be deliberately bad)
theta <- list( tau=c(0.5,0.5),
mu1=c(2.8,75),
mu2=c(3.6,58),
sigma1=matrix(c(0.8,7,7,70),ncol=2),
sigma2=matrix(c(0.8,7,7,70),ncol=2) )
27
Machine Learning
Srihari
Practical Issues with EM
•  Takes many more iterations than K-means
•  Each cycle requires significantly more
comparison
•  Common to run K-means first in order to
find suitable initialization
•  Covariance matrices can be initialized to
covariances of clusters found by K-means
•  EM is not guaranteed to find global
maximum of log likelihood function
28
Machine Learning
Srihari
Summary of EM for GMM
•  Given a Gaussian mixture model
•  Goal is to maximize the likelihood function
w.r.t. the parameters (means, covariances
and mixing coefficients)
Step1: Initialize the means µ,k covariances Σ k
and mixing coefficients π k and evaluate initial
value of log-likelihood
29
Machine Learning
Srihari
EM continued
•  Step 2: E step: Evaluate responsibilities using
current parameter values
π N (x | µ , Σ )
γ ( zk )=
k
k
K
∑π
j =1
j
k
N (x | µ j , Σ j ))
•  Step 3: M Step: Re-estimate parameters using
current responsibilities
µ
new
k
Σ
new
k
1 N
=
γ (znk )x n
∑
N k n=1
1 N
=
γ (znk )(x n − µknew )(x n − µknew )T
∑
N k n=1
π knew =
Nk
N
where
N
N k = ∑ γ (znk )
n=1
30
Machine Learning
Srihari
EM Continued
•  Step 4: Evaluate the log likelihood
⎧K
⎫
ln p(X | π , µ,Σ) = ∑ ln⎨∑ π k N(x n | µk ,Σ k )⎬
⎭
n=1 ⎩ k=1
N
–  And check for convergence of either parameters
or log likelihood
•  If convergence not satisfied return to Step 2
31