Latent Variable Models and Expectation Maximization

K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Latent Variable Models and Expectation
Maximization
Oliver Schulte - CMPT 726
Bishop PRML Ch. 9
small then only the white circle is seen, and there is no exMaximization
Algorithm
EM Example:
tremaThe
in Expectation
entropy.
ThereThere
is an
extrema
when
the theGaussian Mixture Models
trema
in entropy.
isentropy
an entropy
extrema
when
scale scale
is slightly
larger
than
the
radius
of
the
bright
circle,
is slightly larger than the radius of the bright circle,
lative
to a refthe entropy
decreases
as theasscale
increases.
rt
p relative
to a ref-and thereafter
and thereafter
the entropy
decreases
the scale
increases.
sitydensity
whichwhich
has has In practice
this method
gives gives
stablestable
identification
of feaan
In practice
this method
identification
of feaare
to
variety
of sizes
and copes
well with
intra-class
partsassumed
are assumed
totures over
turesaover
a variety
of sizes
and copes
well with
intra-class
• We
ound
modelmodel
asdiscussed
models
attolength
Theprobabilistic
saliency
measure
is designed
betoinvariackground
as-variability.
variability.
The saliency
measure
is designed
be invarihin
a rangea range
r). r). ant to ant
scaling,
although
experimental
tests
show
that this
isthis is
e (within
to scaling,
although
experimental
tests
show
thatdata,
• Assignment
3:
given
fully
observed
training
setting
not entirely
the case
aliasing
and other
effects.
Note,Note,
not entirely
thedue
casetodue
to aliasing
and other
effects.
parameters
θ i for information
Bayes
nets
isis used
straight-forward
only monochrome
is used
to
detect
and repreonly monochrome
information
to detect
and repre)tdpp, U
rfp )dp rf
sent features.
sentinfeatures.
• However,
many settings not all variables are observed
αfK-Means
|µ,
Σ) αf
Learning Parameters to Probability Distributions
der.
re finder.
(labelled) in the training data: xi = (xi , hi )
Feature
representation
Feature
representation
• 2.3.
e.g.2.3.
Speech
recognition:
have speech signals, but not
The
feature
detector
identifies
regions
of interest
on each
phoneme
labels
The feature
detector
identifies
regions
of interest
on each
The coordinates
of theofcentre
give
us
X us
and
Therecognition:
coordinates
the
centre
give
Xthe
andsize
the size
• image.
e.g.image.
Object
have
object
labels
(car,
bicycle),
the of
region
gives
S. Figure
2 illustrates
this
on
two
thepart
region
gives
S.
Figure
2door,
illustrates
this
ontypical
two typical
detected
using usingofbut
atures
detected
not
labels
(wheel,
seat)
from the
motorbike
dataset.
images
from
the motorbike
dataset.
The
is •isimages
an
M second
. The second
Unobserved variables are called latent variables
and theand
lastthe last
sble
variable
sible
occlusion
all
possible
occlusion
1
p(d|θ)p(d|θ)
)(N, f )
20
20
40
40
60
60
80
80
100
100
120
120
el.
ns
the shape
and ocshape
and ocdppearance
the appearance
and and
ncompasses
of
passes
many many
of
robabilistic
way, this
listic way, this
ytrained
constrained
objects
objects
figs from
Fergus et al.
a small
covariance)
all
covariance)
140
160
180
20
20
40
40
60
60
80
80
100
100
140
120
120
160
180
50
140
50
100
100
150
150
200
200
250
250
140
50
50
100
100
150
150
200
2: Output
the feature
detector
FigureFigure
2: Output
of the of
feature
detector
200
250
250
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Latent Variable Models: Pros
• Statistically powerful, often good predictions. Many
applications:
• Learning with missing data.
• Clustering: “missing” cluster label for data points.
• Principal Component Analysis: data points are
generated in linear fashion from a small set of unobserved
components. (more later)
• Matrix Factorization, Recommender Systems:
• Assign users to unobserved “user types”, assign items to
unobserved “item types”.
• Use similarity between user type, item type to predict
preference of user for item.
• Winner of $1M Netflix challenge.
• If latent variables have an intuitive interpretation (e.g.,
“action movies”, “factors”), discovers new features.
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Latent Variable Models: Cons
• From a user’s point of view, like a black box if latent
variables don’t have an intuitive interpretation.
• Statistically, hard to guarantee convergence to a correct
model with more data (the identifiability problem).
• Harder computationally, usually no closed form for
maximum likelihood estimates.
• However, the Expectation-Maximization algorithm provides
a general-purpose local search algorithm for learning
parameters in probabilistic models with latent variables.
K-Means
The Expectation Maximization Algorithm
Outline
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
EM Example: Gaussian Mixture Models
K-Means
The Expectation Maximization Algorithm
Outline
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
EM Example: Gaussian Mixture Models
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Unsupervised Learning
• We will start with an unsupervised
2
(a)
learning (clustering) problem:
• Given a dataset {x1 , . . . , xN }, each
xi ∈ RD , partition the dataset into K
clusters
0
−2
−2
0
2
• Intuitively, a cluster is a group of
points, which are close together and
far from others
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Distortion Measure
2
(a)
• Formally, introduce prototypes (or
cluster centers) µk ∈ RD
0
• Use binary rnk , 1 if point n is in cluster k,
−2
−2
2
0
2
(i)
0 otherwise (1-of-K coding scheme
again)
• Find {µk }, {rnk } to minimize distortion
measure:
0
J=
−2
N X
K
X
n=1 k=1
−2
0
2
rnk ||xn − µk ||2
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Minimizing Distortion Measure
• Minimizing J directly is hard
J=
N X
K
X
rnk ||xn − µk ||2
n=1 k=1
• However, two things are easy
• If we know µk , minimizing J wrt rnk
• If we know rnk , minimizing J wrt µk
• This suggests an iterative procedure
• Start with initial guess for µk
• Iteration of two steps:
• Minimize J wrt rnk
• Minimize J wrt µk
• Rinse and repeat until convergence
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Minimizing Distortion Measure
• Minimizing J directly is hard
J=
N X
K
X
rnk ||xn − µk ||2
n=1 k=1
• However, two things are easy
• If we know µk , minimizing J wrt rnk
• If we know rnk , minimizing J wrt µk
• This suggests an iterative procedure
• Start with initial guess for µk
• Iteration of two steps:
• Minimize J wrt rnk
• Minimize J wrt µk
• Rinse and repeat until convergence
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Minimizing Distortion Measure
• Minimizing J directly is hard
J=
N X
K
X
rnk ||xn − µk ||2
n=1 k=1
• However, two things are easy
• If we know µk , minimizing J wrt rnk
• If we know rnk , minimizing J wrt µk
• This suggests an iterative procedure
• Start with initial guess for µk
• Iteration of two steps:
• Minimize J wrt rnk
• Minimize J wrt µk
• Rinse and repeat until convergence
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Determining Membership Variables
• Step 1 in an iteration of K-means is to
2
minimize distortion measure J wrt
cluster membership variables rnk
(a)
0
J=
N X
K
X
rnk ||xn − µk ||2
n=1 k=1
−2
−2
2
0
2
(b)
• Terms for different data points xn are
independent, for each data point set rnk
to minimize
0
K
X
rnk ||xn − µk ||2
k=1
−2
−2
0
2
• Simply set rnk = 1 for the cluster center
µ with smallest distance
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Determining Membership Variables
• Step 1 in an iteration of K-means is to
2
minimize distortion measure J wrt
cluster membership variables rnk
(a)
0
J=
N X
K
X
rnk ||xn − µk ||2
n=1 k=1
−2
−2
2
0
2
(b)
• Terms for different data points xn are
independent, for each data point set rnk
to minimize
0
K
X
rnk ||xn − µk ||2
k=1
−2
−2
0
2
• Simply set rnk = 1 for the cluster center
µ with smallest distance
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Determining Membership Variables
• Step 1 in an iteration of K-means is to
2
minimize distortion measure J wrt
cluster membership variables rnk
(a)
0
J=
N X
K
X
rnk ||xn − µk ||2
n=1 k=1
−2
−2
2
0
2
(b)
• Terms for different data points xn are
independent, for each data point set rnk
to minimize
0
K
X
rnk ||xn − µk ||2
k=1
−2
−2
0
2
• Simply set rnk = 1 for the cluster center
µ with smallest distance
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Determining Cluster Centers
• Step 2: fix rnk , minimize J wrt the cluster
centers µk
2
(b)
J=
0
rnk ||xn −µk ||2 switch order of sums
k=1 n=1
• So we can minimze wrt each µk separately
−2
−2
2
N
K X
X
0
2
• Take derivative, set to zero:
(c)
2
0
N
X
rnk (xn − µk ) = 0
n=1
−2
−2
0
2
P
rnk xn
⇔ µk = Pn
n rnk
i.e. mean of datapoints xn assigned to
cluster k
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Determining Cluster Centers
• Step 2: fix rnk , minimize J wrt the cluster
centers µk
2
(b)
J=
0
rnk ||xn −µk ||2 switch order of sums
k=1 n=1
• So we can minimze wrt each µk separately
−2
−2
2
N
K X
X
0
2
• Take derivative, set to zero:
(c)
2
0
N
X
rnk (xn − µk ) = 0
n=1
−2
−2
0
2
P
rnk xn
⇔ µk = Pn
n rnk
i.e. mean of datapoints xn assigned to
cluster k
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
K-means Algorithm
• Start with initial guess for µk
• Iteration of two steps:
• Minimize J wrt rnk
• Assign points to nearest cluster center
• Minimize J wrt µk
• Set cluster center as average of points in cluster
• Rinse and repeat until convergence
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
K-means example
2
(a)
0
−2
−2
0
2
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
K-means example
2
(b)
0
−2
−2
0
2
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
K-means example
2
(c)
0
−2
−2
0
2
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
K-means example
2
(d)
0
−2
−2
0
2
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
K-means example
2
(e)
0
−2
−2
0
2
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
K-means example
2
(f)
0
−2
−2
0
2
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
K-means example
2
(g)
0
−2
−2
0
2
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
K-means example
2
(h)
0
−2
−2
0
2
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
K-means example
2
(i)
0
−2
−2
0
Next step doesn’t change membership – stop
2
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
K-means Convergence
• Repeat steps until no change in cluster assignments
• For each step, value of J either goes down, or we stop
• Finite number of possible assignments of data points to
clusters, so we are guaranteed to converge eventually
• Note it may be a local maximum rather than a global
maximum to which we converge
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
K-means Example - Image Segmentation
Original image
• K-means clustering on pixel colour values
• Pixels in a cluster are coloured by cluster mean
• Represent each pixel (e.g. 24-bit colour value) by a cluster
number (e.g. 4 bits for K = 10), compressed version
• This technique known as vector quantization
• Represent vector (in this case from RGB, R3 ) as a single
discrete value
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
K-means Generalized: the set-up
Let’s generalize the idea. Suppose we have the following
set-up.
• X denotes all observed variables (e.g., data points).
• Z denotes all latent (hidden, unobserved) variables (e.g.,
cluster means).
• J(X, Z|θ) where J measures the “goodness” of an
assignment of latent variable models given the data points
and parameters θ.
• e.g., J = -dispersion measure.
• parameters = assignment of points to clusters.
• It’s easy to maximize J(X, Z|θ) wrt θ for fixed Z.
• It’s easy to maximize J(X, Z|θ) wrt Z for fixed θ.
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
K-means Generalized: The Algorithm
The fact that conditional maximization is simple suggests an
iterative algorithm.
1. Guess an initial value for latent variables Z.
2. Repeat until convergence:
2.1 Find best parameter values θ given the current guess for
the latent variables. Update the parameter values.
2.2 Find best value for latent variables Z given the current
parameter values. Update the latent variable values.
K-Means
The Expectation Maximization Algorithm
Outline
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
EM Example: Gaussian Mixture Models
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
EM Algorithm: The set-up
• We assume a probabilistic model, specifically the
complete-data likelihood function p(X, Z|θ).
• “Goodness” of the model is the log-likelihood ln p(X, Z|θ).
• Key difference: instead of guessing a single best value for
latent variables given current parameter settings, we use
the conditional distribution p(Z|X, θ old ) over latent variables.
• Given a latent variable distribution, parameter values θ are
evaluated by taking the expected “goodness” ln p(X, Z|θ)
over all possible latent variable settings.
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
EM Algorithm: The procedure
1. Guess an initial parameter setting θ old .
2. Repeat until convergence:
3. The E-step: Evaluate p(Z|X, θ old ). (Ideally, find a closed
form as a function of Z).
4. The M-step:
4.1 Evaluate theP
function
Q(θ, θ old ) = Z p(Z|X, θ old ) ln p(X, Z|θ).
4.2 Maximize Q(θ, θ old ) wrt θ. Update θ old .
5. This procedure is guaranteed to
Pincrease at each step. the
data log-likelihood ln p(X|θ) = Z ln p(X, Z|θ).
6. Therefore converges to local log-likelihood maximum.
More theoretical analysis in text.
K-Means
The Expectation Maximization Algorithm
Outline
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
EM Example: Gaussian Mixture Models
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Hard Assignment vs. Soft Assignment
2
• In the K-means algorithm, a hard
(i)
assignment of points to clusters is made
• However, for points near the decision
0
boundary, this may not be such a good
idea
−2
−2
0
2
• Instead, we could think about making a
soft assignment of points to clusters
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Gaussian Mixture Model
1
1
(a)
0.5
0.2
(b)
0.5
0.3
0.5
0
0
0
0.5
1
0
0.5
1
• The Gaussian mixture model (or mixture of Gaussians
MoG) models the data as a combination of Gaussians.
• a: constant density contours. b: marginal probability p(x).
c: surface plot.
• Widely used general approximation for multi-modal
distributions.
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Gaussian Mixture Model
1
1
(b)
0.5
0.5
0
0
0
0.5
1
(a)
0
0.5
1
• Above shows a dataset generated by drawing samples
from three different Gaussians
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Generative Model
z
1
(a)
0.5
x
0
0
0.5
1
• The mixture of Gaussians is a generative model
• To generate a datapoint xn , we first generate a value for a
discrete variable zn ∈ {1, . . . , K}
• We then generate a value xn ∼ N (x|µk , Σk ) for the
corresponding Gaussian
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Graphical Model
1
zn
(a)
π
0.5
xn
µ
Σ
N
0
0
0.5
1
• Full graphical model using plate notation
• Note zn is a latent variable, unobserved
• Need to give conditional distributions p(zn ) and p(xn |zn )
• The one-of-K representation is helpful here: znk ∈ {0, 1},
zn = (zn1 , . . . , znK )
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Graphical Model - Latent Component Variable
1
zn
(a)
π
0.5
xn
µ
Σ
N
0
0
• Use a Bernoulli distribution for p(zn )
• i.e. p(znk = 1) = πk
• Parameters to this distribution {πk }
PK
• Must have 0 ≤ πk ≤ 1 and k=1 πk = 1
• p(zn ) =
znk
k=1 πk
QK
0.5
1
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Graphical Model - Observed Variable
1
zn
(a)
π
0.5
xn
µ
0
Σ
N
0
0.5
• Use a Gaussian distribution for p(xn |zn )
• Parameters to this distribution {µk , Σk }
p(xn |znk = 1) = N (xn |µk , Σk )
K
Y
p(xn |zn ) =
N (xn |µk , Σk )znk
k=1
1
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Graphical Model - Joint distribution
1
zn
(a)
π
0.5
xn
µ
Σ
0
N
0
0.5
• The full joint distribution is given by:
p(x, z) =
=
N
Y
p(zn )p(xn |zn )
n=1
N Y
K
Y
πkznk N (xn |µk , Σk )znk
n=1 k=1
1
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
MoG Marginal over Observed Variables
• The marginal distribution p(xn ) for this model is:
p(xn ) =
X
zn
=
K
X
k=1
• A mixture of Gaussians
p(xn , zn ) =
X
p(zn )p(xn |zn )
zn
πk N (xn |µk , Σk )
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
MoG Conditional over Latent Variable
1
1
(b)
0.5
(c)
0.5
0
0
0
0.5
1
0
0.5
1
• To apply EM, need the conditional distribution
p(znk = 1|xn , θ) where θ are the model parameters.
• It is denoted by γ(znk ) can be computed as:
γ(znk ) ≡ p(znk = 1|xn ) =
=
p(znk = 1)p(xn |znk = 1)
PK
j=1 p(znj = 1)p(xn |znj = 1)
πk N (xn |µk , Σk )
PK
j=1 πj N (xn |µj , Σj )
• γ(znk ) is the responsibility of component k for datapoint n
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
MoG Conditional over Latent Variable
1
1
(b)
0.5
(c)
0.5
0
0
0
0.5
1
0
0.5
1
• To apply EM, need the conditional distribution
p(znk = 1|xn , θ) where θ are the model parameters.
• It is denoted by γ(znk ) can be computed as:
γ(znk ) ≡ p(znk = 1|xn ) =
=
p(znk = 1)p(xn |znk = 1)
PK
j=1 p(znj = 1)p(xn |znj = 1)
πk N (xn |µk , Σk )
PK
j=1 πj N (xn |µj , Σj )
• γ(znk ) is the responsibility of component k for datapoint n
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
MoG Conditional over Latent Variable
1
1
(b)
0.5
(c)
0.5
0
0
0
0.5
1
0
0.5
1
• To apply EM, need the conditional distribution
p(znk = 1|xn , θ) where θ are the model parameters.
• It is denoted by γ(znk ) can be computed as:
γ(znk ) ≡ p(znk = 1|xn ) =
=
p(znk = 1)p(xn |znk = 1)
PK
j=1 p(znj = 1)p(xn |znj = 1)
πk N (xn |µk , Σk )
PK
j=1 πj N (xn |µj , Σj )
• γ(znk ) is the responsibility of component k for datapoint n
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
EM for Gaussian Mixtures: E-step
• The complete-data log-likelihood is
ln p(X, Z|θ) =
N X
K
X
znk [ln πk + lnN (xn |µk , Σk )].
n=1 k=1
• E step: Calculate responsibilities using current parameters
θ old :
πk N (xn |µk , Σk )
γ(znk ) = PK
j=1 πj N (xn |µj , Σj )
• The zn vectors assigning data point n to components are
independent of each other.
• Therefore under the posterior distribution p(znk = 1|xn , θ)
the expected value of znk is γ(znk ).
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
EM for Gaussian Mixtures: M-step
• M step: Because of the independence of the component
assignments, we can calculate the expectation wrt the
component assignments by using the expectations of the
component assignments.
P
PK
• So Q(θ, θ old ) = N
n=1
k=1 γ(znk )[ln πk + lnN (xn |µk , Σk )].
• Maximizing Q(θ, θ old ) with respect to the model
parameters is more or less straightforward.
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
EM for Gaussian Mixtures II
• Initialize parameters, then iterate:
• E step: Calculate responsibilities using current parameters
πk N (xn |µk , Σk )
γ(znk ) = PK
j=1 πj N (xn |µj , Σj )
• M step: Re-estimate parameters using these γ(znk )
Nk
=
N
X
γ(znk )
n=1
µk
=
N
1 X
γ(znk )xn
Nk
=
N
1 X
γ(znk )(xn − µk )(xn − µk )T
Nk
=
Nk
N
n=1
Σk
n=1
πk
• Think of Nk as effective number of points in component k.
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
EM for Gaussian Mixtures II
• Initialize parameters, then iterate:
• E step: Calculate responsibilities using current parameters
πk N (xn |µk , Σk )
γ(znk ) = PK
j=1 πj N (xn |µj , Σj )
• M step: Re-estimate parameters using these γ(znk )
Nk
=
N
X
γ(znk )
n=1
µk
=
N
1 X
γ(znk )xn
Nk
=
N
1 X
γ(znk )(xn − µk )(xn − µk )T
Nk
=
Nk
N
n=1
Σk
n=1
πk
• Think of Nk as effective number of points in component k.
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
MoG EM - Example
2
0
−2
−2
0
(a)
2
• Same initialization as with K-means before
• Often, K-means is actually used to initialize EM
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
MoG EM - Example
2
0
−2
−2
0
• Calculate responsibilities γ(znk )
(b)
2
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
MoG EM - Example
2
0
−2
−2
0
(c)
2
• Calculate model parameters {πk , µk , Σk } using these
responsibilities
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
MoG EM - Example
2
0
−2
−2
• Iteration 2
0
(d)
2
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
MoG EM - Example
2
0
−2
−2
• Iteration 5
0
(e)
2
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
MoG EM - Example
2
0
−2
−2
• Iteration 20 - converged
0
(f)
2
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
EM - Summary
• EM finds local maximum to likelihood
p(X|θ) =
X
p(X, Z|θ)
Z
• Iterates two steps:
• E step “fills in” the missing variables Z (calculates their
distribution)
• M step maximizes expected complete log likelihood
(expectation wrt E step distribution)
K-Means
The Expectation Maximization Algorithm
EM Example: Gaussian Mixture Models
Conclusion
• Readings: Ch. 9.1, 9.2, 9.4
• K-means clustering
• Gaussian mixture model
• What about K?
• Model selection: either cross-validation or Bayesian version
(average over all values for K)
• Expectation-maximization, a general method for learning
parameters of models when not all variables are observed