CS B553: ALGORITHMS FOR
OPTIMIZATION AND LEARNING
Parameter Learning with Hidden Variables &
Expectation Maximization
AGENDA
Learning probability distributions from data in
the setting of known structure, missing data
Expectation-maximization (EM) algorithm
BASIC PROBLEM
Given a dataset D={x[1],…,x[M]} and a Bayesian
model over observed variables X and hidden
(latent) variables Z
Fit the distribution P(X,Z) to the data
Interpretation: each example x[m] is an
incomplete view of the “underlying” sample
(x[m],z[m])
Z
X
APPLICATIONS
Clustering in data mining
Dimensionality reduction
Latent psychological traits (e.g., intelligence,
personality)
Document classification
Human activity recognition
HIDDEN VARIABLES CAN YIELD MORE
PARSIMONIOUS MODELS
Hidden variables => conditional independences
Z
X1
X2
X3
X1
X2
X3
X4
X4
Without Z, the observables become
fully dependent
HIDDEN VARIABLES CAN YIELD MORE
PARSIMONIOUS MODELS
Hidden variables => conditional independences
Z
X1
1+4*2=9 parameters
X2
X1
X3
X2
X4
Without Z, the observables become
fully dependent
1+2+4+8=15 parameters
X3
X4
GENERATING MODEL
These CPTs are identical and
given
qz
z[1]
z[M]
x[1]
x[M]
qx|z
These CPTs are identical and
given
EXAMPLE: DISCRETE VARIABLES
Categorical distributions given
by parameters qz
P(Z[i] | qz) = Categorical(qz)
qz
z[1]
z[M]
x[1]
x[M]
qx|z
Categorical distribution
P(X[i]|z[i],qx|z[i]) = Categorical(qx|z[i])
(in other words, z[i] multiplexes
between Categorical distributions)
MAXIMUM LIKELIHOOD ESTIMATION
Approach: find values of q = (qz, qx|z), and
DZ=(z[1],…,z[M]) that maximize the likelihood of
the data
L(q, DZ ; D) = P(D|q, DZ )
Find arg max L(q, DZ ; D) over q, DZ
MARGINAL LIKELIHOOD ESTIMATION
Approach: find values of q = (qz, qx|z), and that
maximize the likelihood of the data without
assuming values of DZ=(z[1],…,z[M])
L(q; D) = SDz P(D, DZ |q)
Find arg max L(q; D) over q
(A partially Bayesian approach)
COMPUTATIONAL CHALLENGES
P(D|q, DZ ) and P(D,DZ | q) are easy to evaluate,
but…
Maximum likelihood arg max L(q, DZ ; D)
Optimizing over M assignments to Z (|Val(Z)|M
possible joint assignments) as well as continuous
parameters
Maximum marginal likelihood arg max L(q; D)
Optimizing locally over continuous parameters, but
objective requires summing over M assignments to Z
EXPECTATION MAXIMIZATION FOR ML
Idea: use a coordinate ascent approach
arg maxq, DZ L(q, DZ ; D) =
arg maxq max DZ L(q, DZ ; D)
Step 1: Finding DZ* = arg max DZ L(q, DZ ; D) is
easy given a fixed q
Step 2: Set Q(q) = L(q, DZ*; D) Finding
q*=arg maxq Q(q) is easy given that DZ is fixed
Fully observed, ML parameter estimation
Fully observed, ML parameter estimation
Repeat steps 1 and 2 until convergence
EXAMPLE: CORRELATED VARIABLES
Unrolled network
Plate notation
qz
qz
z[1]
M
z[M]
z
qx1|z
qx1|z
x1[1]
x1[M]
qx1|z
x2[1]
x1
x2[M]
qx1|z
x2
EXAMPLE: CORRELATED VARIABLES
Plate notation
qz
M
z
Suppose 2 types:
1. X1 != X2, random
2. X1,X2=1,1 with 90% chance, 0,0 otherwise
Type 1 drawn 75% of the time
X Dataset
• (1,1): 222
• (1,0): 382
• (0,1): 364
• (0,0): 32
qx1|z
x1
qx2|z
x2
EXAMPLE: CORRELATED VARIABLES
Plate notation
qz
M
z
Suppose 2 types:
1. X1 != X2, random
2. X1,X2=1,1 with 90% chance, 0,0 otherwise
Type 1 drawn 75% of the time
X Dataset
• (1,1): 222
• (1,0): 382
• (0,1): 364
• (0,0): 32
qx1|z
x1
qx2|z
x2
Parameter Estimates
qz = 0.5
qx1|z=1 = 0.4, qx1|z=2= 0.3
qx2|z=1 = 0.7, qx2|z=2= 0.6
EXAMPLE: CORRELATED VARIABLES
Plate notation
qz
M
z
Suppose 2 types:
1. X1 != X2, random
2. X1,X2=1,1 with 90% chance, 0,0 otherwise
Type 1 drawn 75% of the time
X Dataset
• (1,1): 222
• (1,0): 382
• (0,1): 364
• (0,0): 32
qx1|z
x1
qx2|z
x2
Estimated Z’s
• (1,1): type 1
• (1,0): type 1
• (0,1): type 2
• (0,0): type 2
Parameter Estimates
qz = 0.5
qx1|z=1 = 0.4, qx1|z=2= 0.3
qx2|z=1 = 0.7, qx2|z=2= 0.6
EXAMPLE: CORRELATED VARIABLES
Plate notation
qz
M
z
Suppose 2 types:
1. X1 != X2, random
2. X1,X2=1,1 with 90% chance, 0,0 otherwise
Type 1 drawn 75% of the time
X Dataset
• (1,1): 222
• (1,0): 382
• (0,1): 364
• (0,0): 32
qx1|z
x1
qx2|z
x2
Estimated Z’s
• (1,1): type 1
• (1,0): type 1
• (0,1): type 2
• (0,0): type 2
Parameter Estimates
qz = 0.604
qx1|z=1 = 1, qx1|z=2= 0
qx2|z=1 = 0.368, qx2|z=2= 0.919
EXAMPLE: CORRELATED VARIABLES
Plate notation
qz
M
z
Suppose 2 types:
1. X1 != X2, random
2. X1,X2=1,1 with 90% chance, 0,0 otherwise
Type 1 drawn 75% of the time
X Dataset
• (1,1): 222
• (1,0): 382
• (0,1): 364
• (0,0): 32
qx1|z
x1
qx2|z
x2
Estimated Z’s
• (1,1): type 1
• (1,0): type 1
• (0,1): type 2
• (0,0): type 2
Parameter Estimates
qz = 0.604
qx1|z=1 = 1, qx1|z=2= 0
qx2|z=1 = 0.368, qx2|z=2= 0.919
Converged (true ML estimate)
EXAMPLE: CORRELATED VARIABLES
x3,x4
Plate notation
qz
M
z
qx1|z
qx2|z
qx3|z
qx4|z
x1
x2
x3
x4
x1,x2
0,0
0,1
1,0
1,1
0,0
115
142
20
47
0,1
32
16
37
75
1,0
12
117
39
58
1,1
133
92
45
20
Random initial guess
qZ = 0.44
qX1|Z=1 = 0.97
qX2|Z=1 = 0.21
qX3|Z=1 = 0.87
qX4|Z=1 = 0.57
qX1|Z=2 = 0.07
qX2|Z=2 = 0.97
qX3|Z=2 = 0.71
qX4|Z=2 = 0.03
Log likelihood -5176
EXAMPLE: E STEP
x3,x4
X Dataset
Plate notation
qz
M
z
qx1|z
qx2|z
qx3|z
qx4|z
x1
x2
x3
x4
x1,x2
0,0
0,1
1,0
1,1
0,0
115
142
20
47
0,1
32
16
37
75
1,0
12
117
39
58
1,1
133
92
45
20
Random initial guess
qZ = 0.44
qX1|Z=1 = 0.97
qX2|Z=1 = 0.21
qX3|Z=1 = 0.87
qX4|Z=1 = 0.57
qX1|Z=2 = 0.07
qX2|Z=2 = 0.97
qX3|Z=2 = 0.71
qX4|Z=2 = 0.03
Log likelihood -4401
Z Assignments
x3,x4
x1,x2
0,0
0,1
1,0
1,1
0,0
2
1
2
1
0,1
2
2
2
2
1,0
1
1
1
1
1,1
2
1
1
1
EXAMPLE: M STEP
x3,x4
X Dataset
Plate notation
qz
M
z
qx1|z
qx2|z
qx3|z
qx4|z
x1
x2
x3
x4
x1,x2
0,0
0,1
1,0
1,1
0,0
115
142
20
47
0,1
32
16
37
75
1,0
12
117
39
58
1,1
133
92
45
20
Current estimates
qZ = 0.43
qX1|Z=1 = 0.67
qX2|Z=1 = 0.27
qX3|Z=1 = 0.37
qX4|Z=1 = 0.83
qX1|Z=2 = 0.31
qX2|Z=2 = 0.68
qX3|Z=2 = 0.31
qX4|Z=2 = 0.21
Log likelihood -3033
Z Assignments
x3,x4
x1,x2
0,0
0,1
1,0
1,1
0,0
2
1
2
1
0,1
2
2
2
2
1,0
1
1
1
1
1,1
2
1
1
1
EXAMPLE: E STEP
x3,x4
X Dataset
Plate notation
qz
M
z
qx1|z
qx2|z
qx3|z
qx4|z
x1
x2
x3
x4
x1,x2
0,0
0,1
1,0
1,1
0,0
115
142
20
47
0,1
32
16
37
75
1,0
12
117
39
58
1,1
133
92
45
20
Current estimates
qZ = 0.43
qX1|Z=1 = 0.67
qX2|Z=1 = 0.27
qX3|Z=1 = 0.37
qX4|Z=1 = 0.83
qX1|Z=2 = 0.31
qX2|Z=2 = 0.68
qX3|Z=2 = 0.31
qX4|Z=2 = 0.21
Log likelihood -2965
Z Assignments
x3,x4
x1,x2
0,0
0,1
1,0
1,1
0,0
2
1
2
1
0,1
2
2
2
1
1,0
1
1
1
1
1,1
2
1
2
1
EXAMPLE: E STEP
x3,x4
X Dataset
Plate notation
qz
M
z
qx1|z
qx2|z
qx3|z
qx4|z
x1
x2
x3
x4
x1,x2
0,0
0,1
1,0
1,1
0,0
115
142
20
47
0,1
32
16
37
75
1,0
12
117
39
58
1,1
133
92
45
20
Current estimates
qZ = 0.40
qX1|Z=1 = 0.56
qX2|Z=1 = 0.31
qX3|Z=1 = 0.40
qX4|Z=1 = 0.92
qX1|Z=2 = 0.45
qX2|Z=2 = 0.66
qX3|Z=2 = 0.26
qX4|Z=2 = 0.04
Log likelihood -2859
Z Assignments
x3,x4
x1,x2
0,0
0,1
1,0
1,1
0,0
2
1
2
1
0,1
2
2
2
2
1,0
1
1
1
1
1,1
2
1
1
1
EXAMPLE: LAST E-M STEP
x3,x4
X Dataset
Plate notation
qz
M
z
qx1|z
qx2|z
qx3|z
qx4|z
x1
x2
x3
x4
x1,x2
0,0
0,1
1,0
1,1
0,0
115
142
20
47
0,1
32
16
37
75
1,0
12
117
39
58
1,1
133
92
45
20
Current estimates
qZ = 0.43
qX1|Z=1 = 0.51
qX2|Z=1 = 0.36
qX3|Z=1 = 0.35
qX4|Z=1 = 1
qX1|Z=2 = 0.53
qX2|Z=2 = 0.57
qX3|Z=2 = 0.33
qX4|Z=2 = 0
Log likelihood -2683
Z Assignments
x3,x4
x1,x2
0,0
0,1
1,0
1,1
0,0
2
1
2
1
0,1
2
1
2
1
1,0
2
1
2
1
1,1
2
1
2
1
PROBLEM: MANY LOCAL MINIMA
Flipping Z assignments causes large shifts in
likelihood, leading to a poorly behaved energy
landscape!
Solution: EM using the marginal likelihood
formulation
“Soft” EM
(This is the typical form of the EM algorithm)
EXPECTATION MAXIMIZATION FOR MML
arg maxq L(q, D) =
arg maxq EDZ|D,q [L(q; DZ , D)]
Do arg maxq EDZ|D,q [log L(q; DZ , D)] instead
(justified later)
Step 1: Given current fixed qt, find P(Dz|qt, D)
Compute a distribution over each Z[i]
Step 2: Use these probabilities in the expectation
EDZ |D,qt[log L(q, DZ ; D)] = Q(q). Now find maxq Q(q)
Fully observed, weighted, ML parameter estimation
Repeat steps 1 (expectation) and 2
(maximization) until convergence
E STEP IN DETAIL
Ultimately, want to maximize Q(q | qt) = EDZ|D,qt
[log L(q; DZ , D)] over q
Q(q | qt) =
Sm Sz[m] P(z[m]|x[m], qt ) log P(x[m], z[m]|q)
E step computes the terms
wm,z(qt)=P(Z[m]=z |D, qt )
over all examples m and zVal[Z]
M STEP IN DETAIL
arg maxq Q(q | qt) = Sm Sz wm,z(qt) log P (x[m]|q,
z[m]=z)
= argmax Pm Pz P (x[m]|q, z[m]=z)^(wm,z(qt))
This is weighted ML
Each z[m] is interpreted to be observed wm,z(qt) times
Most closed-form ML expressions (Bernoulli,
categorial, Gaussian) can be adopted easily to
weighted case
EXAMPLE: BERNOULLI PARAMETER FOR Z
qZ* = arg maxqz Sm Sz wm,z log P (x[m],z[m]=z |qZ)
= arg maxqz Sm Sz wm,z log (I[z=1]qZ + I[z=0](1-qZ)
= arg maxqz [log (qZ) Sm wm,z=1 + log(1-qZ)Sm wm,z=0]
=> qZ* = (Sm wm,z=1 )/ Sm (wm,z=1+ wm,z=0)
“Expected counts” Mqt[z] = Sm wm,z(qt)
Express qZ* = Mqt[z=1] / Mqt[ ]
EXAMPLE: BERNOULLI PARAMETERS FOR
XI | Z
qXi|z=k* = arg maxqz Smwm,z=klog P(x[m],z[m]=k |qXi|z=k)
= arg maxqxi|z=k Sm Sz wm,z
log (I[xi[m]=1,z=k]qXi|z=k + I[xi[m]=0,z=k](1-qXi|z=k)
= … (similar derivation)
=> qXi|z=k * = Mqt[xi=1,z=k] / Mqt[z=k]
EM ON PRIOR EXAMPLE (100 ITERATIONS)
x3,x4
X Dataset
Plate notation
qz
M
z
qx1|z
qx2|z
qx3|z
qx4|z
x1
x2
x3
x4
x1,x2
0,0
0,1
1,0
1,1
0,0
115
142
20
47
0,1
32
16
37
75
1,0
12
117
39
58
1,1
133
92
45
20
Final estimates
qZ = 0.49
qX1|Z=1 = 0.64
qX2|Z=1 = 0.88
qX3|Z=1 = 0.41
qX4|Z=1 = 0.46
qX1|Z=2 = 0.38
qX2|Z=2 = 0.00
qX3|Z=2 = 0.27
qX4|Z=2 = 0.68
Log likelihood -2833
P(Z)=2
x3,x4
x1,x2
0,0
0,1
1,0
1,1
0,0
0.90
0.95
0.84
0.93
0,1
0.00
0.00
0.00
0.00
1,0
0.76
0.89
0.64
0.82
1,1
0.00
0.00
0.00
0.00
CONVERGENCE
In general, no way to
tell a priori how fast
EM will converge
Soft EM is usually
slower than hard EM
Still runs into local
minima, but has more
opportunities to
coordinate parameter
adjustments
-500
Log likelihood
1
15
29
43
57
71
85
99
113
127
141
155
169
183
197
0
-1000
-1500
-2000
-2500
-3000
-3500
-4000
Iteration count
WHY DOES IT WORK?
Why are we optimizing over Q(q | qt) =
Sm Sz[m] P(z[m]|x[m], qt ) log P(x[m], z[m]|q)
rather than the true marginalized likelihood:
L(q|D) = Pm Sz[m] P(z[m]|x[m], qt ) P(x[m], z[m]|q) ?
WHY DOES IT WORK?
Why are we optimizing over Q(q | qt) =
Sm Sz[m] P(z[m]|x[m], qt ) log P(x[m], z[m]|q)
rather than the true marginalized likelihood:
L(q|D) = Pm Sz[m] P(z[m]|x[m], qt ) P(x[m], z[m]|q) ?
Can prove that:
The log likelihood is increased at every step
A stationary point of arg maxq EDZ|D,q [L(q; DZ , D)] is
a stationary point of log L(q|D)
see K&F p882-884
GAUSSIAN CLUSTERING USING EM
One of the first uses of EM
Widely used approach
Finding good starting points:
k-means algorithm
(Hard assignment)
Handling degeneracies
Regularization
RECAP
Learning with hidden variables
Typically categorical
© Copyright 2026 Paperzz