I529: Machine Learning in Bioinformatics (Spring 2013)
Expectation-Maximization (EM)
algorithm
Yuzhen Ye
School of Informatics and Computing
Indiana University, Bloomington
Spring 2013
Contents
Introduce EM algorithm using the flipping coin
experiment
Formal definition of the EM algorithm
Two toy examples
– Coin toss with missing data
– Coin toss with hidden data
Applications of the EM algorithm
– Motif finding
– Baum-Welch algorithm
A coin-flipping experiment
θ: the probability of getting heads
θA: the probability of coin A landing on head
θB: the probability of coin B landing on head
Ref: What is the expectation maximization algorithm?
Nature Biotechnology 26, 897 - 899 (2008)
When the identities of the coins are unknown
Instead of picking up the single best guess,
the EM algorithm computes probabilities for
each possible completion of the missing
data, using the current parameters
E(H) for coin B
Main applications of the EM algorithm
When the data indeed has missing values, due
to problems with or limitations of the observation
process
When optimizing the likelihood function is
analytically intractable but it can be simplified by
assuming the existence of and values for
additional but missing (or hidden) parameters.
The EM algorithm handles hidden data
Consider a model where, for observed data x and model
parameters θ:
p(x|θ)=∑z p(x,z|θ).
z is the “hidden” variable that is marginalized out
Finding θ* which maximizes ∑z p(x,z|θ) is hard!
The EM algorithm reduces the difficult task of optimizing log
P(x; θ) into a sequence of simpler optimization subproblems.
In each iteration, The EM algorithm receives parameters θ(t), and
returns new parameters θ(t+1), s.t. p(x|θ(t+1)) > p(x|θ(t)).
The EM Algorithm
The EM algorithm
Yuzhen Ye
M update rule:
Yuzhen Ye
X March 1, 2013
The EM
algorithm
EM
Algorithm
✓ˆ(t+1) The
= arg max
P
(z|x; ✓ˆ(t) )logP (x, z; ✓)
March
1, 2013
In each iteration the ✓EM algorithm does the
following.
z
E step: Calculate
1 function
The
is, EM algorithm
XYuzhen Ye
Qt (✓)
=
P (z|x; ✓ˆ(t) )logP (x, z; ✓)
1 The
EM
algorithm
The EM update rule:
March
1, 2013
z
X
the thatM
EM
update
maximizes
log-likelihood
afunction
dataset
expanded
The
EM
update
rule:
step:
Findrule
maximizes
theof
✓ˆ(t+1)
=which
argthe
max
P (z|x;
✓ˆQ(t)
)logP
(x,
z; ✓)
X
✓
l possible completions of unobserved
variables
z, where each completion
is w
(t)
(t+1)
(t)
z
ˆ
ˆ
(Next
iteration sets θ ←
and
repeats).
=
arg
max
P (z|x; ✓ )log
probability, P (z|x; ✓ˆ(t) ). ✓
eWeposterior
EM
algorithm
✓
see the that EM update rule maximizes the log-likelihood
of a data
z
ain
all rule:
possible
completions
of unobserved
variables
z,the
where
each co
We
see
the
that
EM
update
rule
maximizes
log-likeliho
pdate
The EM update
rule:
Convergence
of
the
EM
algorithm
ˆ(t) ).
X
y the posterior
probability,
P
(z|x;
✓
tain
all possible completions ofˆ(t)unobserved variables z, wh
(t+1)
ˆ
✓
= arg max
P (z|x; ✓ )logP
(x, z; ✓)
(t)
ˆ
the
posterior✓of
probability,
P (z|x; ✓Starting
).
demonstratesbythe
convergence
the EM algorithm.
from initial par
z
E-step
of the EM algorithm
constructs
a function
gt that lower-bounds th
2ethat
Convergence
of
the
EM
algorithm
EM update rule maximizes the log-likelihood of a dataset expan
nction logP (x; ✓) (i.e., gt logP (x; ✓); and gt (✓ˆ(t) ) = logP (x; ✓ˆ(t) ). In the
2as the
Convergence
of the
EM aeach
algorithm
ssible
completions
of unobserved
z,E-step,
where
is computed
maximum
of gt . variables
In the next
newcompletion
lower-bound
Fig 1 demonstrates the convergence
of the EM algorithm. Starting fro
ˆ(t)
The EM Algorithm
Yuzhen Ye
M algorithm
Templates for equations
March 1, 2013
te rule:
Convergence
of the
Yuzhen Ye
Xalgorithm
1 The EM
= argalgorithm
max
P (z|x; ✓ˆ )logP (x, z; ✓)
EM
March 1, 2013
The EM update rule:
✓ˆ(t+1)
n is,
tion for
(t)
✓
(1)
z
Compare the Q function
and the gX
function
✓ˆ(t+1) = arg max
P (z|x; ✓ˆ(t) )logP (x, z; ✓)
(1)
✓
X
z
(t)
ˆ
Qt (✓)
= thatPEM
(z|x;
✓ )logP
(x, z; ✓)the log-likelihood of a dataset (2)
slides
We
see the
update
rule maximizes
expanded to con-
z
tain all possible
completions of unobserved variables z, where each completion is weighted
X
P (x,✓ˆ(t)
z;).✓)of a dataset expanded to conthemaximizes
posterior probability,
P (z|x;
updateby
rule
the
ˆ(t)log-likelihood
at EM
gt (✓) =
P (z|x; ✓ )log
(1)
(t)
ˆ
✓ )each completion is weighted
le completions of unobserved
variablesPz,(z|x;
where
z
Fig. 1
Convergence
of the EM algorithm
or probability,2 P (z|x;
✓ˆ(t) ).
MM
Fig 1 demonstrates the convergence of the EM algorithm. Starting from initial parameters
M
can be specified
✓ = (S,
✓(t)the
, theasE-step
ofalgorithm
the, A,
EMb),
algorithm constructs a function gt that lower-bounds the objecrgence
oftive
EM
function logP (x; ✓) (i.e., gt logP (x; ✓); and gt (✓ˆ(t) ) = logP (x; ✓ˆ(t) ). In the M-step,
S1 , S2 , · · · , SM },
a set
of states as the maximum of gt . In the next E-step, a new lower-bound gt+1 is
✓(t+1)
is computed
rates the convergence
of the EM algorithm. Starting from initial parameters
constructed; maximization of gt+1 in the next M-step gives ✓(t+2) , etc.
p1 ,of 2the
constructs
a function
gtmodels
that the
lower-bounds
the objec, · · ·EM
, Malgorithm
}, aAsset
associated
phylogenetic
theofvalue
of the lower-bound
gt matches
objective function
at ✓ˆ(t) , if follows that
ˆ(t) ) =
ˆ(t) ). In the(t+1)
ogP (x; ✓) (i.e., gt logP (x; ✓);
and
g
(
✓
✓
M-step,
t
(t)
(t) logP (x;
(t+1)
ˆ
ˆ
ˆ
(x; ✓ ) = gt (✓ probabilities
) gt (✓
= logP (x; ✓ˆ
)
(2)
ajk }(1 j, k M ), a matrix oflogP
state-transition
uted as the maximum of gt . In the next E-step, a new lower-bound gt+1 is
So the objective function monotonically(t+2)
maximization
of gt+1
in
the next M-step
gives ✓ increases
, etc.during each iteration of expectation
, · · · , bM ), a vector
of
state-initial
probabilities
maximization!
ue of the lower-bound gt matches the objective function at ✓ˆ(t) , if follows that
onditional probability of visitingRef:
state k at some site i given that state l is
(t)
(t)
3✓ˆ probability
that(x;lower-bound
the objective
funce i 1.
bj (x;
is the
that
first.
logP
)Choose
= gt (✓ˆ )the
gfunction
(✓ˆ(t+1)j is
= visited
logP
✓ˆ(t+1) )
(3)
tstate
tion
March 1, 2013
EM algorithm
The EM update rule
date rule:
ˆ(t+1)
✓
= arg max
✓
X
P (z|x; ✓ˆ(t) )logP (x, z; ✓)
z
that EM update rule maximizes the log-likelihood of a dataset expand
The EM of
update
rule maximizes
thez,
logwhere
likelihood
a
sible completions
unobserved
variables
eachofcompletion
i
dataset expanded
erior probability,
P (z|x; ✓ˆ(t)to). contain all possible completions
of the unobserved variables, where each completion
is weighted by the posterior probability!
vergence of the EM algorithm
nstrates the convergence of the EM algorithm. Starting from initial p
tep of the EM algorithm constructs a function gt that lower-bounds
n logP (x; ✓) (i.e., gt logP (x; ✓); and gt (✓ˆ(t) ) = logP (x; ✓ˆ(t) ). In t
mputed as the maximum of gt . In the next E-step, a new lower-bou
Coin toss with missing data
• Given a coin with two possible outcomes: H (head) and T
(tail), with probabilities θ and 1-θ, respectively.
• The coin is tossed twice, but only the 1st outcome, T, is
seen. So the data is x = (T,*) (with incomplete data!)
• We wish to apply the EM algorithm to get parameters that
increase the likelihood of the data.
• Let the initial parameters be θ = ¼.
z
t Assume
t =
we
have
an ✓)
initial guess of the ✓ =P
1/4.
for current
iteration
(in
this
case,
✓
1/4).
P
(z|x;
✓
)logP
(x,
z;
Qtparameters
(✓t )= P=
t
t
t
n
(z1)
n
(z1)
t
n (z2)
n (z2)
(z1|x; ✓ )logP (x, z1; ✓) + P
(z2|x; ✓ the
)logP
z2; rule
✓) is Q (✓) =
Remember
EM (x,
update
P (z|x; ✓ )logP (x, z; ✓), where ✓
t
H
=XP (z1|x; ✓ )log[✓ H
⇥ (1 ✓) T
] + P (z2|x;
⇥ (1 ✓) T
]t
t ✓ )log[✓
z
t
t current iteration (in this case, ✓ t = t1/4).
Assume we
initial z
guess
of tthe
1/4.parameters for
t ✓=
P
(z1|x;
✓
)logP
(z1;
✓)
+
P
(z2|x;
✓
)logP
(z2; ✓)
Qt (✓t=
)have
=Pan
P
(z|x;
✓
)logP
(x,
z;
✓)
P (z1|x;
)[n=H (z1)log✓
+ nX
+ P (z2|x; ✓ )[nH (z2)log✓ + nT (z2)l
T (z1)log(1
Remember the EM=
update
rule is Q✓t (✓)
P (z|x; ✓t )logP
(x,
z; ✓), where ✓t is✓)]
given
1: Convergence
the EM algorithm
z
t Figure
nH
nof
t z; ✓) nH (z2) t
t
z
T (z1) P (z|x;
t (z1)t✓t⇥
t P (z2|x;
t ✓)nT (z2) ]
t
Q
(✓
)
=
✓t )logP
(x,
parameters for=
current
iteration
(in
this
case,
=
1/4).
P (z1|x;
✓
)log[✓
(1
✓)
]
+
✓
)log[✓
⇥
(1
t
= [P (z1|x;
✓
)n
(z1)
+
P
(z2|x;
✓
)n
(z2)]log✓
+
[P
(z1|x;
✓
)n
(z1)
+
P
(z2|x;
✓
)
H
H
T
t
t
z
=
P
(z1|x;
✓
)logP
(x,
z1;
✓)
+
P
(z2|x;
✓
)logP
(x,
z2;
✓)
t
t
X
=P (z|x;
P (z1|x;
(z1)log✓
✓)]
+z2;
nT✓)(z2)log(1 ✓)]
t have
t + P (z2|x; ✓ )[nH
H
T (z1)log(1
Assume
an
initial
guess
of n
the
✓=
1/4.
t +
t (z2)log✓
Qt (✓t ) =
✓we
)logP✓(x,)[n
z; t✓)
=P
P (z1|x;
✓
)logP
(x,
z1;
✓)
+
P (z2|x;
✓t )logP (x,
t
t )logP
t is given
P (z1|x;
✓update
)logPrule
(z1;is✓)Q+
P=(z2|x;
✓ )logP
(z2;
t
zRemember
the EM
(z|x;
(x,✓)
z;
t (✓) ✓
= =[Pt (z1|x;
✓t )n
)nt H
(z2)]log✓
+(z1;
[P
(z1|x;
✓t )nT✓✓t(z1)
P✓)
(z2|x; ✓t )nT (z2)]log(
zPP
=
(z1|x;
✓✓t )logP
✓) ✓),
+ Pwhere
(z2|x;
)logP+
(z2;
H (z1) + Pt (z2|x;
t + P (z2|x;
nH
nT (z2)
=parameters
P (z1|x; ✓ )logPcurrent
(x, z1; ✓)iteration
)logP
(x, z2;✓✓)=n1/4).
(in
this⇥case,
T (z1) t
H (z2)
nH (z1) ✓ t )log[✓
nTn(z1)
P(z1)
(z1|x;
)log[✓
(1 n
✓)Pin
P T,
(z2|x;
(1 ✓t✓)
] t⇥ (1 ✓)
t✓(z1)
(z1)
=H
(z1|x;
✓] +
)log[✓
⇥ and
(1nT✓)(z1)
] + P⇥(z2|x;
)log[✓nH (z2)
where=ntHfor
is ✓the
number
of heads
z1 (i.e.,
T),
so
on.
t
Inputs:
= P (z1|x; ✓ )logP
X (z1; ✓) +t P (z2|x; ✓ )logP (z2; ✓)
nH+ n (z1)log(1
t n =log✓
P (z1|x;
✓t )[n✓H
(z1)log✓
✓)] + P (z2|x;
+ n(z1)
maximized
when
=
MLE)!
T(the
H (z2)log✓ +
P (z1|x;
✓⇥✓t(1
)[n
+ n =(z1)log(1
✓)]
+
P
(z2|x;
✓t )[n
+ n✓Tt )[n
(z2)log(1
t
nT✓)
(z1)
(z2)
Q
(✓
)
=
)logP
(x,
z;is
✓)
H
T log(1
H✓)(z1)log✓
H (z2)log✓
T +n
x=(T,*)
nnH
= tP (z1|x; ✓ )log[✓PnH(z|x;
] + P (z2|x;T✓t )log[✓nH (z2)t⇥ (1 Observation:
✓)
]T
t
t
[P
✓ )nand
+P
(z2|x; ✓t )nH
zthe number
where
n (z1|x;
(z1)
is[P
of(z1)
heads
in(z2|x;
z1P=
(i.e.,
on.
t
t (z1|x;
t (z2)]log✓ + [P (z1|x; ✓ )nTt (z1) + P (
H (z1)so
tT, T),
If
we
define,
=
(z1|x;
✓
)n
+
P
✓
)n
(z2)]log✓
+
[P
(z1|x;
✓
)n
(z1)
+
P
(z2|x;
= PH
✓t )[n
(z1)log✓
+
n
(z1)log(1
✓)]
+
(z2|x;
✓
)[n
(z2)log✓
+
n
(z2)log(1
✓)]
Hidden
data: z1=(T,T)
z2=(T,H) ✓ )nT (z2)
H
TH
H H nH
T
H
TT
t
t
=
P
(z1|x;
✓
)logP
(x,
z1;
✓)
+
P
(z2|x;
✓
)logP
(x,
z2;
✓)
n
+ ✓nt )n
✓) is maximized
when ✓ =t nHT (z1)
MLE)!
t
T log(1
+nT+(the
=H log✓
[P (z1|x;
PInitial
(z2|x;
✓tguess:
)nT (z2)]log(1
H (z1) + P (z2|x; ✓ )nH (z2)]log✓ + [P (z1|x; ✓ )n
θ t = ¼✓)
t
t
t
t
=
P
(z1|x;
✓
)logP
(z1;
✓)
+
P
(z2|x;
✓
)logP
(z2;
✓)
t
t
If we define,
(7) and so on.
nH = P
(z1|x;
✓ )n
(z1)
+
Pof(z2|x;
✓ )n
(8)
where
(z1)
isH
the
number
heads
in
z1H
(i.e.,
T, T),
H nnTH(z1)
H (z2)
t
nH (z1)
t
nH (z2)
nT (z2)n
= P (z1|x; ✓ )log[✓
⇥ (1 n✓)H log✓ +
]+
P (z2|x; ✓ )log[✓
⇥ when
(1 ✓)
]H (the MLE)!
n(i.e.,
is maximized
✓ = nH +n
T log(1T, ✓)
n
(z1)
is
the
number
of
heads
in
z1
T),
and
so
on.
T
t
t
H
where nwhere
(z1)
is
the
number
of
heads
in
z1
(i.e.,
T,
T),
and
so
on.
t(z1|x;
t (z2|x;
t
t
H
n
=
P
✓
)n
(z1)
+
P
✓
)n
(z2)
(9)
If
we
define,
= P (z1|x; ✓n
)[n
(z1)log✓
+
n
(z1)log(1
✓)]
+
P
(z2|x;
✓
)[n
(z2)log✓
+
n
(z2)log(1
✓)]
T
T
T
=
P
(z1|x;
✓
)n
(z1)
+
P
(z2|x;
✓
)n
(z2)
(8)
T
nH H H
H H when ✓ = TnH H(the MLE)!
nH log✓ +nn
log(1 +✓)nis
maximized
✓)
is
maximized
when
✓
=
(the
MLE)!
HTlog✓
T log(1
n
+n
H
nH
+nT ✓ t )n (z1) + P (z2|x; ✓ t )n (z2)]log(1 ✓)
= [P (z1|x; ✓t )nH (z1)t + P (z2|x;
✓Tt )nH (z2)]log✓t + [P
(z1|x;
T t
T
t
If we define,
we
havedenote
our solution!
nT = P
✓ )n
+ P✓(z2|x;
✓=t )n
(z2)
(9)
P (z1|x;
)nH
(z1) +
(z2|x; ✓✓tt)n
If we
P (z1|x;
✓ (z1|x;
)nH (z1)
+T (z1)
P (z2|x;
)nnHH(z2)
n✓H
and
P P(z1|x;
)nHT(z2)
(z1) +
Tas
(7)
t
t , then
tour solution!
t
t
Now
let’s
calculate
n
and
n
.
P
(z2|x;
✓
)n
(z2)
as
n
we
have
H
T
n
=
P
(z1|x;
✓
)n
(z1)
+
P
(z2|x;
✓
)n
(z2)
(8)
T
T
n
=
P
(z1|x;
✓
)n
(z1)
+
P
(z2|x;
✓
)n
(z2)
H
H
H
T
T
T
we have our solution!
= P (z1|x; ✓ )logP (x, z1; ✓) + P (z2|x; ✓ )logP (x, z2; ✓)
= EM
P (z1|x;
✓ )logP (z1;at
✓) work
+ P (z2|x; ✓ )logP (z2; ✓)
The
algorithm
⇥ (1
= P (z1|x; ✓ )log[✓
✓)
] + P (z2|x; ✓ )log[
= P (z1|x; ✓ )[n (z1)log✓ + n (z1)log(1
✓)] + P (z2|x; ✓
= [P (z1|x; ✓ )n (z1) + P (z2|x; ✓ )n (z2)]log✓ + [P (z1|x
where nH (z1) is the number of heads in z1 (i.e., T, T), and so on.
where
n let’s
(z1)
is=the
number
heads
in. z1 ✓(i.e.,
T, T), and so on.
Now
and
Hof
n calculate
P (z1|x;
✓n
)n
(z1)
+. n
PT
(z2|x;
(9)
nH
we
have )n
our(z2)
solution!
Now
let’s
calculate
n
and
n
2 when ✓ =
H
T
We
have
seen
that
for
function
n
log✓
+
n
log(1
✓),
it
is
maximized
n
log✓
+
n
log(1
✓)
is
maximized
when
✓
=
(the (10)
ML
P
(x;
✓)
=
P
(z1;
✓)
+
P
(z2;
✓)
=
(1
✓)
+
(1
✓)✓
=
3/4
H
T
Now let’s calculate n and n .
n
+n
we have our
solution!
H
T
2
(the MLE).
t ✓) + P
t ✓) = (1 t 2
= =t3/4
P✓t(x;
✓)P=
P (z1;
(z2;
✓)+
+
(1 ✓t )✓
✓)✓
(8)
t
Now let’s calculate
n
and
n
.
P
(x;
)
=
(z1;
✓
)
+
P
(z2;
✓
)
=
(1
✓
)
(1✓)
3/4
(10)
we denote
P (z1|x; ✓P
)n(z1|x;
(z1) + P (z2|x;
(z2)
as
n
and
P
(z1|x;
✓
)n
(z1)
+
If If we
denote
✓ )n✓ )n
(z1)
+
P
(z2|x;
✓
)n
(z2)
as
nH an
P
(x;
✓)
=
P
(z1;
+
P
(z2;
✓)
=
(1
✓)
+
(1
✓)✓
=
3/4
3/4 ⇥
H
H3/4
2
P
(z1|x;
✓)
=
P✓)(x,
z1;✓)✓)/P
(x; ✓) = (1 ✓) /P(10)
(x; 3/4
✓) =⇥ 3/4
= 3/4
(11)
P (z2|x;
✓ )n
(z2)
as ✓)
n+
,P
then
solution!
P (x;
✓)
=
P (z1;
(z2; we
=have
(1 our
+
(1 ✓)✓ = 3/4 2
t
3/4
3/4
⇥
3/4
3/4
⇥
3/4
P (z2|x;
✓let’s
)n
as✓tz1;
n. T✓)/P
, ✓then
) T (z2)
t(x; ✓)
tour
Now
nP
and
n)/P
P (z1|x;
=
(x,
=we
(1✓t✓))2have
✓)P (x;
/Pz1;
✓)
=(x;
3/4
=
(x,
(x;solution!
✓) = (1 ✓)
✓) =(11) (9) = 3/4
P (z1|x;
✓calculate
= ✓)
P (x,
z1;
(x;
) = P(1(z1|x;
/P
✓(x;
)✓)/P
==
= /P
3/4
3/4
3/4
3/4 ⇥ 3/4
3/4
P (z1|x; ✓) = P (x, z1; ✓)/P (x; ✓) = (1 ✓)
/P
(x; ✓) ✓)
= =1
=(z1|x;
3/4
(11)
P
(z2|x;
P
✓)
=
1/4
(12)
Now let’sP (x;calculate
and
.
✓) = P (z1; ✓) +n
P (z2;
✓)
= (1 n
✓)T +
(1 ✓)✓
= 3/4
(8)
3/4
H
P
(z2|x;
✓)
=
1
P
(z1|x;
✓)
=
1/4
P (z2|x;
✓) = 1 P (z1|x;
(10)
t
t ✓) = 1/4
H
t
T
t
T
H
H
t
T
t
T
T
T
H
t
H
H
nH
nH +nT
T
H
t
2
T
2
T
H
2
T
2
2
P 1(z2|x;
✓ )✓)==11/4 P (z1|x; ✓ ) = 1/4 (12)
(12)
P (z2|x;n✓)
P
And we have
(z1)
=(z1|x;
0,
nTAnd
(z1)
=
2,
n
(z2)
=
1,
and
n
(z2)
=
1.
H=
H
T
we have nH (z1) = 0, nT (z1) = 2, nH (z2) = 1, and nT (z2) = 1.
3 2, nH (z2) = 1, and nT (z2) = 1.
And
we
have
n
(z1)
=
0,
n
(z1)
=
H
T
And And
we have
nHhave
(z1) have,
=n0,
nT (z1)==0,
2, n
nHT(z2)
1,So
and
(z2)
1. 1, and nT (z2) = 1.
2
we
(z1)= =
2, we
nnHThave,
(z2)==
So
we
H (z1)
1/4
nH
P
(x;
✓)
=
P
(z1;
✓)
+
P
(z2;
✓)
=
(1
✓)
+
(1
✓)✓
1/4
So
we
have,
nH = 1/4 ⇥ 1 = 1/4, nT = 3/4 ⇥ 2 + 1/4n⇥
So we
H 1 = 7/8, ✓ = nH +nT = 1/4+7/8 = 1/8
Sohave,
we
have,
n
=
1/4
⇥
1
=
1/4,
n
=
3/4
⇥
2
=
=
1/8
1/41 = 7/8, ✓ = n +n
H
T
n+ 1/4 ⇥
1/4+7/8
H1/4 T
nH = 1/4
⇥=
1 =3/4
1/4, nT =+3/4
⇥ 2⇥
+ 1/4
1 = 7/8, ✓ = n +n = 1/4+7/8 = 1/8nH
1/4
=⇥1/4
H 1/4
nH n=
⇥ 1⇥=01/4,
nT =1 3/4
⇥ 2 + 1/4 ⇥ 1 = 7/8, ✓ = nH +nT = 1/4+7/8
= 1/8
nT = 3/4 ⇥ 2 + 1/4 ⇥ 1 = 7/8
1/4
H
✓ = nHn+n
=
1/4+7/8 = 1/8.
T
H
H
T
The EM algorithm at work: continue
Initial guess θ = ¼
After one iteration θ = 1/8
…
The optimal parameter θ will never be reached by the
EM algorithm!
H
P (z1|x; ✓) = P (x, z1; ✓t )/P (x; ✓t ) = (1
P (z2|x; ✓t ) = 1
✓t )2 /P (x; ✓t ) =
3/4 ⇥ 3/4
= 3/4
3/4
P (z1|x; ✓t ) = 1/4
(11)
(12)
And we have nH (z1) = 0, nT (z1) = 2, nH (z2) = 1, and nT (z2) = 1.
So we have,
1/4
H
= 1/4 ⇥ 1 = 1/4, nT = 3/4 ⇥ 2 + 1/4 ⇥ 1 = 7/8, ✓ = nHn+n
=
1/4+7/8 = 1/8
T
Coin toss with hidden data
Two coins A and B, with parameters θ={θA, θB}; compute θ that maximizes
the experiment
log likelihood ofwith
the observed
x={x1,x2,..x5}
Two-coin
hidden data
data
E.g., initial parameter θ: θA=0.60, θB=0.50
(x1, x2,..x5 are independent observations)
P (z1 = A|x; ✓t ) = P (z1 = A|x1 ; ✓t )
=
=
observation
P (z1 = A, x1 ; ✓t )
P (z1 = A, x1 ; ✓t ) + P (z1 = B, x1 ; ✓t )
0.65 ⇥ 0.45
= 0.58
0.65 ⇥ 0.45 + 0.55 ⇥ 0.55
nH
4
Coin A
nT
P(A)
P(B) nH
nT
5
0.58
0.42
2.9
B
(13)n
H
nT
2.9
2.1
2.1
x1: HTTTHHTHTH
5
x2: HHHHTHHHHH
9
1
0.84
0.16
7.6
0.8
1.4
0.2
x3: HTHHHHHTHH
8
2
0.81
0.19
6.4
1.6
1.6
0.4
x4: HTHTTTHHTT
4
6
0.25
0.75
1.0
1.5
3.0
4.5
8
2
0.81
0.19
6.4
1.6
1.6
24.3H 8.4T 9.7H
New parameter: θA=24.3/(24.3+8.4)=0.74, θB=9.7/(9.7+7.6)=0.56
x5: THHHTHHHTH
0.4
7.6T
Motif finding problem
Motif finding problem is not that different from
the coin toss problem!
Probabilistic approaches to motif finding
– EM
– Gibbs sampling (a generalized EM algorithm)
There are also combinatorial approaches
Motif finding problem
Given a set of DNA sequences:
cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc
Find the motif in each of the individual sequences
The MEME algorithm
Collect all substrings with the same length w
from the input sequences: X = (X1,…,Xn)
Treat sequences as bags of subsequences: a
bag for motif, and a bag for background
Need to figure out two models (one for motif, and
one for the background), and assign each of the
subsequences to one of the bags, such that the
likelihood of the data (subsequences) is
maximized
– Difficult problem
– Solved by the EM algorithm
Motif finding vs coin toss
tagacgctatc
0.3x M 0.7x B
gctatccacgt
0.7x M 0.3x B
gtaggtcctct
0.2x M 0.8x B
M
Motif
B
Background model
θ: the probability of getting heads
θA: P(head) for coin A
θB: P(head) for coin B
Probability of a subsequence:
P(x|M), or P(x|B)
Probability of an observation sequence:
P(x|θ)=θ#(heads)(1-θ) #(tails)
Fitting a mixture model by EM
A finite mixture model:
– data X = (X1,…,Xn) arises from two or more
groups with g models θ = (θ1, …, θg).
Indicator vectors Z = (Z1,…,Zn), where Zi =
(Zi1,…,Zig), and Zij = 1 if Xi is from group j,
and = 0 otherwise.
P(Zij= 1|θj) = λj . For any given i, all Zij are 0
except one j;
g=2: class 1 (the motif) and class 2 (the
background) are given by position specific and a
general multinomial distribution
The E- and M-step
E-step: Since the log likelihood is the sum of
over i and j of terms multiplying Zij, and these
are independent across i, we need only consider
the expectation of one such, given Xi. Using
initial parameter values θ’ and λ’, and the fact
that the Zij are binary, we get
E(Zij |X,θ’,λ’)=λ’jP(Xi|θ’j)/ ∑k λ’kP(Xi|θ’k)=Z’ij
M-step: The maximization over λ is independent
of the rest and is readily achieved with
λj’’ = ∑iZ’ij / n.
Baum-Welch algorithm for HMM
parameter estimation
n
1
Akl = ∑
j =1
1
j =1
j |θ )
p
(
s
=
k
,
s
=
l
,
x
∑ i −1 i
p ( x j ) i =1
n
Akl = ∑
L
L
j
j
f
(
i
−1)
a
e
(
x
)
b
∑ k
kl l
i l (i )
p ( x j ) i =1
n
E k (b ) = ∑
j =1
1
p (x j )
∑
f k j (i ) f k j (i )
i :xi j =b
During each iteration, compute the expected transitions between any pair of
states, and expected emissions from any state, using averaging process (Estep), which are then used to compute new parameters (M-step).
Pros and Cons
Cons
– Slow convergence
– Converge to local optima
Pros
– The E-step and M-step are often easy to implement
for many problems, thanks to the nice form of the
complete-data likelihood function
– Solutions to the M-steps often exist in the closed form
Ref
– On the convergence properties of the EM algorithm. CFJ WU, 1983
– A gentle tutorial of the EM algorithm and its applications to parameter estimation
for Gaussian mixture and hidden Markov models, JA Bilmes, 1998
– What is the expectation maximization algorithm? 2008
© Copyright 2026 Paperzz