Maximum Likelihood and
Parameter Estimation For HMM
Lecture #8
Background Readings: Chapter 3.3 in , Biological Sequence
Analysis, Durbin et al., 2001.
© Shlomo Moran, following Danny Geiger and Nir Friedman
.
Parameter Estimation for HMM
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
An HMM model is defined by the parameters: mkl and ek(b), for all
states k,l and all symbols b.
Let θ denote the collection of these parameters:
θ = {mkl : k , l are states} ∪ {ek (b) : k is a state, b is a letter}
k
mkl
l
ek(b)
b
2
Parameter Estimation for HMM
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
To determine the values of (the parameters in) θ, use a
training set = {x1,...,xn}, where each xj is a sequence which
is assumed to fit the model.
Given the parameters θ, each sequence xj has an assigned
probability p(xj|θ).
3
Maximum Likelihood Parameter Estimation
for HMM
The elements of the training set {x1,...,xn}, are assumed to be
independent,
p(x1,..., xn|θ) = ∏j p (xj|θ).
ML parameter estimation looks for θ which
maximizes the above.
The exact method for finding or approximating
this θ depends on the nature of the training set
used.
4
Data for HMM
The training set is characterized by:
1.For each xj, the information on the states sji (the symbols
xji are usually known).
2.The size (number of sequences) in it.
S1
T
x1
M
S2
T
x2
M
M
SL-1
T
XL-1
M
SL
T
xL
5
Case 1: ML when Sequences are fully known
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
We know the complete structure of each sequence in the
training set {x1,...,xn}. We wish to estimate mkl and ek(b) for
all pairs of states k, l and symbols b.
By the ML method, we look for parameters θ* which
maximize the probability of the sample set:
p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).
6
Case 1: Sequences are fully known
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
For each xj we have:
Lj
p( x | θ ) = ∏ msi−1si esi ( xi )
j
j
i =1
Let M klj =|{i : sij−1 = k , sij = l}| (the superscript
j
indicates x j )
Let Ek (b) =|{i : sij = k , xij = b}|
M klj
j
then: prob( x | θ ) = ∏ mkl
( k ,l )
∏ [ek (b)]
E kj (b )
( k ,b )
7
Case 1 (cont)
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
By the independence of the xj’s, p(x1,...,xn| θ)=∏jp(xj|θ).
Thus, if Mkl = #(transitions from k to l) in the training set,
and Ek(b) = #(emissions of symbol b from state k) in the
training set, we have:
M kl
1
n
prob( x ,.., x | θ ) = ∏ mkl
( k ,l )
∏ [ek (b)]
Ek (b )
( k ,b )
8
Case 1 (cont)
So we need to find mkl’s and ek(b)’s which maximize:
∏
M kl
mkl
( k ,l )
∏ [ek (b)]
Ek (b )
( k ,b )
Subject to:
For all states k , ∑ mkl = 1 and
l
∑ e (b) = 1
k
b
[mkl , ek (b) ≥ 0 ]
9
Case 1 (cont)
Rewriting, we need to maximize:
F = ∏ mklM kl
( k ,l )
∏[∏
k
l
M kl
mkl
Ek (b )
[
e
(
b
)]
=
∏ k
( k ,b )
]× ∏[∏ [ek (b)]E (b) ]
Subject to: for all k ,
k
k
b
∑ mkl = 1, and ∑ ek (b) = 1.
l
b
10
Case 1 (cont)
If we maximize for each k :
∏
M kl
mkl
l
and also ∏ [ek (b)]Ek (b ) s.t.
b
s.t.
∑ mkl = 1
l
∑ ek (b) = 1
b
Then we will maximize also F.
Each of the above is a simpler ML problem,
which is similar to ML parameters
estimation for a die, treated next.
11
ML parameters estimation for a die
Let X be a random variable with 6 values x1,…,x6 denoting the
six outcomes of a (possibly unfair) die. Here the parameters
are θ ={q1,q2,q3,q4,q5, q6} , ∑qi=1
Assume that the data is one sequence:
Data = (x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6)
So we have to maximize
prob( Data | θ ) = q12 ⋅ q23 ⋅ q32 ⋅ q4 ⋅ q5 ⋅ q62
Subject to: q1+q2+ q3+ q4+ q5+ q6=1
[and qi ≥0 ]
⎛
⎞
i.e., prob( Data | θ ) = q ⋅ q ⋅ q ⋅ q4 ⋅ q5 ⋅ ⎜1 − ∑ qi ⎟
⎝ i =1 ⎠
2
1
3
2
2
3
5
2
12
Side comment: Sufficient Statistics
To
compute the probability of data in the die example
we only require to record the number of times ni
falling on side i (namely,n1, n2,…,n6).
We
do not need to recall the entire sequence of
outcomes
prob( Data | θ ) = q
n1
1
(
⋅ q ⋅ q ⋅ q ⋅ q ⋅ 1 − ∑ i =1 qi
n2
2
n3
3
n4
4
n5
5
5
)
n6
{ni
| i=1…6} is called sufficient statistics for the
multinomial sampling.
13
Sufficient Statistics
A
sufficient statistics is a function of the data that
summarizes the relevant information for the
likelihood
Formally, s(Data) is a sufficient statistics if for any
two datasets D and D’
z
s(Data) = s(Data’ ) ⇒ P(Data|θ) = P(Data’|θ)
Exercise:
Define a concise
“sufficient statistics”
for the HMM model,
when the sequences
are fully known.
Datasets
Statistics
14
Maximum Likelihood Estimate
By the ML approach, we look for parameters that
maximize the probability of data (i.e., the likelihood
function ).
We will find the parameters by considering the
corresponding log-likelihood function:
(
5
⎡ n1 n2 n3 n4 n5
log(prob( Data | θ )) = log ⎢ q1 ⋅ q2 ⋅ q3 ⋅ q4 ⋅ q5 ⋅ 1 − ∑ i =1 qi
⎣
(
= ∑ i =1 ni log qi + n6 log 1 − ∑ i =1 qi
5
A necessary
condition for
(local) maximum is:
5
)
n6
⎤
⎥⎦
)
n6
∂ log(prob( Data | θ )) n j
= −
=0
5
∂q j
q j 1 − ∑ qi
i =1
15
Finding the Maximum
Rearranging terms:
nj
n6
n6
=
=
5
q j 1 − ∑ qi q6
i =1
Divide the jth equation by the ith equation:
Sum from j=1 to 6:
6
1 = ∑qj
j =1
The only local – and hence
global – maximum is given by
the relative frequency:
∑
=
6
j =1
qj =
nj
ni
ni
qi =
n
nj
ni
qi
n
qi = qi
ni
i = 1,.., 6
16
Generalization to any number k of
outcomes
Let X be a random variable with k values x1,…,xk denoting the k
outcomes of Independently and Identically Distributed experiments,
with parameters θ ={q1,q2,...,qk} (qi is the probability of xi).
Again, the data is one sequence of length n, in which xi appears ni
times.
Then we have to maximize
prob( Data | θ ) = q ⋅ q ...q , (n1 + .. + nk = n)
n1
1
n2
2
nk
k
Subject to: q1+q2+ ....+ qk=1
i.e., prob( Data | θ ) =
n1
q1
nk −1
⋅⋅⋅ qk −1
k −1
⎛
⎞
⋅ ⎜1 − ∑ qi ⎟
⎝ i =1 ⎠
nk
17
Generalization for k outcomes (cont)
By treatment identical to the die case, the maximum is obtained
when for all i:
ni nk
=
qi qk
Hence the MLE is given by the relative frequencies:
ni
qi =
i = 1,.., k
n
18
Fractional Exponents
Some models allow ni’s to be fractions (eg, if we are
uncertain of a die outcome, we may consider it “6” with
20% confidence and “5” with 80%):
We still have for θ = (q1,..,qk):
prob( Data | θ ) =
n1
q1
n2
⋅ q2
nk
⋅⋅⋅ qk
, (n1 + ... + nk = n)
And the same analysis yields:
ni
qi =
i = 1,.., k
n
19
Summary: for an experiment with k outcomes x1,..,xk, the ML problem
is the following:
Given a statistic (n1’ n2,..,nk) of positive real numbers (ni is the number
of observations of xi ), find parameters θ ={q1,q2,...,qk} (qi is the
probability of xi) which maximizes the likelihood.
prob( Data | θ ) = q1n1 ⋅ q2n2 ...qknk , (n1 + .. + nk = n)
Subject to: q1+q2+ ....+ qk=1
The unique θ = (q1 , q2 .., qk ) which maximizes
ni
the likelihood is given by qi = , i = 1..
1,..,k .k .
n
20
Frequency Vector of Statistics
(for single random variable)
The frequency vector of statistics (n1’ n2,..,nk) where n1+..nk=n,
is the statistics (p1,..,pk) obtained by letting pi = ni /n (i = 1,..,k).
Consider two statistics of a random variable with 4 outcomes:
n=125 and the statistics is (10,25,80,10)
n=250 and the statistics is (20,50,160,20).
Both statistics gives the same frequency vector (0.08, 0.2, 0.64, 0.08),
and hence are optimized by the same parameters.
We can treat a frequency vector (p1,..,pk) as any other
statistics, and find the parameters which maximize its
likelihood.
21
ML for frequency vectors
For single-dice experiments, the likelihood of any data which has a
frequency vector P is maximized by the same parameters which
maximizes the likelihood of the “statistics” P (prove!).
The “ML problem for frequency vectors” is the same as the original
problem, but we restrict the statistics to frequency vectors:
Given a frequency vector of observed Data, ( p1 ,.. pk ),
Find parameters θ = (q1 , q2 ,.., qk ) which maximize the
likelihood prob( Data | θ ) = q1p1 ⋅ q2p2 ...qkpk
22
ML for frequency vectors
Since “frequency vector” can be viewed as a
probability vector, our solution for the “single dice”
ML problem implies the following:
Let P = (p1 ,.., pk ) be a probability k -vector.
Then among all probability k -vectors Q = (q1 ,.., qk ),
k
the likelihood ∏ q is maximized iff Q = P.
i =1
pi
i
k
The value of this maximum likelihood is ∏ p
i =1
pi
i
23
Side Trip:
Maximum Likelihood and Entropy
The logarithm of the maximum likelihood of the
frequency vector ( p1 , p2 ,.., pk ) is given by
log( p ⋅ p ... p ) = ∑ i =1 pi log pi .
p1
1
p2
2
pk
k
k
This is a negative number, and its absolute value
( -∑
k
i =1
)
pi log pi is known as the entropy of (p1 ,.., pk ).
24
ML and “Relative Entropy”
Thus, MLE for frequency vectors implies:
Let P = ( p1 ,.., pk ) be a probability vector.
Then among all probability vectors Q = (q1 ,.., qk ), the sum
(∑
k
i =1
)
pi log qi gets its unique maximum when P = Q.
This is equivalent to the following:
Let P = ( p1 ,.., pk ) be a probability vector.
Then for all probability vectors Q = (q1 ,.., qk ) it holds that
∑
k
i =1
pi log( pi / qi ) ≥ 0 , with equality iff P = Q.
25
Relative Entropy
For P = ( p1 ,.., pk ) and Q = (q1 ,.., qk ) as above,
the sum D(P||Q)= ∑ i =1 pi log( pi / qi ) is called the relative
k
entropy, or the Kullback Leiber distance, of P and Q.
So the ML formula for frequency vectors implies
that the relative entropy of P and Q is nonnegative,
and it equals 0 iff P = Q.
26
Relative entropy of scoring functions
Recall that we have defined the scoring function of alignments via
P ( a, b)
P ( a, b)
σ (a, b) = log
= log
Q(a )Q(b)
Q ( a, b)
where P (a, b), Q (a, b) are the probabilities
of (a, b) in the "Match" and "Random" models
The relative entropy between P and Q is:
D( P || Q) = ∑ a ,b P (a, b) log(
P ( a, b)
) = ∑ a ,b P (a, b)σ (a, b)
Q ( a, b)
Large relative entropy means that P(a,b), the distribution of the “match”
model, is significantly different from Q(a,b), the distribution of the
“random” model. (This relative entropy is called “the mutual
information of P and Q”, denoted I(P,Q). I(P,Q)=0 iff P =Q.)
27
Back to ML of fully known training sets
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
Recall: Mkl = #(transitions from k to l) in the training set,
and Ek(b) = #(emissions of symbol b from state k) in the
training set. We need to maximize:
1
n
prob( x ,.., x | θ ) =
∏
( k ,l )
M kl
mkl
∏ [ek (b)]
Ek (b )
( k ,b )
28
Rearranging terms we need to:
Maximize
∏
M kl
mkl
( k ,l )
∏ [ek (b)]
Ek (b )
( k ,b )
Subject to: for all states k ,
∑ mkl = 1, and ∑ ek (b) = 1, mkl , ek (b) ≥ 0.
l
b
For this, we need to maximize the likelihoods of two dices
for each state k: One for transition {mkl|l=1,..,m} and one
for emission {ek(b)|b∈Σ}.
29
Apply to HMM (cont.)
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
We apply the “dice likelihood” technique to get for
each k the parameters {mkl|l=1,..,m} and {ek(b)|b∈Σ}:
M kl
mkl =
, and ek (b) =
∑ l ' M kl '
Ek (b)
∑ b ' Ek (b ')
Which gives the optimal ML parameters
30
Adding pseudo counts in HMM
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
We may modify the actual count by our prior
knowledge/belief (e.g., when the sample set is too small) :
rkl is our prior belief on transitions from k to l.
rk(b) is our prior belief on emissions of b from state k.
M kl + rkl
then mkl =
, and ek (b) =
∑ l ' (M kl ' + rkl ' )
Ek (b) + rk (b)
∑ b ' ( Ek (b ') + rk (b))
32
Case 2: State paths are unknown
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
For a given θ we have:
prob(x1,..., xn|θ)= prob(x1| θ) ⋅ ⋅ ⋅ p (xn|θ)
(since the xj are independent)
For each sequence x,
prob(x|θ)=∑s prob(x,s|θ),
The sum taken over all state paths s which emit x.
33
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
Thus, for the n sequences (x1,..., xn) we have:
prob(x1,..., xn |θ)= ∑ prob(x1,..., xn , s1,..., sn |θ),
(s1,..., sn )
Where the summation is taken over all tuples of n state
paths (s1,..., sn ) which generate (x1,..., xn) .
For simplicity, we will assume that n=1.
34
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
So we need to maximize prob(x|θ)=∑s prob(x,s|θ),
where the summation is over all the sequences S
which produce the output sequence x.
Finding θ* which maximizes ∑s prob(x,s|θ) is
hard. [Unlike finding θ* which maximizes
prob(x,s|θ) for a single sequence (x,s).]
35
ML Parameter Estimation for HMM
The general process for finding θ in this case is
1. Start with an initial value of θ.
2. Find θ’ so that prob(x|θ’) > prob(x|θ)
3. set θ = θ’.
4. Repeat until some convergence criterion is met.
A general algorithm of this type is the Expectation
Maximization algorithm, which we will meet later.
For the specific case of HMM, it is the BaumWelch training.
36
Baum Welch training
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
We start with some values of mkl and ek(b), which define
prior values of θ.
Then we use an iterative algorithm which attempts to
replace θ by a θ* s.t.
prob(x|θ*) > prob(x|θ)
This is done by “imitating” the algorithm for Case 1,
where all states are known:
37
Baum Welch training
…
Si-1= k
Si = l
xi-1= b
xi= c
…
In case 1 we computed the optimal values of mkl and ek(b),
(for the optimal θ) by simply counting the number Mkl of
transitions from state k to state l, and the number Ek(b) of
emissions of symbol b from state k, in the training set.
This was possible since we knew all the states.
38
…
Baum Welch training
Si-1= ?
xi-1= b
Si = ?
…
xi= c
When the states are unknown, the counting
process is replaced by averaging process:
For each edge si-1Æ si we compute the average
number of “k to l” transitions, for all possible
pairs (k,l), over this edge. Then, for each k and
l, we take Mkl to be the sum over all edges.
39
Baum Welch training
Si = ?
xi-1= b
Similarly, For each edge siÆ b and each state k,
we compute the average number of times that
si=k, which is the expected number of “k → b”
transmission on this edge. Then we take Ek(b) to
be the sum over all such edges.
These expected values are computed by assuming
the current parameters θ:
40
The expected values of Mkl and Ek(b)
s1
s2
si
sL-1
sL
X1
X2
Xi
XL-1
XL
Given the current distribution θ, Mkl and Ek(b) are defined by:
Mkl=∑sMsklprob(s|x,θ),
where Mskl is the number of k to l transitions in the sequence s.
Ek(b)=∑sEsk (b)prob(s|x,θ),
where Esk (b) is the number of times k emits b in the sequence s
with output x.
41
© Copyright 2026 Paperzz