Here - UiO

Note on EM-training of IBM-model 1
INF5820 Language Technological Applications, Fall 2012
The slides on this subject (inf5820_2012_10_26.pdf) including the example seem insufficient to give a
good grasp of what is going on. Hence here are some supplementary notes with more details.
Hopefully they make things clearer.
The main idea
There are two main items involved:
- Translation probabilities
- Word alignments
The translation probabilities are assigned to the bilingual lexicon:
For a pair of words (e,f) in the lexicon, how probable is it that e gets translated as f, expressed by t(f|e).
Beware, this is calculated from the whole corpus; we do not consider these probabilities for a single
sentence.
A word alignment is assigned to a pair of sentences (e, f). (We are using bold face to indicate that ‘e’
is a string (array) of words ‘e1, e2,… ek’, etc.) When we have a parallel corpus where the sentences are
sentence aligned – which may be expressed by (e1, f1), (e2, f2), …,(em, fm) – we are considering the
alignment of each sentence pair individually. Ideally, we are looking for the best alignment of each
sentence. But as we do not know it, we will instead consider the probability of the various alignments
of the sentence. For each sentence, the probability of the various alignments must add to 1.
The EM-training then goes as follows
0. Initializing
a. We start with initializing t. When we don’t have other information, we initialize t
uniformly. That is, t(f|e) = 1/s, where s is the number of F-words in the lexicon.
b. For each sentence in the corpus, we estimate the probability distribution for the
various alignments of the sentence. This is done on the basis of t, and should reflect t:
For example, if t(fi| ek) = 3 t(fj| ek), then alignments which aligns fi to ek should be 3
times more probable than they which align fj to ek.
(Well, actually, in round 0 this is more trivial, since all alignments are equally likely
when we start with a uniform t.)
1. Next round
a. We count how many times a word e is translated as f on the basis of the probability
distributions for the sentences. This is a fractional count. Given a sentence pair
(ej, fj). If e occurs in ej. and f occurs in fj, we consider alignments which aligns f to e.
Given such an alignment, a, we consider its probability P(a), and from this alignment
we count that e is translated as f “P(a) many times”. For example, if P(a) is 0.01, we
will add 0.01 to the count of how many times e is translated as f. After we have done
this for all alignments of all the sentences, we can recalculate t.
The notation in Koehn’s book for the different counts and measures is not stellar, but
as we adopted the same notation in the slides, we will stick to it to make the
similarities transparent. Koehn used the notation 𝑐(𝑓|𝑒) for the fractional count of the
1
pair (e,f) in a particular sentence. To make it clear that it is the count in the specific
sentence pair (e, f), he also uses the notation 𝑐(𝑓|𝑒; 𝒇, 𝒆 ). To indicate the fractional
count of the word (type) pair (e,f) over the whole corpus, he uses ∑(𝒇,𝒆) 𝑐(𝑓|𝑒; 𝒇, 𝒆 )
(i.e. we add the fractional counts for all the sentences.) An alternative notation for the
same would have been
𝑚
� 𝑐(𝑓|𝑒; 𝒇𝑖 , 𝒆𝑖 )
𝑖=1
given there are m sentences in the corpus. We introduced the notation tc – for ‘total
count’ – for this on the slides.
𝑡𝑐(𝑓|𝑒) = � 𝑐(𝑓|𝑒; 𝒇, 𝒆)
(𝒇,𝒆)
The reestimated translation probability can then be calculated from this
t ( f | e) =
∑ c( f | e; f , e) = tc( f | e)
∑ ∑ c( f ′ | e; f , e) ∑ tc( f '| e)
( f ,e )
f′
f'
( f ,e )
Here f’ varies over all the F-words in the lexicon.
b. With these new translation probabilities, we may return to the alignments, and for
each sentence estimate the best probability distribution over the possible alignments.
This time there is no simple way as there was in round 0. For each alignment, we
calculate a probability on the basis of t, and normalize to make sure that the sum of
the probabilities for each sentence add up to 1
2. Next round:
a. We go about exactly as in step (1a). On the basis of the alignment probabilities
estimated in step (1b), we may now calculate new translations probabilities t,
b. And on the basis of the translation probabilities estimate new alignment probabilities
3. And so we may repeat the two steps as long as we like …
Properties
What is nice with this algorithm is:
- We can prove that the result gets better (or stay the same) after each round. It never
deteriorates.
- The result converges towards a local optimum.
- For IBM model 1 (but not in general) this local optimum is also a global optimum.
The fast way
We have described here the underlying idea of the algorithm. The description above is probably the
best for understanding what is going on. There is a problem when applying it. There are so many (too
many) different alignments. We therefore derived a modified algorithm where we do not calculate the
probabilities of the actual alignments. Instead we calculate the translation probability in step (1a)
directly from the translation probabilities from step (0a) and the translation probabilities in step (2a)
directly from the translation probabilities in (1a) without actually calculating the intermediate
alignment probabilities (step 1b).
2
Examples
There is a very simple example in Jurafsky and Martin which illustrates the calculation with the
original algorithm. You should consult this first. In the example in the lecture, we followed the
modified algorithm where we sidestep the actual alignments. Let us now see how the example from
the lecture would go with the full algorithm first (similarly to the Jurafsky-Martin example), before we
compare it to the example from the lecture with some more details filled in.
We will number the examples
Sentence 1:
- e1:
dog barked
hund bjeffet
- f1:
Sentence 2:
- e2:
dog bit dog
hund bet hund
- f2:
to have the simplest example first.
The theoretical sound, but computationally intractable way:
Step 0a - Initialization.
Since there are 3 Norwegian words all t(f|e) is set to 1/3.
t(hund|dog) = 1/3
t(bet|dog) = 1/3
t(bjeffet|dog) = 1/3
t(hund|bit) = 1/3
t(bet|bit) = 1/3
t(bjeffet|bit) = 1/3
t(hund|barked) = 1/3
t(bet|barked) = 1/3
t(bjeffet|barked) = 1/3
t(hund|0) = 1/3
t(bet|0) = 1/3
t(bjeffet|0) = 1/3
Step 0b – Alignments
We must also include 0 in the E-sentence to indicate that a word in the F-sentence may be aligned to
nothing. Each of the 2 words in the sentence f1 may come from one of 3 different words in sentence e1.
Hence there are 9 different alignments: <0,0>, <0,1>, <0,2>, <1,0>, <1,1>, <1,2>, <2,0>, <2,1>,
<2,2>. Since all translation probabilities are equally likely, each alignment will have the same
probability. Since there are 9 different alignments, each of them will have the probability 1/9. Writing
a1 for the alignment probability of the first sentence, we have a1(<0,0>)= a1(<0,1>)=…=
a1(<2,2>)=1/9.
For sentence 2, there are 3 words in f2.. Each of them may be aligned to any of 4 different words in e2
(including 0). Hence there are 4*4*4=64 different alignments, ranging from <0,0,0> to <3,3,3>. We
could take the easy way out and say that each of them is equally likely, hence a2(<0,0,0>)=
a2(<0,0,1>)=…= a1(<3,3,3>)=1/64. But to prepare our understanding for later rounds, let us see what
happens if we follow the recipe. To calculate the probability of one particular alignment, we multiply
together the involved translation probabilities, eg. P’(<1,2,0>) = t(hund|dog)*t(bet|bit)*t(hund|0)=1/27.
In this round, we get exactly the same result for all the alignments, 1/27. But that isn’t the same as
1/64. Has anything gone wrong here? No. The score 1/27 is not the probability of the alignment. To
get at the probability we must normalize. First we sum together the scores for all the alignments which
yields 64/27. Then to get the probability for each alignment, we must divide the second with this sum.
Hence the probability for each alignment is (1/27)/(64/27) – a complicated way to write 1/64.
3
Step 1a – Maximize the translation probabilities
Then the show may start. We first calculate the fractional counts for the word pairs in the lexicon, and
we do this sentence by sentence, starting with sentence 1. To take one example, what is the fractional
count of (dog, hund) in sentence 1? We must see which alignments which align the two words. There
are 3: <1,0>, <1,1>, <1,2>. (A good advice at this point is to draw the alignments while you read.) To
get the fractional count we must add the probabilities of these alignments, i.e.,
c(hund|dog; f1, e1) = a1(<1,0>)+ a1(<1,1>)+ a1(<1,2>)=3*(1/9) = 1/3.
We can repeat for the pair (hund, barked) and get
c(hund|barked; f1, e1) = a1(<2,0>)+ a1(<2,1>)+ a1(<2,2>)=3*(1/9) = 1/3,
and so on. We see we get the same for all word pairs in this sentence
c(hund|dog)= 1/3
c(bjeffet|dog) = 1/3
c(hund|barked) = 1/3
c(bjeffet|barked) = 1/3
c(hund|0) = 1/3
c(bjeffet|0) = 1/3
(There is a typo in the lecture slides and in the first version of these notes, writing t instead of c in the
right column. The same for sentence 2.)
Sentence 2 is more exiting. Consider first the pair (bet, bit). They get aligned by all alignments of the
form <x, 2, y> where x and y are any of 0,1,2,3. There are 16 such alignments. (We don’t bother to
write them out). Each alignment has probability 1/64. Hence
c(bet|bit; f2, e2)= 16/64 = ¼
Similarly we get c(bet|0; f2, e2)= 16/64 = ¼. To count the pair (dog, bet), they are aligned by all
alignments of the form <x,1,y> and all alignments of the form <x,3,y>, hence
c(bet|dog; f2, e2)= 2*16/64 = ½
To count the pair (bit, hund), we must consider both alignments of the form <2,x,y> and of the form
<x,y,2>. (Observe that <2,x,2> should be counted twice since two occurrences of ‘hund’ are aligned to
‘bit’.) And to count the pair (hund, dog), we must consider all alignments <1,x,y>, <3,x,y>, <x,y,1>
and <x,y,3>. We get the following counts for sentence 2:
c(hund|dog)=1
c(bet|dog) = 1/2
c(hund|bit) = ½
c(bet|bit) = 1/4
c(hund|0) = 1/2
c(bet|0) = 1/4
4
We get the total counts (tc) by adding the fractional counts for all the sentences in the corpus resulting
in
tc(hund|dog) = 1+1/3
tc(bet|dog) = 1/2
tc(bjeffet|dog) = 1/3
tc(*|dog)=4/3+1/2+1/3=
13/6
tc(hund|bit) = ½
tc(bet|bit) = ¼
tc(bjeffet|bit) = 0
tc(*|bit)=3/4
tc(hund|barked) = 1/3
tc(bet|barked) = 0
tc(bjeffet|barked) = 1/3
tc(*|barked) =2/3
tc(hund|0) = ½+1/3
tc(bet|0) = 1/4
tc(bjeffet|0) = 1/3
tc(*|0)=17/12
In the last column we have added all the total counts for one E word, e.g.
𝑡𝑐(∗ |𝑑𝑜𝑔) = ∑𝑓′ 𝑡𝑐(𝑓′|𝑒; 𝒇, 𝒆 )
We can then finally calculate the new translation probabilities:
e
f
t(f|e)
0
hund
(5/6)/(17/12) 10/17
0.588235
0
bet
(1/4)/(17/12) 3/17
0.176471
0
bjeffet
(1/3)/(17/12) 4/17
0.235294
dog
hund
(4/3)/(13/6) 8/13
0.615385
dog
bet
(1/2)/(13/6) 3/13
0.230769
dog
bjeffet
(1/3)/(13/6) 2/13
0.153846
bit
hund
(1/2)/(3/4)
2/3
0.666667
bit
bet
(1/4)/(3/4)
1/3
0.333333
barked
hund
(1/3)/(2/3
1/2
0.5
barked
bjeffet
(1/3)/(2/3)
1/2
0.5
exact
decimal
5
Step 1b_ Estimate alignment probabilities
It is time to estimate the alignment probabilities again. Remember this is done sentence by sentence,
starting with sentence 1. There are 9 different alignments to consider. For each of them we may
calculate an initial unnormalized probability, call it P’, on the basis of the last translation probabilities.
P’(<0,0>) =
P’(<0,1>)=
P’(<0,2>)=
P’(<1,0>) =
P’(<1,1>) =
P’(<1,2>) =
P’(<2,0>) =
P’(<2,1>)=
P’(<2,2>)=
Sum of P’s
t(hund|0)*t(bjeffet|0)=
t(hund|0)*t(bjeffet|dog)=
t(hund|0)*t(bjeffet|barked)=
t(hund|dog)*t(bjeffet|0)=
t(hund|dog)*t(bjeffet|dog)=
t(hund|dog)*t(bjeffet|barked)=
t(hund|barked)*t(bjeffet|0)=
t(hund|barked)*t(bjeffet|dog)=
t(hund|barked)*t(bjeffet|barked)=
(10/17)*(3/17)=
(10/17)*(2/13)=
(10/17)*(1/2)=
(8/13)*(3/17)=
(8/13)*(2/13)=
(8/13)*(1/2)=
(1/2)*(3/17)=
(1/2)*(2/13)=
(1/2)*(1/2)=
P’
0,103806
0,0904977
0,294118
0,108597
0,0946746
0,307692
0,0882352
0,0769231
0,25
1,4145436
P=P’/1,4145436
0,0733848
0,0639766
0,207924
0,0767718
0,0669294
0,217520
0,06237715
0,05438015
0,176735
We sum the P’ scores (last line) and normalize them in the last column to get the probability
distribution over the alignments.
We may do the same for sentence 2. But because there are 64 different alignments we refrain from
carrying out the details.
Step 2a – Maximize the translation probabilities
We proceed exactly as in step 1a. We first collect the fractional counts sentence by sentence, starting
with sentence 1. For example, we get
c(hund|barked; f1, e1) = a1(<2,0>)+ a1(<2,1>)+ a1(<2,2>)=
0,06237715+0,05438015+0,176735=0,2934923
And similarly for the other fractional counts in sentence 1.
Since we have not calculated the alignments for sentence 2, we stop here.
Hopefully the idea is clear by now.
The fast lane
Manually we refrain from calculating 64 alignments, but it wouldn’t have been a problem for a
machine. However, a short sentence of 10 words has roughly 1010 alignments and soon also the
machines must give in.
Let us repeat the calculations from the slides from the lecture. The point is that we skip the alignments
and pass directly from step 0a to step 1a and then to step 2a etc. The key is the formula
c( f | e; e, f ) =
m
t ( f | e)
∑
k
i =0
k
∑ δ ( f , f )∑ δ (e, e )
t( f | e )
i
j =1
j
i =0
i
which lets us calculate fractional counts directly from (last round of) translation probabilities.
6
Step 1a – Maximize the translation probabilities
To understand the formula, fj refers to the word at position j in sentence f. Thus in sentence 1, if f is
‘hund’, 𝛿�𝑓, 𝑓𝑗 � = 1 for j=1, while , 𝛿�𝑓, 𝑓𝑗 � = 0 for j = 2. Similarly, ei refers to the word in position i
in the English string. Hence,
c(hund | barked ; e1 , f1 ) =
t (hund | barked )
∑
2
i =0
13
∑i =0 (1 3)
2
t ( f | ei )
2
2
j =1
i =0
∑ δ (hund , f j )∑ δ (barked , ei ) =
(δ (hund , hund ) + δ (hund , bjeffet )) ×
(δ (barked ,0) + δ (barked , dog ) + δ (barked , barked )) = 1 / 3
and similarly for the other word pairs. We get the same fractional counts for sentence 1 as when we
used explicit alignments.
Then sentence 2. To take two examples
c(bet | bit ; e 2 , f 2 ) =
3
t (bet | bit )
∑i =0 t (bet | ei )
c(hund | dog ; e 2 , f 2 ) =
3
j =1
t (hund | dog )
∑i =0 t (hund | ei )
3
3
13
i =0
∑i =0 (1 3)
∑ δ (bet , f j )∑ δ (bit , ei ) =
3
3
3
13
i =0
∑i =0 (1 3)
∑ δ (hund , f j )∑ δ (dog , ei ) =
j =1
× 1× 1 = 1 / 4
3
× 2× 2 =1
Hurray – we get the same fractional counts as with the explicit use of alignments. And we may
proceed as we did there, calculating first the total fractional counts, tc, and then the translation
probabilities, t.
Step 2a – Maximize the translation probabilities
We can harvest the award when we come to the next round and want to calculate the fractional counts.
Take an example from sentence 1:
c(hund | barked ; e1 , f1 ) =
t (hund | barked )
∑
2
t ( f | ei )
i =0
2
2
∑ δ (hund , f )∑ δ (barked , e ) =
j
j =1
i
i =0
0.5
0.5
=
= 0.2934927
0.588235 + 0.615385 + 0.5 1.70362
Which is close enough to the result we got by taking the long route (given that we use calculator and
round off for each round). The miracle is that this works equally well on sentence 2, for example:
c(hund | dog ; e 2 , f 2 ) =
t (hund | dog )
∑
3
i =0
3
3
∑ δ (hund , f )∑ δ (dog , e ) =
t (hund | e )
i
j =1
j
i =0
0.615385
× 2× 2 = ?
0.588235 + 0.615385 + 0.666667 + 0.615385
7
i
Summing up
This concludes the examples. Hopefully it is now possible to better see:
• The motivation between the original approach where we explicitly calculate alignments
• That the faster algorithm yields the same results as the original algorithm, at least on the
example we explicitly calculated. And that even though it may be hard to see step by step that
the two algorithms produce the same results in general, we may open up to the idea.
• That the fast algorithm is computationally tractable.
8