Note on EM-training of IBM-model 1 INF5820 Language Technological Applications, Fall 2012 The slides on this subject (inf5820_2012_10_26.pdf) including the example seem insufficient to give a good grasp of what is going on. Hence here are some supplementary notes with more details. Hopefully they make things clearer. The main idea There are two main items involved: - Translation probabilities - Word alignments The translation probabilities are assigned to the bilingual lexicon: For a pair of words (e,f) in the lexicon, how probable is it that e gets translated as f, expressed by t(f|e). Beware, this is calculated from the whole corpus; we do not consider these probabilities for a single sentence. A word alignment is assigned to a pair of sentences (e, f). (We are using bold face to indicate that ‘e’ is a string (array) of words ‘e1, e2,… ek’, etc.) When we have a parallel corpus where the sentences are sentence aligned – which may be expressed by (e1, f1), (e2, f2), …,(em, fm) – we are considering the alignment of each sentence pair individually. Ideally, we are looking for the best alignment of each sentence. But as we do not know it, we will instead consider the probability of the various alignments of the sentence. For each sentence, the probability of the various alignments must add to 1. The EM-training then goes as follows 0. Initializing a. We start with initializing t. When we don’t have other information, we initialize t uniformly. That is, t(f|e) = 1/s, where s is the number of F-words in the lexicon. b. For each sentence in the corpus, we estimate the probability distribution for the various alignments of the sentence. This is done on the basis of t, and should reflect t: For example, if t(fi| ek) = 3 t(fj| ek), then alignments which aligns fi to ek should be 3 times more probable than they which align fj to ek. (Well, actually, in round 0 this is more trivial, since all alignments are equally likely when we start with a uniform t.) 1. Next round a. We count how many times a word e is translated as f on the basis of the probability distributions for the sentences. This is a fractional count. Given a sentence pair (ej, fj). If e occurs in ej. and f occurs in fj, we consider alignments which aligns f to e. Given such an alignment, a, we consider its probability P(a), and from this alignment we count that e is translated as f “P(a) many times”. For example, if P(a) is 0.01, we will add 0.01 to the count of how many times e is translated as f. After we have done this for all alignments of all the sentences, we can recalculate t. The notation in Koehn’s book for the different counts and measures is not stellar, but as we adopted the same notation in the slides, we will stick to it to make the similarities transparent. Koehn used the notation 𝑐(𝑓|𝑒) for the fractional count of the 1 pair (e,f) in a particular sentence. To make it clear that it is the count in the specific sentence pair (e, f), he also uses the notation 𝑐(𝑓|𝑒; 𝒇, 𝒆 ). To indicate the fractional count of the word (type) pair (e,f) over the whole corpus, he uses ∑(𝒇,𝒆) 𝑐(𝑓|𝑒; 𝒇, 𝒆 ) (i.e. we add the fractional counts for all the sentences.) An alternative notation for the same would have been 𝑚 � 𝑐(𝑓|𝑒; 𝒇𝑖 , 𝒆𝑖 ) 𝑖=1 given there are m sentences in the corpus. We introduced the notation tc – for ‘total count’ – for this on the slides. 𝑡𝑐(𝑓|𝑒) = � 𝑐(𝑓|𝑒; 𝒇, 𝒆) (𝒇,𝒆) The reestimated translation probability can then be calculated from this t ( f | e) = ∑ c( f | e; f , e) = tc( f | e) ∑ ∑ c( f ′ | e; f , e) ∑ tc( f '| e) ( f ,e ) f′ f' ( f ,e ) Here f’ varies over all the F-words in the lexicon. b. With these new translation probabilities, we may return to the alignments, and for each sentence estimate the best probability distribution over the possible alignments. This time there is no simple way as there was in round 0. For each alignment, we calculate a probability on the basis of t, and normalize to make sure that the sum of the probabilities for each sentence add up to 1 2. Next round: a. We go about exactly as in step (1a). On the basis of the alignment probabilities estimated in step (1b), we may now calculate new translations probabilities t, b. And on the basis of the translation probabilities estimate new alignment probabilities 3. And so we may repeat the two steps as long as we like … Properties What is nice with this algorithm is: - We can prove that the result gets better (or stay the same) after each round. It never deteriorates. - The result converges towards a local optimum. - For IBM model 1 (but not in general) this local optimum is also a global optimum. The fast way We have described here the underlying idea of the algorithm. The description above is probably the best for understanding what is going on. There is a problem when applying it. There are so many (too many) different alignments. We therefore derived a modified algorithm where we do not calculate the probabilities of the actual alignments. Instead we calculate the translation probability in step (1a) directly from the translation probabilities from step (0a) and the translation probabilities in step (2a) directly from the translation probabilities in (1a) without actually calculating the intermediate alignment probabilities (step 1b). 2 Examples There is a very simple example in Jurafsky and Martin which illustrates the calculation with the original algorithm. You should consult this first. In the example in the lecture, we followed the modified algorithm where we sidestep the actual alignments. Let us now see how the example from the lecture would go with the full algorithm first (similarly to the Jurafsky-Martin example), before we compare it to the example from the lecture with some more details filled in. We will number the examples Sentence 1: - e1: dog barked hund bjeffet - f1: Sentence 2: - e2: dog bit dog hund bet hund - f2: to have the simplest example first. The theoretical sound, but computationally intractable way: Step 0a - Initialization. Since there are 3 Norwegian words all t(f|e) is set to 1/3. t(hund|dog) = 1/3 t(bet|dog) = 1/3 t(bjeffet|dog) = 1/3 t(hund|bit) = 1/3 t(bet|bit) = 1/3 t(bjeffet|bit) = 1/3 t(hund|barked) = 1/3 t(bet|barked) = 1/3 t(bjeffet|barked) = 1/3 t(hund|0) = 1/3 t(bet|0) = 1/3 t(bjeffet|0) = 1/3 Step 0b – Alignments We must also include 0 in the E-sentence to indicate that a word in the F-sentence may be aligned to nothing. Each of the 2 words in the sentence f1 may come from one of 3 different words in sentence e1. Hence there are 9 different alignments: <0,0>, <0,1>, <0,2>, <1,0>, <1,1>, <1,2>, <2,0>, <2,1>, <2,2>. Since all translation probabilities are equally likely, each alignment will have the same probability. Since there are 9 different alignments, each of them will have the probability 1/9. Writing a1 for the alignment probability of the first sentence, we have a1(<0,0>)= a1(<0,1>)=…= a1(<2,2>)=1/9. For sentence 2, there are 3 words in f2.. Each of them may be aligned to any of 4 different words in e2 (including 0). Hence there are 4*4*4=64 different alignments, ranging from <0,0,0> to <3,3,3>. We could take the easy way out and say that each of them is equally likely, hence a2(<0,0,0>)= a2(<0,0,1>)=…= a1(<3,3,3>)=1/64. But to prepare our understanding for later rounds, let us see what happens if we follow the recipe. To calculate the probability of one particular alignment, we multiply together the involved translation probabilities, eg. P’(<1,2,0>) = t(hund|dog)*t(bet|bit)*t(hund|0)=1/27. In this round, we get exactly the same result for all the alignments, 1/27. But that isn’t the same as 1/64. Has anything gone wrong here? No. The score 1/27 is not the probability of the alignment. To get at the probability we must normalize. First we sum together the scores for all the alignments which yields 64/27. Then to get the probability for each alignment, we must divide the second with this sum. Hence the probability for each alignment is (1/27)/(64/27) – a complicated way to write 1/64. 3 Step 1a – Maximize the translation probabilities Then the show may start. We first calculate the fractional counts for the word pairs in the lexicon, and we do this sentence by sentence, starting with sentence 1. To take one example, what is the fractional count of (dog, hund) in sentence 1? We must see which alignments which align the two words. There are 3: <1,0>, <1,1>, <1,2>. (A good advice at this point is to draw the alignments while you read.) To get the fractional count we must add the probabilities of these alignments, i.e., c(hund|dog; f1, e1) = a1(<1,0>)+ a1(<1,1>)+ a1(<1,2>)=3*(1/9) = 1/3. We can repeat for the pair (hund, barked) and get c(hund|barked; f1, e1) = a1(<2,0>)+ a1(<2,1>)+ a1(<2,2>)=3*(1/9) = 1/3, and so on. We see we get the same for all word pairs in this sentence c(hund|dog)= 1/3 c(bjeffet|dog) = 1/3 c(hund|barked) = 1/3 c(bjeffet|barked) = 1/3 c(hund|0) = 1/3 c(bjeffet|0) = 1/3 (There is a typo in the lecture slides and in the first version of these notes, writing t instead of c in the right column. The same for sentence 2.) Sentence 2 is more exiting. Consider first the pair (bet, bit). They get aligned by all alignments of the form <x, 2, y> where x and y are any of 0,1,2,3. There are 16 such alignments. (We don’t bother to write them out). Each alignment has probability 1/64. Hence c(bet|bit; f2, e2)= 16/64 = ¼ Similarly we get c(bet|0; f2, e2)= 16/64 = ¼. To count the pair (dog, bet), they are aligned by all alignments of the form <x,1,y> and all alignments of the form <x,3,y>, hence c(bet|dog; f2, e2)= 2*16/64 = ½ To count the pair (bit, hund), we must consider both alignments of the form <2,x,y> and of the form <x,y,2>. (Observe that <2,x,2> should be counted twice since two occurrences of ‘hund’ are aligned to ‘bit’.) And to count the pair (hund, dog), we must consider all alignments <1,x,y>, <3,x,y>, <x,y,1> and <x,y,3>. We get the following counts for sentence 2: c(hund|dog)=1 c(bet|dog) = 1/2 c(hund|bit) = ½ c(bet|bit) = 1/4 c(hund|0) = 1/2 c(bet|0) = 1/4 4 We get the total counts (tc) by adding the fractional counts for all the sentences in the corpus resulting in tc(hund|dog) = 1+1/3 tc(bet|dog) = 1/2 tc(bjeffet|dog) = 1/3 tc(*|dog)=4/3+1/2+1/3= 13/6 tc(hund|bit) = ½ tc(bet|bit) = ¼ tc(bjeffet|bit) = 0 tc(*|bit)=3/4 tc(hund|barked) = 1/3 tc(bet|barked) = 0 tc(bjeffet|barked) = 1/3 tc(*|barked) =2/3 tc(hund|0) = ½+1/3 tc(bet|0) = 1/4 tc(bjeffet|0) = 1/3 tc(*|0)=17/12 In the last column we have added all the total counts for one E word, e.g. 𝑡𝑐(∗ |𝑑𝑜𝑔) = ∑𝑓′ 𝑡𝑐(𝑓′|𝑒; 𝒇, 𝒆 ) We can then finally calculate the new translation probabilities: e f t(f|e) 0 hund (5/6)/(17/12) 10/17 0.588235 0 bet (1/4)/(17/12) 3/17 0.176471 0 bjeffet (1/3)/(17/12) 4/17 0.235294 dog hund (4/3)/(13/6) 8/13 0.615385 dog bet (1/2)/(13/6) 3/13 0.230769 dog bjeffet (1/3)/(13/6) 2/13 0.153846 bit hund (1/2)/(3/4) 2/3 0.666667 bit bet (1/4)/(3/4) 1/3 0.333333 barked hund (1/3)/(2/3 1/2 0.5 barked bjeffet (1/3)/(2/3) 1/2 0.5 exact decimal 5 Step 1b_ Estimate alignment probabilities It is time to estimate the alignment probabilities again. Remember this is done sentence by sentence, starting with sentence 1. There are 9 different alignments to consider. For each of them we may calculate an initial unnormalized probability, call it P’, on the basis of the last translation probabilities. P’(<0,0>) = P’(<0,1>)= P’(<0,2>)= P’(<1,0>) = P’(<1,1>) = P’(<1,2>) = P’(<2,0>) = P’(<2,1>)= P’(<2,2>)= Sum of P’s t(hund|0)*t(bjeffet|0)= t(hund|0)*t(bjeffet|dog)= t(hund|0)*t(bjeffet|barked)= t(hund|dog)*t(bjeffet|0)= t(hund|dog)*t(bjeffet|dog)= t(hund|dog)*t(bjeffet|barked)= t(hund|barked)*t(bjeffet|0)= t(hund|barked)*t(bjeffet|dog)= t(hund|barked)*t(bjeffet|barked)= (10/17)*(3/17)= (10/17)*(2/13)= (10/17)*(1/2)= (8/13)*(3/17)= (8/13)*(2/13)= (8/13)*(1/2)= (1/2)*(3/17)= (1/2)*(2/13)= (1/2)*(1/2)= P’ 0,103806 0,0904977 0,294118 0,108597 0,0946746 0,307692 0,0882352 0,0769231 0,25 1,4145436 P=P’/1,4145436 0,0733848 0,0639766 0,207924 0,0767718 0,0669294 0,217520 0,06237715 0,05438015 0,176735 We sum the P’ scores (last line) and normalize them in the last column to get the probability distribution over the alignments. We may do the same for sentence 2. But because there are 64 different alignments we refrain from carrying out the details. Step 2a – Maximize the translation probabilities We proceed exactly as in step 1a. We first collect the fractional counts sentence by sentence, starting with sentence 1. For example, we get c(hund|barked; f1, e1) = a1(<2,0>)+ a1(<2,1>)+ a1(<2,2>)= 0,06237715+0,05438015+0,176735=0,2934923 And similarly for the other fractional counts in sentence 1. Since we have not calculated the alignments for sentence 2, we stop here. Hopefully the idea is clear by now. The fast lane Manually we refrain from calculating 64 alignments, but it wouldn’t have been a problem for a machine. However, a short sentence of 10 words has roughly 1010 alignments and soon also the machines must give in. Let us repeat the calculations from the slides from the lecture. The point is that we skip the alignments and pass directly from step 0a to step 1a and then to step 2a etc. The key is the formula c( f | e; e, f ) = m t ( f | e) ∑ k i =0 k ∑ δ ( f , f )∑ δ (e, e ) t( f | e ) i j =1 j i =0 i which lets us calculate fractional counts directly from (last round of) translation probabilities. 6 Step 1a – Maximize the translation probabilities To understand the formula, fj refers to the word at position j in sentence f. Thus in sentence 1, if f is ‘hund’, 𝛿�𝑓, 𝑓𝑗 � = 1 for j=1, while , 𝛿�𝑓, 𝑓𝑗 � = 0 for j = 2. Similarly, ei refers to the word in position i in the English string. Hence, c(hund | barked ; e1 , f1 ) = t (hund | barked ) ∑ 2 i =0 13 ∑i =0 (1 3) 2 t ( f | ei ) 2 2 j =1 i =0 ∑ δ (hund , f j )∑ δ (barked , ei ) = (δ (hund , hund ) + δ (hund , bjeffet )) × (δ (barked ,0) + δ (barked , dog ) + δ (barked , barked )) = 1 / 3 and similarly for the other word pairs. We get the same fractional counts for sentence 1 as when we used explicit alignments. Then sentence 2. To take two examples c(bet | bit ; e 2 , f 2 ) = 3 t (bet | bit ) ∑i =0 t (bet | ei ) c(hund | dog ; e 2 , f 2 ) = 3 j =1 t (hund | dog ) ∑i =0 t (hund | ei ) 3 3 13 i =0 ∑i =0 (1 3) ∑ δ (bet , f j )∑ δ (bit , ei ) = 3 3 3 13 i =0 ∑i =0 (1 3) ∑ δ (hund , f j )∑ δ (dog , ei ) = j =1 × 1× 1 = 1 / 4 3 × 2× 2 =1 Hurray – we get the same fractional counts as with the explicit use of alignments. And we may proceed as we did there, calculating first the total fractional counts, tc, and then the translation probabilities, t. Step 2a – Maximize the translation probabilities We can harvest the award when we come to the next round and want to calculate the fractional counts. Take an example from sentence 1: c(hund | barked ; e1 , f1 ) = t (hund | barked ) ∑ 2 t ( f | ei ) i =0 2 2 ∑ δ (hund , f )∑ δ (barked , e ) = j j =1 i i =0 0.5 0.5 = = 0.2934927 0.588235 + 0.615385 + 0.5 1.70362 Which is close enough to the result we got by taking the long route (given that we use calculator and round off for each round). The miracle is that this works equally well on sentence 2, for example: c(hund | dog ; e 2 , f 2 ) = t (hund | dog ) ∑ 3 i =0 3 3 ∑ δ (hund , f )∑ δ (dog , e ) = t (hund | e ) i j =1 j i =0 0.615385 × 2× 2 = ? 0.588235 + 0.615385 + 0.666667 + 0.615385 7 i Summing up This concludes the examples. Hopefully it is now possible to better see: • The motivation between the original approach where we explicitly calculate alignments • That the faster algorithm yields the same results as the original algorithm, at least on the example we explicitly calculated. And that even though it may be hard to see step by step that the two algorithms produce the same results in general, we may open up to the idea. • That the fast algorithm is computationally tractable. 8
© Copyright 2025 Paperzz