Road Map
108
Word-Based SMT
and Alignment
Chapter 4. Word-Based Models
SMT Models
Let us step Word-based
through the
mathematics of trigram language models,
(i.e. using n-gram language models with n=3).
Parallel Corpora and Alignment
..., en ) data
•p(e)
How=
top(e
prepare
1 , e2 ,training
• How=
top(e
align
a parallel
corpus
1 )p(e
2 |e1 )...p(e
n |e1 , e2 , ..., en 1 )
' p(e1 )p(e2 |e1 )...p(en |en
Word Alignment
Statistical Machine Translation
e* = argmax p(E=e|F=f)
e
Training: Find appropriate model p(E|F) using data D
MLE: p(E=e|F=f,D) = count(e,f) / count(f)
Why can’t we do that?
2 , en 1 )
We decompose the whole-sentence probability into single-word probabilities, using the chain rule. Then, we make the independence assumption
that only the previous two words matter for predicting a word.
To estimate the probabilities for a trigram language model, we need
to collect statistics for three word sequences from large amounts of text.
In statistical machine translation, we use the English side of the parallel
corpus, but may also include additional text resources.
Chapter 7 has more detail on how language models are built.
Statistical Machine Translation
4.3.3
Using a probabilistic model p(E|F) translation becomes
(4.22)
Noisy Channel Model
Use Bayes’
Rule to Decompose
How can we combine
a language
model andp(e|f)
our into
translation model? Recall
• Translation
p(f|e) e for an input sentence f. We now
that we want to find
the bestModel
translation
Target
Model p(e)
apply the Bayes •rule
to Language
include p(e):
p(f |e)p(e)
p(f )
= argmaxe p(f |e)p(e)
argmaxe p(e|f ) = argmaxe
(4.23)
Note that, mathematically, the translation direction has changed from
Does this help us?
p(e|f ) to p(f |e).
This may create a lot of confusion, since the concept of
what constitutes the source language di↵ers between the mathematics of the
model and the actual application. We try to stay away from this confusion
as much Note:
as possible
by sticking
to the notation p(e|f ) when formulating a
p(e|f) is a shorthand
for p(E=e|F=f)
translation model.
Combining a language model and translation model this way is called the
noisy channel model. This method is widely used in speech recognition,
Statistical Machine Translation
Translation Model: Conditional likelihood over sentence pairs
• p(das haus ist klein|the house is small)
• p(das haus ist klein|the building is tiny)
• p(das haus ist klein|the shell is low)
• ....
Language Model: likelihood of target language sentences
• p(the house is small)
words a↵ect the probability of the next word. It is technically wrong, and
it is not too hard to come up with counter-examples that demonstrate that
a longer history is needed. However, limited data restrict the collection of
reliable statistics to short histories.
Typically, we chose the actually number of words in the history based on
how much training data we have. More training data allows for longer histories. Most commonly, trigram language models are used. They consider
a two word history to predict the third word. This requires the collection of
statistics over sequences of three words, so-called 3-grams (trigrams). LanReduce
dependencies
guage models
may also
estimated over 2-grams (bigrams), single words
6
(unigrams), or any other order of n-grams.
N-Gram Language Models
of course john
Reordering
7.1.2 Estimationcourse john has
fun
duringhas
translation
• Words may be reordered john
In its simplest form, the estimation
of
probabilities
.....
1
2 trigram
3 word prediction
4
p(w3 |w1 , w2 ) is straight-forward.
how often
klein Weistcountdas
Hausthe sequence w1 , w2
is followed by the word w3 in our training corpus, opposed to other words.
...maximum
to make itlikelihood
possibleestimation,
to estimatewereliable
parameters
According to
compute:
• p(small the is house)
count(wis
w3 )
1 , w2 ,small
house
p(w3 |w1 , wthe
2) = P
1
3 1 , w2 ,4w)
w2 count(w
(7.5)
a : {1 → 3, 2 → 4, 3 → 2, 4 → 1}
Statistical Translation Models
Decompose p(f |e) in such a way that its parameters can be
estimated from example data (bilingual parallel corpus = bitext).
Parallel corpus
what is more , the relevant cost
dynamic is completely under control.
im übrigen ist die diesbezügliche
kostenentwicklung völlig unter kontrolle .
sooner or later we will have to be
sufficiently progressive in terms of own
resources as a basis for this fair tax
system .
früher oder später müssen wir die
notwendige progressivität der eigenmittel als
grundlage dieses gerechten steuersystems
zur sprache bringen .
we plan to submit the first accession
partnership in the autumn of this year .
wir planen , die erste beitrittspartnerschaft
im herbst dieses jahres vorzulegen .
it is a question of equality and solidarity
.
hier geht es um gleichberechtigung und
solidarität .
the recommendation for the year 1999
has been formulated at a time of
favourable developments and optimistic
prospects for the european economy .
die empfehlung für das jahr 1999 wurde vor
dem hintergrund günstiger entwicklungen
und einer für den kurs der europäischen
wirtschaft positiven perspektive abgegeben .
that does not , however , detract from
the deep appreciation which we have for
this report .
im übrigen tut das unserer hohen
wertschätzung für den vorliegenden bericht
keinen abbruch .
Figure 7.1 gives some examples on how this estimation works on real
(plus smoothing to cover unseen events)
data, in this case the European parliament corpus. We consider three different histories: the green, the red, and the blue. The words that most frequently follow are quite di↵erent for the di↵erent histories. For instance the
red cross is a frequent trigram in the Europarl corpus, which also mentions
Koehn, U Edinburgh
ESSLLI Summer School
Day 2
a lot the green party, a political organization.
Let us look at one example for the maximum likelihood estimation of
word probabilities given a two word history. There are 225 occurrences
Word-Based Translation Models
7
Generative Model: Source language words
One-to-many
translation
are generated
by target
language words
• A source word may translate into multiple target words
1
2
3
4
das
Haus
ist
klitzeklein
the
house
is
very
small
1
2
3
4
5
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4, 5 → 4}
Translation: Decode what kind of word
sequence has generated the foreign string.
Koehn, U Edinburgh
ESSLLI Summer School
40
Day 2
more
formally,
weduring
need
to •compute
p(a|e,
f ),
the probability of an al
Reordering
during
translation
Words may be reordered
Reordering
translation
• Words
may be reordered
Words may be reordered during translation
given the English
klein ist
das Haus
kleinand
ist foreign
das Haussentence.
• Words may be reordered during translation
klein
ist
dasonHaus
klein
ist
das
Haus
Applying the chain rule (recall Section 3.2.3
page 76) gives u
klein ist
das Haus
6
1
2
3
Reordering
• Words may be reordered during translation •
1
Koehn, U Edinburgh
ESSLLI Summer School
Day 2
Statistical Word Alignment Models
1
2
3
2
the
house
is
small
1
2
3
4
1
2
3
4
Haus
ist
klitzeklein
4
3
4
p(e,house
a|f ) is small
the
small
p(a|e, f ) =a : {1 → 3, 2 → 4, 3 → 2, 4 → 1}
a : {1p(e|f
→ 3, 2 →)4, 3 → 2, 4 → 1}
a : {1 → 3, 2 → 4, 3 → 2, 4 → 1}
2
house
3is
1
4
2
1
2 4, 3 →
3 2, 4 →4 1}
a : {11 → 3, 2 →
3
2
4
3
4
We still need to derive p(e|f ), the probability of translating the
overany
all alignment:
possible ways of generating the string:
f into Sum
e with
X
p(e|f ) =
p(e, a|f )
7
Koehn, U Edinburgh
ESSLLI Summer School
Koehn, U Edinburgh
Summer School
DayESSLLI
2
Koehn, U Edinburgh
ESSLLI Summer School
Koehn, U Edinburgh
Koehn, U Edinburgh
ESSLLI Summer School
• A source word may translate into multiple target words
das
3
2
StatisticaltheWord
the Models
house is small
house isAlignment
small
a : {1 → 3, 2 → 4, 3 → 2, 4 → 1}
• observed word j is generated by target word i
2
1
4
4
1
the
Introduce an alignment function a : i → j
One-to-many translation
1
4
3
Day 2
ESSLLI
Summer School
Day
2
Day 2
Day 2
7
a
7
One-to-many
One-to-many translation
lf
lf One-to-manytranslation
translation
One-to-many translation
X
X
• A source word may translate into multiple target words
7
One-to-many translation
=
7
7
p(e, a|f )
• A source word may translate into multiple target words
...
words
• A source word may translate into multiple
target
• A4 source
1
2into multiple
3
4 words
target
wordsword may translate
• target
A source
word may translate
1
2into multiple
3
the
house
is
very
small
1
2
3
4
5
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4, 5 → 4}
Koehn, U Edinburgh
ESSLLI Summer School
40
Translation Model Parameters
Lexical Translations
• das → the
• haus → house, home, building, household, shell
• ist → is
• klein → small, low
2
3
4
das
Haus
ist
klitzeklein
the
house
is
very
small
1
2
3
4
5
2
4
das1 Haus
ist3 klitzeklein
2
4
das1 Haus
ist3 klitzeklein
das Haus ist
klitzekleina(le )=0das Haus ist klitzeklein
a(1)=0
Day 2
ESSLLI Summer School
40
Koehn, U Edinburgh
Koehn, U Edinburgh
lf
X
the house is very small
the
house
very
small
1
2
3is
4
5
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4, 5 → 4}
Alignment model: p(f,a|e)
Koehn, U Edinburgh
1
1
2
=
3
4
...
5
lf
X
the
l
Y
✏
t(ej |fa(j) )
le 3, 4 → 4, 5 → 4}
a : {1(l
→ 1, 2+
3→
1)
f 1, 2→→2,
a : {1 →
2, 3 → 3, 4 → 4, 5 → 4}
the
1
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4, 5 → 4}
a : {1 → 1, 2 → 2, 3a(1)=0
→ 3, 4 → 4, 5 →a(l
4} e )=0
......
Day 2
=
✏
Koehn, U Edinburgh
ESSLLI Summer School
40Summer School
Koehn, U Edinburgh
e
ESSLLI
40
(lf + 1)l
✏
lf
X
e
house is very
small
house
very
small
2
3is
4
5
1
a(1)=0
2
3
4
5
j=1
...
lf
le
X
Y
Summer School
DayESSLLI
2
40
ESSLLI
Summer School
Day
2
40
a(le )=0 j=1
lf
le X
Y
t(ej |fa(j) )
Day 2
Day 2
Context-Independent
Models
(IBM1)
t(ej |f
=
i)
le
(lf + 1)
Collect Statistics
j=1 i=0
Note
the significance
last step: Bitext:
Instead of performing a s
Count
Translation
Statistics of
in athe
Word-Aligned
le
an •exponential
lf of into
products,
we reduced the computation
How often is number
Haus translated
....
plexity to linear in lf and le (since lf ⇠ le , computing p(e|f ) is
quadratic with respect to sentence length).
Let us now put Equations 4.9 and 4.10 together:
Look at a parallel corpus (German text along with English translation)
Translation of Haus Count
house
8,000a|f )
p(e,
p(a|e,
f
)
=
building
1,600
p(e|f
)
home
200
Q le
household
150 ✏ le
j=1 t(ej |fa(j) )
(lf +1)
shell
= 50 Q
P
Multiple translation options?
• some options are more likely than others
• learn translation probabilities from example data
✏
(lf +1)le
=
Chapter 4: Word-Based Models
le
j=1
lf
i=0 t(ej |fi )
le
Y
t(ej |fa(j) )
Plf
j=1
i=0 t(ej |fi )
2
Context-Independent Models (IBM1)
Special Cases in Word Alignment
8
Dropping words
Estimate Translation Probabilities:
words:when translated
• WordsDropping
may be dropped
– The German article das is dropped
8
• Maximum Likelihood
Estimation
(MLE)
Estimate
Translation
Probabilities
Dropping words
count(f, e)
• Words may be dropped when translated
t(f |e) =
– The German article das is dropped
count(e)
Maximum
estimation
1 likelihood
2
3
4
• for f = Haus:das Haus ist klein
8
>
0.8
if e = house,
>
>
>
>
>0.16 if e = building,
<
house if eis= home,
small
pf (e)
t(f
|e) =
= 0.02
>
1
2
3
>
>
0.015 if e = household,
>
>
>
a
:
{1
→
2,
2
→
3,
3
→ 4}
:0.005 if e = shell.
Koehn, U Edinburgh
1
2
3
4
das
Haus
ist
klein
house
1
is
small
2
3
a : {1 → 2, 2 → 3, 3 → 4}
Koehn, U Edinburgh
ESSLLI Summer School
ESSLLI Summer School
Day 2
Day 2
Special Cases in Word Alignment
Chapter 4: Word-Based Models
3
Inserting words
Inserting words: Introduce special NULL word
• Words may be added during translation
– The English just does not have an equivalent in German
– We still need to map it to something: special null token
0
1
2
3
4
das
Haus
ist
klein
1010
IBM Model
1 IBM 1IBM
IBMModel
Model
11
Model
1
IBM Model
9
Inserting
words
Generative
model:
break
upinto
translation
process
intosteps
smallersteps
steps
model:
up
translation
process
into
smaller
• •Generative
model:
break
upbreak
translation
process
smaller
• Generative
model:
break
up translation
process
smaller into
steps
• Generative
IBM
Model
1
only
uses
lexical
translation
–
IBM
Model
1
only
uses
lexical
translation
–
IBM
Model
1
only
uses
lexical
translation
–
Modelmay
1 only
uses lexical
– IBM
during translation
translation
• Words
be added
•
Translation
probability
•
Translation
probability
•
Translation
probability
• Translation
probability
– The English just does not have an equivalent in German
•toafmap
foreign
sentence:
–forfor
a=
foreign
sentence
fspecial
, ...,
lengthlflf
– for
aneed
sentence
f lf=) (f
,f...,
f(f
of
length
lflength
–foreign
foreign
sentence
==
...,
flfftoken
– for a foreign
sentence
(fit1,to...,
of 1length
l)1f,1null
l)f )ofof
lf(f
– We
still
something:
–
to
an
English
sentence
e
=
(e
,
...,
e
)
of
lengthlele
–
to
an
English
sentence
e
=
(e
,
...,
e
)
of
length
l
•
english
sentence:
–
to
an
English
sentence
e
=
(e
,
...,
e
)
of
– to an English sentence 0e = (e11, ..., le )2 of 1length
elength
lele
3 lele1 1 4
– with –an–with
alignment
of
each
English
word
e
to
a
foreign
ftoi according
to
with
alignment
of
each
English
word
e
aforeign
foreign
wordffi iaccording
according
t
– with an alignment
of
each
word
e
to
a
foreign
f
according
ananEnglish
alignment
of
each
English
word
e
to
word
to
•NULL
alignment:
a(
j
)
=
i
aligns
with
j
ij j toaword
das Hausj ist klein
the function
alignment
afunction
: j → ai a: :j j→→i i
the alignment
aalignment
:function
j → i function
thealignment
the
9
NULL
10
10
le
Y
l
l!
ele
p(e, a|f ) = Z(l
ϵ e , lf ) ϵ t(e!
j e|f
ϵ ϵa(j) )!
p(e, a|f ) the
= p(e,house
t(e
|f
)
is
just
small
a|f )p(e,
=
t(e
|f
)t(ej |f
j
a(j)
p(e,
a|f
)
=
a|f
)
=
t(e
))
j
a(j)
j |f
a(j)
le
a(j)
le
l
the
house
is
1
2
3
1
just small
4
(lf + 1)
2
le
!
le e
(l j=1
+ 1)4 (l(l
f++1)
f
5 1)
j=1
j=1 f3
j=1
j=1
→ϵ constant
2a→
2, 3 →constant
3, 4 →
0,
5 → 4}
– parameter
is a–normalization
– ϵparameter
is: {1
a normalization
• ϵaonly
lexical
parameters
–parameter
parameter
ϵ1,
atranslation
normalization
constant
isis
normalization
constant
5
• normalization:
a : {1 → 1, 2 → 2, 3 → 3, 4 → 0, 5 → 4}
✏
Z(l
)School
= School le
e , lfSummer
Koehn, U Edinburgh
ESSLLI
Summer
Koehn,
U Edinburgh
ESSLLI
Koehn,
U Koehn,
Edinburgh
Summer
41ESSLLI(l
1)School
Koehn,
Edinburgh
ESSLLI
SummerSchool
School
UU
Edinburgh
ESSLLI
Summer
f +
Koehn, U Edinburgh
ESSLLI Summer School
41
Day 2
(ε = likelihood of selecting length of e given length of f)
11
DayDay
2 2
Day 2
Day
Day
IBM Model 1: Example
IBM Model 1
111
4.4. Higher IBM Models
As a consequence, according to IBM Model 1 the translation probabilities
forWhat
the following
two mean
alternative
translations are the same:
does this
for p(e,a|f)?
Example
das
e
t(e|f )
the
0.7
that
0.15
which 0.075
who
0.05
this
0.025
Haus
e
house
building
home
household
shell
ist
t(e|f )
0.8
0.16
0.02
0.015
0.005
e
is
’s
exists
has
are
t(e|f )
0.8
0.16
0.02
0.015
0.005
klein
e
t(e|f )
small 0.4
little
0.4
short 0.1
minor 0.06
petty 0.04
1
2
3
4
5
1
2
3
4
5
natürlich ist das haus klein
natürlich ist das haus klein
of course the house is small
the course small is of house
4.4. Higher IBM Models
1
2
3
4
5
6
1
2
3
4
5
6
111
IBM Model 2 addresses the issue of alignment with an explicit model for
What does this mean for p(e|f) = ∑ p(e,a|f)?
As a consequence, according
to IBM
Modelof1thethe
translation
probabilities
alignment based
on the positions
input
and output words.
The trans• of
p(athe
house
is word
smallin| position
das Haus
ist
klein
) word in position j
lation
foreign
input
i
to
an
English
for the following two alternative translations are the same:
✏
p(e, a|f ) = 3 ⇥ t(the|das) ⇥ t(house|Haus) ⇥ t(is|ist) ⇥ t(small|klein)
4
✏
= 3 ⇥ 0.7 ⇥ 0.8 ⇥ 0.8 ⇥ 0.4
4
= 0.0028✏
is modelled
byisansmall
alignment
• p( the
houseprobability
| das Hausdistribution
ist klein )
1
2
• p( the house is small |a(i|j,
ist Haus
le , lf ) das klein )
3
4
5
1
2
3
4
5
(4.24)
natürlich ist das haus
klein
natürlich
ist fdas
haus
Recall that
the length of the
input sentence
is denoted
as lf klein
, the length
Chapter 4: Word-Based Models
of the output sentence e is le . We can view translation under IBM Model 2
as a two step process with a lexical translation step and an alignment step.
11
of course the house is small
1
2
3
IBM Model 2
3
1
4
112
5
4
4
5
6
5
of course the house is small
5
1
2
3
4
a(i|j, le , lf )
5
6
Chapter 4. Word-Based
Models
(4.24)
by the
translation
probability
t(e|f
). The
step isas
thelform
alignment
step.Model 2:
steps
areinput
combined
mathematically
to
Recall thatThe
the two
length
of the
sentence
f is second
denoted
length
f , theIBM
ForNew
instance,
translating ist into is has a lexical translation probability of
Model:
of the output sentence
e isand
le .anWe
can view
translation
under
IBM
Model
t(is|ist)
alignment
probability
of a(2|5, 6, 5)
— the 5th
English
word 2
le
Y
alignedatolexical
the 2nd foreign
word.
as a two step processiswith
translation
step and an alignment step.
p(e,
) = ✏ function
t(eajmaps
|fa(j)each
)a(a(j)|j,
le , lfword
) j to
(4.25)
Note that
thea|f
alignment
English output
alignment step
4
3
3
The first step is lexical translation as in IBM Model 1, again modelled
lexical translation step
3
2
2
alignment
probability
for small
aligning absolute positions
of• course
the
house is
of course is the house small
2
1
IBM Model 2 addresses the issue of alignment with an explicit model for
lexical translation step
alignment based on the positions of the input and output words. The transof course is the house small
lation of a foreign input word in position i to an English
wordstep
in position j
alignment
New parameter:
is modelled by an alignment probability distribution
natürlich ist das haus klein
1
the course small is of house
6
natürlich ist das haus klein
Add a model forAdding
positional
a modelalignment:
of alignment
2
5
IBM Model 2
IBM Model 2
1
4
6
a foreign input position a(j)j=1
and the alignment probability distribution is
also set up in this reverse direction.
Chapter 4: Word-Based Models
37
Fortunately, adding the alignment probability distribution does not make
2
3
4
5
EM training much more complex. Recall that the number of possible word
natürlich
ist das haus klein
alignments for a sentence pair is exponential with the number of words.
lexical
translation
step
However, for IBM Model 1, we were
able to
reduce the complexity
of com1
Interlude: HMM Model
4.4.2
IBM Model 3
HMM alignment model
IBM
Model
3 many words are generated from
So far, we have not explicitly
modeled
how
• Words do not move independently of each other
each input word. In most cases, a German word translates to one single
– they often move in groups
English word. However, some
German words like zum typically translate to
Motivation: Words do not move independently!
Motivation:
! condition word movements on previous word
two English words, i.e., to the.
Others,
flavoring
particle
• they often move in groups
• Some
words such
tend toas
bethe
aligned
to multiple
words ja, get
Chapter 4. Word-Based Modelsdropped.
• add condition on alignment of the previous word
• Other words tend to be unaligned (dropped)
116
• HMM alignment model:
Different alignment parameter: p(a(j)|a(j
4.4. Higher IBM
Models
Good
replacement for IBM Model 2
4.4.2
1), le)
We now want to model the fertility of input words directly with a
probability distribution Add a parameter modeling this property!
115
Fertility:
n( |f )
(4.29)
• EM algorithm application harder, requires dynamic programming
IBM Model 3
• IBM Model 4 is similar, also conditions on word classes For each foreign word f , this probability distribution indicates, how
So far, we have not explicitly modeled how many words are generatedmany
from
= 0, 1, 2, ... output
each input word. In most cases, a German word translates to one single
examples above, we 39expect
4: Word-Based
Models
English word.Chapter
However,
some German
words like zum typically translate to
two English words, i.e., to the. Others, such as the flavoring particle ja, get
dropped.
We now want to model the fertility of input words directly with a
probability distribution
4.4. Higher IBM Models
IBM Model 3
111
words it usually translates to. Returning to the
for instance, that
n(1|haus) ' 1
IBMn(2|zum)
Model '3 1
(4.30)
n(0|ja) ' 1
n( |f )
(4.29)
As a consequence, according to IBM Model 1 the translation probabilities
for the following two alternative translations are the same:
To review, each of these four steps is modeled probabilistically:
For each foreign
word f , this probability distribution indicates, how
116
Chapter 4. Word-Based Models
Example:
Fertility deals explicitly with
dropping input words
by allowing = 0.
•many
Fertility
modeled
by thewords
distribution
n( |f
). For instance,
duplica= is
0, 1,
2, ... output
translates
to. 1Returning
to
the4
1
2
3 it usually
4
5
2
3
5
•
German
“zum”
translates
to
“to
the”
Buthaus
there
is also the issue of adding words. Recall that we introduced the
tion of the
German
word zum
has the
probability
of n(2|zum).
natürlich
ist
haus
klein
natürlich ist das
klein
examples
above,
we
expect
for das
instance,
that
• German “ja” (used as a filler) is often dropped
• null insertion is modeled by the probabilities p1 and p0 = 1 p0 . Fornull token to account for words in the output that have no correspondent
in is
theof input.
instance, the probability
p1n(1|haus)
ishouse
factored
of null
of courseofthe
is'in
small
the course
small
house For instance, the English word do is often inserted when
1 for the insertion
1
2
3
5
6
3
4
5
6
after ich, and the
probability
of p04 is factored
in for no1 such 2insertion
translating
verbal negations into English. In the IBM models these added
n(2|zum) ' 1
(4.30)
after nicht.
words are generated by the special null token.
n(0|ja) the
' 1issue of alignment with an
IBM Model 2 addresses
explicit model for
• Lexical translation is handled by the probability distribution t(e|f ) as
alignment
based
on
the
positions
of
the
input
and
output
words.
transWe The
could
model the fertility of the null token the same way as for all
other
differences:
in IBM Model
1. For
instance, translating nicht into not is done with
lation
of
a
foreign
input
word
in
position
i
to
an
English
word
in
position
j by the conditional distribution n( |null). However, the
Fertility deals
explicitly
with
dropping
input words by allowing the
= 0.other words
probability
p(not|nicht).
• explicit
NULL
insertion
parameter
is modelled
anadding
alignment
probability
distribution
But there is also
issuebyof
Recall that we
introduced the
• the
distortion
instead
of words.
alignment
• Distortion is modeled
almost
the same
way as in IBM Model 2 with anumber of inserted words clearly depends on the sentence length, so we
null token to account for words in the output that have no correspondent
probability distribution d(j|i, le , lf ), which predicts
outputchose to model
a(i|j, lthe
, l English
)
(4.24) null insertion as a special step. After the fertility step,
in the input. For instance, the
English word do eis foften inserted when
To review, each of these four steps is modeled probabilistically:
word position j based on the foreign input word position i and thewe introduce one null token with
probability
p1 after
generated
word,
•
Fertility
is modeled by the distribution
n( |feach
). For instance,
duplicatranslating verbal negations into English. In the IBM models these added
respective sentence
lengths.
Forlength
instance,
theinput
placement
of fgoisas
the as lf , the length
tion of the German word zum has the probability of n(2|zum).
Recall
that the
of the
sentence
denoted
or no null token with probability p0 = 1 p1 .
words are generated by the special null token.
4th word ofof
thethe
7 word
English
sentence
translation
gehe whichunder IBM Model 2
• null insertion is modeled by the probabilities p1 and p0 = 1 p0 . For
output
sentence
e is las
We
can view of
translation
e. a
instance, the probability of p1 is factored in for the insertion of null
Wethe
could
model
the
fertility
of
the
null
token
the
same
way
as
for
all
was
2nd
word
in
the
6
word
German
sentence
has
probability
The addition
increases the translaas a two step process with a lexical translation step and an alignment
step. of fertility and null token insertion
For Model 4, we introduce a relative distortion model. In this model,
the placement of the translation of an input word is typically based on
the placement of the translation of the proceeding input word. The issue of
relative distortion IBM
gets a bit
convoluted,
Model
4 since we are dealing with placement
in the input and the output, aggravated by the fact that words may be added
or dropped, or translated one-to-many.
For clarification, let us first introduce some terminology. See Figure 4.9
Motivation:
for an example to illustrate this. Each input word fj that is aligned to at
• Absolute
for distortion
feels wrong
least one output word
forms apositions
cept. Figure
4.9 tabulates
the five cepts ⇡i
Wordswords
do not—move
for the six German •input
the independently
flavoring particle ja is untranslated
• Some
words tend to move and some not
and thus does not form
a cept.
We define the operator [i] to map the cepts with index i back to its
corresponding foreign
input word
position.
For instance
→ Introduce
a relative
distortion
model in our example, the
last cept ⇡5 maps →
to Introduce
input word
position
6,
i.e.
[5]
= 6. Most cepts contain
dependence on word classes
one English word, except for cept ⇡4 that belongs to the fertility-2 word
zum, which generates the two English words to and the.
The center 4.4.
of Higher
a cept
is the defined as the ceiling of the average
of
IBM Models
125
the word positions. For instance, the fourth cept ⇡4 (belonging to zum) is
aligned to output words in position 5 and 6. The average of 5 and 6 is 5.5,
the ceiling of 5.5 is 6. We use the symbol i to denote the center of cept i,
Alignment
e.g., 4 = 6.
NULL cepts
ich gehe
ja nicht
zum haus
How does this scheme using
determine
distortion
for English words?
126
Chapter distortion.
4. Word-Based
Models
For each output word, we now define its relative
We distinguish between three cases: (a) words that are generated by the null token,
IBM
Model
4: not
Relative
I and
do (c)go
to words
the Distortion
(b) the
firstofword
a cept,
inhouse
a cept:
center
the inpreceding
cept ⇡subsequent
2 is
2 = 4. Thus we have relative
of 1, reflecting
the forward
of
this word. In the
(a)distortion
Words generated
by the null
token
aremovement
uniformly
distributed.
Foreign
words
and
cepts
case of translation in sequence, distortion is always +1: house and I
cept ⇡i
⇡
⇡3
⇡5 we use the
1
2
(b)are
For
the likelihood
of placements
of⇡the
first
word
of⇡4 a cept,
examples
for this.
Distortion for the first word in a cept
foreign position [i]
1
2
4
5
6
probability distribution
foreign word f[i]
ich gehe nicht zum
haus
(c) For the placementEnglish
of subsequent
innota cept,
we consider
the
dj1}(j Iwords
(4.37)
words {e
to,the house
i go
1)
placement of the English
previous
word
cept,
the probability
dispositions
{j} in the
1
4
3using5,6
7
center
of
cept
1
4
3
6
7
Placement
is
defined
as
English
word
position
j
relative
to
the
center
i
Distortion of subsequent
words in a cept
tribution
of the preceding cept i 1 .d>1
Refer
again
to
Figure 4.9 for an example:
(j ⇡ j and
(4.38)
1 )distortion
English words ei,k
The word not, English word position 3, is generated by cept ⇡3 . The
⇡i,k refers to thejword position
in the
ith cept.
We
1
2of the3 kth word
4
5
6
7 have
I
do4.9: not
go
to
theis the
house
one example forejthis in Figure
The English
word
the
second
in cept ⇡i,k
⇡1,0
⇡0,0
⇡3,0
⇡2,0
⇡4,0
⇡4,1
⇡5,0
word in the 4th cept (German
word4 zum), 1the preceding
English
word
0
3
6
i 1
in the cept isj to.i Position
of- the is+1j = 6,
+1 of -to is ⇡
1 4,0 =+35, position
+2
1
distortion
d1 (+1)
1
d1 ( 1) d1 (+3) d1 (+2) d>1 (+1) d1 (+1)
thus we factor
in d>1 (+1)
Figure 4.9: Distortion in IBM Model 4. Foreign words with non-zero fertility
forms cepts (here 5 cepts), which contain English words ej . The center i of a cept
⇡i is ceiling(avg(j)). Distortion of each English word is modeled by a probability
(a) richer
for null generated
words suchon
as do:
uniform, (b) forFor
first words
to distribution:
introduce
conditioning
distortion.
instance,
in a cept such as not: based on the distance between the word and the center of the
Word Classes
We may want
IBM
Model 4: Relative Distortion
126
Chapter 4. Word-Based Models
126
126
center
of IBM
the Models
preceding cept ⇡2Chapter
is 2 =4.4. Word-Based
Thus we have
4.4.
Higher
125relative
Models
distortion of
4. Word-Based
1, reflecting theChapter
forward movement
of this Models
word. In the
• ceptcase
π =ofAlltranslation
words in in
e aligned
the same
f house and I
sequence,todistortion
is word
alwaysin+1:
center of
the
preceding
cept ⇡2 is 2 = 4. Thus we have relative
are
examples
for
this.
⦿ of
a cept πcept
= ceiling
position
in e relative
• center
center
of the
preceding
⇡2 is of2 average
= 4. Thus
we have
distortion of 1, reflecting the Alignment
forward movement of this word. In the
distortion
of
1,
reflecting
the
forward
movement
of
this
word.
In the the
(c) For the placement of subsequent words in a cept, we consider
case ofoftranslation
translationininsequence,
sequence,
distortion
always
+1:house
house
and
case
is is
always
+1:
and
I Idisgehedistortion
ja nicht
zum
haus
placement NULL
of theich
previous
word
in the
cept,
using
the probability
are
examples
for
this.
are examples for this.
tribution
(j ⇡i,k
) cept, we consider the
(4.38)
>1
(c) For
For the
the placement
placementofofsubsequent
subsequentdwords
words
(c)
inina a1cept,
we consider the
I
do
go
not
to
the
house
placement
theprevious
previous
word
the
cept,
using
theprobability
probability
dis⇡i,kofof
refers
to
the word
position
ofcept,
the
kth
word
in
the
ith cept. disWe
have
placement
the
word
ininthe
using
the
one example for this in Figure 4.9: The English word the is the second
tribution
tribution
word in the 4th cept
zum),
word
d>1
⇡word
)cepts the preceding English
(4.38)
Foreign
and
i,k
d(German
(j(jwords
⇡i,k
(4.38)
>1
1 )1
in the cept is to. Position of to is ⇡4,0 = 5, position of the is j = 6,
referstotothe
the word
kth
word
the
ith
cept.
We
have
ceptposition
⇡
⇡the
⇡2word
⇡3inin
⇡4ith
⇡5 We
1 kth
i,krefers
⇡⇡i,k
ofofthe
the
cept.
have
thus we word
factorposition
in id>1 (+1)
foreign
position
[i] 4.9:
2 English
4
5
oneexample
examplefor
for
thisin
inFigure
Figure
The
word
the
second
one
this
4.9: 1The
English
word
thethe
is6is
the
second
word f[i]
ich gehe nicht zum
haus
word
the
4thforeign
cept(German
(German
word
zum),the
thepreceding
preceding
English
word
word
ininthe
4th
cept
English
word
Word
Classes
English
words {ej } word
I zum),
go
not
to,the house
in the
the cept
ceptisisEnglish
to. Position
Position
oftotois1is⇡4,0
⇡4,0
position
j =
in
to.
the
is is
j =
6, 6,
positionsof
{j}
4==5,5,
3position
5,6 ofof
7the
Wewe
may
want
introduce
For instance,
thus
we
factor
into
(+1)
of
cept i richer1 conditioning
4
3 on distortion.
6
7
thus
factor
incenter
dd>1
(+1)
>1
some words tend to get reordered during translation, while other remain in
order. A typical example
this eisj and
adjective-noun
Englishfor
words
distortion inversion when transWord Classes
Classes
Word
lating French to English. Adjectives get moved back, when preceded by a
j
1
2
3
4
5
6
7
For
instance: richer
A↵aires
extérieur becomes
external For
a↵airs.
We
want
conditioning
onondistortion.
We may
maynoun.
wantto
toeintroduce
introduce
conditioning
distortion.
Forinstance,
instance,
I richer
do
not
go
to
the
house
j
be
condition
the⇡2,0
distortion
probability
distribution
some
tend
toto⇡get
reordered
during
while
other
some words
wordsWe
tend
gettempted
reordered
during
translation,
other
remain
in may
cept
⇡1,0 to
⇡0,0
⇡3,0translation,
⇡while
⇡4,1 remain
⇡5,0in in
4,0
i,k
in
Equations
4.37–4.38
on
the
words
e
and
f
:
j
0
4
1
3
6
[i
1]
order.
for
1
order. AA typical
typicaliexample
example
forthis
thisisisadjective-noun
adjective-nouninversion
inversionwhen
whentranstransj
+1
1
+3
+2
+1
i
1
lating
Adjectives
moved
a a
lating French
FrenchtotoEnglish.
English.
Adjectives
get
moved
back,when
when
by
for initial
word
inget
cept:
d (jback,
|f[i dpreceded
, ej ) d by
1]
1]preceded
distortion
d1 (+1)
1
d1 ( 1) d11(+3) d[i
1 (+2)
>1 (+1)
1 (+1)(4.39)
noun.
extérieur
becomes
external
a↵airs.
noun. For
Forinstance:
instance:A↵aires
A↵aires
extérieur
becomes
external
a↵airs.
for additional words: d>1 (j ⇧i,k 1 |ej )
We
totocondition
the
probability
distribution
Wemay
maybe
betempted
tempted
condition
thedistortion
distortion
probability
distribution
Figure
4.9: Distortion
in IBM Model
4. Foreign words
with non-zero
fertility
However,
we
will
find
ourselves
in
a
situation,
where
forofmost
e and
in
Equations
4.37–4.38
on
the
words
e
and
f
:
forms
cepts
(hereon
5 trigger
cepts),
which
contain
English
e
.
The
center
a cept j
jej and
[if[i1]words
j
i
in Equations
4.37–4.38
the words
:
Some
words
reordering:
Conditional
distortion!
1]
f[i 1]⇡,i we
will not be Distortion
able to collect
sufficient
is ceiling(avg(j)).
of each English
wordstatistics
is modeled to
by aestimate
probabilityrealistic
for
iningenerated
cept:
|funiform,
1d(j
[i [i1] ,1]e,j(b)
probability
distributions.
distribution:
(a)word
for null
do:
• initial
initial
words:
for
initial
word
cept: dwords
e)j )for first words
1 (j such[ias[i1]
1] |f
(4.39)
(4.39)
in a•
cept
such
as
not:
based
on
the
distance
between
the
word
and
the center
of the
To
have
both
some
of
the
benefits
of
a
lexicalized
model
and
sufficient
subsequent
words:
for
additional
words:
d>1
(j(j ⇧⇧
)
i,k 1 |ej|e
for
additional
words:
d>1
i,k
1ceptj )such as the: distance
preceding
word
(j
),
(c)
for
subsequent
words
in
a
i
1
statistics, Model 4 introduces word classes. When we group the vocabulary
towe
the previous
word
in the cept.
probability distributions
d and d>1
However,
find
ourselves
ininThe
a asituation,
where
ej eare
and
However,
we will
willthe
find
ourselves
situation,
wherefor
for1most
most
of a learnt
language
into,
say,
50 classes,
we
can condition
the
probability
distrij and
from
data.
They are sufficient
conditioned using
lexical information,
see realistic
text for
ff[i 1] ,, we
will
not
be
able
to
collect
statistics
to
estimate
butions
on these
classes.
The result
is an statistics
almost lexicalized
model,
but with
will
not
be able
to collect
sufficient
to estimate
realistic
details.
[i 1] we
Conditioning
on words creates too many parameters!
probability
distributions.
sufficient
statistics.
probability
distributions.
To have To
both
ofdata
the
benefits
a lexicalized
model and
put
this
formally,
weof
introduce
two functions
)sufficient
and
B(e) that
• some
sparse
problem
in
To have both
somemore
of the
benefits
of training
a lexicalized
modelA(f
and
sufficient
statistics,map
Model
4
introduces
word
classes.
When
we
group
the
vocabulary
words
to
their
word
classes.
This
allows
us
to
reformulate
Equation
• 4induce
wordword
classes
for source
targetthe
language
statistics, Model
introduces
classes.
When and
we group
vocabulary4.39
of a language
into: into, say, 50 classes, we can condition the probability distriof a language into, say, 50 classes, we can condition the probability distributions on these•for
classes.
The
result
is and1almost
lexicalized
model,
with
initialwords:
word
cept:
B(ej ))but
initial
[i lexicalized
1] |A(f[i 1] ),model,
butions on these classes.
The in
result
is an(jalmost
but with
(4.40)
sufficient statistics.
• subsequent
for additionalwords:
words: d>1 (j ⇧i,k 1 |B(ej ))
sufficient statistics.
To put this more formally, we introduce
two functions A(f ) and B(e) that
To put this more formally, we introduce two functions A(f ) and B(e) that
map words to their word classes. This allows us to reformulate Equation 4.39
map
into: words to their word classes. This allows us to reformulate Equation 4.39
into:
for initial word in cept: d1 (j
[i 1] |A(f[i 1] ), B(ej ))
(4.40)
for initial word in cept: d1 (j
[i 1] |A(f[i 1] ), B(ej ))
for additional words: d>1 (j ⇧i,k 1 |B(ej ))
(4.40)
for additional words: d (j ⇧
|B(e ))
IBM Model 4: Word Classes
101
4.2. Learning Lexical Translation Models
When applying the model to the data, we need to compute the probability IBM
of di↵erent
alignments
Model
5 given a sentence pair in the data. To put it
more formally, we need to compute p(a|e, f ), the probability of an alignment
given the English and foreign sentence.
Applying the chain rule (recall Section 3.2.3 on page 76) gives us:
IBM Models 1-4 are deficient
• some impossible translations have
positive
p(e,
a|f ) probabilities
p(a|e, f ) =
• multiple output words may be placed
in )the same place
p(e|f
• probability mass is wasted!
(4.9)
We still need to derive p(e|f ), the probability of translating the sentence
f into e with any alignment:
IBM Model 5
X
• fix deficiency
by keeping
track
p(e|f ) =
p(e, a|f
) of vacancies
• details: see text book
a
=
=
lf
X
...
a(1)=0
a(le )=0
lf
X
lf
X
...
a(1)=0
=
lf
X
a(le )=0
✏
lf
X
p(e, a|f )
le
Y
✏
t(ej |fa(j) )
(lf + 1)le
j=1
lf
le
X
Y
...
(4.10)
t(ej |fa(j) )
l
(lf + 1)
Alignment and
Parameter
j=1
a(1)=0
a(l )=0Estimation
=
e
✏
le
e
lf
le X
Y
t(ej |fi )
Note the significance of the last step: Instead of performing a sum over
• Where do we get the word-aligned bitexts from?
an exponential number lf le of products, we reduced the computational complexity to linear in lf and le (since lf ⇠ le , computing p(e|f ) is roughly
Wordwith
alignment
quadratic
respect to sentence length).
anput
alignment
model4.9
p(e,a|f)
to align
data
Let •usUse
now
Equations
and 4.10
together:
=
=
p(e, a|f )
p(e|f )
✏
(lf +1)le
✏
(lf +1)le
le
Y
Q le
Q le
j=1 t(ej |fa(j) )
j=1
Plf
i=0 t(ej |fi )
t(ej |fa(j) )
Plf
j=1
i=0 t(ej |fi )
4.4
Chapter 4. Word-Based Models
Higher IBM Models
Summary IBM Models
We have presented with IBM Model 1 a model for machine translation.
However, at closer inspection, the model has many flaws. The model is very
weak in terms of reordering, as well as adding and dropping words. In fact,
according to IBM Model 1, the best translation for any input is the empty
string (see also Exercise 3 at the end of this chapter).
Models
with increasing
complexity
Five models
of increasing
complexity were
proposed in the original work
on statistical machine translation at IBM. The advances of the five models
Higher models include more information
are:
IBM
IBM
IBM
IBM
IBM
Model
Model
Model
Model
Model
1
2
3
4
5
lexical translation
adds absolute alignment model
adds fertility model
relative alignment model
fixes deficiency
Alignment models introduce an explicit model for reordering words in a
sentence. More often than not, words that follow each other in one language
have translations that follow each other in the output language. However,
IBM Model 1 treats all possible reorderings as equally likely.
Fertility is the notion that input words produce a specific number of
output words in the output language. Most often, of course, one word in the
input language translates into one single word in the output language. But
some words produce multiple words or get dropped (producing zero words).
A model for the fertility of words addresses this aspect of translation.
Adding additional components increases the complexity of training the
models, but the general principles that we introduced for IBM Model 1 stay
the same: We define the models mathematically and then devise an EM
algorithm for training.
Assignment on word-based SMT
This incremental march towards more complex models does not only
• manipulate
models
serve didactic
purposes. translation
All the IBM
models are relevant, because EM
training starts
simplest Model
1 formodels
a few iterations, and then
• trainwith
andthe
manipulate
language
proceeds through
iterations
of
the
more
complex
models all the way to IBM
• decode a simple test corpus
Model 5.
What is Next?
(lf + 1) j=1 i=0
Training
• Estimate parameters of p(e,a|f) from aligned training data
p(a|e, f ) =
110
(4.11)
Corpora
4.4.1 Parallel
IBM Model
2 and alignment
• sentence alignment
In IBM Model 2, we add an explicit model for alignment. Previously, in IBM
• word
alignment
Model 1, we
do not
have a probabilistic model for this aspect of translation.
Parallel Corpora and NLP
Building and Using Parallel Corpora
Collecting and Pre-Processing
Aligning
• document alignment (often given)
• paragraph (optional)
• sentence alignment
Training Word-Based SMT
• word alignment and parameter estimation
Sentence Alignment
Task:
• align corresponding sentences to each other
• (may be sequences of sentences)
Assumption:
• sentence alignment can be done monotonically (no crossing links)
Challenges:
• non-1:1 alignments, insertions, deletions, incomplete translations
Many different ways to align sentences:
• (s1 → t1) (s2 → t2) (s3 → t3) ...
• (s1 → t1,t2) (s2 → t3) (s3 → 0) ...
• (s1,s2 → t1) (s3 → t2,t3) ...
...
Simple Example
Parallel corpora & MT Building parallel corpora Sentence alignment
Example
- Sentence-Aligned
ExampleSimple
1: Regeringsf
örklaring
Parallel corpora & MT Building parallel corpora Sentence alignment
Try To Align the Following Sentences
Try to align the following sentences ...
source language
Fooi Tiadii , hseatenis aoe iscesnaohtmutis
emt eis Lsoih , xes ücis aot iisioohicsudiio .
Fooi Qitutiadii , eoi eoi Iämgui aotisit
Löoohsiodiit eeiooseggui .
Xu len toi iis ?
Xoi wiscsiouiui toi todi ?
Eoi Qsoituis tehuio aot , toi tio eoi Tusegi Huuuit .
Bcis güs ximdii Tüoei ?
Ximdiit Hicuu ieuuio xos hicsudiio , eett xos tu
iuxet wiseoiou ieuuio ?
Oioo , xos leoouio eoi Xeisiiou .
Eet xes oodiu Huuuit Xisl , tuoeiso Uiagimio ...
ueis Iiyisio .
Voe aotisi Bagheci citueoe eesoo , güs eoi
Iiomaoh easdi Huuu eio Eänuo za geohio .
Csaeis Uiunet !
JörgParallel
Tiedemann
corpora & MT Building parallel corpora Sentence alignment
9/31
Re-Arranging
Re-arranging
the table ...the Table ...
ID
1
2
3
4
5
6
7
8
9
10
11
source language
Fooi Tiadii , hseatenis aoe iscesnaohtmutis
emt eis Lsoih , xes ücis aot iisioohicsudiio .
Fooi Qitutiadii , eoi eoi Iämgui aotisit
Löoohsiodiit eeiooseggui .
Xu len toi iis ?
Xoi wiscsiouiui toi todi ?
Eoi Qsoituis tehuio aot , toi tio eoi Tusegi
Huuuit .
Bcis güs ximdii Tüoei ?
Ximdiit Hicuu ieuuio xos hicsudiio , eett xos
tu iuxet wiseoiou ieuuio ?
Oioo , xos leoouio eoi Xeisiiou .
Eet xes oodiu Huuuit Xisl , tuoeiso Uiagimio
... ueis Iiyisio .
Voe aotisi Bagheci citueoe eesoo , güs eoi
Iiomaoh easdi Huuu eio Eänuo za geohio .
Csaeis Uiunet !
target language
Wo wes qmåheei ew io iqoeino tun
wes nis iäotzotmöt äo lsoh .
Fo qitu tun lun euu eöee nis äo
iämguio ew soliu .
Wesu lun eio ogsåo ?
Win tqsie eio ?
Qsätuisoe cisäuueei euu eiu wes Haet
cituseggoooh .
Gös womlio tzoe ?
Oik , wo wottui teoooohio
Eiuue wes ooui Haet wisl , aueo
ekäwamiot .
Fmmis usummeun .
Wo wes uwaohoe euu citihse io einuo
.
Gös Haet gösmåuimti .
Csueis Uiunet .
Jörg Tiedemann
Parallel corpora & MT
ID
1
6
7
target language
Wo wes qmåheei ew io iqoeino tun wes nis
iäotzotmöt äo lsoh .
Fo qitu tun lun euu eöee nis äo iämguio ew
soliu .
Wesu lun eio ogsåo ?
Win tqsie eio ?
Qsätuisoe cisäuueei euu eiu wes Haet cituseggoooh .
Gös womlio tzoe ?
Oik , wo wottui teoooohio
8
9
Eiuue wes ooui Haet wisl , aueo ekäwamiot .
Fmmis usummeun .
10
Wo wes uwaohoe euu citihse io einuo .
11
12
Gös Haet gösmåuimti .
Csueis Uiunet .
2
3
4
5
Building parallel corpora Sentence alignment
10/31
Decoding the Example ...
It was actually German and Swedish ...
ID
1
ID
1
2
3
4
5
6
7
8
9
10
11
12
2
3
4
5
6
7
8
9
10
11
source language
Eine Seuche , grausamer und erbarmungsloser als der Krieg , war über uns
hereingebrochen .
Eine Pestseuche , die die Hälfte unseres
Königreiches dahinraffte .
Wo kam sie her ?
Wie verbreitete sie sich ?
Die Priester sagten uns , sie sei die Strafe
Gottes .
Aber für welche Sünde ?
Welches Gebot hatten wir gebrochen , dass
wir so etwas verdient hatten ?
Nein , wir kannten die Wahrheit .
Das war nicht Gottes Werk , sondern
Teufelei ... oder Hexerei .
Und unsere Aufgabe bestand darin , für die
Heilung durch Gott den Dämon zu fangen .
Bruder Thomas !
How did you do it?
target language
Vi var plågade av en epidemi som var
mer hänsynslös än krig .
ID
1
En pest som kom att döda mer än
hälften av riket .
Vart kom den ifrån ?
Vem spred den ?
Prästerna berättade att det var Guds
bestraffning .
För vilken synd ?
2
Nej , vi visste sanningen
Detta var inte Guds verk , utan djävulens .
Eller trolldom .
Vi var tvungna att besegra en demon .
För Guds förlåtelse .
Broder Thomas .
7
8
3
4
5
6
9
10
11
12
Parallel corpora & MT Building parallel corpora Sentence alignment
Automatic
Sentence
alignmentSentence
approachesAlignment
Length Correlation of KDE System Messages
40 4. SENTENCE
ALIGNMENT
Length
Correlation
of KDE System Messages
Parallel corpora & MT Building parallel corpora Sentence alignment
Combined methods: use lexical cues in length-based settings
300
200
100
r=0.9778
13/31
Back to Our Example: String Lengths
Parallel corpora & MT Building parallel corpora Sentence alignment
target language
Vi var plågade av en epidemi som var
mer hänsynslös än krig .
ID
66
ID
92
En pest som kom att döda mer än
hälften av riket .
Vart kom den ifrån ?
Vem spred den ?
Prästerna berättade att det var Guds
bestraffning .
För vilken synd ?
54
70
22
16
54
17
27
54
19
26
73
Nej , vi visste sanningen
Detta var inte Guds verk , utan djävulens .
Eller trolldom .
Vi var tvungna att besegra en demon .
För Guds förlåtelse .
Broder Thomas .
26
45
34
64
400
300
200
100
r=0.9200
0
600
0
50
100 150 200 250 300 350 400 450 500
sentence length in characters (en)
300
500
400
300
200
100
r=0.9562
0
50
100 150 200 250 300 350 400 450 500
sentence length in characters (en)
250
200
150
100
r=0.9095
50
0
0
50
100 150 200 250 300 350 400 450 500
sentence length in characters (en)
Figure 4.2:
Jörg Tiedemann
Correlation of sentence length differences in 10,000 parallel KDE system messages:
14/31
English-French (en-fr), English-Finnish (en-fi), English-Russian (en-ru) and English-Chinese
(en-zh).
Finding the Optimal Alignment
Parallel corpora & MT Building parallel corpora Sentence alignment
86
16
source language
Eine Seuche , grausamer und erbarmungsloser als der Krieg , war über uns
hereingebrochen .
Eine Pestseuche , die die Hälfte unseres
Königreiches dahinraffte .
Wo kam sie her ?
Wie verbreitete sie sich ?
Die Priester sagten uns , sie sei die Strafe
Gottes .
Aber für welche Sünde ?
Welches Gebot hatten wir gebrochen , dass
wir so etwas verdient hatten ?
Nein , wir kannten die Wahrheit .
Das war nicht Gottes Werk , sondern
Teufelei ... oder Hexerei .
Using the assumption that the same information is present in both bitext halves (everything
is translated) the length signal will be enough in most cases to make local decisions about
mapping corresponding segments.
Using these findings Gale and Church [1991b] define a dynamic programming algoDefine a cost function cost(a)
rithm that finds sentence mappings according to a generative model with sentence lengths
A =Their
(a1model
, a2in, ...)
as ItheDefine
only observable
features.
describeslengths
the process of generating charac• alignment
penalize
mismatch
string
ters in the target language from characters in the source language. They assume that the
• penalize
uncommon
alignment
I consisting
number
of characters
follows
a pre-defined
of})
type and
ofgenerated
linked
sentences
ai = distribution,
({s1 ,types
.., snindependent
}, {t1 , .., tm
context. In order
• …to estimate this distribution they computed the character ratio of a small
trilingual corpus (French, English, German) and found a value that was close to one (1.1 for
I Task:
Find
A0 withthat
minimal
cost! cost
Search
the for
alignments
minimizes
overall
German/English
andalignment
1.06
French/English).
They
alsoalignment
plottedthe
the frequencies
of length
differences in their aligned parallel data in order
to check the density distribution which in
X
A0 = argmin
cost(ai )
A
i
I
Find a cost function based on length differences
Use dynamic
programming
for efficient
(penalize
alignments
with unexpected
length computation
differences)
Und unsere Aufgabe bestand darin , für die
Heilung durch Gott den Dämon zu fangen .
Bruder Thomas !
Jörg Tiedemann
Jörg Tiedemann
100
200
300
400
500
sentence length in characters (en)
500
Gale & Church: The Main Idea
Compare String Lengths
17
38
25
16
0
600
0
Jörg Tiedemann
sentence length in characters (fi)
400
sentence length in characters (zh)
Lexical methods: assumption = corresponding sentences
contain more corresponding words; use
distribution of corresponding words in source and
target language texts
600
500
0
sentence length in characters (ru)
Length-based methods: assumption = sentences (and
sequences of sentences) that correspond to each
other are also similar in length (characters or
words) (more than others)
sentence length in characters (fr)
600
16/31
17/31
Parallel corpora & MT Building parallel corpora Sentence alignment
Gale & Church Algorithm
Gale & Church: Alignment Prior
Gale
& Church Algorithm
Length difference distribution is approximately normal
Some alignment types are more common than others!
Observed likelihoods of certain alignment types
P(type = 1 : 1) = 0.89
(substitution)
(insertion)
(deletion)
(expansion)
(contraction)
(merging/swap)
Parallel corpora & MT Building parallel
corpora
alignment
P(type
= 1 Sentence
: 0) = 0.0099/2
P(type = 0 : 1) = 0.0099/2
2 : 1) = 0.0891/2
Gale & Church:P(type
Cost= Function
P(type = 1 : 2) = 0.0891/2
P(type = 2 : 2) = 0.011
I
The finalOverall
alignment
cost function:
cost function:
(Fixed prior parameters estimated from example corpora)
cost (ai ) =
log (P(typei )P( i |typei ))
Use fixed parameters: mean = 1 and variance = 6.8
Define cost according to the given normal distribution
Jörg Tiedemann
22/31
Parallel corpora & MT Building parallel corpora Sentence alignment
Gale & Church: Alignment Algorithm
Parallel corpora & MT Building parallel corpora Sentence alignment
Gale &AChurch:
Alignment
Costs
Simple
Example
Gale & Church Algorithm
Alignment costs of (s1, s2 → t1)
Example:
lengthsrc
3
5
5
I
I
Textsrc
Hej
Hallo
Hejdå
Texttrg
Hi and hello
Goodbye
lengthtrg
12
7
0
I
define a distance measure D(m, n) based on the costs of
aligning m source to n target sentences
I
Dynamic Programming
minimize
the overall distance by finding the best
• Distance
measure
D(m,n)
based on the costs of aligning m
alignment
using
dynamic
programming
source to n target sentences using recursive definition:
Jörg Tiedemann
! recursive definition of
8
D(i, j 1)
>
>
>
>
D(i 1, j)
>
>
<
D(i 1, j
D(i, j) = min
D(i 1, j
>
>
>
>
D(i 2, j
>
>
:
D(i 2, j
(8,
12) ⇡ 0.485
• δ(8,12)
≈ 0.485
2:1)| = P(|X|
>⇡
0.485
P(• P(δ
|2 : |1)type
= P= (|X
0.485)
0.63) ≈ 0.63
• P(type = 2:1) = 0.0891 / 2
• cost ( a = (s1, s2 → t1) ≈ 3.58
Jörg Tiedemann
Jörg Tiedemann
21/31
1)
2)
1)
2)
23/31
+
+
+
+
+
+
cost(
cost(
cost(
cost(
cost(
cost(
{},
{si },
{si },
{si },
{sj 1 , sj },
{sj 1 , sj },
{tj }
{}
{tj }
{tj 1 , tj }
{tj }
{tj 1 , tj }
)
)
)
)
)
)
24/31
Gale & Church Algorithm
The Example Again
Parallel corpora & MT Building parallel corpora Sentence alignment
Gale & Church’s length-based sentence aligner
Hej
Dynamic Programming
• start with position (0,0) and no costs
• update all cells of the table using the recursive definition
• read-out the alignment with the minimal costs
0
Hi
and
hello
deletion
substitution
Goodbye
contraction
8.12
Hallo
Hejdå
6.371:0
–
–
–
–
–
13.161:0
–
–
–
–
–
20.171:0
–
–
–
–
–
–
–
–
–
14.481:0
1.691:1
14.480:1
–
–
–
8.491:0
7.511:1
21.280:1
3.582:1
–
–
10.581:0
14.091:1
28.280:1
9.572:1
–
–
–
–
–
–
–
21.681:0
8.941:1
8.890:1
–
–
12.651:0
2.091:1
10.770:1
11.352:1
5.892:2
11.591:2
9.091:0
3.821:1
16.770:1
5.302:1
11.722:2
18.121:2
0:1
15.310:1
5.861:2
insertion
swap/merge
Resulting alignment (one 2-to-1 match & one 1-to-1 match):
Hej •⇣⇣⇣• Hi and hello
Hallo
⇢• Goodbye
⇢
Hejdå •⇢
expansion
Summary on Parallel Corpora
Essential training data for SMT
• translation modeling
Sentence alignment
• automatic process
• based on length correlation or lexical matching
Existing data resources
• Europarl (http://www.statmt.org/europarl/)
• OPUS (http://opus.lingfil.uu.se/)
Jörg Tiedemann
27/31
© Copyright 2026 Paperzz