An example
Machine Translation
Speech and Language Processing: chapter 21
Foundations of Statistical Natural Language
Processing: chapter 13
z
Au sortir de la saison 97/98 et surtout au
debut de cette saison 98/99…
z
With leaving season 97/98 and especially at
the beginning of this season 98/99…
1
2
Challenges
Challenges
2.
1. Capture variation and similarities amongst
languages
z
Morphologically: # morphemes/word:
z
z
z
Syntax: order of words in a sentence
z
To Yukio; Yukio ne
English vs. Hungarian:
z
z
z
Isolating languages (Vietnamese, Cantonese) – 1 word/ 1
morpheme
Polysynthetic languages (Siberian Yupik), 1 word = a
whole sentence
3.
The man’s(affix) house(head)
Az ember haz(head)-a(affix)
Differences in specificity
English
brother
Japanese
otooto (younger)
oniisan (older)
English
wall
German
wand (inside)
mauer(outside)
German
berg
English
hill
mountain
Degree to which morphemes are segmentable
3
Conceptual space
4
The three main areas for MT
source
language
S
language
understanding
language
generation
target language
T
source-target
mapping info
Lexical gap: Jp, no word for privacy; Eng: no word for oyakoko
(filial piety)
S and T convey the same meaning
5
6
Language understanding
Methods of machine translation
Lexical ambiguity:
English: book - Spanish libro, reservar
⇒ Use syntactic context
1.
2.
3.
z
z
high
thematic
abstraction
Syntactic ambiguity:
I saw the guy on the hill with the telescope
a
low
Semantic ambiguity
E: While driving, John swerved & hit a tree
interlingua
Interlingual meaning (universal)
s
g
syntactic
t word-word
transfer approach
direct approach
a = a(s)
g = f(a(s)); f - transfer function
t=g(f(a(s)))
John’s car
S: Minetras que John estaba manejando, se desvio y
golpeop con un arbo
7
8
The triangle/transfer diagram
List of transforms
9
10
Interlingua approach: using
meaning
The triangle/transfer diagram
z
z
z
11
Transfer: one pair of rules per language pair
Objects/events (ontology)
Avoid explicit description of relations between source
words and target words by mapping between concepts
12
The triangle diagram: types of
MT
high
Increasing
abstraction
Statistical Machine Translation
Interlingual meaning (universal)
a
low
s
thematic
syntactic
g
t
} transfer
word-word
13
14
A Statistical MT System
The Main Idea
Treat translation as a noisy channel problem:
Input (Source)
“Noisy” Output (target)
The channel
E: English words... (adds “noise”) F: Les mots Anglais...
z
z
z
The translation model:
P(E|F) = P(F|E) P(E) / P(F)
Interested in rediscovering E given F:
After the usual simplification (P(F) fixed):
argmaxE P(E|F) = argmaxE P(F|E) P(E)
15
The Necessities
z
z
z
z
16
Alignment Idea
Language Model (LM): our expectation of seeing a
particular English (E) sentence, a priori:
P(E)
Translation Model (TM): Target sentence in French (F)
given source sentence in English:
P(F|E)
Search procedure
z Given F, find best E using the LM and TM distributions.
Usual problem: sparse data!
z We cannot create a “sentence dictionary” E ↔ F
z Typically, we do not see a sentence even twice!
17
z
z
z
TM doesn’t care about correct strings of English words
Use the “tagging” approach:
z 1 English word (“tag”) ~ 1 French word (“word”)
→ not realistic: even #words in two sentences isn’t equal
→ use “Alignment”.
Sentence alignment: find some group of sentences in
one language corresponds to some other group of
sentences in another language.
18
Sentence Alignment
The old man is
happy. He has
fished many
times. His wife
talks to him. The
fish are jumping.
The sharks await.
Sentence Alignment
El viejo está feliz
porque ha
pescado muchos
veces. Su mujer
habla con él. Los
tiburones
esperan.
1.
2.
3.
4.
5.
The old man is
happy.
He has fished
many times.
His wife talks to
him.
The fish are
jumping.
The sharks await.
1.
2.
3.
El viejo está feliz
porque ha
pescado muchos
veces.
Su mujer habla
con él.
Los tiburones
esperan.
19
Sentence Alignment
1.
The old man is
happy.
2.
He has fished many
times.
3.
His wife talks to him.
4.
The fish are jumping.
1.
El viejo está feliz
porque ha pescado
muchos veces.
2.
Su mujer habla con
él.
3.
Los tiburones
esperan.
20
Word alignment - easy
The sharks await.
5.
Difficulties:
z
Crossing dependencies: the order of sentences are
changed in the translation.
21
Word alignment - harder
22
Word alignment - harder
23
24
Word alignment - encoding
Word alignment - hard
z
0 1 2
3
4
5
6
e0 And the program has been implemented
z f0
z
Le programme a été mis en application
0 1
2
3 4 5 6
7
Linear notation:
z
z
25
Word alignment learning with EM
f0(1) Le(2) programme(3) a(4) été(5) mis(6) en(6)
application(6)
e0 And(0) the(1) program(2) has(3) been(4)
implemented(5,6,7)
26
Word alignment learning with EM
27
Word alignment learning with EM
28
Noisy channel again
Language model
P(e)
e
Translation model
P(f|e)
f
Decoder
Argmax
e =P(e|f)
e
29
30
Elements of a Translation Model
z
z
z
z
Individual translations are independance
1 English word -n French words
1 French word - (0-1) English word
P ( f | e) =
z
z
z
z
z
Example
Assuming that
1
Z
l
∑
a1
l
L∑
am=0
z
z
m
∏ P( f
j =1
j
| ea j )
P(Jean aime Marie| John loves Mary)
For the alignment (Jean, John), (aime, loves),
(Marie, Mary), we multiply the three
corresponding translation pr’s
P(Jean|John) x P(aime|loves) x P(Marie|Mary)
fj - word j in f;
aj - position in e that fj is aligned with
eaj - word in e that fj is aligned with
Z is a normalization constant
aj = 0: word j in the French sentence is aligned with the empty
cept (no overt translation)
31
EM algorithm
Decoder
E-step
z Initiate P(wf|we) at random
z Compute expected #times we will find wf in the French
sentence given that we have we in the English
sentence
e = arg max e P(e | f )
P (e) P ( f | e)
P( f )
= arg max e P (e) P( f | e)
= arg max e
Problem: search space is infinite!
Heuristics:
z stack search: build incrementally, keep stack of partial
translation hypotheses.
z using some measure of correspondence, eg, chamber/house,
(but misleading if one word occurs with another often, eg,
commune/house, because of Chambre de Communes (House
of Commons)
33
Evaluation
z
zw f , we =
∑
P( w f | we )
( e , f ) s .t . we = e , w f = f
M-step
z Re-estimate translation prs from these expectations:
P ( w f | we ) =
zw f , we
∑
v
zw
f
,v
34
Reasons
Evaluate the system on the aligned Hansard corpus:
z 48% of French sentences were translated correctly.
z 2 kinds of errors:
z
32
Incorrect translation:
z Permettez que je donne un example à chambre
z Let me give an example in the House (incorrect decoding)
z (Let me give the House an example)
Ungrammatical translation:
z Vous avez besoin de toute l’aide disponsible
z You need all of the benefits available (ungrammatical
decoding)
z (You need all the help you can get)
35
z
Distortion: penalize alignment where English word at
beginning of sentence is matched with French word at
end – distortion in alignment decreases pr of alignment
z
Fertility: mapping between French words and English
words (1-to-1, 1-to-2, 1-to-0, …),
z Eg, fertility(farmers) in corpus is 2, since most often
translated as 2 words: les argiculteurs
z To go → aller
36
Lack of linguistic knowledge
Reasons
z
z
z
z
Independence assumptions: unfair advantage to short
sentences because there are fewer prs multiplied
⇒ multiply final likelihood by constant that increases with
length of sentence
Sensitivity to training data. Small changes in model
cause large changes in parameter estimates
Eg, P(le|the) varies from 0.610 to 0.497
Efficiency. Sentences longer than 30 words excluded –
exponential search
Lack of linguistic knowledge
z
z
z
z
No phrases: cannot deal with “to go” and “aller”
Non local dependencies:
Eg, is she a mathematician
Morphology. Morphologically related words
are treated as separate symbols
Sparse data. Estimates for rare words
unreliable (and so excluded)
37
38
Sentence Boundary Detection
Another Alignment System
z
z
Available corpus assumed:
z Sentence breaks:
z paragraphs (if marked)
z certain characters: ?, !, ; (...almost sure)
z Problem: period .
z parallel text (translation E ↔ F)
z
Rules, lists:
Sentence alignment
z sentence detection
z sentence alignment
z
Word alignment
z tokenization
z word alignment (with restrictions)
z
end of sentence (... left yesterday. He was heading to...)
decimal point: 3.6 (three-point-six)
thousand segment separator: 3.200
abbreviation: cf., e.g., Calif., Mt., Mr.
ellipsis: ...
other languages: ordinal number indication (2nd ~ 2.)
initials: A. B. Smith
Statistical methods: e.g., Maximum Entropy
39
Alignment Methods
Sentence Alignment
z
The Problem: sentences detected only:
z
E:
F:
z
z
Several methods (probabilistic and not prob.)
z character-length based
z word-length based
z “cognates” (word identity used)
Desired output: Segmentation with equal number
of segments, spanning continuously the whole text.
Original sentence boundaries kept:
z
z
E:
F:
z
40
z
Alignments obtained: 2-1, 1-1, 1-1, 2-2, 2-1, 0-1
using an existing dictionary (F: prendre ~ E: make,
take)
using word “distance” (similarity): names, numbers,
borrowed words, Latin origin words, ...
Best performing:
z statistical, word- or character- length based (with
some words perhaps)
41
42
Length-based Alignment
The Alignment Task
First, define the problem probabilistically:
z
Some definitions:
z Given P(A,E,F) ≅ Πi=1..nP(Bi),
find the partitioning of (E,F) into n beads Bi=1..n,
that maximizes P(A,E,F) over training data.
argmaxA P(A|E,F) = argmaxA P(A,E,F) (E,F fixed)
Define a “bead”:
E:
F:
Approximate:
z
z
“bead” (2:2 in this case)
z
P(A,E,F) ≅ Πi=1..nP(Bi),
where Bi is a bead; P(Bi) does not depend on the rest
of E,F.
z
Bi = p:qαi, where p:q ∈ {0:1,1:0,1:1,1:2,2:1,2:2}
describes the type of alignment
Pref(i,j) - probability of the best alignment from
the start of (E,F) data (1,1) up to (i,j)
43
Probability of a Bead
Recursive Definition
z
z
Initialize: Pref(0,0) = 0.
Pref(i,j) = max (
z
z
z
z
This is enough for a Viterbi-like search.
E:
F:
z
i
Pref(i-2,j-2)
Pref(i-2,j-1)
Pref(i-1,j-2)
Pref(i-1,j-1)
Pref(i-1,j)
Pref(i,j-1)
P(1:0αk)
P(2:2
P(α2:1
P(
)α
k)α
1:2
kα
P(
P(
αkkk)))
0:1
1:1
j
z
z
Use normal distribution for length variation:
P(p:qαk) = P(δ(lk,e,lk,f,μ,σ2),p:q) ≅ P(δ(lk,e,lk,f,μ,σ2))P(p:q)
δ(lk,e,lk,f,μ,σ2) = (lk,f - μlk,e)/√lk,eσ2
Estimate P(p:q) from small amount of data, or even
guess and re-estimate after aligning some data.
Words etc. might be used as better clues in P(p:qak) def.
45
46
Word Alignment
z
Remains to define P(p:qαk) (the red part):
z k refers to the “next” bead, with segments of p and q
sentences, lengths lk,e and lk,f.
Pref(i,j-1) P(0:1αk), Pref(i-1,j) P(1:0αk), Pref(i-1,j-1)
P(1:1αk),
Pref(i-1,j-2) P(1:2αk), Pref(i-2,j-1) P(2:1αk), Pref(i-2,j2) P(2:2αk) )
z
44
Word Alignment Algorithm
Length alone does not help:
z
z words can be swapped, and mutual translations have
often vastly different length.
z
1.
2.
Idea:
z Assume some (simple) translation model.
z Find its parameters by considering virtually all
alignments.
z After we have the parameters, find the best alignment
given those parameters.
47
Start with sentence-aligned corpus.
Let (E,F) be a pair of sentences (actually, a bead).
Initialize p(f|e) randomly, f∈F, e∈E.
Compute expected counts over the corpus:
c(f,e) = Σ(E,F);e∈E,f∈F p(f|e)
∀ aligned pair (E,F), find if e in E and f in F; if yes, add p(f|e).
3.
Reestimate:
4.
p(f|e) = c(f,e) / c(e) [c(e) = Σf c(f,e)]
Iterate until change of p(f|e) is small.
48
Best Alignment
Evaluation
Select, for each (E,F),
A = argmaxA P(A|F,E) = argmaxA P(F,A|E)/P(F) =
argmaxA P(F,A|E) = argmaxA (ε / (l+1)m Πj=1..m p(fj|ea ))
j
= argmaxA Πj=1..mp(fj|ea )
j
z
Use dynamic programming, Viterbi-like algorithm.
z
Recompute p(f|e)
49
z
The system works well (has won some Arpa
translation bake-offs)
z
Weaknesses:
z local optimum
z Pr(t) is long-distance trigram, so optimal translation
of some part of the source sentence is potentially
dependent on every preceding word in the
translation
z Searching in an exponential time A*
z generation time grows up rapidly when sentence
length increases
¾ making translation tree-based rather than wordbased
50
© Copyright 2026 Paperzz