Sriraman Tallam

THE MATHEMATICS OF
STATISTICAL MACHINE
TRANSLATION
Sriraman M Tallam
The Problem

The problem of machine translation is discussed.

Five Statistical Models are proposed for the
translation process.
•
Algorithms for estimating their parameters are described.

For the learning process, pairs of sentences that are
translations of one another are used.

Previous work shows statistical methods to be
useful in achieving linguistically interesting goals.
•

natural extension - matching up words within pairs of aligned
sentences.
Results show the power of statistical methods in
extracting linguistically interesting correlations.
Sriraman Tallam
July 28, 2017
2
Statistical Translation

Warren Weaver first suggested the use of statisitical
techniques for machine translation. [Weaver 1955]
Pr(e|f) = Pr(e) Pr(f|e)
--------------Pr(f)

Fundamental Equation for Machine Translation
ê = argmax Pr(e) Pr(f|e)
Sriraman Tallam
July 28, 2017
3
Statistical Translation
A translator when writing a French sentence, even a
native speaker, conceives an English sentence and
then mentally translates it.

•
Machine translation’s goal is to find that English sentence.
Equation summarizes the 3 computational
challenges presented by statistical translation.

•
•
•
Language Model Probability Estimation - Pr(e)
Translational Model Probability Estimation - Pr(f|e)
Search Problem - maximizing their product
Why not reverse the translation models ?

•
Class Discussion !!
Sriraman Tallam
July 28, 2017
4
Alignments
What is a translation ?

•
•

Pair of strings that are translations of one another
(Qu’ aurions-nous pu faire ? | What could we have done ?)
What is an alignment ?
Sriraman Tallam
July 28, 2017
5
Alignments

The mapping in an alignment could be from one-one
to many-many.

The alignment in the figure is expressed as
•
(Le programme a ete mis en application | And the(1)
program(2) has(3) been(4) implemented(5,6,7)).
This alignment though acceptable has a lower
probability.

•
(Le programme a ete mis en application | And(1,2,3,4,5,6,7)
the program has been implemented).
A(e,f) is the set of alignments of (f|e)

•
If e has length ‘l’ and f has length ‘m’, there are 2lm
alignments in all.
Sriraman Tallam
July 28, 2017
6
Cepts
What is a cept ?

•
To express the fact that each word is related to a concept, in
a figurative sense, a sentence is a web of concepts woven
together
•
The cepts in the example are The, poor and don’t have any
money
There is the notion of an empty cept.
•
Sriraman Tallam
July 28, 2017
7
Translation Models

Five Translation models have been developed.

Each model is a recipe for computing Pr(f|e), which
is called the likelihood of the translation (f,e).

The likelihood is a function of many parameters ( !).

The idea is to guess values for these parameters
and to apply the EM algorithm iteratively.
Sriraman Tallam
July 28, 2017
8
Translation Models
Models 1 and 2.

•
•
•
•
all possible lengths are equally possible
In Model 1, all connections for each french position are
equally likely.
In Model 2, connection probabilities are more realistic
These models lead to unsatisfactory alignments very often
Models 3,4 and 5.

•
•
•
•
No assumptions on the length of the French string
Models 3 and 4 make more realistic assumptions on the
connection probabilities
Models 1 - 4 are a stepping stone for the training of Model 5
Start with Model 1 for initial estimates and pipe thru the
models, 2 - 5.
Sriraman Tallam
July 28, 2017
9
Translation Models
The likelihood of f | e is,

over all elements of A(e,f)
Then,

•
•
•
choose the length of the French string given the English
for each french word position, choose the alignment, given
previous alignments and words
choose the identity of the word at this position given our
knowledge of the previous alignments and words.
Sriraman Tallam
July 28, 2017
10
Model 1
Assumptions
We assume Pr(m|e) is independent of e and m

•
All reasonable lengths of the French string are equally likely.
Also,

depends only on l.
•
All connections are equally likely, and for a word there are
(l + 1) connections, so this quantity is equal to (l + 1) -1
is called the translation

probability of fj given eaj
Sriraman Tallam
July 28, 2017
11
Model 1

The joint likelihood function for Model 1 is,
and for j = 1 … m, and aj from 1 … l

Therefore,

subject to,
Sriraman Tallam
July 28, 2017
12
Model 1

Technique of Lagrange Multipliers,

EM algorithm is applied repeatedly.
 = t (f | e)
X = f, e, l
Y = set of aj

The expected number of times e connects to f is
Sriraman Tallam
July 28, 2017
13
Model 1
Sriraman Tallam
July 28, 2017
14
Model 1 -> Model 2
Model 1 does not take into account where words
appear in either string

•

All connections are equally probable
In Model 2, alignment probabilities are introduced
and,
which satisfy the constraints,
Sriraman Tallam
July 28, 2017
15
Model 2

The likelihood function now is,
and the cost function is,
Sriraman Tallam
July 28, 2017
16
Fertitlity and Tablet

Fertility of a english word is the number of French
words it is connected to - i

Each english word translates to a set of French
words called the Tablet - Ti

The collection of Tablets is the Tableau - T.

The final French string is a permutation of the words
in the Tableau - 
Sriraman Tallam
July 28, 2017
17
Joint Likelihood of a Tableau and
Permutation

The joint likelihood of a Tableau and Permutation is,

and ,
Sriraman Tallam
July 28, 2017
18
Model 3
Assumptions

The fertility probability of an english word only
depends on the word.

The translation probability is,

The distortion probability is,
Sriraman Tallam
July 28, 2017
19
Model 3

The likelihood function for Model 3 is now,
Sriraman Tallam
July 28, 2017
20
Deficiency of Model 3
The fertility of word i does not depend on the fertility
of previous words.

•
Does not always concentrate its probability on events of
interest.

This deficiency is no serious problem.

It might decrease the probability of all well-formed
strings by a constant factor.
Sriraman Tallam
July 28, 2017
21
Model 4
Allowing Phrases in the English String to move and
be translated as units in the French String

•
Model 3 doesn’t account for this well, because of the word
by word movement.
where, A and B are functions of the French and
English words.

Using this they account for facts that an adjective
appears before a noun in English and reverse in
Frernch. - THIS IS GOOD !
Sriraman Tallam
July 28, 2017
22
Model 4

For example, implemented produces mis en
application, all occuring together, whereas not
produces ne pas which occurs with a word in
between.

So, d>1(2 | B(pas)) is relatively large when
compared to d>1(2 | B(en))

Models 3 and 4 are both deficient. Words can be
placed before the first position or beyond the last
position in the French string. Model 5 removes this
deficiency.
Sriraman Tallam
July 28, 2017
23
Model 5

They define
to be the number of
vacancies up to and including position j just before
forming the words of the ith cept.

And, this gives rise to the following distortion
probability equation,

Model 5 is powerful but must be used in tandem with
the other 4 models.
Sriraman Tallam
July 28, 2017
24
Results
Sriraman Tallam
July 28, 2017
25
Changing Viterbi Alignments with
Iterations
Sriraman Tallam
July 28, 2017
26
Key Points from Results

Words like nodding have a large fertility because
they don’t slip gracefully into French.

Words like should do not have a fertility greater than
one but they translate into many different possible
words, their translation probability is spread more.

Words like the have zero fertlility some times since
English prefers an article in some places where
French does not.
Sriraman Tallam
July 28, 2017
27