CS 585, oct 29 2015 - Alignments From Kevin Knight 1997, AI

CS 585, oct 29 2015 - Alignments
From Kevin Knight 1997, AI Magazine
Name: _____________________________________
These are sentence pairs in the (Centauri, Arcturan) made-up languages.
Learn the translation dictionary and word alignments.
The translation dictionary is mostly nonambiguous,
Articles
-------------------------------------------------1a. ok-voon ororok sprok .
Sen ten ce p air 3 is m u ch m ore ch allen gin g.
An initial dictionary is given
So far, we h ave
on the
erok
sp rokbottom
izok
hleft.
ih ok
gh irok
1b. at-voon bichat dat .
-------------------------------------------------2a. ok-drubel ok-voon anok plok sprok .
New entries in the
Th e Cen tau ri word izok wou ld be tran slated
translation
as eith
er totat, arrat,dictionary:
or vat, yet wh en you look
2b. at-drubel at-voon pippat rrat dat .
-------------------------------------------------3a. erok sprok izok hihok ghirok .
at izok in sen ten ce p air 6, n on e of th ose th ree
word s ap p ear in th e Arctu ran . Th erefore, izok
ap p ears t o b e am b igu o u s. Th e wo rd hihok,
h owever, is fixed in sen ten ce p air 11 as arrat.
Both sen ten ce p airs 3 an d 12 h ave izok hihok
sittin g d irectly on top of arrat vat; so, in all p ossibility, vat seem s a reason able tran slation for
(am b igu o u s) izok. Sen t en ce p airs 5, 6, an d 9
su ggest t h at quat is it s o t h er t ran slat io n .
Th rou gh p rocess of elim in ation , you con n ect
t h e wo rd s erok an d totat, fin ish in g o ff t h e
an alysis:
3b. totat dat arrat vat hilat .
-------------------------------------------------4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
-------------------------------------------------5a. wiwok farok izok stok .
5b. totat jjat quat cat .
-------------------------------------------------6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
-------------------------------------------------7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
-------------------------------------------------8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
-------------------------------------------------9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .
-------------------------------------------------10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .
-------------------------------------------------11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .
-------------------------------------------------12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
-------------------------------------------------Translation dictionary:
ghirok - hilat
ok-drubel - at-drubel
ok-voon - at-voon
ok-yurp - at-yurp
zanzanok - zanzanat
totat
d at
erok
sp rok
totat
d at
arrat
h ilat
izok h ih ok
arrat
vat
gh irok
h ilat
Notice th at align in g th e sen ten ce p airs h elp s
yo u t o b u ild t h e t ran slat io n d ict io n ary an d
t h at b u ild in g t h e t ran slat io n d ict io n ary also
h elp s yo u d ecid e o n co rrect align m en ts. You
m igh t call th is th e decipherm ent m ethod.
Figu re 3 sh o ws t h e p ro gress so far. Wit h a
ballp oin t p en an d som e p atien ce, you can carry th is reason in g to its logical en d , lead in g to
th e followin g tran slation d iction ary:
an ok - p ip p at
brok - lat
clok - bat
crrrok - (n on e?)
d rok - sat
en em ok - en eat
erok - totat
farok - jjat
gh irok - h ilat
h ih ok - arrat
izok - vat/ q u at
jok - krat
kan tok - oloat
lalok - wat/ iat
m ok - gat
n ok - n n at
ok-d ru bel - at-d ru bel
ok-voon - at-voon
ok-yu rp - at-yu rp
ororok - bich at
p lok - rrat
rarok - forat
sp rok - d at
stok - cat
wiwok - totat
yorok - m at
zan zan ok - zan zan at
Th e d iction ary sh ows am bigu ou s Cen tau ri
word s (su ch as izok) an d am bigu ou s Arctu ran
word s (su ch as totat). It also con tain s a cu riou s
Cen t au ri wo rd (crrrok) t h at h as n o t ran slation —after th e align m en t of sen ten ce p air 11,
th is word was som eh ow left over:
lalok n ok crrrok
Figure 2. Twelve Pairs of Sentences W ritten in
Im aginary Centauri and Arcturan Languages.
vat
wat n n at
arrat
h ih ok
m at
yorok
zan zan ok
zan zan at
Yo u b egin t o sp ecu lat e wh et h er crrrok is
som e kin d of an affix, or crrrok hihok is a p olite
form of hihok, bu t you are su d d en ly wh isked
away by an alien sp acecraft an d p u t to work in
th e In terstellar Tran slation Bu reau , wh ere you
are im m ed iat ely t asked wit h t ran slat in g t h e
EM for Model 1
Here there are 4 words in both the foreign and English vocabularies. There are 3 sentences in
the training data. Assume no NULLs. Initialize the translation parameters to be uniform:
das
ein
Buch
Haus
the
0.25
0.25
0.25
0.25
a
0.25
0.25
0.25
0.25
book
0.25
0.25
0.25
0.25
house
0.25
0.25
0.25
0.25
t(f|e)
Translation probs
Every row is one t(f|e) prob dist.
1a. E-step: Given t(f|e), calculate posterior alignments over the training data.
Each English word came from one German word in the sentence. Which?
the
.5
house
.5
the
.5
das
.5
.5
.5
Haus
das
book
a
.5
book
.5
.5
.5
Buch
ein
.5
.5
p(Buch from “book”) =
t(Buch | book)
------------------------------t(Buch | book) + t(Buch | a)
Buch
1b. M-step: Given these posterior alignments,
(1) calculate fractional translation counts …….... (2) normalize into a new translation probability table.
tcount(f|e): Translation COUNTS
das
ein
Buch
t(f|e): Translation PROBS
Haus
das
the
the
a
a
book
book
house
house
ein
Buch
Haus
2a. E-step
the
house
the
book
a
book
das
Haus
das
Buch
ein
Buch
2a. M-step
tcount(f|e): Translation COUNTS
das
ein
Buch
t(f|e): Translation PROBS
Haus
das
the
the
a
a
book
book
house
house
ein
Buch
Haus