CS 585, oct 29 2015 - Alignments From Kevin Knight 1997, AI Magazine Name: _____________________________________ These are sentence pairs in the (Centauri, Arcturan) made-up languages. Learn the translation dictionary and word alignments. The translation dictionary is mostly nonambiguous, Articles -------------------------------------------------1a. ok-voon ororok sprok . Sen ten ce p air 3 is m u ch m ore ch allen gin g. An initial dictionary is given So far, we h ave on the erok sp rokbottom izok hleft. ih ok gh irok 1b. at-voon bichat dat . -------------------------------------------------2a. ok-drubel ok-voon anok plok sprok . New entries in the Th e Cen tau ri word izok wou ld be tran slated translation as eith er totat, arrat,dictionary: or vat, yet wh en you look 2b. at-drubel at-voon pippat rrat dat . -------------------------------------------------3a. erok sprok izok hihok ghirok . at izok in sen ten ce p air 6, n on e of th ose th ree word s ap p ear in th e Arctu ran . Th erefore, izok ap p ears t o b e am b igu o u s. Th e wo rd hihok, h owever, is fixed in sen ten ce p air 11 as arrat. Both sen ten ce p airs 3 an d 12 h ave izok hihok sittin g d irectly on top of arrat vat; so, in all p ossibility, vat seem s a reason able tran slation for (am b igu o u s) izok. Sen t en ce p airs 5, 6, an d 9 su ggest t h at quat is it s o t h er t ran slat io n . Th rou gh p rocess of elim in ation , you con n ect t h e wo rd s erok an d totat, fin ish in g o ff t h e an alysis: 3b. totat dat arrat vat hilat . -------------------------------------------------4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . -------------------------------------------------5a. wiwok farok izok stok . 5b. totat jjat quat cat . -------------------------------------------------6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . -------------------------------------------------7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . -------------------------------------------------8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . -------------------------------------------------9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . -------------------------------------------------10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . -------------------------------------------------11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . -------------------------------------------------12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . -------------------------------------------------Translation dictionary: ghirok - hilat ok-drubel - at-drubel ok-voon - at-voon ok-yurp - at-yurp zanzanok - zanzanat totat d at erok sp rok totat d at arrat h ilat izok h ih ok arrat vat gh irok h ilat Notice th at align in g th e sen ten ce p airs h elp s yo u t o b u ild t h e t ran slat io n d ict io n ary an d t h at b u ild in g t h e t ran slat io n d ict io n ary also h elp s yo u d ecid e o n co rrect align m en ts. You m igh t call th is th e decipherm ent m ethod. Figu re 3 sh o ws t h e p ro gress so far. Wit h a ballp oin t p en an d som e p atien ce, you can carry th is reason in g to its logical en d , lead in g to th e followin g tran slation d iction ary: an ok - p ip p at brok - lat clok - bat crrrok - (n on e?) d rok - sat en em ok - en eat erok - totat farok - jjat gh irok - h ilat h ih ok - arrat izok - vat/ q u at jok - krat kan tok - oloat lalok - wat/ iat m ok - gat n ok - n n at ok-d ru bel - at-d ru bel ok-voon - at-voon ok-yu rp - at-yu rp ororok - bich at p lok - rrat rarok - forat sp rok - d at stok - cat wiwok - totat yorok - m at zan zan ok - zan zan at Th e d iction ary sh ows am bigu ou s Cen tau ri word s (su ch as izok) an d am bigu ou s Arctu ran word s (su ch as totat). It also con tain s a cu riou s Cen t au ri wo rd (crrrok) t h at h as n o t ran slation —after th e align m en t of sen ten ce p air 11, th is word was som eh ow left over: lalok n ok crrrok Figure 2. Twelve Pairs of Sentences W ritten in Im aginary Centauri and Arcturan Languages. vat wat n n at arrat h ih ok m at yorok zan zan ok zan zan at Yo u b egin t o sp ecu lat e wh et h er crrrok is som e kin d of an affix, or crrrok hihok is a p olite form of hihok, bu t you are su d d en ly wh isked away by an alien sp acecraft an d p u t to work in th e In terstellar Tran slation Bu reau , wh ere you are im m ed iat ely t asked wit h t ran slat in g t h e EM for Model 1 Here there are 4 words in both the foreign and English vocabularies. There are 3 sentences in the training data. Assume no NULLs. Initialize the translation parameters to be uniform: das ein Buch Haus the 0.25 0.25 0.25 0.25 a 0.25 0.25 0.25 0.25 book 0.25 0.25 0.25 0.25 house 0.25 0.25 0.25 0.25 t(f|e) Translation probs Every row is one t(f|e) prob dist. 1a. E-step: Given t(f|e), calculate posterior alignments over the training data. Each English word came from one German word in the sentence. Which? the .5 house .5 the .5 das .5 .5 .5 Haus das book a .5 book .5 .5 .5 Buch ein .5 .5 p(Buch from “book”) = t(Buch | book) ------------------------------t(Buch | book) + t(Buch | a) Buch 1b. M-step: Given these posterior alignments, (1) calculate fractional translation counts …….... (2) normalize into a new translation probability table. tcount(f|e): Translation COUNTS das ein Buch t(f|e): Translation PROBS Haus das the the a a book book house house ein Buch Haus 2a. E-step the house the book a book das Haus das Buch ein Buch 2a. M-step tcount(f|e): Translation COUNTS das ein Buch t(f|e): Translation PROBS Haus das the the a a book book house house ein Buch Haus
© Copyright 2025 Paperzz