Tree-based translation LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi Overview Motivation Synchronous context free grammar Example derivations ITG grammars Reordering for ITG grammars Applications of bracketing ITG grammars Applications: ITGs for word alignment Hierarchical phrase-based translation with Hiero Examples of reordering/translation phenomena Rule extraction Model features Decoding for SCFGs and integrating a LM Motivation for tree-based translation Phrases capture contextual translation and local reordering surprisingly well However this information is brittle: of the book 本書的作者” tells us nothing about how to translate “author of the pamphlet” or “author of the play” The Chinese phrase “NOUN1的 NOUN2” becomes “NOUN2 of NOUN1” in English “author Motivation for tree-based translation There are general principles a phrase-based system is not using Some languages have adjectives before the nouns, some after Some languages place prepositions before nouns, some after Some languages put PPs before the head, others after Some languages place relative clauses before head, others after Discontinuous translations are not handled well by phrase-based systems ne pas in French, German verb split Types of tree-based systems Formally tree-based but not using linguistic syntax Can still model hierarchical nature of language Can capture hierarchical reordering Examples: phrase-based ITGs and Hiero (will focus on these in this lecture) Can use linguistic syntax on source, target, or both sides Phrase structure trees, dependency trees Next lecture Synchronous context-free grammars Synchronous context-free grammars A generalization of context free grammars Slide from David Chiang, ACL 2006 tutorial Context-free grammars (example in Japanese) Slide from David Chiang, ACL 2006 tutorial Synchronous CFGs Slide from David Chiang, ACL 2006 tutorial Synchronous CFGs Slide from David Chiang, ACL 2006 tutorial Synchronous CFGs Slide from David Chiang, ACL 2006 tutorial Rules with probabilities Joint probability of source and target language re-writes, given non-terminal on left. Could also use conditional probability of target given source or source given target. Synchronous CFGs Slide from David Chiang, ACL 2006 tutorial Slide from David Chiang, ACL 2006 tutorial Slide from David Chiang, ACL 2006 tutorial Slide from David Chiang, ACL 2006 tutorial Slide from David Chiang, ACL 2006 tutorial Slide from David Chiang, ACL 2006 tutorial Slide from David Chiang, ACL 2006 tutorial Slide from David Chiang, ACL 2006 tutorial Slide from David Chiang, ACL 2006 tutorial Inversion Transduction Grammars (ITGs) Stochastic Inversion Transduction Grammars [Wu 97] A restricted form of SCFGs Productions of the form 𝐴 → 𝐵1 𝐶2 , B1 C2 or 𝐴 → [𝐵 𝐶] 𝐴 → 𝐵1 𝐶2 , C2 B1 or 𝐴 →< 𝐵 𝐶 > 𝐴 → 𝑥, 𝑦 𝐴 → 𝑥, 𝜖 𝐴 → 𝜖, 𝑦 𝑆 → 𝜖, 𝜖 At most binary rules, either same or reversed order Either only non-terminals or only terminals on the right-handside This is a normal form for ITG grammars Bracketing ITG grammars A minimal number of non-terminal symbols Does not capture linguistic syntax but can be used to explain word alignment and translation 𝐴 → 𝐴1 𝐴2 , A1 A2 or 𝐴 → [𝐴 𝐴] 𝐴 → 𝐴1 𝐴2 , A2 A1 or 𝐴 → < 𝐴, 𝐴 > 𝐴 → 𝑥, 𝑦 𝐴 → 𝑥, 𝜖 𝐴 → 𝜖, 𝑦 Can be extended to allow direct generation of one-to-many or many-to-many blocks (Block ITG) 𝐴 → 𝑥, 𝑦 Reordering in bracketing ITG grammar Because of assumption of hierarchical movement of contiguous sequences, the space of possible word alignments between sentence pairs is limited Assume we start with a bracketing ITG grammar Allow any foreign word to translate to any English word or empty 𝐴 → 𝑓, 𝑒 𝐴 → 𝑓, 𝜖 𝐴 → 𝜖, 𝑒 Possible alignment One that is the result of a synchronous parse of the source and target with the grammar Example re-ordering with ITG Grammar includes 𝐴 → 1, 1 ; 𝐴 → 2,2 ; 𝐴 → 3,3 ; 𝐴 → 4,4 Can the bracketing ITG generate these sentence pairs? [1,2,3,4] [1,2,3,4] 𝐴1 → [𝐴2 𝐴3 ] 𝐴2 → [𝐴4 𝐴5 ] 𝐴4 → 𝐴6 𝐴7 𝐴6 → 1,1 𝐴7 → 2,2 𝐴5 → 3,3 𝐴3 → 4,4 Example re-ordering with ITG Are there other synchronous parses of this sentence pair? [1,2,3,4] [1,2,3,4] Example re-ordering with ITG Other re-orderings with parses A horizontal bar means the non-terminals are swapped But some re-orderings are not allowed When words move inside-out 22 out of the 24 permutations of 4 words are parsable by the bracketing ITG Number of permutations compared to ones parsable by ITG Application of ITGs Have been applied to word alignment and translation in many previous works One recent interesting work is Haghighi et al’s 09 paper on supervised word alignment with block ITGs Aria Haghighi, John Blitzer, John DeNero, and Dan Klein “Better word alignments with supervised ITG Models” Comparison of oracle alignment error (AER) for different alignment spaces From Haghighi et al 09 Space of all alignments, space of 1-to-1 alignments, space of ITG alignments Block ITG: adding one to many alignments From Haghighi et al 09 Comparison of oracle alignment error (AER) for different alignment spaces From Haghighi et al 09 Alignment performance using discriminative model From Haghighi et al 09 Training for maximum likelihood So far results were with MIRA Requiring only finding the best alignment under the model Efficient under 1-to-1 and ITG models If we want to train for maximum likelihood according to a log-linear model Requires summing over all possible alignments This is tractable in ITGs (will discuss bitext parsing in a bit) One of the big advantages of ITGs MIRA versus maximum likelihood training Algorithms for SCFGs Translation with synchronous CFGs Bi-text parsing with synchronous CFGs Review: CKY parsing for CFGs in CNF Slide from David Chiang, ACL 2006 tutorial Start with spans of length one and construct possible constituents Review: CKY parsing for CFGs in CNF Slide from David Chiang, ACL 2006 tutorial Continue with spans of length 2 and construct constituents using words and constructed constituents Review: CKY parsing for CFGs in CNF Slide from David Chiang, ACL 2006 tutorial Spans of length 3 Review: CKY parsing for CFGs in CNF Slide from David Chiang, ACL 2006 tutorial Spans of length 4 Review: CKY parsing for CFGs in CNF Slide from David Chiang, ACL 2006 tutorial The best S constituent covering the whole sentence is the final output Review: complexity of CKY Slide from David Chiang, ACL 2006 tutorial Translation with SCFG Slide from David Chiang, ACL 2006 tutorial Translation Slide from David Chiang, ACL 2006 tutorial Bi-text parsing Slide from David Chiang, ACL 2006 tutorial Bi-text parsing Slide from David Chiang, ACL 2006 tutorial We consider SCFGs with at most two symbols on the right-hand-side (rank 2) Bi-text parsing Slide from David Chiang, ACL 2006 tutorial Bi-text parsing Slide from David Chiang, ACL 2006 tutorial Bi-text parsing Slide from David Chiang, ACL 2006 tutorial Bi-text parsing Slide from David Chiang, ACL 2006 tutorial Bi-text parsing for grammars with higher rank No CNF for synchronous CFGs with rank greater or equal to 4 in the general case With higher-rank grammars we can translate efficiently by converting the source side CFG to CNF, parsing, flattening the trees back, and translating Not so for bi-text parsing: In general, exponential in rank of grammar and polynomial in sentence length Hierarchical phrase-based translation David Chiang ISI, USC Hierarchical phrase-based translation overview Motivation Extracting rules Scoring derivations Decoding without an LM Decoding with a LM Motivation Review of phrase based models Segment input into sequence of phrases Translate each phrase Re-order phrases depending on distortion and perhaps the lexical content of the phrases Properties of phrase-based models Local re-ordering is captured within phrases for frequently occurring groups of words Global re-ordering is not modeled well Only contiguous translations are learned Chinese-English example Australia is one of the few countries that have diplomatic relations with North Korea. Output from phrase-based system: Captured some reordering through phrase translation and phrase re-ordering Did not re-order the relative clause and the noun phrase. Idea: Hierarchical phrases 𝑦𝑢 𝑋1 𝑦𝑜𝑢 𝑋2 , ℎ𝑎𝑣𝑒 𝑋2 𝑤𝑖𝑡ℎ 𝑋1 The variables stand for corresponding hierarchical phrases Capture the fact that PP phrases tend to be before the verb in Chinese and after the verb in English Serves as both a discontinuous phrase pair and re-ordering rule Other example hierarchical phrases 𝑋1 𝑑𝑒 𝑋2 , 𝑡ℎ𝑒 𝑋2 𝑡ℎ𝑎𝑡 𝑋1 Chinese relative clauses modify NPs on the left, and English relative clauses modify NPs on the right 𝑋1 𝑧ℎ𝑖𝑦𝑖, 𝑜𝑛𝑒 𝑜𝑓𝑋1 A Synchronous CFG for example Only 1 non-terminal X plus start symbol S used 𝑋 → 𝑦𝑢 𝑋1 𝑦𝑜𝑢 𝑋2 , ℎ𝑎𝑣𝑒 𝑋2 𝑤𝑖𝑡ℎ 𝑋1 𝑋 → 𝑋1 𝑑𝑒 𝑋2 , 𝑡ℎ𝑒 𝑋2 𝑡ℎ𝑎𝑡 𝑋1 𝑋 → 𝑋1 𝑧ℎ𝑖𝑦𝑖, 𝑜𝑛𝑒 𝑜𝑓𝑋1 𝑋 → 𝐴𝑜𝑧ℎ𝑜𝑢, 𝐴𝑢𝑠𝑡𝑟𝑎𝑙𝑖𝑎 𝑋 → 𝐵𝑒𝑖ℎ𝑎𝑛, 𝑁𝑜𝑟𝑡ℎ 𝐾𝑜𝑟𝑒𝑎 𝑋 → 𝑠ℎ𝑖, 𝑖𝑠 𝑋 → 𝑏𝑎𝑛𝑗𝑖𝑎𝑜, 𝑑𝑖𝑝𝑙𝑜𝑚𝑎𝑡𝑖𝑐 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠 𝑋 → 𝐴𝑜𝑧ℎ𝑜𝑢, 𝐴𝑢𝑠𝑡𝑟𝑎𝑙𝑖𝑎 𝑋 → 𝑠ℎ𝑎𝑜𝑠ℎ𝑢 𝑔𝑢𝑜𝑗𝑖𝑎, 𝑓𝑒𝑤 𝑐𝑜𝑢𝑛𝑡𝑟𝑖𝑒𝑠 𝑆 → 𝑆1 𝑋2 , 𝑆1 𝑋2 [glue rule] 𝑆 → 𝑋1, 𝑋1 General approach Align parallel training data using word-alignment models (e.g. GIZA++) Extract hierarchical phrase pairs Can be represented as SCFG rules Assign probabilities (scores) to rules Like in log-linear models for phrase-based MT, can define various features on rules to come up with rule scores Translating new sentences Parsing with an SCFG grammar Integrating a language model Example derivation Extracting hierarchical phrases Start with contiguous phrase pairs, as in phrasal SMT models (called initial phrase pairs) Make rules for these phrase pairs and add them to the rule-set extracted from this sentence pair Extracting hierarchical phrase pairs For every rule of the sentence pair, and every initial phrase pair contained in it, replace initial phrase pair by non-terminal and add new rule Another example Hierarchical phrase Traditional phrases Constraining the grammar rules This method generates too many phrase pairs and leads to spurious ambiguity Place constraints on the set of allowable rules for robustness/speed Adding glue rules For continuity with phrase-based models, add glue rules which can split the source into phrases and translate each 𝑆 → 𝑆1 𝑋2 , 𝑆1 𝑋2 𝑆 → 𝑋1, 𝑋1 Question: if we only have conventional phrase pairs and these two rules, what system do we have? Question: what do we get if we also add these rules X→ 𝑋1 𝑋2 , 𝑋1 𝑋2 X → 𝑋1 𝑋2 , 𝑋2 𝑋1 Assigning scores to derivations A derivation is a parse tree for the source and target sentences As in phrase-based models, we choose a derivation that maximizes the score The derivation corresponds to a target sentence, which is returned as a translation There are multiple derivations of a target sentence but we do not sum over them, approximate with max as in phrase-based models 𝑃(𝐷) ∝ Feature functions on rules and a language model feature 𝑖 𝜙𝑖 𝐷 𝜆 𝑖 𝑃(𝐷) ∝ 𝑃𝐿𝑀 𝑒 𝜆𝑙𝑚 𝑖≠𝐿𝑀 𝑋→𝛾,𝛼∈𝐷 𝜙𝑖 𝑋 → 𝛾, 𝛼 𝜆𝑖 Assigning scores to derivations Except for language model, can represent as weight according to weighted SCFG 𝑤 𝐷 = 𝑤 𝑋 → 𝛾, 𝛼 = 𝑖≠𝐿𝑀 𝜙𝑖 𝑋 → 𝛾, 𝛼 𝑃 𝐷 ∝ 𝑃𝐿𝑀 𝑒 × 𝑤(𝐷) 𝑋→ 𝛾,𝛼∈𝐷 𝑤(𝑋 → 𝛾, 𝛼) 𝜆𝑖 Features Rule probabilities in two directions and lexical weighting in two directions 𝑃 𝛾 𝛼 and 𝑃(𝛼|𝛾) 𝑃𝑙𝑒𝑥 (𝛾|𝛼) and 𝑃𝑙𝑒𝑥 (𝛼|𝛾) Exp of rule count, word count, glue rule count Estimating feature values and feature weights Estimating translation probabilities in two directions Like in phrase-based models, heuristic estimation using relative frequency of counts of rules Count of one from every sentence for each initial phrase pair For each initial phrase pair in a sentence, fractional (equal) count to each rule obtained by subtracting sub-phrases Relative frequency of these counts 𝑐(𝑋 → 𝛾, 𝛼) 𝑃 𝛾𝛼 = ′ 𝛾′ 𝑐(𝑋 → 𝛾 , 𝛼) Estimating values of parameters 𝜆 Minimum error rate training like in phrase-based models: maximize BLEU score on a development set Finding the best translation: decoding In the absence of a language model feature, our scores look like this 𝑤 𝐷 = 𝑋→ 𝛾,𝛼∈𝐷 𝑤(𝑋 → 𝛾, 𝛼) This can be represented by a weighted SCFG Can parse the source sentence using source side of grammar using CKY (cubic time in sent length) Read off target sentence assembled from corresponding target derivation Finding the best translation including an LM Method 1: generate k-best derivations without LM, then rescore with LM May need an extremely large k-best list to get to highest scoring +LM derivation Method 2: integrate LM in grammar by intersecting target side of SCFG with an LM : very large expansion of rule-set with time 𝑂(𝑛3 𝑇 4 𝑚−1 ) Method 3: integrate LM while parsing with the SCFG, using cube pruning to generate k-best LMscored translations at each span Parsing with Hiero grammars Modification of CKY which does not require conversion to Chomsky Normal Form Parsing as weighted deduction (without LM, using source-side grammar) Goal: prove [S,0,n] Pseudo-code for parsing If two items are equivalent: same span, same non-terminal, they are merged. For k-best generation, keep pointers to all ways to generate the item, plus weights. K-best derivation generation To generate a k-best list for some item in the chart (e.g. X from 5 to 8), need to consider top k combinations of rules used to form X, plus sub-items used for the rules E.g. top 4 rules applying at span 5 to 8 and target sides of top 3 derivations from 6 to 8 Naïve method: generate all combinations, sort them and return top k Faster method: we don’t need to generate all combinations to get top k K-best combinations of two lists Integrating an LM with cube pruning Integrating an LM with cube pruning Results using different LM integration methods Comparison to a phrase-based system Summary Introduced stochastic SCFGs Translation is efficient using CKY-style parsing Bi-text parsing 𝑂 𝑛6 for rank 2 grammars , slower for rank 4 and up Introduced ITGs as a tractable special case Bracketing ITG using only 1 non-terminal used for word alignment Block bracketing ITG used for improved supervised alignment, also phrase-based translation Can intergrade an LM efficiently using cube pruning Summary Described hierarchical phrase-based translation Uses hierarchical rules encoding phrase re-ordering and discontinuous lexical correspondence Rules include traditional contiguous phrase pairs Can translate efficiently without LM using SCFG parsing Outperforms phrase-based models for several languages Hiero is implemented in Moses References Hierarchical phrase-based translation. David Chiang, CL 2007. An introduction to Synchronous Grammars. Notes and slides from ACL 2006 tutorial. David Chiang. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Dekai Wu, CL 1997. Better word alignment with Supervised ITG models. ACL 2009, A. Haghighi, J. Blitzer, J. DeNero, and D. Klein Many other interesting papers using ITGs and extensions to Hiero: will add some to the web page Next lecture Chris Quirk will talk about SMT systems using linguistic syntax Using syntax on source, target Different types of syntactic analysis Other types of synchronous grammars List of readings will be updated
© Copyright 2026 Paperzz