- Catalyst

Tree-based translation
LING 575 Lecture 5
Kristina Toutanova
MSR & UW
April 27, 2010
With materials borrowed from Philip Koehn, Chris
Quirk, David Chiang, Dekai Wu, Aria Haghighi
Overview

Motivation


Synchronous context free grammar

Example derivations

ITG grammars
Reordering for ITG grammars


Applications of bracketing ITG grammars


Applications: ITGs for word alignment
Hierarchical phrase-based translation with Hiero



Examples of reordering/translation phenomena
Rule extraction
Model features
Decoding for SCFGs and integrating a LM
Motivation for tree-based
translation


Phrases capture contextual translation and local
reordering surprisingly well
However this information is brittle:
of the book  本書的作者” tells us nothing
about how to translate “author of the pamphlet” or
“author of the play”
 The Chinese phrase “NOUN1的 NOUN2” becomes
“NOUN2 of NOUN1” in English
 “author
Motivation for tree-based
translation

There are general principles a phrase-based system
is not using
Some languages have adjectives before the nouns, some
after
 Some languages place prepositions before nouns, some
after
 Some languages put PPs before the head, others after
 Some languages place relative clauses before head, others
after


Discontinuous translations are not handled well by
phrase-based systems

ne pas in French, German verb split
Types of tree-based systems

Formally tree-based but not using linguistic syntax
 Can
still model hierarchical nature of language
 Can capture hierarchical reordering
 Examples: phrase-based ITGs and Hiero (will focus on
these in this lecture)

Can use linguistic syntax on source, target, or both
sides
 Phrase
structure trees, dependency trees
 Next lecture
Synchronous context-free grammars
Synchronous context-free
grammars

A generalization of context free grammars
Slide from David Chiang, ACL 2006 tutorial
Context-free grammars (example
in Japanese)
Slide from David Chiang, ACL 2006 tutorial
Synchronous CFGs
Slide from David Chiang, ACL 2006 tutorial
Synchronous CFGs
Slide from David Chiang, ACL 2006 tutorial
Synchronous CFGs
Slide from David Chiang, ACL 2006 tutorial
Rules with probabilities
Joint probability of source and target language re-writes, given non-terminal on left.
Could also use conditional probability of target given source or source given target.
Synchronous CFGs
Slide from David Chiang, ACL 2006 tutorial
Slide from David Chiang, ACL 2006 tutorial
Slide from David Chiang, ACL 2006 tutorial
Slide from David Chiang, ACL 2006 tutorial
Slide from David Chiang, ACL 2006 tutorial
Slide from David Chiang, ACL 2006 tutorial
Slide from David Chiang, ACL 2006 tutorial
Slide from David Chiang, ACL 2006 tutorial
Slide from David Chiang, ACL 2006 tutorial
Inversion Transduction Grammars (ITGs)
Stochastic Inversion Transduction
Grammars [Wu 97]


A restricted form of SCFGs
Productions of the form
𝐴 → 𝐵1 𝐶2 , B1 C2 or 𝐴 → [𝐵 𝐶]
𝐴 → 𝐵1 𝐶2 , C2 B1 or 𝐴 →< 𝐵 𝐶 >
𝐴 → 𝑥, 𝑦
𝐴 → 𝑥, 𝜖
𝐴 → 𝜖, 𝑦
𝑆 → 𝜖, 𝜖

At most binary rules, either same or reversed order
Either only non-terminals or only terminals on the right-handside

This is a normal form for ITG grammars

Bracketing ITG grammars


A minimal number of non-terminal symbols
Does not capture linguistic syntax but can be used to explain
word alignment and translation
𝐴 → 𝐴1 𝐴2 , A1 A2 or 𝐴 → [𝐴 𝐴]
𝐴 → 𝐴1 𝐴2 , A2 A1 or 𝐴 → < 𝐴, 𝐴 >
𝐴 → 𝑥, 𝑦
𝐴 → 𝑥, 𝜖
𝐴 → 𝜖, 𝑦

Can be extended to allow direct generation of one-to-many or
many-to-many blocks (Block ITG)
𝐴 → 𝑥, 𝑦
Reordering in bracketing ITG
grammar



Because of assumption of hierarchical movement
of contiguous sequences, the space of possible
word alignments between sentence pairs is limited
Assume we start with a bracketing ITG grammar
Allow any foreign word to translate to any English
word or empty


𝐴 → 𝑓, 𝑒 𝐴 → 𝑓, 𝜖 𝐴 → 𝜖, 𝑒
Possible alignment
 One
that is the result of a synchronous parse of the
source and target with the grammar
Example re-ordering with ITG
Grammar includes 𝐴 → 1, 1 ; 𝐴 → 2,2 ; 𝐴 → 3,3 ; 𝐴 →
4,4
Can the bracketing ITG generate these sentence pairs?
[1,2,3,4] [1,2,3,4]
𝐴1 → [𝐴2 𝐴3 ]
𝐴2 → [𝐴4 𝐴5 ]
𝐴4 → 𝐴6 𝐴7
𝐴6 → 1,1
𝐴7 → 2,2
𝐴5 → 3,3
𝐴3 → 4,4
Example re-ordering with ITG
Are there other synchronous parses of this sentence
pair?
[1,2,3,4] [1,2,3,4]
Example re-ordering with ITG

Other re-orderings with parses

A horizontal bar means the non-terminals are
swapped
But some re-orderings are not
allowed

When words move inside-out

22 out of the 24 permutations of 4 words are
parsable by the bracketing ITG
Number of permutations
compared to ones parsable by ITG
Application of ITGs


Have been applied to word alignment and
translation in many previous works
One recent interesting work is Haghighi et al’s 09
paper on supervised word alignment with block
ITGs
 Aria
Haghighi, John Blitzer, John DeNero, and Dan
Klein “Better word alignments with supervised ITG
Models”
Comparison of oracle alignment error
(AER) for different alignment spaces
From Haghighi et al 09
Space of all alignments, space of 1-to-1 alignments, space of ITG alignments
Block ITG: adding one to many
alignments
From Haghighi et al 09
Comparison of oracle alignment error
(AER) for different alignment spaces
From Haghighi et al 09
Alignment performance using
discriminative model
From Haghighi et al 09
Training for maximum likelihood

So far results were with MIRA
 Requiring
only finding the best alignment under the
model
 Efficient under 1-to-1 and ITG models

If we want to train for maximum likelihood
according to a log-linear model
 Requires
summing over all possible alignments
 This is tractable in ITGs (will discuss bitext parsing in a
bit)
 One of the big advantages of ITGs
MIRA versus maximum likelihood
training
Algorithms for SCFGs


Translation with synchronous CFGs
Bi-text parsing with synchronous CFGs
Review: CKY parsing for CFGs in CNF
Slide from David Chiang, ACL 2006 tutorial

Start with spans of length one and construct possible
constituents
Review: CKY parsing for CFGs in CNF
Slide from David Chiang, ACL 2006 tutorial

Continue with spans of length 2 and construct constituents
using words and constructed constituents
Review: CKY parsing for CFGs in CNF
Slide from David Chiang, ACL 2006 tutorial

Spans of length 3
Review: CKY parsing for CFGs in CNF
Slide from David Chiang, ACL 2006 tutorial

Spans of length 4
Review: CKY parsing for CFGs in CNF
Slide from David Chiang, ACL 2006 tutorial

The best S constituent covering the whole
sentence is the final output
Review: complexity of CKY
Slide from David Chiang, ACL 2006 tutorial
Translation with SCFG
Slide from David Chiang, ACL 2006 tutorial
Translation
Slide from David Chiang, ACL 2006 tutorial
Bi-text parsing
Slide from David Chiang, ACL 2006 tutorial
Bi-text parsing
Slide from David Chiang, ACL 2006 tutorial

We consider SCFGs with at most two symbols on
the right-hand-side (rank 2)
Bi-text parsing
Slide from David Chiang, ACL 2006 tutorial
Bi-text parsing
Slide from David Chiang, ACL 2006 tutorial
Bi-text parsing
Slide from David Chiang, ACL 2006 tutorial
Bi-text parsing
Slide from David Chiang, ACL 2006 tutorial
Bi-text parsing for grammars with
higher rank



No CNF for synchronous CFGs with rank greater or
equal to 4 in the general case
With higher-rank grammars we can translate
efficiently by converting the source side CFG to
CNF, parsing, flattening the trees back, and
translating
Not so for bi-text parsing:
 In
general, exponential in rank of grammar and
polynomial in sentence length
Hierarchical phrase-based translation
David Chiang
ISI, USC
Hierarchical phrase-based
translation overview





Motivation
Extracting rules
Scoring derivations
Decoding without an LM
Decoding with a LM
Motivation

Review of phrase based models




Segment input into sequence of phrases
Translate each phrase
Re-order phrases depending on distortion and perhaps the lexical
content of the phrases
Properties of phrase-based models



Local re-ordering is captured within phrases for frequently occurring
groups of words
Global re-ordering is not modeled well
Only contiguous translations are learned
Chinese-English example
Australia is one of the few countries that have diplomatic relations with North Korea.
Output from phrase-based system:
Captured some reordering through phrase translation and phrase re-ordering
Did not re-order the relative clause and the noun phrase.
Idea: Hierarchical phrases
𝑦𝑢 𝑋1 𝑦𝑜𝑢 𝑋2 ,



ℎ𝑎𝑣𝑒 𝑋2 𝑤𝑖𝑡ℎ 𝑋1
The variables stand for corresponding hierarchical phrases
Capture the fact that PP phrases tend to be before the verb in
Chinese and after the verb in English
Serves as both a discontinuous phrase pair and re-ordering
rule
Other example hierarchical
phrases
𝑋1 𝑑𝑒 𝑋2 ,

𝑡ℎ𝑒 𝑋2 𝑡ℎ𝑎𝑡 𝑋1
Chinese relative clauses modify NPs on the left, and English
relative clauses modify NPs on the right
𝑋1 𝑧ℎ𝑖𝑦𝑖,
𝑜𝑛𝑒 𝑜𝑓𝑋1
A Synchronous CFG for example
Only 1 non-terminal X plus start symbol S used

𝑋 → 𝑦𝑢 𝑋1 𝑦𝑜𝑢 𝑋2 , ℎ𝑎𝑣𝑒 𝑋2 𝑤𝑖𝑡ℎ 𝑋1

𝑋 → 𝑋1 𝑑𝑒 𝑋2 , 𝑡ℎ𝑒 𝑋2 𝑡ℎ𝑎𝑡 𝑋1

𝑋 → 𝑋1 𝑧ℎ𝑖𝑦𝑖, 𝑜𝑛𝑒 𝑜𝑓𝑋1

𝑋 → 𝐴𝑜𝑧ℎ𝑜𝑢, 𝐴𝑢𝑠𝑡𝑟𝑎𝑙𝑖𝑎

𝑋 → 𝐵𝑒𝑖ℎ𝑎𝑛, 𝑁𝑜𝑟𝑡ℎ 𝐾𝑜𝑟𝑒𝑎

𝑋 → 𝑠ℎ𝑖, 𝑖𝑠

𝑋 → 𝑏𝑎𝑛𝑗𝑖𝑎𝑜, 𝑑𝑖𝑝𝑙𝑜𝑚𝑎𝑡𝑖𝑐 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠

𝑋 → 𝐴𝑜𝑧ℎ𝑜𝑢, 𝐴𝑢𝑠𝑡𝑟𝑎𝑙𝑖𝑎

𝑋 → 𝑠ℎ𝑎𝑜𝑠ℎ𝑢 𝑔𝑢𝑜𝑗𝑖𝑎, 𝑓𝑒𝑤 𝑐𝑜𝑢𝑛𝑡𝑟𝑖𝑒𝑠

𝑆 → 𝑆1 𝑋2 , 𝑆1 𝑋2 [glue rule]

𝑆 → 𝑋1, 𝑋1
General approach


Align parallel training data using word-alignment
models (e.g. GIZA++)
Extract hierarchical phrase pairs
 Can

be represented as SCFG rules
Assign probabilities (scores) to rules
 Like
in log-linear models for phrase-based MT, can
define various features on rules to come up with rule
scores

Translating new sentences
 Parsing
with an SCFG grammar
 Integrating a language model
Example derivation
Extracting hierarchical phrases

Start with contiguous phrase pairs, as in phrasal
SMT models (called initial phrase pairs)

Make rules for these phrase pairs and add them to
the rule-set extracted from this sentence pair
Extracting hierarchical phrase pairs

For every rule of the sentence pair, and every
initial phrase pair contained in it, replace initial
phrase pair by non-terminal and add new rule
Another example
Hierarchical phrase
Traditional phrases
Constraining the grammar rules

This method generates too many phrase pairs and leads to
spurious ambiguity

Place constraints on the set of allowable rules for robustness/speed
Adding glue rules

For continuity with phrase-based models, add glue
rules which can split the source into phrases and
translate each
𝑆 → 𝑆1 𝑋2 , 𝑆1 𝑋2
 𝑆 → 𝑋1, 𝑋1



Question: if we only have conventional phrase pairs
and these two rules, what system do we have?
Question: what do we get if we also add these rules
X→ 𝑋1 𝑋2 , 𝑋1 𝑋2
 X → 𝑋1 𝑋2 , 𝑋2 𝑋1

Assigning scores to derivations


A derivation is a parse tree for the source and target
sentences
As in phrase-based models, we choose a derivation that
maximizes the score


The derivation corresponds to a target sentence, which is returned as a
translation
There are multiple derivations of a target sentence but we do not sum
over them, approximate with max as in phrase-based models
𝑃(𝐷) ∝

Feature functions on rules and a language model feature

𝑖 𝜙𝑖
𝐷
𝜆
𝑖

𝑃(𝐷) ∝ 𝑃𝐿𝑀 𝑒
𝜆𝑙𝑚
𝑖≠𝐿𝑀
𝑋→𝛾,𝛼∈𝐷 𝜙𝑖
𝑋 → 𝛾, 𝛼
𝜆𝑖
Assigning scores to derivations

Except for language model, can represent as weight according
to weighted SCFG

𝑤 𝐷 =

𝑤 𝑋 → 𝛾, 𝛼 = 𝑖≠𝐿𝑀 𝜙𝑖 𝑋 → 𝛾, 𝛼
𝑃 𝐷 ∝ 𝑃𝐿𝑀 𝑒 × 𝑤(𝐷)


𝑋→ 𝛾,𝛼∈𝐷 𝑤(𝑋
→ 𝛾, 𝛼)
𝜆𝑖
Features

Rule probabilities in two directions and lexical weighting in two
directions

𝑃 𝛾 𝛼 and 𝑃(𝛼|𝛾)
𝑃𝑙𝑒𝑥 (𝛾|𝛼) and 𝑃𝑙𝑒𝑥 (𝛼|𝛾)
Exp of rule count, word count, glue rule count


Estimating feature values and feature
weights

Estimating translation probabilities in two directions




Like in phrase-based models, heuristic estimation using relative
frequency of counts of rules
Count of one from every sentence for each initial phrase pair
For each initial phrase pair in a sentence, fractional (equal) count to
each rule obtained by subtracting sub-phrases
Relative frequency of these counts
𝑐(𝑋 → 𝛾, 𝛼)
𝑃 𝛾𝛼 =
′
𝛾′ 𝑐(𝑋 → 𝛾 , 𝛼)

Estimating values of parameters 𝜆

Minimum error rate training like in phrase-based models: maximize
BLEU score on a development set
Finding the best translation: decoding

In the absence of a language model feature, our
scores look like this




𝑤 𝐷 =
𝑋→ 𝛾,𝛼∈𝐷 𝑤(𝑋
→ 𝛾, 𝛼)
This can be represented by a weighted SCFG
Can parse the source sentence using source side of
grammar using CKY (cubic time in sent length)
Read off target sentence assembled from
corresponding target derivation
Finding the best translation
including an LM

Method 1: generate k-best derivations without LM,
then rescore with LM
 May
need an extremely large k-best list to get to
highest scoring +LM derivation


Method 2: integrate LM in grammar by intersecting
target side of SCFG with an LM : very large
expansion of rule-set with time 𝑂(𝑛3 𝑇 4 𝑚−1 )
Method 3: integrate LM while parsing with the
SCFG, using cube pruning to generate k-best LMscored translations at each span
Parsing with Hiero grammars

Modification of CKY which does not require conversion to Chomsky
Normal Form

Parsing as weighted deduction (without LM, using source-side grammar)

Goal: prove [S,0,n]
Pseudo-code for parsing
If two items are equivalent: same span, same non-terminal, they are merged.
For k-best generation, keep pointers to all ways to generate the item, plus weights.
K-best derivation generation

To generate a k-best list for some item in the chart (e.g. X
from 5 to 8), need to consider top k combinations of rules
used to form X, plus sub-items used for the rules

E.g. top 4 rules applying at span 5 to 8 and target sides of top 3
derivations from 6 to 8

Naïve method: generate all combinations, sort them and return top k
Faster method: we don’t need to generate all combinations to get top
k

K-best combinations of two lists
Integrating an LM with cube pruning
Integrating an LM with cube pruning
Results using different LM integration
methods
Comparison to a phrase-based system
Summary

Introduced stochastic SCFGs
Translation is efficient using CKY-style parsing
 Bi-text parsing 𝑂 𝑛6 for rank 2 grammars , slower for rank 4
and up


Introduced ITGs as a tractable special case
Bracketing ITG using only 1 non-terminal used for word
alignment
 Block bracketing ITG used for improved supervised
alignment, also phrase-based translation
 Can intergrade an LM efficiently using cube pruning

Summary

Described hierarchical phrase-based translation
Uses hierarchical rules encoding phrase re-ordering and
discontinuous lexical correspondence
 Rules include traditional contiguous phrase pairs
 Can translate efficiently without LM using SCFG parsing
 Outperforms phrase-based models for several languages


Hiero is implemented in Moses
References





Hierarchical phrase-based translation. David Chiang, CL 2007.
An introduction to Synchronous Grammars. Notes and slides
from ACL 2006 tutorial. David Chiang.
Stochastic inversion transduction grammars and bilingual
parsing of parallel corpora. Dekai Wu, CL 1997.
Better word alignment with Supervised ITG models. ACL 2009,
A. Haghighi, J. Blitzer, J. DeNero, and D. Klein
Many other interesting papers using ITGs and extensions to
Hiero: will add some to the web page
Next lecture





Chris Quirk will talk about SMT systems using
linguistic syntax
Using syntax on source, target
Different types of syntactic analysis
Other types of synchronous grammars
List of readings will be updated