binarized, a nonterminal is considered new if it is
"#2 →
The
source
side
of
the
first
binarized
rule
“
[]1 → JJ
!
!
!
!
previously unseen in binarizing rules 1 to i − 1. This
c < c ⇔ c1 < c1 ∨ (c1 = c1 ∧ c2 < c2 ) ∨ . . .
NN, propose a JJ NN” contains a very frequent nongreedy approach is similar to that of DeNero et al. The form
terminal sequence “JJ NN”. If one were to parse
the same
The
cost
function
is
thus
defined
as:
If the (min, +) operators on each component cost with the binarized(2009b).
rule, and if the virtual nontermicause on
satisfy the semiring properties, the cost tuple is also nal [] has been built, the parser!needs to continue
1
1 if the VT for span (i, j) is new attached
a semiring. Next, we describe our cost functions and following the binarization
n("i, k, j#)
tree=in order to determine
0 otherwise
likely to
how we handle target-side terminals.
whether the original rule would be matched. Furtherbinarizati
n
(i)
=
0
init
more,
having two consecutive
nonterminals adds to
Tagyoung Chung
and
Daniel
Gildea
•
University
of
Rochester
2.1 Synchronous Binarization as a Cost
Second
complexity since the parser needs to test each split
We use a binary cost b to indicate whether a binariza- point.
The semiring operators for this cost are also programm
tion tree is a permissible synchronous binarization.
many targ
(min, +) onisreal
numbers.
The following binarization
equally
valid but inBinarization
as parsing
Late
target-side
terminal
attachment
Given a hyperedge
$i, k, j%, we say k is a permissible tegrates
competin
terminals early:
2.4 wisdom
Late Target-Side
Terminal
Attachment
as of
parsing
the(i,
source
side
of aif rule
is attaching
target-side
terminals as
• Formulatedsplit
• Conventional
the span
j) if and
only
the spans (i, k)
the langu
PP [[提出inJJ]
best(k,binarization
based on cost
functions
low
as
possible
binarization
tree
• Chooses aand
1 NN]2 ,
j) are both tree
synchronously
binarizable
and
be discar
VP →
Once
the
optimal
source-side
binarization
tree
is
[[propose
a JJ]1 NN]
PP
algorithm
But
LM
discourages
long
translation hypotheses. Unfairly
2
• Uses CYK-like
•
the span (i, j) covers a consecutive sequence of nonrule has a
found,
we
have
a
good
deal
of
freedom
to
attach
3 cost functions
pruning
away
translation
hypotheses
generated
by
rules
• We define terminals
on the target side. A span is synchronously
Here, the first binarized
rule
“[]1 → 提出
JJ, pro- nonterminals, as
the
benefi
target-side
terminals
to
adjacent
with
many
target-side
terminals
binarizable if and only if the span is of length one, pose a JJ” anchorslong
on a as
terminal
and enables
earlier
contextua
the
bracketing
of
nonterminals
is
not
vioWe
attach
target-side
terminals
as
high
as
possible
in
the
•
or a permissible split of the span exists. The cost b pruning of the original
rule.
lated.
binarization tree The following example is taken from Zhang that late t
is defined as:
We formulate this
intuition
by asking the quesoutperfor
et
al.
(2006):
Cost function b - !synchronous
tion: given a source-side string γ, what binarization
Althou
T if k is a permissible split of (i, j)
RB 负责
PP
的 NN,
binarization
tree, on average, builds
the
smallest
number
of
hyb($i, k, j%) =
ADJP
→
computin
F otherwise
RB
responsible
for
the
NN
PP
peredges when the rule is applied? This is realized
• Enforces synchronous binarization
unbinariz
binit (i) = T
rules are discarded
• Non-synchronously-binarizable
by defining a costWith
function
e which estimates
the different
the source-side
binarization
fixed, we
can protarget-side
terminal
narized ru
probability of a hyperedge
$i, k,binarized
j% being built.
attachment different
duce distinct
rules We
by choosing
minal
att
Under this configuration, the semiring operators use a simple model: assume each terminal or nonCost function n - maximize nonterminal
ways of attaching target-side terminals:
late targe
(min, +) defined for the cost b are (∨, ∧). Using b as terminal in γ is matched
ADJP independently with a fixed
ADJP
performs
sharing the first cost function in the cost function tuple guar- probability, then a hyperedge $i, k, j%
[RB
负责]
"
[PP
的]
NN
#
,
1
3
2
is
derived
if
ADJP
→
负责
的
RB+fuze
PP+de+NN
负责the
的
RB+fuze
responsible
for #the PP+de+NN
trees
that
generate
more
unseen
nonterminals
• Binarization
[RB]
"
resp.
for
NN
[PP]
antees that we will find a tree that is a synchronously and only if all symbols in the source
1
3
2
span (i, j) are
are discouraged
3 Expe
2
binarized if one exists.
RBcostresponsible
the NNasPP+de
matched. The
e is thusfordefined
的
的
RB
NN PP+de
Terminal-Aware Synchronous Binarization
different source-side
binarizations
X
X
X
?
A B+C A B+C
A+B C A+B C
A B
X
B C
A B
C B
X → A B+C , A B+C
B+C → B C , C B
non-synchronous
binarization
synchronous
binarization
Room for improvement
• Vanilla synchronous binarization (Zhang et el, 2006) always
chooses the rightmost binarizable point when many legal
synchronous binarizations exist, which is common
18000
lo
16000
ts
14000
Number of rules
12000
mo
n
ot
on
ic
10000
8000
ru
les
!
6000
4000
2000
0
tence, terminals
in based
the ruleon
have
less chance
of bemodel
corpus
statistics
• A simple probabilistic
infrequent
source-side
first ter• Prefers to match
ing matched.
We can
exploit thisterminals
fact by taking
minals into account during binarization and placing
terminals lower
in the binarization tree. Consider the
Algorithm 1 The CYK binarization
algorithm.
SCFG rule:
CYK - BINARIZE(X → "γ, following
α#)
Total
Binarizable
Monotonic
of
1
2
3
4
5
Number of right-hand-side nonterminals
6
7
Cost function
e -Source-Side
early source-side
2.2 Early
Terminal Matching
terminal
matching
When
a rule is being applied while parsing a sen-
for i = 0 . . . |γ| − 1 do
PP 提出 JJ NN,
VP →
T [i, i + 1] ← cinit (i)
propose a JJ NN PP
for s = 2 . . . |γ| do
different source-side
The
synchronous
binarization
algorithm of Zhang et
for i = 0 . . . |γ|-1 do
binarizations
1
j ←i+s
al. (2006) binarizes the rule by finding the rightfor k = i + 1 . . . j −most
1VP
do binarizable points on the source side:
VP
t ← T [i, k] + T [k, j] + c("i, k, j#)
1
T [i, j] ← min(T [i, j],
t) follow Wu (1997) and use square brackets for straight
We
PP Tichu+JJ+NN
提出and pointed brackets forPP
rules
invertedTichu+JJ+NN
rules. We also mark
提出
brackets with indices to represent virtual nonterminals.
JJ+NNconstraint,
提出
Even with the synchronousTichu
binarization
Tichu+JJ
NN
提出
many possible binarizations exist. Analysis of our
only nonterminals:
• The conventional method
Figure 1:considers
Rule Statistics
Chinese-English parallel corpus has
JJshown
NN that the
提出 JJ
Tichu
• always chooses right-most binarizable point for terminals
majority of synchronously binarizable rules with aron the source side
ity smaller than 4 are monotonic, i.e., the target-side
pruning,
we
show
that
different
strategies
do
have
a
• attaches target-side terminals as low as possible in the
nonterminal
permutation
is
either
strictly
increasing
significant
effect
in
translation
quality.
binarization tree
or
decreasing
(See
Figure
1).
For
monotonic
rules,
vanilla
preferred by cost
Other works investigating alternative binarization
any
source-side
binarization
is
also
a
permissible
synchronous
function e
methods mostly focus on the effect of nonterminal
synchronous
binarization.
binarization
sharing. Xiao et al. (2009) also proposed a CYKThe
binarization
problem
can
be
formulated
as
a
like algorithm for synchronous binarization. Apparsemiring
parsing
(Goodman,
1999)
problem.
We
ently the lack of virtual nonterminal sharing in their
define
a
cost
function
that
considers
different
binadecoder caused heavy competition between virtual
ADJP
→
"
[RB 负责]1 " [PP 的]3 NN #2 ,
[RB]PP
1 resp. for the " NN [PP]3 #2
3.1 Set
PP
p(γ! )
e($i, k, j%) =
The first
binarization
is
generated
by
attaching
the
We test o
i≤!<j
target-side terminals as low as possible
in a post- English t
vanilla
late
target-side
-355
term
einit (i) = 0
synchronous
scor
mar (Gall
nonterminal
-360input strings,
it is defined as an expectation over
instead of an
gram
binarization
-365
attachment
the
parse
over
trees.
For terminals, p(γexpectation
)
can
be
estimated
by
counting
!
imiz
-370
sect
the source
side
of
the
training
corpus.
For
nontermiResults
-375
shar
nals, weChinese-to-English
simply assume p(γ! )translation
= 1.
W
-380 dataset of 250K
task on
•
term
Withtraining
the hyperedge
cost
e,
the
cost
of
a
binariza-385
# sentence pairs, and a 329-sentence test set
(b,n)-early
show
(b,n)-late
-390
tion tree
t is source-side
the expected
number
(b,e,n)-early
matching
is of
better than rightmost
h∈t e(h), i.e.,terminal
• Early
ing
(b,e,n)-late
-395
hyperedges
to
be
built
when
a
particular
binarization
for
10
100
binarization
Seconds / Sentence (log scale)
3
earl
of a rule
is
applied
to
unseen
data.
The
operators
• Late target-side terminal attachment is better than early
earl
Figure 2: Model Scores vs. Decoding Time
target-side
terminal
attachment
nal
2
Model Score (log-probability)
X→ABC, ACB
In this definition, k does not appear on the right-hand side
-355
20.5 BLEU score, decoding speed, or model
of the equation
because all edges leading to the sameterms
span of
share
score when comparing translation results that used
the same -360
cost value.
20
grammars that employed nonterminal sharing max3
-365
Although
this cost function is defined as an expectation, it
imization
19.5 and ones that did not. In the rest of this
-370
does not form
an expectation semiring (Eisner, 2001)
because
section,
all the results we discuss use nonterminal
BLEU
• Factorizes synchronous context-free grammars rules
• Makes machine translation decoding with SCFG faster
• Facilitates language model integration during decoding
Model Score (log-probability)
Synchronous binarization
19
sharing maximization
as a part of the cost function.
We then compare the effects of early target-side
-380
18.5
terminal attachment and late attachment. Figure 2
-385
(b,n)-early
(b,n)-early
shows model
scores of each decoder run
with vary18
(b,n)-late
(b,n)-late
-390
(b,e,n)-early
ing bin sizes, and Figure 3 shows(b,e,n)-early
BLEU scores
(b,e,n)-late
(b,e,n)-late
-395
17.5
for
corresponding
runs of the experiments.
(b,n)10
100
10
100
Seconds / Sentence (log scale)
/ Sentence (log
scale)
early is conventionalSeconds
synchronous
binarization
with
early target-side terminal attachment and nontermiFigure 2:score
Model Scores
Decoding Time
model
vs vs.
decoding
time nal sharing
BLEU
vs (b,n)-late
decoding
timesetFigure
3: BLEU Scores
vs Decoding
maximization,
is theTime
same
ting with late target-side terminal attachment. The
20.5
tuples
represent
cost
functions
that
are discussed
in
as not
to extract
unary
et the
al., 2011).
Keys: (b, n)-late means we use the
cost
function
brules
to (Chung
select
best
Section
2. The
figuresofclearly
thatpairs,
late attach20
The corpus
consists
250K show
sentence
which
binarization tree, and ties are broken
by
the
cost
function
n.
Late
ment
of words
target-side
terminals
better.
Although
is 6.3M
on the
English is
side.
A 392-sentence
19.5
Figure
doestonot
show
correlation
with Figtarget-side terminal attachment
is
alsoperfect
applied.
test set3was
evaluate
different
binarizations.
ure Decoding
2, it exhibits
the same trend.
The same
goes
for
is
performed
by
a
general
CYK
SCFG
19
(b,e,n)-early
and (b,e,n)-late.
decoder developed
in-house and a trigram language
Finally, we examine the effect of including the
18.5
BLEU
Licheng Fang,
-375
ting
tupl
Sect
men
Figu
ure
(b,e
F
sour
“e”
(b,n
give
tren
and
4
We
gram
prop
term
thou
still
© Copyright 2026 Paperzz