Lecture 10: Discriminative Training for MT/ the Brown

Lecture 10: Discriminative Training for MT/
the Brown et al. Word Clustering Algorithm
Michael Collins
April 6, 2011
Discriminative Training for MT
I
Our original model:
f (y) = h(e(y)) +
L
X
g(pk ) +
k=1
I
L−1
X
η × δ(t(pk ), s(pk+1 ))
k=1
A discriminative model for translation (Liang et al., 2006):
L
L−1
X
X
w·φ(pk )+
η×δ(t(pk ), s(pk+1 ))
f (y; w, α, η) = α×h(e(y))+
k=1
I
k=1
Here α ∈ R, η ∈ R and w ∈ Rd are the parameters of the
model
Crucial idea: φ(p) is a feature-vector representation of a
phrase p
The Learning Set-up
I
I
I
Our training data consists of (x(i) , e(i) ) pairs, for i = 1 . . . n,
where x(i) is a source language sentence, and e(i) is a target
language sentence
We use Y (i) to denote the set of possible derivations for x(i)
A complication: for a given (x(i) , e(i) ) pair, there may be many
derivations y ∈ Y (i) such that e(y) = e(i) .
A “Bold Updating” Algorithm from Liang et al.
I
Initialization: set w = 0, α = 1, η = −1
I
for t = 1 . . . T , for i = 1 . . . n,
I
y ∗ = arg maxy∈Y (i) :e(y)=e(i) f (y; w, α, η)
I
z ∗ = arg maxz∈Y (i) f (z; w, α, η)
I
For any phrase p ∈ y ∗ , w = w + φ(p)
I
For any phrase p ∈ z ∗ , w = w − φ(p)
I
Set α = α + h(e(y ∗ )) − h(e(z ∗ ))
I
Set η = η + . . . − . . .
A “Local Updating” Algorithm from Liang et al.
I
Initialization: set w = 0, α = 1, η = −1
I
for t = 1 . . . T , for i = 1 . . . n,
I
Define N i to be the k highest scoring translations in Y (i)
under f (y; w, α, η) (easy to generate N i using k-best search)
I
y ∗ is member of N i that is “closest” to e(i) .
I
z ∗ = arg maxz∈Y (i) f (z; w, α, η)
I
For any phrase p ∈ y ∗ , w = w + φ(p)
I
For any phrase p ∈ z ∗ , w = w − φ(p)
I
Set α = α + h(e(y ∗ )) − h(e(z ∗ ))
I
Set η = η + . . . − . . .
The Brown Clustering Algorithm
I
I
I
Input: a (large) corpus of words
Output 1: a partition of words into word clusters
Output 2 (generalization of 1): a hierarchichal word clustering
Example Clusters (from Brown et al, 1992)
Peter F. Brown and Vincent J. Della Pietra
Class-Basedn-gram Models of Natural Language
Friday Monday Thursday Wednesday Tuesday Saturday Sunday weekends Sundays Saturdays
June March July April January December October November September August
people guys folks fellows CEOs chaps doubters commies unfortunates blokes
down backwards ashore sideways southward northward overboard aloft downwards adrift
water gas coal liquid acid sand carbon steam shale iron
great big vast sudden mere sheer gigantic lifelong scant colossal
man woman boy girl lawyer doctor guy farmer teacher citizen
American Indian European Japanese German African Catholic Israeli Italian Arab
pressure temperature permeability density porosity stress velocity viscosity gravity tension
mother wife father son husband brother daughter sister boss uncle
machine device controller processor CPU printer spindle subsystem compiler plotter
John George James Bob Robert Paul William Jim David Mike
anyone someone anybody somebody
feet miles pounds degrees inches barrels tons acres meters bytes
director chief professor commissioner commander treasurer founder superintendent dean custodian
liberal conservative parliamentary royal progressive Tory provisional separatist federalist PQ
had hadn't hath would've could've should've must've might've
asking telling wondering instructing informing kidding reminding bc)thering thanking deposing
that tha theat
head body hands eyes voice arm seat eye hair mouth
A Sample Hierarchy (from Miller et al., NAACL
2004)
lawyer
newspaperman
stewardess
toxicologist
slang
babysitter
conspirator
womanizer
mailman
salesman
bookkeeper
troubleshooter
bouncer
technician
janitor
saleswoman
...
Nike
Maytag
Generali
Gap
Harley-Davidson
Enfield
genus
Microsoft
Ventritex
Tractebel
Synopsys
WordPerfect
....
John
Consuelo
Jeffrey
Kenneth
Phillip
WILLIAM
Timothy
Terrence
Jerald
Harold
Frederic
Wendell
1000001101000
100000110100100
100000110100101
10000011010011
1000001101010
100000110101100
1000001101011010
1000001101011011
10000011010111
100000110110000
1000001101100010
10000011011000110
10000011011000111
1000001101100100
1000001101100101
1000001101100110
1011011100100101011100
10110111001001010111010
10110111001001010111011
1011011100100101011110
10110111001001010111110
101101110010010101111110
101101110010010101111111
10110111001001011000
101101110010010110010
1011011100100101100110
1011011100100101100111
1011011100100101101000
101110010000000000
101110010000000001
101110010000000010
10111001000000001100
101110010000000011010
101110010000000011011
10111001000000001110
101110010000000011110
101110010000000011111
101110010000000100
101110010000000101
10111001000000011
Table 1: Sample bit strings
3
Discriminative Name Tagger
algorithm exactly as described by Collins. However,
we did not implement cross-validation to determine
when to stop training. Instead, we simply iterated for 5
epochs in all cases, regardless of the training set size or
number of features used. Furthermore, we did not
implement features that occurred in no training
instances, as was done in (Sha and Pereira, 2003). We
suspect that these simplifications may have cost several
tenths of a point in performance.
A set of 16 tags was used to tag 8 name classes (the
seven MUC classes plus the additional null class). Two
tags were required per class to account for adjacent
elements of the same type. For example, the string
Betty Mary and Bobby Lou would be tagged as
PERSON-START PERSON-START NULL-START
PERSON-START PERSON-CONTINUE.
Our model uses a total of 19 classes of features. The
first seven of these correspond closely to features used
in a typical HMM name tagger. The remaining twelve
encode cluster membership.
Clusters of various
granularity are specified by prefixes of the bit strings.
Short prefixes specify short paths from the root node
and therefore large clusters. Long prefixes specify long
paths and small clusters. We used 4 different prefix
lengths: 8 bit, 12 bit, 16 bit, and 20 bit. Thus, the
clusters decrease in size by about a factor of 16 at each
level. The complete set of features is given in Table 2.
1.
2.
3.
4.
Tag + PrevTag
Tag + CurWord
Tag + CapAndNumFeatureOfCurWord
ReducedTag + CurWord
//collapse start and continue tags
5. Tag + PrevWord
6. Tag + NextWord
7. Tag + DownCaseCurWord
8. Tag + Pref8ofCurrWord
9. Tag + Pref12ofCurrWord
10. Tag + Pref16ofCurrWord
The Formulation
I
I
I
I
V is the set of all words seen in the corpus w1 , w2 , . . . wT
Say n(w, v) is the number of times that word w precedes v in
our corpus. n(w) is the number of times we see word w.
Say C : V → {1, 2, . . . k} is a partition of the vocabulary into
k classes
The model:
p(w1 , w2 , . . . wT ) =
n
Y
p(wi |C(wi ))p(C(wi )|C(wi−1 ))
i=1
I
(note: C(w0 ) is a special start state)
More conveniently:
log p(w1 , w2 , . . . wT ) =
n
X
i=1
log p(wi |C(wi ))p(C(wi )|C(wi−1 ))
in cluster c appears in the text, and define n(c, c" ) =
"
w∈c,w !∈c!
Measuring
oftheCtext.
Also, recall n isthe
simplyQuality
the length of
I
n(w, w ") analogous
How do we measure the quality of a partition C?
(Taken from Percy Liang, MENG thesis, MIT, 2005):
1#
log P (C(wi )|C(wi−1))P (wi|C(wi ))
n i=1
# n(w, w ")
=
log P (C(w ")|C(w))P (w "|C(w "))
n
!
n
Quality(C) =
w,w
=
# n(w, w ")
w,w !
=
# n(w, w ")
w,w !
=
n
n
# n(c, c" )
c,c!
n
log
n(C(w), C(w ")) n(w " )
n(C(w))
n(C(w " ))
log
n(C(w), C(w "))n # n(w, w ")
n(w ")
+
log
"
n(C(w))n(C(w ))
n
n
!
"
log
n(c, c )n
+
n(c)n(c" )
# n(w " )
w!
n
w,w
log
n(w " )
n
We use the counts n(·) to define empirical distributions over words, clusters, a
pairs of clusters, so that P (w) =
n(w)
,
n
P (c) =
n(c)
,
n
and P (c, c" ) =
n(c,c! )
.
n
Then t
The Final Equation
I
Define
I
n(w)
n(c)
n(c, c0 )
P (w) =
P (c) =
n
n
n
Then (again from Percy Liang, 2005):
P (c, c0 ) =
Quality(C) =
!
P (c, c! ) log
c,c!
!
P (c, c!)
+
P (w) log P (w)
P (c)P (c!)
w
= I(C) − H
The first term I(C) is the mutual information between adjacent clusters and the
second term H is the entropy of the word distribution. Note that the quality of C
can be computed as a sum of mutual information weights between clusters minus the
constant H, which does not depend on C. This decomposition allows us to make
optimizations.
Optimization by precomputation
A First Algorithm
I
I
I
We start with |V| clusters: each word gets its own cluster
Our aim is to find k final clusters
We run |V| − k merge steps:
I
I
At each merge step we pick two clusters ci and cj , and merge
them into a single cluster
We greedily pick merges such that
Quality(C)
for the clustering C after the merge step is maximized at
each stage
I
Cost? Naive = O(|V|5 ). Improved algorithm gives O(|V|3 ):
still two slow for realistic values of |V|
A Second Algorithm
I
I
I
Parameter of the approach is m (e.g., m = 1000)
Take the top m most frequent words, put each into its own
cluster, c1 , c2 , . . . cm
For i = (m + 1) . . . |V|
I
I
I
Create a new cluster, cm+1 , for the i’th most frequent word.
We now have m + 1 clusters
Choose two clusters from c1 . . . cm+1 to be merged: pick the
merge that gives a maximum value for Quality(C). We’re
now back to m clusters
Carry out (m − 1) final merges, to create a full hierarchy
Running time: O(|V|m2 + n) where n is corpus length
Miller et al, NAACL 2004
Name Tagging with Word Clusters and Discriminative Training
Scott Miller, Jethran Guinness, Alex Zamanian
BBN Technologies
10 Moulton Street
Cambridge, MA 02138
[email protected]
Abstract
considered reasonable. However, this user's problems
Miller
1
technique achieves a 25% reduction in error
etover
al, the
NAACL
2004 HMM trained on the
state-of-the-art
same material.
Introduction
At a recent meeting, we presented name-tagging
technology to a potential user. The technology had
performed well in formal evaluations, had been applied
successfully by several research groups, and required
only annotated training examples to configure for new
name classes. Nevertheless, it did not meet the user's
needs.
To achieve reasonable performance, the HMM-based
n Guinness,
Alex Zamanian
name classes.
Nevertheless, it did not meet the user's
needs.
Technologies
Miller
et al, NAACL 2004
oulton Street
To achieve reasonable performance, the HMM-based
dge, MA
02138
technology we presented required roughly 150,000
[email protected]
words of annotated examples, and over a million words
to achieve peak accuracy. Given a typical annotation
rate of 5,000 words per hour, we estimated that setting
up a name finder for a new problem would take four
person days of annotation work – a period we
considered reasonable. However, this user's problems
were too dynamic for that much setup time. To be
useful, the system would have to be trainable in
minutes or hours, not days or weeks.
We left the meeting thinking about ways to reduce
level. The complete set of features is given in Table 2.
Miller et al, NAACL 2004
the
Our
hod
obal
as
are
ond,
ould
most
to
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
Tag + PrevTag
Tag + CurWord
Tag + CapAndNumFeatureOfCurWord
ReducedTag + CurWord
//collapse start and continue tags
Tag + PrevWord
Tag + NextWord
Tag + DownCaseCurWord
Tag + Pref8ofCurrWord
Tag + Pref12ofCurrWord
Tag + Pref16ofCurrWord
Tag + Pref20ofCurrWord
Tag + Pref8ofPrevWord
Tag + Pref12ofPrevWord
Tag + Pref16ofPrevWord
Tag + Pref20ofPrevWord
Tag + Pref8ofNextWord
Tag + Pref12ofNextWord
Tag + Pref16ofNextWord
Tag + Pref20ofNextWord
Table 2: Feature Set
Miller et al, NAACL 2004
100
95
90
85
Discriminative
+ Cluste rs
F-Measure
80
HM M
75
70
65
60
55
50
1000
10000
100000
Training Size
Figure 2: Impact of Word Clustering
1000000
know the word.
Miller et al, NAACL 2004
100
Discriminative +
Clusters + Activ e
95
90
85
Discriminative
+ Clusters
F-Measure
80
75
Discriminative + Active
70
Discriminative
65
60
55
50
1000
10000
100000
Training Size
Figure 3: Impact of Active Learning
1000000