Multiword Expression Identification with Tree Substitution Grammars

Multiword Expression Identification
with Tree Substitution Grammars
Spence Green, Marie-Catherine de Marneffe,
John Bauer, and Christopher D. Manning
Stanford University
EMNLP 2011
Main Idea
Use syntactic context to find multiword expressions
Main Idea
Use syntactic context to find multiword expressions
Syntactic context
→
constituency parses
Main Idea
Use syntactic context to find multiword expressions
Syntactic context
→
Multiword expressions →
constituency parses
idiomatic constructions
Which languages?
Results and analysis for French
3 / 42
Which languages?
Results and analysis for French
I
Lexicographic tradition of compiling MWE lists
I
Annotated data!
3 / 42
Which languages?
Results and analysis for French
I
Lexicographic tradition of compiling MWE lists
I
Annotated data!
English examples in the talk
3 / 42
Motivating Example: Humans get this
1. He kicked the pail.
2. He kicked the bucket.
I
“He died.”
(Katz and Postal 1963)
4 / 42
Stanford parser can’t tell the difference
S
NP
VP
NP
He
kicked
the pail
5 / 42
Stanford parser can’t tell the difference
S
S
NP
NP
VP
NP
He
kicked
the pail
He
VP
NP
kicked
the bucket
5 / 42
What does the lexicon contain?
S
Single-word entries?
I kick : <agent, theme>
I die : <theme>
NP
Multi-word entries?
I kick the bucket : <theme>
He
VP
NP
kicked
the bucket
6 / 42
Lexicon-Grammar: He kicked the bucket
S
NP
VP
He
died
7 / 42
Lexicon-Grammar: He kicked the bucket
S
S
NP
He
VP
died
NP
He
VP
MWV
kicked the bucket
(Gross 1986)
7 / 42
MWEs in Lexicon-Grammar
Classified by global POS
Described by internal POS
sequence
Flat structures!
MWV
VBD
kicked
DT
NN
the bucket
8 / 42
MWEs in Lexicon-Grammar
Classified by global POS
Described by internal POS
sequence
Flat structures!
MWV
VBD
kicked
DT
NN
the bucket
Of theoretical interest but...
8 / 42
Why do we care (in NLP)?
MWE knowledge improves:
Dependency parsing
(Nivre and Nilsson 2004)
Constituency parsing
(Arun and Keller 2005)
Sentence generation
(Hogan et al. 2007)
Machine translation
Shallow parsing
(Carpuat and Diab 2010)
(Korkontzelos and Manandhar 2010)
9 / 42
Why do we care (in NLP)?
MWE knowledge improves:
Dependency parsing
(Nivre and Nilsson 2004)
Constituency parsing
(Arun and Keller 2005)
Sentence generation
(Hogan et al. 2007)
Machine translation
Shallow parsing
(Carpuat and Diab 2010)
(Korkontzelos and Manandhar 2010)
Most experiments assume high accuracy
identification!
9 / 42
French and the French Treebank
MWEs common in French
I ∼5,000 multiword adverbs
10 / 42
French and the French Treebank
MWC
MWEs common in French
I ∼5,000 multiword adverbs
Paris 7 French Treebank
I ∼16,000 trees
I 13% of tokens are MWE
P
N
C
sous
prétexte
que
on the grounds that
10 / 42
French Treebank: MWE types
I
ET
CL
Global POS
PRO
Lots of nominal compounds
e.g. N – N numéro deux
ADV
D
V
C
P
ADV
N
0
10
20
30
40
50
%Total MWEs
11 / 42
MWE Identification Evaluation
Identification is a by-product of parsing
12 / 42
MWE Identification Evaluation
Identification is a by-product of parsing
I
Corpus: Paris 7 French Treebank (FTB)
I
Split: same as (Crabbé and Candito 2008)
I
Metrics: Precision and Recall
I
Lengths ≤ 40 words
12 / 42
MWE Identification: Parent-Annotated PCFG
60
32.6
20
PC
FG
0
PA
-
F1
40
13 / 42
MWE Identification: n-gram methods
60
32.6
34.7
20
m
w
et
oo
lk
it
PC
FG
0
PA
-
F1
40
14 / 42
MWE Identification: n-gram methods
60
F1
40
32.6
34.7
20
m
w
et
oo
lk
it
PA
-
PC
FG
0
Standard approach in 2008 MWE Shared Task, MWE
Workshops, etc.
14 / 42
n-gram methods: mwetoolkit
Based on surface statistics
15 / 42
n-gram methods: mwetoolkit
Based on surface statistics
Step 1: Lemmatize and POS tag corpus
15 / 42
n-gram methods: mwetoolkit
Based on surface statistics
Step 1: Lemmatize and POS tag corpus
Step 2: Compute n-gram statistics:
I
Maximum likelihood estimator
I
Dice’s coefficient
I
Pointwise mutual information
I
Student’s t-score
(Ramisch, Villavicencio, and Boitet 2010)
15 / 42
n-gram methods: mwetoolkit
Step 3: Create n-gram
feature vectors
16 / 42
n-gram methods: mwetoolkit
Step 3: Create n-gram
feature vectors
Step 4: Train a binary
classifier
16 / 42
n-gram methods: mwetoolkit
Step 3: Create n-gram
feature vectors
Step 4: Train a binary
classifier
Exploits statistical idiomaticity of MWEs
16 / 42
Is statistical idiomaticity sufficient?
VN
French multiword verbs
Tree maintains relationship
between MWV parts
MWV
MWADV
MWV
va
d’ ailleurs
bon train
is also well underway
17 / 42
Recap: French MWE Identification Baselines
60
32.6
34.7
20
m
w
et
oo
lk
it
PC
FG
0
PA
-
F1
40
18 / 42
Recap: French MWE Identification Baselines
60
F1
40
32.6
34.7
20
m
w
et
oo
lk
it
PA
-
PC
FG
0
Let’s build a better grammar
18 / 42
Better PCFGs: Manual grammar splits
Symbol refinement à la (Klein
and Manning 2003)
19 / 42
Better PCFGs: Manual grammar splits
Symbol refinement à la (Klein
and Manning 2003)
I
Has a verbal nucleus
(VN)
19 / 42
Better PCFGs: Manual grammar splits
COORD
Symbol refinement à la (Klein
and Manning 2003)
I
Has a verbal nucleus
(VN)
C
ADV
VN
Ou
bien
doit -il
...
Otherwise he must
19 / 42
Better PCFGs: Manual grammar splits
COORD-hasVN
Symbol refinement à la (Klein
and Manning 2003)
I
Has a verbal nucleus
(VN)
C
ADV
VN
Ou
bien
doit -il
...
Otherwise he must
20 / 42
French MWE Identification: Manual Splits
80
63.1
60
32.6
34.7
20
Sp
lits
m
w
et
oo
lk
it
0
PA
-P
C
FG
F1
40
21 / 42
French MWE Identification: Manual Splits
80
63.1
60
F1
40
32.6
34.7
20
Sp
lits
m
w
et
oo
lk
it
PA
-P
C
FG
0
MWE features: high frequency POS sequences
21 / 42
Capture more syntactic context?
PCFGs work well!
22 / 42
Capture more syntactic context?
PCFGs work well!
Larger “rules”: Tree Substitution Grammars (TSG)
22 / 42
Capture more syntactic context?
PCFGs work well!
Larger “rules”: Tree Substitution Grammars (TSG)
Relationship with Data-Oriented Parsing (DOP):
I
Same grammar formalism (TSG)
I
We include unlexicalized fragments
I
Different parameter estimation
22 / 42
Which tree fragments do we select?
S
NP
VP
N
MWV
He
V
kicked
D
N
the bucket
23 / 42
Which tree fragments do we select?
S
NP
VP
N
MWV
He
V
kicked
D
N
the bucket
24 / 42
Which tree fragments do we select?
NP
V
N
kicked
He
MWV
V
D
S
N
the bucket
NP
VP
MWV
25 / 42
TSG Grammar Extraction as Tree Selection
MWV
V
D
N
the bucket
26 / 42
TSG Grammar Extraction as Tree Selection
MWV
V
D
N
the bucket
I
Describes MWE context
I
Allows for inflection: kick, kicked, kicking
26 / 42
Dirichlet process TSG (DP-TSG)
Tree selection as non-parametric clustering1
1
Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009;
O’Donnell, Tenenbaum, and Goodman 2009.
27 / 42
Dirichlet process TSG (DP-TSG)
Tree selection as non-parametric clustering1
Labeled Chinese Restaurant process
I
Dirichlet process (DP) prior for each non-terminal
type c
1
Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009;
O’Donnell, Tenenbaum, and Goodman 2009.
27 / 42
Dirichlet process TSG (DP-TSG)
Tree selection as non-parametric clustering1
Labeled Chinese Restaurant process
I
Dirichlet process (DP) prior for each non-terminal
type c
Supervised case: segment the treebank
1
Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009;
O’Donnell, Tenenbaum, and Goodman 2009.
27 / 42
DP-TSG: Learning and Inference
DP base distribution from manually-split CFG
28 / 42
DP-TSG: Learning and Inference
DP base distribution from manually-split CFG
Type-based Gibbs sampler
I
(Liang, Jordan, and Klein 2010)
Fast convergence: 400 iterations
28 / 42
DP-TSG: Learning and Inference
DP base distribution from manually-split CFG
Type-based Gibbs sampler
I
(Liang, Jordan, and Klein 2010)
Fast convergence: 400 iterations
Derivations of a TSG are a CFG forest
28 / 42
DP-TSG: Learning and Inference
DP base distribution from manually-split CFG
Type-based Gibbs sampler
I
(Liang, Jordan, and Klein 2010)
Fast convergence: 400 iterations
Derivations of a TSG are a CFG forest
I
SCFG decoder: cdec
(Dyer et al. 2010)
28 / 42
French MWE Identification: DP-TSG
80
71.1
63.1
60
32.6
34.7
20
G
D
PTS
Sp
lit
s
m
w
et
oo
lk
it
PC
FG
0
PA
-
F1
40
29 / 42
French MWE Identification: DP-TSG
80
71.1
63.1
60
32.6
34.7
20
G
D
PTS
Sp
lit
s
m
w
et
oo
lk
it
PC
FG
0
PA
-
F1
40
DP-TSG result is a lower bound
29 / 42
Human-interpretable DP-TSG rules
MWN → coup de N
coup de pied
‘kick’
coup de coeur
‘favorite’
coup de foudre
‘love at first sight’
coup de main
‘help’
coup de grâce
‘death blow’
30 / 42
Human-interpretable DP-TSG rules
MWN → coup de N
coup de pied
‘kick’
coup de coeur
‘favorite’
coup de foudre
‘love at first sight’
coup de main
‘help’
coup de grâce
‘death blow’
n-gram methods: separate feature vectors
30 / 42
DP-TSG errors: Overgeneration
NP
NP
D
N
AP
Le
marché
A
national
‘The national march’
D
Le
MWN
N
A
marché national
DP-TSG
Reference
31 / 42
DP-TSG errors: Overgeneration
NP
NP
D
N
AP
Le
marché
A
national
‘The national march’
D
Le
MWN
N
A
marché national
DP-TSG
Reference
MWEs are subtle; reference sometimes inconsistent
31 / 42
Standard Parsing Evaluation
Same setup as MWE identification!
32 / 42
Standard Parsing Evaluation
Same setup as MWE identification!
I
Corpus: Paris 7 French Treebank (FTB)
I
Split: same as (Crabbé and Candito 2008)
I
Metrics: Evalb and Leaf Ancestor
I
Lengths ≤ 40 words
32 / 42
French Parsing Evaluation: All bracketings
90
70
75.2
75.8
67.6
G
D
PTS
Sp
lit
s
60
PA
-P
C
FG
Evalb F1
80
33 / 42
French Parsing Evaluation: All bracketings
90
Evalb F1
80
70
75.2
75.8
67.6
G
D
PTS
Sp
lit
s
PA
-P
C
FG
60
Paper: more results (Stanford, Berkeley, etc.)
33 / 42
Future Directions
Syntactic context for n-gram methods
I
Parse the corpus!
I
Adapt lexical context measures to syntactic context
34 / 42
Future Directions
Syntactic context for n-gram methods
I
Parse the corpus!
I
Adapt lexical context measures to syntactic context
DP-TSG
I
Better base distribution
34 / 42
Conclusion
Parsers work well for MWE identification
35 / 42
Conclusion
Parsers work well for MWE identification
Other languages: combine treebanks with MWE lists
35 / 42
Conclusion
Parsers work well for MWE identification
Other languages: combine treebanks with MWE lists
Non-“gold mode” parsing results for French
35 / 42
Conclusion
Parsers work well for MWE identification
Other languages: combine treebanks with MWE lists
Non-“gold mode” parsing results for French
Code → Google: “Stanford parser”
35 / 42
un grand merci.
thanks a lot.
Questions?
MWE Identification Results
80
69.6
71.1
70.1
63.1
60
32.6
34.7
20
G
TS
DP
-
St
an
fo
rd
Be
rk
el
ey
Sp
lit
s
m
w
et
oo
lk
it
PC
FG
0
PA
-
F1
40
38 / 42
Dirichlet process TSG
DP prior for each non-terminal type c ∈ V :
θc |c, αc , P0 (·|c) ∼ DP(αc , P0 )
e|θc ∼ θc
2
Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009;
O’Donnell, Tenenbaum, and Goodman 2009.
39 / 42
Dirichlet process TSG
DP prior for each non-terminal type c ∈ V :
θc |c, αc , P0 (·|c) ∼ DP(αc , P0 )
e|θc ∼ θc
Binary variable bs for each non-terminal node in corpus
I
Supervised case: segment the treebank2
2
Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009;
O’Donnell, Tenenbaum, and Goodman 2009.
39 / 42
DP-TSG: Base distribution P0
Phrasal rules:
P0 (A+ → B− C+ ) = pMLE (A → B C) sB (1 − sC )
40 / 42
DP-TSG: Base distribution P0
Phrasal rules:
P0 (A+ → B− C+ ) = pMLE (A → B C) sB (1 − sC )
pMLE is the manually-split grammar!
sB is the stop probability
40 / 42
DP-TSG: Base distribution P0
Lexical insertion rules:
P0 (C+ → t) = pMLE (C → t) p(t)
41 / 42
DP-TSG: Base distribution P0
Lexical insertion rules:
P0 (C+ → t) = pMLE (C → t) p(t)
p(t) is unigram probability of word t
41 / 42
Tree substitution grammars
A Probabilistic TSG is a 5-tuple hV , Σ, R, ♦, θi
c ∈ V are non-terminals
♦ ∈ V is a unique start symbol
t ∈ Σ are terminals
e ∈ R are elementary trees
θc,e ∈ θ are parameters for each tree fragment
42 / 42
Tree substitution grammars
A Probabilistic TSG is a 5-tuple hV , Σ, R, ♦, θi
c ∈ V are non-terminals
♦ ∈ V is a unique start symbol
t ∈ Σ are terminals
e ∈ R are elementary trees
θc,e ∈ θ are parameters for each tree fragment
elementary tree == tree fragment
42 / 42