A Dynamic Oracle for Arc-Eager Dependency Parsing

"A Dynamic Oracle for Arc-Eager Dependency Parsing"
Goldberg, Nivre (2012)
"Training Deterministic Parsers with Non-Deterministic Oracles"
Goldberg, Nivre (2013)
Presentation for UM NLP Reading Group
Dan Pressel
Two Papers (Sort of)
• A Dynamic Oracle for Arc-Eager Dependency Parsing
• Really only presenting this one:
• Most of the new ideas from first paper, second builds on, expands, and improve the first
• Arc-Eager is perhaps the most intriguing of the ensemble of parser types presented in both
• Covers the initial approach
• Less concerned with proving Arc Decomposition
• Doesn’t address Arc-Hybrid or Easy-First
• Training Deterministic Parsers with Non-Deterministic Oracles
• Some things notable in this presentation:
• Modifies the dynamic oracle approach
• Exploration criteria changes
• Uses best choice from oracle for most decisions early on
• Uses an averaged perceptron for prediction/test, standard perceptron for update rule
Dependency Parsing
Taken from Nivre, Kubler Dependency Parsing Tutorial COLING-ACL, 2006
Projective, Greedy Dependency Parsing
• Dependency graph for a sentence
• 𝑋 = 𝑤1 , … , 𝑤𝑛
• Labeled (or unlabeled!) graph 𝐺 = (𝑉𝑥 , 𝐴) consisting of nodes 𝑉𝑥 = {0, 1, … , 𝑛}
• Each node i is linear position of 𝑤𝑖 (plus root node 0)
• Arcs (i, l, j) from head 𝑤𝑖 to dependent 𝑤𝑗 with relationship l
• Projective
• (i, l, j) entails there is a directed path from i to k such that min(𝑖, 𝑗) < 𝑘 < max(𝑖, 𝑗)
• No crossing arcs!!
• Greedy Parsing
• O(n) time parsing, see it make a decision, move on
• Suprisingly high accuracy
• In the cases where more context is helpful, don’t just go for the CRF!
• K-Arc Eager (Beam Parsing), not discussed here
Non-Projective Parse Example
Taken from Nivre, Kubler Dependency Parsing Tutorial COLING-ACL, 2006
Transition Based Dependency Parsing
• Defines a non-deterministic transition for mapping sentences to
dependency trees, and to perform parsing as search for the optimal
transition sequence for a given sentence
• Uses an “oracle”
• Predicts an optimal transition sequence for a sentence and its “gold” tree
• An oracle translates a given tree to a static sequence of parser transitions
• Used for training a parser
• Most transition systems exhibit spurious ambiguity
• Map several sequences to the same gold tree
• In ambiguous cases, static oracles define a canonical derivation order
(Greedy) Arc-Eager Dependency Parsing
• Definitions
• Stack – data structure containing “seen” items
• Buffer AKA queue – data structure containing “unseen” items
• LA: Left-arc transition (arcs from buffer to stack, head is on the right)
• Adds arc (b, l, s) to the tree, pops the stack, s cannot be root, s doesn’t already have a head
• Popping prevents s from taking on any dependents
• RA: Right-arc transition (arcs from stack to buffer, head is on the left)
• Adds arc (s, l, b), s is top of stack, b is top of the buffer, pushes b onto the stack
• Once b is on the stack, it must be popped before s can acquire any new relationships
• RE: Reduce transition
• Pops the stack s with no action
• SH: Shift transition
• Removes top of buffer b and pushes to stack with no action
Static Oracle Resolution
Problems with Standard Oracle
• They provide a set of rules under properties X, gold tree Y -> correct
transition Z
• Only correct as functions from gold trees to transition sequences, and not as
functions from configurations to transitions
• Due to spurious ambiguity it is not clear that the canonical transition
is easiest to learn
• If the parser deviates from gold sequence, reaching configurations
from which the correct tree is not derivable
• Parser’s classifier then faced with configurations not seen in training
• Increase errors
Projective Ambiguous Example
What do I
do next??
• Two Distinct Transition Sequences
• Remember: RA moves b to stack, her
• We can reduce her since no deps
• Or we can shift a, process a and
letter, then reduce her afterward
• Whenever SH-RE ambiguity, oracle
always picks SH!
𝑺𝑯, 𝑳𝑨𝑺𝑩𝑱 , 𝑹𝑨𝑷𝑹𝑫 , 𝑹𝑨𝑰𝑶𝑩𝑱 , 𝑺𝑯, 𝑳𝑨𝑫𝑬𝑻 , 𝑹𝑬, 𝑹𝑨𝑫𝑶𝑩𝑱 , 𝑹𝑬, 𝑹𝑨𝒑
𝑺𝑯, 𝑳𝑨𝑺𝑩𝑱 , 𝑹𝑨𝑷𝑹𝑫 , 𝑹𝑨𝑰𝑶𝑩𝑱 , 𝑹𝑬, 𝑺𝑯, 𝑳𝑨𝑫𝑬𝑻 , 𝑹𝑨𝑫𝑶𝑩𝑱 , 𝑹𝑬, 𝑹𝑨𝒑
Projective Erroneous State Propagation
I can see how to
minimize loss but
this static oracle
wont let me!
• Let’s say we miss the RA transition from 2 to
3, and pick SH instead.
• “her” has no rels, so SH again
• Now we can left-arc (“letter”, DET, “a”), pops “a”, b
is still “letter”, s is “her”
• (“wrote”, DOBJ, “letter”) is unreachable, her in
gold arcs, so SH “letter” on, etc.
• Loss of 3 labeled attachments
Loss is 3 using static oracle
• With an LA (“letter”, DET, “her”), we could
have recovered almost of the arcs with only
that single error
• Labeled attachment loss of 1
Loss is 1 using dynamic oracle
I’ll take it!!
Idea: Make the Oracle Dynamic
• Don’t restrict to a canonical order of transitions: Allow all transitions
that can lead to a tree with minimum loss compared to the gold tree
• No single static canonical transition sequence
• Answers the question: Is transition Z valid in configuration X for producing the
best possible tree Y
• No longer forces a unique transition sequence in situations where multiple sequences
derive the gold tree
• Well-defined and correct for all configurations, including ones which are not
part of the gold derivation
• Can handle configurations not part of any gold sequence
• Mitigates error propagation
Dynamic Oracle: How Do We Do This Though?
• Want to allow more than one transition sequence for a given tree
• Should define a relation from configurations to transitions, rather than a
function
• Want to make optimal predictions in all configurations
• Not optimal if it commits to a parsing error
• Want to maximize the reachable tree from our current context
• Pick trees that minimize some loss function relative to gold parse
• Optimal iff at least one tree is reachable from context
𝐶𝑜𝑠𝑡 𝑡 = min(𝐿𝑜𝑠𝑠(𝐺𝑐 , 𝐺𝑔𝑜𝑙𝑑 ) − min(𝐿𝑜𝑠𝑠 𝐺𝑡
𝐶
, 𝐺𝑔𝑜𝑙𝑑
Dynamic Oracle: Zero Cost Transitions Only Please!
• IOW, any next move we cannot cut off paths to best achievable
current parse
• By definition, one of these moves must have zero cost
• So only allow transitions with zero cost, as follows:
𝑜 𝑡; 𝑐, 𝐺𝑔𝑜𝑙𝑑
𝐭𝐫𝐮𝐞 𝑖𝑓 𝐶 𝑡; 𝑐, 𝐺𝑔𝑜𝑙𝑑 = 0
=
𝐟𝐚𝐥𝐬𝐞 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Arc-Eager System Costs for Transitions
• Definitions:
•
s: Top of the stack, b: Top of the buffer, l: label for arc, 𝑨𝒈𝒐𝒍𝒅 : 𝐺𝑜𝑙𝑑 𝑡𝑟𝑒𝑒
• 𝐶(𝐿𝐴; 𝑐, 𝐺𝑔𝑜𝑙𝑑 )
•
Adding (b, l, s) and popping s means that s cannot acquire heads or deps in the buffer
•
•
Cost is then # of arcs in 𝐴𝑔𝑜𝑙𝑑 of form (k, l’, s) or (s, l’, k) where k is in the buffer
Zero cost where (b, l, s) in 𝐴𝑔𝑜𝑙𝑑 , but also where b is not gold head, but real head isn’t in the buffer and there are no gold deps of s in the buffer
• 𝐶(𝑅𝐴; 𝑐, 𝐺𝑔𝑜𝑙𝑑 )
•
Adding (s, l, b) and pushing b onto the stack means b cannot acquire head on stack or B, nor any deps in the stack
•
•
Cost is # arcs in 𝐴𝑔𝑜𝑙𝑑 of form (k, l’, b) where k is in the stack or the buffer, or of form (b, l’, k) where k is on the stack
Zero-cost where (s, l, b) in 𝐴𝑔𝑜𝑙𝑑 but also where s not gold head of b but real head is not in the stack nor the buffer, and there are no gold deps of b on the stack
• 𝐶(𝑅𝐸; 𝑐, 𝐺𝑔𝑜𝑙𝑑 )
•
Popping s from stack means that s cannot acquire deps in buffer
•
Cost # arcs in 𝐴𝑔𝑜𝑙𝑑 of the form (s, l’, k) such that where k is on the buffer
• 𝐶(𝑆𝐻; 𝑐, 𝐺𝑔𝑜𝑙𝑑 )
•
Pushing b onto the stack means that b will not be able to acquire heads or deps from stack
•
Cost is # Arcs in 𝐴𝑔𝑜𝑙𝑑 of form (k, l’, b) or (b, l’, k) such that k in Stack
Mitigating Error Propagation with Dynamic
Oracle
• Want to mitigate error prop by allowing parser to explore larger
portions of configuration space during training
• Learn how to recover best from previous errors
• Dynamic oracles can do this since they only produce a set of optimal
transitions for each possible configuration!
• We can modify the online training algorithm of a static oracle to let it explore
the space
Online Training With Static Oracle
Online Training With Dynamic Oracle in ArcEager Paper
Two options explored for CHOOSE_NEXT()
• Spurious ambiguity
• Spurious ambiguity and non-optimal transitions
Testing
• Averaged perceptron model, 15 iterations
• Features from Zhang and Nivre (2011)
• k = 2, p 0.1
• Starting from 3rd iteration, 90% of the time following non-optimal transitions
• English Model Trained on Sections 2-21 of PTB WSJ converted to
Stanford basic deps
• POS tags from structured perceptron tagger, same corpus
• 4-fold jack-knifing
• CoNLL 2007 Corpora as-is
• Gold tags
Arc-Eager Paper Results for English Corpora
Arc-Eager Paper Results for CoNLL 2007
Results Summary for Arc-Eager Paper
• For English
• Dynamic Ambiguity generally improves both labeled and unlabeled scores by up to
0.5% absolute
• Except GRPS
• Dynamic Exploration does even better, up to about 1.5% points
• For CoNLL sets
• Dynamic Ambiguity condition are mixed
• Drop in some
• Dynamic exploration makes up for all except two languages
• Large UAS and LAS gains in most, only two exceptions are Hungarian and Turkish
• Possibly can be improved with language specific tuning
• Overall Significant Improvement over static training
In “Training
Deterministic Parsers
with Non-Deterministic
Oracles”
Algorithm has changed slightly
•
Looking for best in oracle, and
best in prediction
• Early on, follow the oracle, then
branch out and explore a little
later
• 15 training iterations, left-toright parsers
• Standard perceptron for
update rule
• Averaged perceptron for
prediction, test
• K = 1, p = 0.9
Training Deterministic Parsers with NonDeterministic Oracles Results for CoNLL 2007
Some References
• Goldberg, Nivre, "A Dynamic Oracle for Arc-Eager Dependency
Parsing" (2012)
• http://www.aclweb.org/anthology/C/C12/C12-1059.pdf
• Goldberg, Nivre, "Training Deterministic Parsers with NonDeterministic Oracles " (2013)
• http://www.cs.bgu.ac.il/~yoavg/publications/tacl2013dynamic.pdf
• An implementation in Python of Arc-Eager and Arc-Hybrid with
Exploration (currently unlabeled only)
• https://github.com/dpressel/arcs-py
• K-arc (Beam) Arc-Eager Implementation referencing these papers
• https://github.com/yahoo/YaraParser