Structured Prediction - BYU Computer Science

Hal Daumé III ([email protected])
Practical Structured Learning
for Natural Language Processing
Hal Daumé III
School of Computing
University of Utah
[email protected]
CS Colloquium, Brigham Young University, 14 Sep 2006
Slide 1
Structured Learning for Language
Hal Daumé III ([email protected])
All have
innate
What Is Natural Language
Processing?
structure to
Processing of language tovarying
solve adegrees
problem
Input
Output
Machine Translation
Arabic sentence
English sentence
Summarization
Long documents
Short documents
Information Extraction
Document
Database
Question Answering
Corpus + query
Answer
Understanding
Sentence
Logical sentence
Corpus-based NLP:
Collect example input/output
learn tree
Parsing
Sentence pairs andSyntactic
Slide 2
Structured Learning for Language
Hal Daumé III ([email protected])
What is Machine Learning?
Automatically uncovering structure in data
Input
Output
Classification
Vector
Bit (or bits)
Regression
Vector
Real number
Dimensionality reduction Vector
Shorter vector
Clustering
Vectors
Partition
Sequence labeling
Vectors
Labels
Reinforcement learning State space +
observations
Slide 3
Policy
Structured Learning for Language
Hal Daumé III ([email protected])
NLP versus Machine Learning
Machine Translation
Summarization
Information Extraction
Question Answering
Parsing
Understanding
Arabic sentence
Long documents
Document
Corpus + query
Sentence
Sentence
Natural
Decomposition
Encoding
Classification
Regression
Dimensionality reduction
Clustering
Sequence labeling
Vector
Vector
Vector
Vectors
Vectors
Reinforcement learning- State space
inspired
Slide 4
English sentence
Short documents
Database
Answer
Syntactic tree
Logical sentence
Ad Hoc
Search
Provably
Good
Bit (or bits)
Real number
Shorter vector
Partition
Labels
Policy
Structured Learning for Language
Hal Daumé III ([email protected])
Entity Detection and Tracking
§1.3 (Example Problem: EDT)
Shakespeare scholarship, lively at the best of times, saw the fur
flying yesterday after a German academic claimed to have
authenticated not just one but four contemporary images of the
playwright - and suggested, to boot, that he had died of cancer.
As the National Portrait Gallery planned
to reveal that only one of half a dozen
claimed portraits of William
Shakespeare can now be considered
genuine, Prof Hildegard
Hammerschmidt-Hummel said she
could prove that there were at least four
surviving portraits of the playwright.
Why? Summarization, Machine Translation, etc...
Slide 5
Structured Learning for Language
Hal Daumé III ([email protected])
Why is EDT Hard?
Long-range dependencies
Syntax, Semantics and Discourse
§1.3 (Example Problem: EDT)
World Knowledge
Highly constrained outputs
Shakespeare scholarship, lively at the best of times, saw the fur
flying yesterday after a German academic claimed to have
authenticated not just one but four contemporary images of the
playwright - and suggested, to boot, that he had died of cancer.
Slide 6
Structured Learning for Language
Hal Daumé III ([email protected])
How is EDT Typically Attacked?
Shakespeare scholarship, lively at the best of times, saw the fur
flying yesterday after a German academic claimed to have
authenticated not just one but four contemporary images of the
playwright - and suggested, to boot, that he had died of cancer.
Shakespeare scholarship , ... a German academic claimed ...
PER
*
*
* GPE
PER
*
§5.2 (EDT: Prior Work)
playwright
Shakespeare
he
Slide 7
Sequence Labeling
German
academic
Binary Classification +
Clustering
Structured Learning for Language
Hal Daumé III ([email protected])
Learning in NLP Applications
Model
Features
Learning
Search
Slide 8
Structured Learning for Language
Hal Daumé III ([email protected])
Result: Searn (= “Search + Learn”)
Computationally efficient
§3 (Search-based Structured Prediction)
Strong theoretical guarantees
State-of-the-art performance
Broadly applicable
Easy to implement
Slide 9
Structured Learning for Language
Hal Daumé III ([email protected])
Talk Outline
¾
Searn: Search-based Structured Prediction
¾
Experimental Results:
¾
¾
Sequence Labeling
¾
Entity Detection and Tracking
¾
Automatic Document Summarization
Discussion and Future Work
Slide 10
Structured Learning for Language
Hal Daumé III ([email protected])
Structured Prediction
§2.2 (Machine Learning: Structured Prediction)
Formulated as a maximization problem:
y = argmax y
Search
Y x
Model
f x,y; θ
Features Learning
Search (and learning) tractable
only for very simple Y(x) and f
Slide 11
Structured Learning for Language
Hal Daumé III ([email protected])
Reduction for Structured Prediction
§3.3 (Search-based Structured Prediction)
¾
Idea: view structured prediction in light of search
D
D
D
P
P
P
N
N
N
R
R
R
V
V
I
sing
V
merril
y
Slide 12
Each step here
looks like it could
be represented as
a weighted multiclass problem.
Can we formalize
this idea?
Structured Learning for Language
Hal Daumé III ([email protected])
Error-Limiting Reductions
¾
A formal mapping between learning problems
[Beygelzimer et al.,
ICML 2005]
F
§2.3 (Machine Learning: Reductions)
Learning
Problem
“A”
What we want
to solve
Learning
Problem “B”
F
-1
What we know how
to solve
¾
F maps examples from A to examples from B
¾
F -1 maps a classifier h for B to a classifier that solves A
¾
Error limiting: good performance on B implies good
performance on A
Slide 13
Structured Learning for Language
Hal Daumé III ([email protected])
Our Reduction Goal
Structured
Prediction
“Searn”
§2.3 (Machine Learning: Reductions)
Importance
Weighted
Classification
¾
¾
¾
Error-bounds
get composed
New techniques for binary
classification lead to new
techniques for structured prediction
Any generalization theory for binary
classification immediately applies
to structured prediction
Slide 14
“Weighted All Pairs”
[Beygelzimer et al.,
ICML 2005]
Costsensitive
Classification
“Costing”
[Zadronzy,
Langford &
Abe,
ICDM 2003]
Binary
Classification
Structured Learning for Language
Hal Daumé III ([email protected])
Two Easy Reductions
Costing
¿0
X × {± 1 }×
§2.3 (Machine Learning: Reductions)
h
h
[Zadronzy, Langford &
Abe, ICDM 2003]
X × {± 1 }
L IW ≤ L BC E [w ]
h
Weighted-All-Pairs
¿0
X×
×
×
¿0
... 1vK
2v3 ... 2vK
[Beygelzimer, Dani, Hayes,
Langford and Zadrozny,
ICML 2005]
X × {± 1 }×
¿0
1v2 1v3
Slide 15
L CS ≤ 2L IW
Structured Learning for Language
Hal Daumé III ([email protected])
§3.3 (Search-based Structured Prediction)
A First Attempt
D
D
D
N
N
N
A
A
A
R
R
R
V
V
I
sing
V
merril
y
At each
correct state:
Train classifier
to choose
next state
Thm (Kääriäinen): There is a 1st-order binary Markov problem
such that a classification error ¡ implies a Hamming loss of:
T 1− 1− 2ε
−
2
4ε
Slide 16
T+1
1 T
≈
2 2
Structured Learning for Language
Hal Daumé III ([email protected])
§3.4.2 (Searn: Optimal Policy)
Reducing Structured Prediction
D
D
D
N
N
N
A
A
A
R
R
R
V
V
I
sing
V
merril
y
Key Assumption: Optimal Policy for training data
Given: input, true output and state;
Return: best successor state
Slide 17
Weak!
Structured Learning for Language
Hal Daumé III ([email protected])
Idea
§3.4 (Searn: Training)
Learned Policy
Optimal Policy
(Features)
Slide 18
Structured Learning for Language
Hal Daumé III ([email protected])
Iterative Algorithm: Searn
§3.4.3 (Searn: Algorithm)
Set current policy = optimal policy
Repeat:
Decode using current policy
... after
a
German
academic
claimed ...
*
*
GPE
PER
*
L=0
PER
*
*
L = 0.3
ORG
PER
*
L = 0.2
*
PER
*
L = 0.25
Generate classification problems
Learn new multiclass classifier
Interpolate: cur = ž*cur + (1-ž)*new
Slide 19
Structured Learning for Language
Hal Daumé III ([email protected])
Theoretical Analysis
§3.5 (Searn: Theoretical Analysis)
Theorem: For conservative ž, after 2T3 ln T
iterations, the loss of the learned policy is
bounded as follows:
L h ≤ L h0
Loss of the
optimal policy
Slide 20
2T ln Tl avg
Average
multiclass
classification
loss
1 ln T
c max
T
Worst case
per-step
loss
Structured Learning for Language
Hal Daumé III ([email protected])
Why Does Searn Work?
§3.7 (Searn: Discussion and Conclusions)
A classifier tells us how to search
Impossible to train on full space
Want to train only where we'll wind up
Searn tells us where we'll wind up
Slide 21
Structured Learning for Language
Hal Daumé III ([email protected])
Talk Outline
¾
Searn: Search-based Structured Prediction
¾
Experimental Results:
¾
¾
Sequence Labeling
¾
Entity Detection and Tracking
¾
Automatic Document Summarization
Discussion and Future Work
Slide 22
Structured Learning for Language
Hal Daumé III ([email protected])
Proof on Concept: Sequence Labels
Spanish Named
Entity Recognition
98
Handwriting
Recognition
95
Chunking+
Tagging
97
97
Slide 23
94
93
Searn
Searn+data
Searn
65
M3N
70
SVM
75
LR
Searn
CRF
94
ISO
95
95
CRF
80
Searn+data
96
LaSO
85
Incremental Perceptron
96
SVM
§4.4 (Sequence Labeling: Comparison)
90
Structured Learning for Language
Hal Daumé III ([email protected])
Talk Outline
¾
Searn: Search-based Structured Prediction
¾
Experimental Results:
¾
¾
Sequence Labeling
¾
Entity Detection and Tracking
¾
Automatic Document Summarization
Discussion and Future Work
Slide 24
Structured Learning for Language
Hal Daumé III ([email protected])
§5.6.1 (EDT: Joint Detection and Coreference: Search)
Entity Detection and Tracking
Shakespeare scholarship, lively at the best of times, saw the fur
flying yesterday after a German academic claimed to have
authenticated not just one but four contemporary images of the
playwright - and suggested, to boot, that he had died of cancer.
... he had died of cancer
... he had died of cancer
... he had died of cancer
... he had died of cancer
... he had died of cancer
... he had died of cancer
... he had died of cancer
Slide 25
Structured Learning for Language
Hal Daumé III ([email protected])
EDT Features
§5.[45].3 (EDT: Feature Functions)
Shakespeare scholarship, lively at the best of times, saw the fur
flying yesterday after a German academic claimed to have
authenticated not just one but four contemporary images of the
playwright - and suggested, to boot, that he had died of cancer.
¾
Lexical
¾
Syntactic
¾
Semantic
¾
Gazetteer
¾
Knowledge-based
Slide 26
Structured Learning for Language
Hal Daumé III ([email protected])
EDT Results (ACE 2004 data)
80
§5.6.3 (EDT: Experimental Results)
78
76
74
72
Sys1
Sys2
Previous state of the art
with
extra proprietary data
Slide 27
Sys3
Sys4
Searn
Structured Learning for Language
Hal Daumé III ([email protected])
Feature Contributions
82
80
78
§5.6.3 (EDT: Experimental Results)
76
74
72
70
68
-Lex
Slide 28
-Syn
-Sem
-Gaz
-KB
Structured Learning for Language
Hal Daumé III ([email protected])
Talk Outline
¾
Searn: Search-based Structured Prediction
¾
Experimental Results:
¾
¾
Sequence Labeling
¾
Entity Detection and Tracking
¾
Automatic Document Summarization
Discussion and Future Work
Slide 29
Structured Learning for Language
Hal Daumé III ([email protected])
New Task: Summarization
Give me
information
about
That's
too much
the Falkland
war
That'sislands
perfect!
information
to read!
Argentina was still obsessed
with the Falkland Islands
even in 1994, 12 years
after its defeat in the 74-day
war with Britain. The country's
overriding foreign policy aim
continued to be winning
sovereignty over the islands.
The Falkland islands
war, in 1982, was
fought between
Britain and Argentina.
Standard approach is sentence
extraction, but that is often
deemed too “coarse” to produce
good, very short summaries. We
wish to also drop words and
phrases => document
compression
Slide 30
Structured Learning for Language
Hal Daumé III ([email protected])
Structure of Search
To
Laymake
sentences
search
out sequentially
more tractable, we run an
rounda dependency
of sentenceparse
extraction
@ 5x length
¾initial
Generate
of each sentence
a
big
The
¾ Mark each root as a frontier node
¾
§6.1 (Summarization: Vine-Growth Model)
Argentina was still obsessed
with the Falkland Islands
even in 1994, 12 years
after its defeat in the 74-day
war with Britain. The country's
overriding foreign policy aim
continued to be winning
sovereignty over the islands.
Repeat:
¾
Choose a frontier
node node to add to the summary
sandwich
man
¾
Add
all its children to the frontier
Optimal Policy not analytically
available;
¾
Finish when we have enough words
Approximated with beam search.
ate
¾
The man ate a big sandwich
S1
S2
S3
= frontier node
Slide 31
S4
S5
...
Sn
= summary node
Structured Learning for Language
Hal Daumé III ([email protected])
Example Output (40 word limit)
Sentence Extraction + Compression:
§6.7 (Summarization: Error Analaysis)
+13
Argentina and Britain announced an agreement, nearly
eight years after they fought a 74-day war a populated
archipelago off Argentina's coast. Argentina gets out the
red carpet, official royal visitor since the end of the
Falklands war in 1982.
Vine Growth:
Argentina and Britain announced to restore full ties, eight
years after they fought a 74-day war over the Falkland
+24 islands. Britain invited Argentina's minister Cavallo to
London in 1992 in the first official visit since the Falklands
war in 1982.
6 Diplomatic ties restored
3 Falkland war was in 1982
5 Major cabinet member visits
3 Cavallo visited UK
5 Exchanges were in 1992
2 War was 74-days long
3 War between Britain and Argentina
Slide 32
Structured Learning for Language
Hal Daumé III ([email protected])
Results
4
§6.6 (Summarization: Experimental Results)
3.5
3
2.5
2
Approx.
Optimal
Vine Growth
Slide 33
Vine Growth
(Searn)
Optimal
Extract
@100
Extract
Baseline
@100
[Daumé III and Marcu,
DUC 2005]
Structured Learning for Language
Hal Daumé III ([email protected])
Talk Outline
¾
Searn: Search-based Structured Prediction
¾
Experimental Results:
¾
¾
Sequence Labeling
¾
Entity Detection and Tracking
¾
Automatic Document Summarization
Discussion and Future Work
Slide 34
Structured Learning for Language
Hal Daumé III ([email protected])
What Can Searn Do?
Solve structured prediction problems...
§7 (Conclusions and Future Directions)
...efficiently.
...with theoretical guarantees.
...with weak assumptions on structure.
...and outperform the competition.
...with little extra code required.
Slide 35
Structured Learning for Language
Hal Daumé III ([email protected])
Thanks!
Questions?
Slide 36
Structured Learning for Language