Hal Daumé III ([email protected]) Practical Structured Learning for Natural Language Processing Hal Daumé III School of Computing University of Utah [email protected] CS Colloquium, Brigham Young University, 14 Sep 2006 Slide 1 Structured Learning for Language Hal Daumé III ([email protected]) All have innate What Is Natural Language Processing? structure to Processing of language tovarying solve adegrees problem Input Output Machine Translation Arabic sentence English sentence Summarization Long documents Short documents Information Extraction Document Database Question Answering Corpus + query Answer Understanding Sentence Logical sentence Corpus-based NLP: Collect example input/output learn tree Parsing Sentence pairs andSyntactic Slide 2 Structured Learning for Language Hal Daumé III ([email protected]) What is Machine Learning? Automatically uncovering structure in data Input Output Classification Vector Bit (or bits) Regression Vector Real number Dimensionality reduction Vector Shorter vector Clustering Vectors Partition Sequence labeling Vectors Labels Reinforcement learning State space + observations Slide 3 Policy Structured Learning for Language Hal Daumé III ([email protected]) NLP versus Machine Learning Machine Translation Summarization Information Extraction Question Answering Parsing Understanding Arabic sentence Long documents Document Corpus + query Sentence Sentence Natural Decomposition Encoding Classification Regression Dimensionality reduction Clustering Sequence labeling Vector Vector Vector Vectors Vectors Reinforcement learning- State space inspired Slide 4 English sentence Short documents Database Answer Syntactic tree Logical sentence Ad Hoc Search Provably Good Bit (or bits) Real number Shorter vector Partition Labels Policy Structured Learning for Language Hal Daumé III ([email protected]) Entity Detection and Tracking §1.3 (Example Problem: EDT) Shakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright - and suggested, to boot, that he had died of cancer. As the National Portrait Gallery planned to reveal that only one of half a dozen claimed portraits of William Shakespeare can now be considered genuine, Prof Hildegard Hammerschmidt-Hummel said she could prove that there were at least four surviving portraits of the playwright. Why? Summarization, Machine Translation, etc... Slide 5 Structured Learning for Language Hal Daumé III ([email protected]) Why is EDT Hard? Long-range dependencies Syntax, Semantics and Discourse §1.3 (Example Problem: EDT) World Knowledge Highly constrained outputs Shakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright - and suggested, to boot, that he had died of cancer. Slide 6 Structured Learning for Language Hal Daumé III ([email protected]) How is EDT Typically Attacked? Shakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright - and suggested, to boot, that he had died of cancer. Shakespeare scholarship , ... a German academic claimed ... PER * * * GPE PER * §5.2 (EDT: Prior Work) playwright Shakespeare he Slide 7 Sequence Labeling German academic Binary Classification + Clustering Structured Learning for Language Hal Daumé III ([email protected]) Learning in NLP Applications Model Features Learning Search Slide 8 Structured Learning for Language Hal Daumé III ([email protected]) Result: Searn (= “Search + Learn”) Computationally efficient §3 (Search-based Structured Prediction) Strong theoretical guarantees State-of-the-art performance Broadly applicable Easy to implement Slide 9 Structured Learning for Language Hal Daumé III ([email protected]) Talk Outline ¾ Searn: Search-based Structured Prediction ¾ Experimental Results: ¾ ¾ Sequence Labeling ¾ Entity Detection and Tracking ¾ Automatic Document Summarization Discussion and Future Work Slide 10 Structured Learning for Language Hal Daumé III ([email protected]) Structured Prediction §2.2 (Machine Learning: Structured Prediction) Formulated as a maximization problem: y = argmax y Search Y x Model f x,y; θ Features Learning Search (and learning) tractable only for very simple Y(x) and f Slide 11 Structured Learning for Language Hal Daumé III ([email protected]) Reduction for Structured Prediction §3.3 (Search-based Structured Prediction) ¾ Idea: view structured prediction in light of search D D D P P P N N N R R R V V I sing V merril y Slide 12 Each step here looks like it could be represented as a weighted multiclass problem. Can we formalize this idea? Structured Learning for Language Hal Daumé III ([email protected]) Error-Limiting Reductions ¾ A formal mapping between learning problems [Beygelzimer et al., ICML 2005] F §2.3 (Machine Learning: Reductions) Learning Problem “A” What we want to solve Learning Problem “B” F -1 What we know how to solve ¾ F maps examples from A to examples from B ¾ F -1 maps a classifier h for B to a classifier that solves A ¾ Error limiting: good performance on B implies good performance on A Slide 13 Structured Learning for Language Hal Daumé III ([email protected]) Our Reduction Goal Structured Prediction “Searn” §2.3 (Machine Learning: Reductions) Importance Weighted Classification ¾ ¾ ¾ Error-bounds get composed New techniques for binary classification lead to new techniques for structured prediction Any generalization theory for binary classification immediately applies to structured prediction Slide 14 “Weighted All Pairs” [Beygelzimer et al., ICML 2005] Costsensitive Classification “Costing” [Zadronzy, Langford & Abe, ICDM 2003] Binary Classification Structured Learning for Language Hal Daumé III ([email protected]) Two Easy Reductions Costing ¿0 X × {± 1 }× §2.3 (Machine Learning: Reductions) h h [Zadronzy, Langford & Abe, ICDM 2003] X × {± 1 } L IW ≤ L BC E [w ] h Weighted-All-Pairs ¿0 X× × × ¿0 ... 1vK 2v3 ... 2vK [Beygelzimer, Dani, Hayes, Langford and Zadrozny, ICML 2005] X × {± 1 }× ¿0 1v2 1v3 Slide 15 L CS ≤ 2L IW Structured Learning for Language Hal Daumé III ([email protected]) §3.3 (Search-based Structured Prediction) A First Attempt D D D N N N A A A R R R V V I sing V merril y At each correct state: Train classifier to choose next state Thm (Kääriäinen): There is a 1st-order binary Markov problem such that a classification error ¡ implies a Hamming loss of: T 1− 1− 2ε − 2 4ε Slide 16 T+1 1 T ≈ 2 2 Structured Learning for Language Hal Daumé III ([email protected]) §3.4.2 (Searn: Optimal Policy) Reducing Structured Prediction D D D N N N A A A R R R V V I sing V merril y Key Assumption: Optimal Policy for training data Given: input, true output and state; Return: best successor state Slide 17 Weak! Structured Learning for Language Hal Daumé III ([email protected]) Idea §3.4 (Searn: Training) Learned Policy Optimal Policy (Features) Slide 18 Structured Learning for Language Hal Daumé III ([email protected]) Iterative Algorithm: Searn §3.4.3 (Searn: Algorithm) Set current policy = optimal policy Repeat: Decode using current policy ... after a German academic claimed ... * * GPE PER * L=0 PER * * L = 0.3 ORG PER * L = 0.2 * PER * L = 0.25 Generate classification problems Learn new multiclass classifier Interpolate: cur = *cur + (1-)*new Slide 19 Structured Learning for Language Hal Daumé III ([email protected]) Theoretical Analysis §3.5 (Searn: Theoretical Analysis) Theorem: For conservative , after 2T3 ln T iterations, the loss of the learned policy is bounded as follows: L h ≤ L h0 Loss of the optimal policy Slide 20 2T ln Tl avg Average multiclass classification loss 1 ln T c max T Worst case per-step loss Structured Learning for Language Hal Daumé III ([email protected]) Why Does Searn Work? §3.7 (Searn: Discussion and Conclusions) A classifier tells us how to search Impossible to train on full space Want to train only where we'll wind up Searn tells us where we'll wind up Slide 21 Structured Learning for Language Hal Daumé III ([email protected]) Talk Outline ¾ Searn: Search-based Structured Prediction ¾ Experimental Results: ¾ ¾ Sequence Labeling ¾ Entity Detection and Tracking ¾ Automatic Document Summarization Discussion and Future Work Slide 22 Structured Learning for Language Hal Daumé III ([email protected]) Proof on Concept: Sequence Labels Spanish Named Entity Recognition 98 Handwriting Recognition 95 Chunking+ Tagging 97 97 Slide 23 94 93 Searn Searn+data Searn 65 M3N 70 SVM 75 LR Searn CRF 94 ISO 95 95 CRF 80 Searn+data 96 LaSO 85 Incremental Perceptron 96 SVM §4.4 (Sequence Labeling: Comparison) 90 Structured Learning for Language Hal Daumé III ([email protected]) Talk Outline ¾ Searn: Search-based Structured Prediction ¾ Experimental Results: ¾ ¾ Sequence Labeling ¾ Entity Detection and Tracking ¾ Automatic Document Summarization Discussion and Future Work Slide 24 Structured Learning for Language Hal Daumé III ([email protected]) §5.6.1 (EDT: Joint Detection and Coreference: Search) Entity Detection and Tracking Shakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright - and suggested, to boot, that he had died of cancer. ... he had died of cancer ... he had died of cancer ... he had died of cancer ... he had died of cancer ... he had died of cancer ... he had died of cancer ... he had died of cancer Slide 25 Structured Learning for Language Hal Daumé III ([email protected]) EDT Features §5.[45].3 (EDT: Feature Functions) Shakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright - and suggested, to boot, that he had died of cancer. ¾ Lexical ¾ Syntactic ¾ Semantic ¾ Gazetteer ¾ Knowledge-based Slide 26 Structured Learning for Language Hal Daumé III ([email protected]) EDT Results (ACE 2004 data) 80 §5.6.3 (EDT: Experimental Results) 78 76 74 72 Sys1 Sys2 Previous state of the art with extra proprietary data Slide 27 Sys3 Sys4 Searn Structured Learning for Language Hal Daumé III ([email protected]) Feature Contributions 82 80 78 §5.6.3 (EDT: Experimental Results) 76 74 72 70 68 -Lex Slide 28 -Syn -Sem -Gaz -KB Structured Learning for Language Hal Daumé III ([email protected]) Talk Outline ¾ Searn: Search-based Structured Prediction ¾ Experimental Results: ¾ ¾ Sequence Labeling ¾ Entity Detection and Tracking ¾ Automatic Document Summarization Discussion and Future Work Slide 29 Structured Learning for Language Hal Daumé III ([email protected]) New Task: Summarization Give me information about That's too much the Falkland war That'sislands perfect! information to read! Argentina was still obsessed with the Falkland Islands even in 1994, 12 years after its defeat in the 74-day war with Britain. The country's overriding foreign policy aim continued to be winning sovereignty over the islands. The Falkland islands war, in 1982, was fought between Britain and Argentina. Standard approach is sentence extraction, but that is often deemed too “coarse” to produce good, very short summaries. We wish to also drop words and phrases => document compression Slide 30 Structured Learning for Language Hal Daumé III ([email protected]) Structure of Search To Laymake sentences search out sequentially more tractable, we run an rounda dependency of sentenceparse extraction @ 5x length ¾initial Generate of each sentence a big The ¾ Mark each root as a frontier node ¾ §6.1 (Summarization: Vine-Growth Model) Argentina was still obsessed with the Falkland Islands even in 1994, 12 years after its defeat in the 74-day war with Britain. The country's overriding foreign policy aim continued to be winning sovereignty over the islands. Repeat: ¾ Choose a frontier node node to add to the summary sandwich man ¾ Add all its children to the frontier Optimal Policy not analytically available; ¾ Finish when we have enough words Approximated with beam search. ate ¾ The man ate a big sandwich S1 S2 S3 = frontier node Slide 31 S4 S5 ... Sn = summary node Structured Learning for Language Hal Daumé III ([email protected]) Example Output (40 word limit) Sentence Extraction + Compression: §6.7 (Summarization: Error Analaysis) +13 Argentina and Britain announced an agreement, nearly eight years after they fought a 74-day war a populated archipelago off Argentina's coast. Argentina gets out the red carpet, official royal visitor since the end of the Falklands war in 1982. Vine Growth: Argentina and Britain announced to restore full ties, eight years after they fought a 74-day war over the Falkland +24 islands. Britain invited Argentina's minister Cavallo to London in 1992 in the first official visit since the Falklands war in 1982. 6 Diplomatic ties restored 3 Falkland war was in 1982 5 Major cabinet member visits 3 Cavallo visited UK 5 Exchanges were in 1992 2 War was 74-days long 3 War between Britain and Argentina Slide 32 Structured Learning for Language Hal Daumé III ([email protected]) Results 4 §6.6 (Summarization: Experimental Results) 3.5 3 2.5 2 Approx. Optimal Vine Growth Slide 33 Vine Growth (Searn) Optimal Extract @100 Extract Baseline @100 [Daumé III and Marcu, DUC 2005] Structured Learning for Language Hal Daumé III ([email protected]) Talk Outline ¾ Searn: Search-based Structured Prediction ¾ Experimental Results: ¾ ¾ Sequence Labeling ¾ Entity Detection and Tracking ¾ Automatic Document Summarization Discussion and Future Work Slide 34 Structured Learning for Language Hal Daumé III ([email protected]) What Can Searn Do? Solve structured prediction problems... §7 (Conclusions and Future Directions) ...efficiently. ...with theoretical guarantees. ...with weak assumptions on structure. ...and outperform the competition. ...with little extra code required. Slide 35 Structured Learning for Language Hal Daumé III ([email protected]) Thanks! Questions? Slide 36 Structured Learning for Language
© Copyright 2026 Paperzz