Learning with Noise: Relation Extraction with Dynamic Transition Matrix Bingfeng Luo, Yansong Feng, Zheng Wang, Zhanxing Zhu, Songfang Huang, Rui Yan and Dongyan Zhao 2017/04/22 About Dataset Noise Noise is common in dataset Human can make erroneous annotations Noise is significant in automatically constructed dataset Relation Extraction Heavily rely on automatically constructed dataset Relation Extraction Find the relation between target subject and object Melania Trump was born in Novo Mesto. EXTRACT Knowledge Base <Melania Trump, born-in, Novo Mesto> POPULATE Distant Supervision Automatically construct noisy training data Donald Trump was born in New York. Donald Trump worked in New York. Knowledge Base <Donald Trump, born-in, New York> Corpus RETRIEVE & ALIGN Two Paradigm Sentence Level Donald Trump was born in New York. Donald Trump worked in New York. born-in (0, 0, 1, ..., 0) (0, 0, 1, ..., 0) NOISY Bag Level At least one assumption False Positive, False Negative born-in Ivanka Trump flew to New York. (0, 0, 1, ..., 0) Ivanka Trump lived in New York. ALSO NOISY Model the Noise How to represent the noise? True relation is i, erroneously labeled as j Transition Matrix 𝒑(𝒋|𝒊) 𝒊, 𝒋 = 𝟏, 𝟐, … , 𝒌 𝑻𝒊𝒋 = 𝒑(𝒋|𝒊) Model the Noise Transition Matrix 𝑻𝒊𝒋 is 𝒑(𝒋|𝒊), true label is i, erroneously labeled as j × Predicted Relation Distribution Base RE Model = Transition Matrix Observed Relation Distribution Match the Noisy Label Model the Noise Global Transition Matrix Model general noise pattern Randomly initialization a matrix T’ 𝑻𝒊𝒋 = 𝑻′ 𝒆 𝒊𝒋 𝒋𝒆 𝑻′𝒊𝒋 BI PL NA born-in (BI) 0.7 0.1 0.2 place-lived (PL) 0.5 0.3 0.2 NA 0.3 0.1 0.6 Model the Noise Dynamic Transition Matrix Model individual noise pattern Generated according to the input instance BI PL NA BI PL NA born-in (BI) 0.6 0.2 0.2 born-in (BI) 0.7 0.2 0.1 place-lived (PL) 0.2 0.6 0.2 place-lived (PL) 0.5 0.4 0.1 NA 0.1 0.2 0.7 NA 0.2 0.2 0.6 Donald Trump lives in New York. Donald Trump lives in New York near his parents’ old house. Dynamic Transition Matrix One Instance one Embedding Instance: sentence or sentence bag Instance embedding from base RE model Softmax classifier to generate each row of the transition matrix T One row at a time Each row sums to 1 𝑇𝑖𝑗 = Instance Embedding 𝑥𝑛 softmax for each row Transition Matrix 𝑇 𝑇 exp(𝑤𝑖𝑗 𝑥𝑛 + 𝑏) 𝑇 𝑗 exp(𝑤𝑖𝑗 𝑥𝑛 + 𝑏) Dynamic Transition Matrix One Instance Embedding per Relation R instance embeddings regarding R relations (e.g., Lin et al., ACL, 2016) Softmax classifier for corresponding rows 𝑇𝑖𝑗 = Instance Embedding Regarding Each Relation 𝑥𝑛,𝑖 softmax for each row Transition Matrix 𝑇 exp(𝑤𝑗𝑇 𝑥𝑛,𝑖 + 𝑏𝑖 ) 𝑇 𝑗 exp(𝑤𝑗 𝑥𝑛,𝑖 + 𝑏𝑖 ) Instance Embedding Sentence Embedding Piecewise CNN (PCNN, Zeng et al., EMNLP, 2015) Instance Embedding Bag Embedding Average the embedding of each sentence Attention to each sentence regarding each relation (Lin, et al., ACL 2016) One bag embedding per relation 𝒙𝟏 𝒙𝟐 𝒙 𝒙𝟑 Aggregation Sentence Embeddings Bag Embeddings Instance Embedding Bag Embedding Attention to each sentence regarding each relation Embedding of Relation 𝑘 𝒓𝒌 𝜶𝟏 𝒙𝟏 exp(𝑟𝑖𝑇 𝑥𝑗 ) 𝛼𝑖𝑗 = 𝑇 𝑗 exp(𝑟𝑖 𝑥𝑗 ) 𝜶𝟐 𝒙𝟐 𝜶𝟑 𝒙𝟑 Sentence Embeddings Attention Aggregation 𝒙 Bag Embeddings Training TM is a just hidden layer? × Predicted Relation Distribution Base RE Model = Transition Matrix Observed Relation Distribution Match the Noisy Label Curriculum Learning Based Training Trace of Transition Matrix Each row of the transition matrix sums to 1 No Noise Identity Transition Matrix Largest Trace Imposing the noise expectation by trace regularization Trace of Transition Matrix Curriculum Learning Based Training No Prior Knowledge about Data Quality Acquire basic classification ability first, then model the noise 𝑳𝒐𝒔𝒔 = Initialization: 𝜶 = 1, big 𝜷 Decrease 𝜶 and 𝜷 gradually 𝑵 𝒊=𝟏 𝟏 − 𝜶 𝒍𝒐𝒔𝒔𝒐 + 𝜶 𝒍𝒐𝒔𝒔𝒑 − 𝜷 𝒕𝒓𝒂𝒄𝒆(𝑻𝒊 ) 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏_𝑳𝒐𝒔𝒔 𝒍𝒐𝒔𝒔𝒑 𝑶𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏_𝑳𝒐𝒔𝒔 𝒍𝒐𝒔𝒔𝒐 × Predicted Relation Distribution = Transition Matrix Observed Relation Distribution Curriculum Learning Based Training With Prior Knowledge about Data Quality Subsets with different levels of reliability Time RE: birth-date, publication-date, inception-date Fine-grained Time Expression Reliable Data Alphabet was founded on October-2-2015 Alphabet’s financial report of 2015 show that... Knowledge Base Corpus RETRIEVE & ALIGN <Alphabet, inception-date, October-2 2015> Curriculum Learning Based Training With Prior Knowledge about Data Quality Reliable data first, Gradually add unreliable ones 𝑳𝒐𝒔𝒔 = Reliable Subset Large Positive 𝛽 Unreliable Subset Negative 𝛽 or Small Positive 𝛽 𝑵 𝒊=𝟏 𝒍𝒐𝒔𝒔𝒐 − 𝜷𝒕𝒓𝒂𝒄𝒆(𝑻𝒊 ) 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏_𝑳𝒐𝒔𝒔 𝒍𝒐𝒔𝒔𝒑 𝑶𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏_𝑳𝒐𝒔𝒔 𝒍𝒐𝒔𝒔𝒐 × Predicted Relation Distribution = Transition Matrix Observed Relation Distribution Experiments Time RE (Sentence Level) mix: no prior knowledge PR: with prior knowledge (different subsets) TM: transition matrix TM Consistently Improve Sentence Level Models Experiments Time RE (Bag Level) Average Aggregation mix: no prior knowledge PR: with prior knowledge (different subsets) TM: transition matrix TM Consistently Improve Bag Level Models Experiments Time RE (Bag Level) Attention aggregation mix: no prior knowledge PR: with prior knowledge (different subsets) TM: transition matrix TM Consistently Improve Bag Level Models Experiments Time RE (Global TM v.s. Dynamic TM) GTM: Global Transition Matrix TM: Dynamic Transition Matrix Dynamic TM BETTER THAN Global TM Experiments Entity RE (Bag Level) Riedel et al, 2010 avg: Average Aggregation, att: Attention Aggregation TM: Transition Matrix TM also Works in Entity RE Conclusion Modeling noise benefits RE results Dynamic/Global Transition matrix can model noise Dynamic TM is better than Global TM Curriculum Learning can train the transition matrix Curriculum Learning can incorporate prior knowledge about data quality Q&A
© Copyright 2024 Paperzz