Transition Matrix

Learning with Noise: Relation Extraction
with Dynamic Transition Matrix
Bingfeng Luo, Yansong Feng, Zheng Wang, Zhanxing Zhu, Songfang
Huang, Rui Yan and Dongyan Zhao
2017/04/22
About Dataset Noise

Noise is common in dataset

Human can make erroneous annotations

Noise is significant in automatically constructed dataset

Relation Extraction

Heavily rely on automatically constructed dataset
Relation Extraction

Find the relation between target subject and object

Melania Trump was born in Novo Mesto.
EXTRACT
Knowledge
Base
<Melania Trump, born-in, Novo Mesto>
POPULATE
Distant Supervision

Automatically construct noisy training data
Donald Trump was born in New York.
Donald Trump worked in New York.
Knowledge
Base
<Donald Trump, born-in, New York>
Corpus
RETRIEVE
&
ALIGN
Two Paradigm

Sentence Level
Donald Trump was born in New York.
Donald Trump worked in New York.

born-in
(0, 0, 1, ..., 0)
(0, 0, 1, ..., 0)
NOISY
Bag Level

At least one assumption

False Positive, False Negative
born-in
Ivanka Trump flew to New York.
(0, 0, 1, ..., 0)
Ivanka Trump lived in New York.
ALSO
NOISY
Model the Noise

How to represent the noise?
True relation is i, erroneously labeled as j
Transition
Matrix
𝒑(𝒋|𝒊)
𝒊, 𝒋 = 𝟏, 𝟐, … , 𝒌
𝑻𝒊𝒋 = 𝒑(𝒋|𝒊)
Model the Noise

Transition Matrix

𝑻𝒊𝒋 is 𝒑(𝒋|𝒊), true label is i, erroneously labeled as j
×
Predicted
Relation Distribution
Base RE
Model
=
Transition
Matrix
Observed
Relation Distribution
Match the
Noisy Label
Model the Noise

Global Transition Matrix

Model general noise pattern

Randomly initialization a matrix T’

𝑻𝒊𝒋 =
𝑻′
𝒆 𝒊𝒋
𝒋𝒆
𝑻′𝒊𝒋
BI PL NA
born-in (BI) 0.7 0.1 0.2
place-lived (PL) 0.5 0.3 0.2
NA 0.3 0.1 0.6
Model the Noise

Dynamic Transition Matrix

Model individual noise pattern

Generated according to the input instance
BI PL NA
BI PL NA
born-in (BI) 0.6 0.2 0.2
born-in (BI) 0.7 0.2 0.1
place-lived (PL) 0.2 0.6 0.2
place-lived (PL) 0.5 0.4 0.1
NA 0.1 0.2 0.7
NA 0.2 0.2 0.6
Donald Trump lives in New York.
Donald Trump lives in New York
near his parents’ old house.
Dynamic Transition Matrix

One Instance one Embedding

Instance: sentence or sentence bag

Instance embedding from base RE model

Softmax classifier to generate each row of the transition matrix T

One row at a time

Each row sums to 1
𝑇𝑖𝑗 =
Instance
Embedding
𝑥𝑛
softmax for
each row
Transition
Matrix
𝑇
𝑇
exp(𝑤𝑖𝑗
𝑥𝑛 + 𝑏)
𝑇
𝑗 exp(𝑤𝑖𝑗 𝑥𝑛
+ 𝑏)
Dynamic Transition Matrix

One Instance Embedding per Relation

R instance embeddings regarding R relations (e.g., Lin et al., ACL, 2016)

Softmax classifier for corresponding rows
𝑇𝑖𝑗 =
Instance
Embedding
Regarding
Each Relation
𝑥𝑛,𝑖
softmax for
each row
Transition
Matrix
𝑇
exp(𝑤𝑗𝑇 𝑥𝑛,𝑖 + 𝑏𝑖 )
𝑇
𝑗 exp(𝑤𝑗 𝑥𝑛,𝑖
+ 𝑏𝑖 )
Instance Embedding

Sentence Embedding

Piecewise CNN (PCNN, Zeng et al., EMNLP, 2015)
Instance Embedding

Bag Embedding

Average the embedding of each sentence

Attention to each sentence regarding each relation (Lin, et al., ACL 2016)

One bag embedding per relation
𝒙𝟏
𝒙𝟐
𝒙
𝒙𝟑
Aggregation
Sentence
Embeddings
Bag
Embeddings
Instance Embedding

Bag Embedding

Attention to each sentence regarding each relation
Embedding of
Relation 𝑘
𝒓𝒌
𝜶𝟏
𝒙𝟏
exp(𝑟𝑖𝑇 𝑥𝑗 )
𝛼𝑖𝑗 =
𝑇
𝑗 exp(𝑟𝑖 𝑥𝑗 )
𝜶𝟐
𝒙𝟐
𝜶𝟑
𝒙𝟑
Sentence
Embeddings
Attention
Aggregation
𝒙
Bag
Embeddings
Training

TM is a just hidden layer?
×
Predicted
Relation Distribution
Base RE
Model
=
Transition
Matrix
Observed
Relation Distribution
Match the
Noisy Label
Curriculum Learning Based Training

Trace of Transition Matrix

Each row of the transition matrix sums to 1

No Noise  Identity Transition Matrix  Largest Trace

Imposing the noise expectation by trace regularization
Trace of
Transition
Matrix
Curriculum Learning Based Training

No Prior Knowledge about Data Quality

Acquire basic classification ability first, then model the noise

𝑳𝒐𝒔𝒔 =

Initialization: 𝜶 = 1, big 𝜷

Decrease 𝜶 and 𝜷 gradually
𝑵
𝒊=𝟏
𝟏 − 𝜶 𝒍𝒐𝒔𝒔𝒐 + 𝜶 𝒍𝒐𝒔𝒔𝒑 − 𝜷 𝒕𝒓𝒂𝒄𝒆(𝑻𝒊 )
𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏_𝑳𝒐𝒔𝒔
𝒍𝒐𝒔𝒔𝒑
𝑶𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏_𝑳𝒐𝒔𝒔
𝒍𝒐𝒔𝒔𝒐
×
Predicted
Relation Distribution
=
Transition
Matrix
Observed
Relation Distribution
Curriculum Learning Based Training

With Prior Knowledge about Data Quality

Subsets with different levels of reliability

Time RE: birth-date, publication-date, inception-date

Fine-grained Time Expression  Reliable Data
Alphabet was founded on October-2-2015
Alphabet’s financial report of 2015 show that...
Knowledge
Base
Corpus
RETRIEVE & ALIGN
<Alphabet, inception-date, October-2 2015>
Curriculum Learning Based Training

With Prior Knowledge about Data Quality

Reliable data first, Gradually add unreliable ones

𝑳𝒐𝒔𝒔 =

Reliable Subset  Large Positive 𝛽

Unreliable Subset  Negative 𝛽 or Small Positive 𝛽
𝑵
𝒊=𝟏 𝒍𝒐𝒔𝒔𝒐
− 𝜷𝒕𝒓𝒂𝒄𝒆(𝑻𝒊 )
𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏_𝑳𝒐𝒔𝒔
𝒍𝒐𝒔𝒔𝒑
𝑶𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏_𝑳𝒐𝒔𝒔
𝒍𝒐𝒔𝒔𝒐
×
Predicted
Relation Distribution
=
Transition
Matrix
Observed
Relation Distribution
Experiments

Time RE (Sentence Level)

mix: no prior knowledge

PR: with prior knowledge (different subsets)

TM: transition matrix
TM Consistently
Improve Sentence
Level Models
Experiments

Time RE (Bag Level)

Average Aggregation

mix: no prior knowledge

PR: with prior knowledge (different subsets)

TM: transition matrix
TM Consistently
Improve Bag Level
Models
Experiments

Time RE (Bag Level)

Attention aggregation

mix: no prior knowledge

PR: with prior knowledge (different subsets)

TM: transition matrix
TM Consistently
Improve Bag Level
Models
Experiments

Time RE (Global TM v.s. Dynamic TM)

GTM: Global Transition Matrix

TM: Dynamic Transition Matrix
Dynamic TM
BETTER THAN
Global TM
Experiments

Entity RE (Bag Level)

Riedel et al, 2010

avg: Average Aggregation,

att: Attention Aggregation

TM: Transition Matrix
TM also Works in
Entity RE
Conclusion

Modeling noise benefits RE results

Dynamic/Global Transition matrix can model noise

Dynamic TM is better than Global TM

Curriculum Learning can train the transition matrix

Curriculum Learning can incorporate prior knowledge about
data quality
Q&A