A Hybrid Approach to Inferring a Consistent Temporal Relation
Set in Natural Language Text
A Dissertation
submitted to the Faculty of the
Graduate School of Arts and Sciences
of Georgetown University
in partial fulfillment of the requirements for the
degree of
Doctor of Philosophy
in Linguistics
By
Chong Min Lee, M.S.
Washington, DC
July 31, 2013
c 2013 by Chong Min Lee
Copyright All Rights Reserved
ii
A Hybrid Approach to Inferring a Consistent Temporal
Relation Set in Natural Language Text
Chong Min Lee, M.S.
Dissertation Advisor: Graham E. Katz
Abstract
This dissertation investigates the temporal relation identification task. The goal is
to construct consistent temporal relations between temporal entities (e.g., events
and time expressions) in a narrative. Constructing consistent temporal relations is
challenging due to the exponential increase in the number of candidates for temporal relations proportional to the number of pairs of temporal entities. When we
use transitive constraints to construct consistent temporal relations, performance
improvement can be expected because the application of transitive constraints
reduces the number of possible relation candidates in a narrative.
The primary objective of this study is to develop a temporal relation identification (TRI) system that is composed of three modules: 1) a module that classifies
the temporal relation of a pair of temporal entities, 2) a module that extracts
conflicting classified relations using transitive constraints, and 3) a module that
restores consistent temporal relations from the conflicting relations using transitive constraints. In developing a TRI system, this dissertation examines whether
the application of transitive constraints to such a system can lead to performance
improvement.
The first step in developing the system was to implement a rudimentary temporal
relation classification module. The module labels a pair of temporal entities with
a temporal relation among eleven possible temporal relations. Next, a method for
iii
extracting conflicting relations among classified relations is proposed. The extraction method is based on heuristics because of NP-hard complexity in extracting
all conflicting relations. Finally, two heuristic methods are proposed that restore
consistent temporal structure from conflicting relations using transitive constraints.
The performance of the developed system is tested using TimeBank and AQUAINT
temporal corpora.
The results of this work indicate that a performance improvement through the
application of transitive constraints to TRI task is not guaranteed. Furthermore,
this study empirically shows the limitations on performance improvement through
the application of transitive constraints to the TRI task and identifies the bottlenecks in the TRI task.
Index words:
computational linguistics, natural language processing,
temporal information processing, temporal relation extraction,
temporal relation classification, temporal relation identification
iv
Dedication
In memory of my mother
성숙정
하늘에서 편안하시길
v
Acknowledgments
There are a lot of people who have contributed to this dissertation in many different ways. I am indebted all of them and would like to take this opportunity to
acknowledge their contributions and express my gratitude to them.
I would like to thank my advisor, Graham Katz. He deserves special credit for his
continuous encouragement and support during my dissertation writing. His comments, criticisms and suggestions on my research help me deepen my thoughts
and improve my work. I would like to acknowledge two other committee members,
Paul Portner and Mark Maloof, who also patiently guided me through the dissertation process. They provided insights which challenged my thinking and substantially improved the finished product. Paul’s careful, thoughtful reading of the draft
directed me to be more precise concerning my claims. His comments and guidance
have been invaluable in sharpening the content of my work. I am grateful to Mark
Maloof for his inputs on machine learning. His class on machine learning was my
great chance to get insights on machine learning. He also provided me helpful suggestions and comments on my machine learning experiments in my dissertations.
I would like to thank Aoife Cahill who works with me in ETS. I consider myself
extremely fortunate that I have a chance to work with her. She read numerous
versions of this dissertation and made me come up with something better each time.
Without her help, this dissertation would not be finalized.
vi
I would like to thank everyone at the Linguistic Department for their help from the
day I arrived until my graduation at Georgetown University. The best and worst
moments of my dissertation journey have been shared with them.
I would like to extend my deepest gratitude to my family. They all have been
extremely patient with my longish student life. I owe a great deal to each of them.
My deepest appreciation goes to my parents for allowing me to study in the United
States and giving me a constant support from Korea. But I regret that I could not
have a chance to show my doctoral degree to my mother. My heartfelt thanks go
to my wife, Jisun Moon. She has been there with me the entire way, and made this
long journey so much easier. Her love, support and devotion are more than I could
ever ask and without her, this would not have been possible. Finally, I am especially grateful to our children Nathan, Juno and Irene, for having given me a break
in my brain work ever so often, so that I had an opportunity to experience aspects
of life beyond science.
아버지 고맙습니다. 아버지의 도움 없이는 이 논문을 마칠 수 없었을겁니다. 아버
지에게 박사 학위를 보여드릴 수 있어서 정말 다행이라고 생각합니다.
vii
Table of Contents
Chapter
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Background and Related Works . . . . . . . . . . .
2.1 TimeML and Temporal Corpora . . . . . . .
2.2 Previous Studies . . . . . . . . . . . . . . . .
2.3 Temporal Constraint Propagation Algorithm
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
16
26
3 Temporal Relation Classification between Temporal Entities . . .
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Temporal Relation Classification: Interval Relations . . . .
3.3 Reconsidering the Use of Merged Relations . . . . . . . . .
3.4 Temporal Relation Classification Using Boundary Relations
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
33
53
64
4 Detection of Inconsistent Temporal Relations . . . . . . . . . .
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
4.2 NP-hardness of Extracting Inconsistent Sets . . . . . .
4.3 Violation Extraction . . . . . . . . . . . . . . . . . . .
4.4 Evaluating the Extraction Method with Artificial Data
4.5 Graph Analysis on Inconsistent Relation Extraction . .
4.6 Extracting Violations from Classified Relations . . . . .
4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 75
. 75
. 80
. 86
. 97
. 107
. 123
. 126
5 Restoration of Consistent Temporal Relations with Reasoning
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Consistency Restoration Heuristics . . . . . . . . . . .
5.3 Identifying and Correcting Errors in TRI . . . . . . . .
5.4 Addition of Time-Time Information . . . . . . . . . . .
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
128
128
133
144
147
158
6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.1 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Appendix
A Conflicting Annotated Temporal Relations . . . . . . . . . . . . . . . . 164
A.1 TimeBank 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
A.2 ACQUAINT TimeML Corpus . . . . . . . . . . . . . . . . . . . 174
viii
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
ix
List of Figures
2.1
2.2
3.1
3.2
3.3
3.4
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
4.18
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
Initial and propagated graphs . . . . . . . . . . . . . . . . . . .
Inconsistent example . . . . . . . . . . . . . . . . . . . . . . . .
Example of parsed tree . . . . . . . . . . . . . . . . . . . . . . .
Syntactic tree of a relative clause . . . . . . . . . . . . . . . . .
Syntactic tree . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conversion of an interval link to four boundary links . . . . . .
Classified relations with an error . . . . . . . . . . . . . . . . . .
Classified relations with an error . . . . . . . . . . . . . . . . . .
Derivation example for Allen’s algorithm . . . . . . . . . . . . .
Graph with a conflict . . . . . . . . . . . . . . . . . . . . . . . .
Classified relations with one misclassified relation . . . . . . . .
Graph with two unrelated conflicts . . . . . . . . . . . . . . . .
Graph with exclusively extractable conflicts . . . . . . . . . . .
Extracted violation with an extra link . . . . . . . . . . . . . . .
A graph in misclassified relations unextractable using transitive
straints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Graph with a misclassified relation . . . . . . . . . . . . . . . .
Example of line form . . . . . . . . . . . . . . . . . . . . . . . .
Undirected version of a temporal link graph . . . . . . . . . . .
Complete graph . . . . . . . . . . . . . . . . . . . . . . . . . . .
Three different graph types . . . . . . . . . . . . . . . . . . . .
Extractable and unextractable errors . . . . . . . . . . . . . . .
TLINK patterns with three links and one unextractable error . .
TLINK patterns with four links and one unextractable error . .
TLINK pattern with five links and one unextractable error . . .
Addition of time-time and time-dct links . . . . . . . . . . . . .
Illustrative example of error identification task . . . . . . . . . .
Grounding relations . . . . . . . . . . . . . . . . . . . . . . . . .
Graph after the link additions using score heuristics . . . . . . .
Final consistent graph using Score Heuristic . . . . . . . . . . .
Graph after running TCPA on grounding relations . . . . . . . .
Graph after adding a link to grounding relations . . . . . . . . .
Graph after the addition of the link from c to b . . . . . . . . .
Graph with more links and an error . . . . . . . . . . . . . . . .
x
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
con. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
29
30
41
43
59
66
76
81
88
91
93
94
96
96
98
101
107
109
110
112
115
121
122
123
131
134
136
136
137
138
139
139
141
5.10 Relations without a violation after the violation extraction
5.11 Relations after adding links . . . . . . . . . . . . . . . . .
5.12 Extracted conflicting relations . . . . . . . . . . . . . . . .
5.13 Violation with only correctly classified relations . . . . . .
5.14 An extracted violation after adding time-time links . . . .
6.1 A graph that can be split into two subgraphs . . . . . . . .
A.1 AP900816-0139.tml . . . . . . . . . . . . . . . . . . . . .
A.2 APW19980227.0468.tml . . . . . . . . . . . . . . . . . . .
A.3 CNN19980126.1600.1104.tml . . . . . . . . . . . . . . . .
A.4 CNN19980227.2130.0067.tml . . . . . . . . . . . . . . . .
A.5 NYT19980206.0460.tml . . . . . . . . . . . . . . . . . . .
A.6 NYT19980402.0453.tml . . . . . . . . . . . . . . . . . . .
A.7 PRI19980303.2000.2550.tml . . . . . . . . . . . . . . . .
A.8 wsj_0032.tml . . . . . . . . . . . . . . . . . . . . . . . . .
A.9 wsj_0160.tml . . . . . . . . . . . . . . . . . . . . . . . . .
A.10 wsj_0169.tml . . . . . . . . . . . . . . . . . . . . . . . . .
A.11 wsj_0505.tml . . . . . . . . . . . . . . . . . . . . . . . . .
A.12 wsj_0542.tml . . . . . . . . . . . . . . . . . . . . . . . . .
A.13 wsj_0660.tml . . . . . . . . . . . . . . . . . . . . . . . . .
A.14 wsj_0675.tml . . . . . . . . . . . . . . . . . . . . . . . . .
A.15 wsj_0762.tml . . . . . . . . . . . . . . . . . . . . . . . . .
A.16 wsj_0778.tml . . . . . . . . . . . . . . . . . . . . . . . . .
A.17 wsj_0786.tml . . . . . . . . . . . . . . . . . . . . . . . . .
A.18 wsj_0816.tml . . . . . . . . . . . . . . . . . . . . . . . . .
A.19 wsj_0927.tml . . . . . . . . . . . . . . . . . . . . . . . . .
A.20 wsj_1011.tml . . . . . . . . . . . . . . . . . . . . . . . . .
A.21 XIE19980808.0031.tml . . . . . . . . . . . . . . . . . . .
A.22 NYT20000601.0442.tml . . . . . . . . . . . . . . . . . . .
A.23 NYT20000424.0319.tml . . . . . . . . . . . . . . . . . . .
A.24 NYT20000414.0296.tml . . . . . . . . . . . . . . . . . . .
A.25 NYT20000329.0359.tml . . . . . . . . . . . . . . . . . . .
A.26 NYT20000224.0173.tml . . . . . . . . . . . . . . . . . . .
A.27 NYT20000113.0267.tml . . . . . . . . . . . . . . . . . . .
A.28 NYT20000106.0007.tml . . . . . . . . . . . . . . . . . . .
A.29 NYT20000105.0325.tml . . . . . . . . . . . . . . . . . . .
A.30 NYT19990419.0515.tml . . . . . . . . . . . . . . . . . . .
A.31 APW20000417.0031.tml . . . . . . . . . . . . . . . . . . .
A.32 APW20000403.0057.tml . . . . . . . . . . . . . . . . . . .
A.33 APW20000401.0150.tml . . . . . . . . . . . . . . . . . . .
A.34 APW20000210.0328.tml . . . . . . . . . . . . . . . . . . .
A.35 APW20000128.0316.tml . . . . . . . . . . . . . . . . . . .
A.36 APW20000115.0031.tml . . . . . . . . . . . . . . . . . . .
xi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
143
143
148
153
157
162
164
165
165
166
166
167
167
168
168
169
169
170
170
170
171
171
172
172
172
173
174
174
175
175
176
176
177
177
177
178
178
178
179
179
179
180
A.37 APW20000107.0318.tml
A.38 APW199980817.1193.tml
A.39 APW19991008.0151.tml
A.40 APW19990506.0155.tml
A.41 APW19980818.0515.tml
A.42 APW19980811.0474.tml
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 180
. 181
. 181
. 181
. 182
. 182
List of Tables
2.1
2.2
2.3
2.4
2.5
2.6
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
4.1
4.2
4.3
4.4
4.5
4.6
4.7
5.1
5.2
5.3
5.4
5.5
5.6
Distributions of temporal relations in TimeBank and AQUAINT . . .
Descriptions of TempEval tasks . . . . . . . . . . . . . . . . . . . . .
Target links and target relations of previous studies . . . . . . . . . .
Differences between previous TRI studies . . . . . . . . . . . . . . . .
Thirteen relations of Allen (1983) . . . . . . . . . . . . . . . . . . . .
Mapping between TimeML relations and Allen (1983) relations . . . .
Comparison of features . . . . . . . . . . . . . . . . . . . . . . . . . .
Features per TLINK type . . . . . . . . . . . . . . . . . . . . . . . .
Distribution of TimeML relations after modifying temporal links . . .
Results of event-dct . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Results of event-event . . . . . . . . . . . . . . . . . . . . . . . . . .
Results of event-time . . . . . . . . . . . . . . . . . . . . . . . . . . .
Accuracies of four compared systems . . . . . . . . . . . . . . . . . .
A temporal link decomposition table . . . . . . . . . . . . . . . . . .
Distribution of boundary relations . . . . . . . . . . . . . . . . . . . .
Results of boundary relation classification for event-dct . . . . . . . .
Results of boundary relation classification for event-event . . . . . . .
Results of boundary relation classification for event-time . . . . . . .
Results of temporal relation classification using the boundary relation method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Distribution of documents per the number of misclassified relations .
Measures per error ratio range . . . . . . . . . . . . . . . . . . . . . .
Statistics of violation extraction . . . . . . . . . . . . . . . . . . . . .
Global measures vs local measures . . . . . . . . . . . . . . . . . . . .
Overall values of global measures . . . . . . . . . . . . . . . . . . . .
Comparison of local measures for extracted and unextracted errors . .
Performance of violation extraction . . . . . . . . . . . . . . . . . . .
Influence of additional links in previous studies . . . . . . . . . . . . .
Difference between Score Heuristic and Entailment Heuristic . . . . .
Performance of two restoration methods in misclassified link identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Performance of two restoration methods in target relation restoration
Statistics of time-time links . . . . . . . . . . . . . . . . . . . . . . .
Violation extraction results after adding time-time link . . . . . . . .
xiii
16
20
21
24
27
27
37
45
47
50
51
52
60
66
67
70
71
72
73
101
103
104
110
118
120
126
132
141
145
145
152
154
5.7
5.8
Results of restoration methods after adding time-time links . . . . . . 156
Performance of two restoration methods in misclassified link identification after adding time-time links . . . . . . . . . . . . . . . . . . . 157
xiv
Chapter 1
Introduction
Recently, there has been increased interest in the area of temporal information
processing (e.g., Barzilay et al., 2002; Boguraev and Ando, 2005; Zhou et al.,
2005; Bramsen et al., 2006; Verhagen et al., 2007, 2010). This is because of greater
demands on computer systems to be able to fully understand narrative text. When
computer systems can infer the temporal flow of events, they can answer questions
based on temporal information such as “when does an event happen?” or “which
event occurred first?” Moreover, the systems can make decisions based on the
inferred information. Because of the advantages of temporal information processing,
it has been applied to a large number of diverse research areas such as medical
systems (Zhou et al., 2005), legal document systems (Schilder and McCulloh, 2005),
and summarization systems (Barzilay et al., 2002). When the temporal information
automatically constructed has conflicting information, however, it is not possible to
generate the inferred information. In this thesis, I will design a system that resolves
conflicting information when when it occurs in automatically generated temporal
information.
Temporal Information Processing (TIP) is the automatic generation of the temporal
structure of a narrative. TIP includes various tasks such as identifying temporal
entities that are events or temporal expressions, calculating the time period of a
temporal entity, choosing pairs of temporal entities to be temporally ordered, classifying a temporal relation of a pair of temporal entities, and so on.
1
In the following example, “killed”, “wounded”, “last week”, and “today” are two
events and two temporal expressions, respectively.
(1) Last week, Hezbollah fighters killed three Israeli soldiers. In addition, the
fighters wounded 73 soldiers en route to Lebanon today.
It is possible to define a relative temporal order between a pair of temporal entities. The relative temporal order between two events such as “killed” and “wounded”
can be defined as “killed” occurs before “wounded”. In this thesis relative temporal
orders that do not pinpoint specific time information are defined as “qualitative”
temporal information. I will not go into detail on “quantitative” temporal information, which pinpoints the exact time information of an event (e.g., the information
that “killed” happened between 03/01/2012 and 03/02/2012). Identifying these
kinds of relations poses a different set of challenges, which I leave for future work.
Various features are relevant when making a decision about the qualitative temporal relation between a pair of temporal entities. Early studies on automatic qualitative information extraction (Webber, 1988; Kamp and Reyle, 1993; Lascarides
and Asher, 1993; Hitzeman et al., 1995) showed that linguistic features such as
temporal adverbials, tense, aspect, rhetorical relations, pragmatic conventions, and
background knowledge are relevant in order to infer a temporal relation.
The example in (2) shows that the various linguistic features associated with temporal entities are not independent. Rather, they are tightly intertwined when constructing a temporal structure that shows qualitative temporal orderings of temporal entities in a document.
(2) Juno teased Nathan. Suddenly, Juno fell. Nathan pushed Juno.
2
I will notate temporal relations as directed arrows with the relation type written
above the arrow. This makes explicit the order of the arguments. In (2), teased hapbef ore
pens before fell (teased ===⇒ f ell in my representation) and fell happens after
af ter
pushed (f ell ===⇒ pushed). There are a number of factors that go into determining this ordering. Among these, the narrative convention that events mentioned
first occurred first can play a role in determining the temporal order of events. For
example, the before relation between teased and fell is based on the narrative convention. Other factors such as the natural causal relations holding between events
have also been claimed to play a role. A causality between the events can derive the
after relation between fell and pushed.
Sometimes these factors conflict with each other, making it difficult to determine
which feature should have priority in deciding a temporal relation when a pair of
temporal entities is only given. When we only have the two clauses, “Juno fell. Min
pushed Nathan.”, and we are given the task of inferring the qualitative relation
between the two clauses, the narrative relation between the two clauses conflicts
with the natural causal relation. Usually falling happens as a result of pushing,
and so, naively, we might conclude that the push event happened before the fall
event. The narrative convention, however, makes it possible to conclude that the
fall event happened before the push event. This conflict makes it difficult to infer a
relative relation. We can say that push happens before fell based on the preference
for causality between events, or we can say that fell happens before push based
on the preference for the narrative convention. When factors influencing ordering
conflict with each other, we need a way of resolving this conflict. The resolution will
be based on a set of preferences.
A probabilistic approach to resolving ordering conflicts is an appealing one. The
preferences to be applied by an automatic system are based on observations of data
3
for which the “correct” solution is known. Previous studies (e.g., Mani et al., 2006;
Chambers et al., 2007; Tatu and Srikanth, 2008) have also adopted a probabilistic
approach to the task of classifying the temporal relation between a pair of temporal
entities. Usually, they classified the relations between two temporal entities using
standard machine learning techniques based on the features of the entities to be
temporally related. For example, Chambers et al. (2007) classified temporal relations between events using features such as the part-of-speech labels of the events
and the tenses of the events.
Temporal relations that hold among a set of entities are logically related to one
another. This is the case because the set of relations are logically related to each
other (i.e., dependent) and should not violate the logical relatedness (i.e., they must
be consistent). In (2), the qualitative temporal orders of tease, fall, and push are
dependent and consistent. If we do not care about the dependence among temporal
relations within a set of entities when we classify temporal relations between the
pairs of temporal entities in that set, the classified relations could conflict with each
other. But, if we require temporal consistency, and the set of temporal relations
inferred for a set of temporal entities is found to be inconsistent, it is reasonable to
assume that this inconsistency indicates that at least one of the relations inferred is
incorrect.
Let’s suppose that a system inferred the relations among the events described in (2)
af ter
af ter
and classified the following three temporal relations: teased ===⇒ f ell, f ell ===⇒
bef ore
pushed, and teased ===⇒ pushed. These relations conflict with one another because
they form a circular relationship. We can now use the fact that there is a conflict to
conclude that at least one of these relations must be incorrectly classified. In this
thesis, I will develop a system that identifies this kind of conflict and resolves the
identified conflict.
4
I focus on a Temporal Relation Identification (TRI) task where the goal is to identify the qualitative temporal structure among temporal entities in a narrative.
Therefore, the main focus of this study is how to determine qualitative temporal
structure for a narrative text, reflecting the natural temporal relations among the
temporal entities referred to in the text. The primary objective of this study is to
develop a temporal relation identification system that constructs consistent temporal relations in a narrative. A TRI system includes a number of tasks: identifying
temporal entities in a document, defining temporal links between pairs of temporal
entities, classifying the temporal relation of each temporal link, detecting the existence of a conflict among classified relations, extracting conflicting temporal relations, reconstructing a temporal structure without a conflict, and so on. I assume in
this thesis that temporal links between given temporal entities are already available.
The focus of this the is the following set of subtasks of a TRI system:
Task 1 implementation of a rudimentary temporal relation classification system
that uses a machine learning classifier
Task 2 extraction of conflicting relations among classified relations using transitive
constraints such that, if i before k and k before j, then i before j
Task 3 reconstruction of temporal structure without conflicting relations, using
transitive constraints
For the second task, I devise an algorithm to extract conflicting relations among
classified relations. For the third task, I propose two heuristics that reconstruct a
consistent temporal structure from extracted relations. The extraction and reconstruction methods are based on transitive constraints. I carry out a thorough evaluation of my system and show how effective the two heuristics are for the TRI task.
5
In contrast to most prior studies, which only employed a subset of the annotated
relations in the temporal corpora on which they were based on, I use all of the qualitative temporal relations in the corpora in my studies when identifying relations
without conflicts. The TRI task has been considered difficult both because of the
need of contextual information such as causality and because of the interrelatedness
of temporal relations. The task is also difficult for human annotators. Verhagen
et al. (2007) reported inter-annotator agreement, presented as the average of precision and recall, at 0.77 for annotations of temporal relations.1 The low agreement
level indicates annotators’ difficulty in selecting one discrete relation between a
given pair of temporal entities and the difficulty of the TRI task. In order to make
the task simpler, previous studies (Mani et al., 2006; Chambers et al., 2007; Chambers and Jurafsky, 2008; Tatu and Srikanth, 2008; Yoshikawa et al., 2009) adopted
strategies such as merging identical temporal relations and using simplified relations. In this thesis, I show that it is impossible to achieve the performance of previous studies (Chambers and Jurafsky, 2008; Tatu and Srikanth, 2008; Yoshikawa
et al., 2009; UzZaman and Allen, 2010) when their methods are applied to real
world applications due to their unthorough experiment designs (Chambers and
Jurafsky, 2008; Tatu and Srikanth, 2008) and the incompleteness of the simplified
relations they employ (Yoshikawa et al., 2009; UzZaman and Allen, 2010).
In order to show the current status of the TRI task, I explore the problems of classifying temporal relations in text, identifying the limited scope of improvement
through incorporating transitive constraints, selecting links with misclassified relations, and recovering target relations of selected links. If we apply transitive constraints to the TRI task, we can significantly reduce possible sets of relations for
1 Compare
to over 0.9 agreement ratio of a Part-Of-Speech annotation task (Marcus
et al., 1993)
6
?
?
?
given links. Given three transitive links such as i ===⇒ k, k ===⇒ j, and i ===⇒ j,
transitive constraints only allow sets of consistent relations for those links such as
before
before
before
i ===⇒ k, k ===⇒ j, and i ===⇒ j. The reduced sets of consistent relations can
lead to performance improvement because of the reduced search space. The exploration in this thesis shows that the actual performance of this task is lower than
the reported performance of previous studies and that the application of transitive
constraints does not guarantee performance improvements.
I organize this thesis into six chapters.
Chapter 2 provides the necessary background of the work. I describe the annotation
language that was used in annotating the temporal corpora first. The description
explains what temporal entities and temporal relations are. After the description
of the language, I describe the temporal corpora that were used in this thesis. I
also describe the characteristics of the data. I then outline previous studies on the
temporal relation identification task. I also explain, in some detail, the Temporal
Constraint Propagation Algorithm that I later extend and modify and which forms
the basis of my methodology. The description includes the advantages and limitations of the algorithm.
Chapter 3 begins with temporal relation classification of all available qualitative
temporal relations. This chapter describes my classification system. In addition, I
argue for treating all qualitative temporal relations as target relations while discussing the importance of the direction of links in constructing a classification
system. As supporting evidence, I show how an approach using normalized relations
misled previous studies. I examine an alternative method for classifying temporal
relations using boundaries of temporal entities in the classification.
Chapter 4 begins by designing an algorithm that extracts conflicting relations. I
will show the maximally achievable performances with transitive constraints in
7
an experiment using the designed algorithm and artificially generated misclassified relations. I also apply the algorithm to the classification results from Chapter
3 in order to extract candidates for misclassified relations to be reconstructed in
Chapter 5.
Chapter 5 presents the design and experiments for two restoration methods. The
methods utilize a score-based heuristic and a constraint-based heuristic, respectively. I outline the algorithms of the two methods and present experiments and
their results. As an additional experiment, I examine whether an increase in the
number of temporal links influences the performance of a temporal relation identification system. In the experiment, I generate additional temporal links and add to
the classified data from Chapter 3. After adding the additional links, I extract all
error candidates and restore them. Finally, I report the results of the experiment.
Chapter 6 concludes the thesis by summarizing the basic findings, and outlining the
possibilities for future research.
8
Chapter 2
Background and Related Works
2.1
TimeML and Temporal Corpora
In order to build a system for identifying temporal relations, we need a notation
to represent the temporal relations among the events and times described in a document in an explicit way. Previous studies (Kamp and Reyle, 1993; Hwang and
Schubert, 1992) proposed ways of representing temporal information using logical
representation languages such as Discourse Representation Theory and Tense Tree.
For the work presented here, I will use the TimeML (Time Markup Language) notation.
This thesis uses two widely available corpora: TimeBank version 1.2 (Pustejovsky
et al., 2003b), which consists of 183 news articles and the AQUAINT TimeML
Corpus (www.timeml.org), which has 73 news articles. Both corpora are annotated
in TimeML (Pustejovsky et al., 2003a) designed to represent and annotate temporal information on a narrative text. The language specifies what temporal entities
(e.g., events and temporal expressions) are, what tokens are marked as temporal
entities, and how to represent temporal relations between a pair of temporal entities. This section provides some detail about TimeML and the temporal corpora to
contextualize this study. For more complete accounts readers are referred to Pustejovsky et al. (2003a) and Pustejovsky et al. (2003b). Although this study uses the
corpora in TimeML, the proposed methods in this thesis that reconstruct consistent
9
temporal structure are applicable to other resources as long as transitive temporal
constraints can also be applied to the resources.
TimeML defines three tags for temporal entities: EVENT, MAKEINSTANCE, and
TIMEX3. In addition, the language has the TLINK (TemporalLINK) tag that
annotates a temporal relation between a pair of temporal entities. EVENT and
MAKEINSTANCE are designed to tag events and TIMEX3 is designed to annotate
time expressions. An event is “a cover term for situations that happen or occur ” in
TimeML (Pustejovsky et al., 2003a). TimeML makes a distinction between an event
token that is a word referring to an event and an event realization that is the actual
happening of the referred event. In (3), killed and wounded are event tokens.
(3) TimeML annotation example
<TIMEX3 tid=“t1”>Last week <TIMEX3> , Hezbollah fighters
<EVENT eid=“e1”>killed <EVENT> three Israeli soldiers and seriously
<EVENT eid=“e2”>wounded <EVENT> six others.
<MAKEINSTANCE eventID=“e1” eiid=“ei1”>
<MAKEINSTANCE eventID=“e2” eiid=“ei2”>
<TLINK eventInstanceID=“ei1” relatedToEventInstanceID=“t1”
relType=“IS_INCLUDED”>
<TLINK eventInstanceID=“ei2” relatedToEventInstanceID=“t1”
relType=“IS_INCLUDED”>
An EVENT tag marks an event token. Each event of killing an Israeli soldier is an
event realization. TimeML uses a MAKEINSTANCE tag in order to mark up the
realization of an event token. When more than one event realization corresponds
to an event token and a distinction between the realizations is required, distinct
MAKEINSTANCE tags are made. When a temporal relation between killed and
10
last week is annotated, it is not necessary to make distinctions between the three
event realizations that are the killings of three Israeli soldiers. All three event realizations occurred during the time expression, last week. If we need to annotate
temporal relations between a pair of kill and wound realizations, however, we need
to make six distinct event realizations of kill and wound. The six distinct event realizations are not made in the example because the relative temporal order between
each pair of six realizations is vague. In this thesis, I use event as an umbrella
term that includes EVENT and MAKEINSTANCE tags. If a distinction between
EVENT and MAKEINSTANCE tags is required, I will use distinct terms.
Various properties of an annotated event are annotated as attributes of EVENT
and MAKEINSTANCE tags as shown in (4).
(4) Attributes of event
stem : word string
event class : reporting, perception, aspectual, i_action, i_state, state,
occurrence
tense : past, present, future, infinitive, present participle, past participle,
none
aspect : progressive, perfective, perfective progressive, none
part-of-speech : adjective, noun, verb, preposition
polarity : POS (positive), NEG (negative)
An EVENT tag has stem and event class as attributes and a MAKEINSTANCE
tag has tense, aspect, part-of-speech, and polarity attributes. The attributes of both
tags can be used as properties of an event. An EVENT tag is required to have one
11
event class among seven event classes: PERCEPTION (e.g., see, hear ), ASPECTUAL (e.g., begin, continue), I_ACTION - “intentional action” - (e.g., try, prevent), I_STATE - “intentional state” - (e.g., believe, hope), STATE - (e.g., on board,
shortage), REPORTING - (e.g., report, say), and OCCURRENCE - the default
class - (e.g., walk, sell ) (Verhagen et al., 2009). The tense attribute of an event has
a value of either present, past, future, infinitive, present participle, past participle,
or none. The aspect has a value of progressive, perfective, perfective progressive, or
none. The part-of-speech attribute can have a value of adjective, noun, verb, or
preposition.
Temporal expressions are marked with TIMEX3 (Time Expression 3) tags. The
TIMEX3 tag is a successor of TIDES TIMEX2 (Ferro et al., 2001), which was
designed to represent properties of a temporal expression such as today, two days
ago, etc. A TIMEX3 tag has attributes such as real time information of a tagged
expression (value), the type of the expression such as date, time, duration, or set
(type), and the role of the expression such as publication time (functionInDocument), as listed in (5). I am going to use TIMEX instead of TIMEX3 for convenience.
(5) Attributes of TIMEX3
value : date and time
type : date, time, duration, set
functionInDocument : a role of an annotated time expression
TLINK (Temporal Link) is a tag that represents an asymmetric temporal relation
between a pair of temporal entities. A TLINK tag is composed of an anchor that is
a value of eventInstanceID or timeID, a target that is a value of relatedToTime or
relatedToEvent, and a temporal relation that is a value of relType, as shown in (6).
12
(6) Attributes of TLINK
eventInstanceID or timeID : a unique id
relatedToTime or relatedToEvent : a unique id
relType : before, after, ibefore, iafter, is_included, includes, begins,
begun_by, ends, ended_by, simultaneous, during, during_inv, identity
A TLINK tag allows only one temporal relation which is selected from a set of fourteen relations: before, after, ibefore (immediately before), iafter (immediately after),
includes, is_included, during, during_inv (during inverse), simultaneous, identity,
begins, begun_by, ends, and ended_by. The TimeML relation types are based on
the thirteen relations proposed by Allen in his interval algebra (Allen, 1983). The
relationship between TimeML relations and Allen’s relations is explained more in
Section 2.3. When a pair of temporal entities can have multiple disjunctive relations, the pair is not annotated with a TLINK tag.
One of the goals of this thesis is to identify a qualitative temporal structure among
temporal entities in a document. The four TimeML relations, during, during_inv,
identity, and simultaneous, are identical from the perspective of a relative temporal
order. Therefore, these relations are merged into one relation in this thesis: simultaneous. According to the TimeML annotation guidelines, during means that an
event persists throughout a time period. The definition focuses on the identical
duration between an event and a time period. Mani et al. (2006) considered during
identical to is_included, and merged during and during_inv into is_included and
includes, respectively. But, the Tarski Toolkit (Verhagen et al., 2005), which was
used in annotating temporal corpora, merged during, during_inv, simultaneous,
and identity into simultaneous. In this thesis, I follow the implementation in the
Tarski Toolkit because this toolkit was used to verify the consistency of the corpora
13
in the annotation process. The target relations in this thesis are the eleven qualitative relations: before, after, begins, begun_by, ends, ended_by, includes, is_included,
ibefore, iafter, and simultaneous.
The TLINK tag <TLINK eventInstanceID=“ei1” relatedToTime=“t1” relType=
“IS_INCLUDED”>, from (3), illustrates the components of this type of tag. The
ei1 is the value of eventInstanceID attribute which is an anchor, and the t1 is
the value of the relatedToTime attribute which is a target in the example. This
TLINK tag is interpreted such that the duration of the anchor entity, ei1, is
included in the duration of the target entity, t1.
For convenience in displaying TimeML annotated text, I will use simplified forms of
the tags, as shown in (7).
(7) Brief forms of EVENT, TIMEX, and TLINK
Last week t1 , Hezbollah fighters killed e1 three Israeli soldiers and seriously
wounded e2 six others.
is_included
e1 =====⇒ t1
is_included
e2 =====⇒ t1
I omit all attributes of EVENT and TIMEX tags in this thesis and leave only the
IDs, as illustrated in (7). I use event ID instead of event instance ID for convenience if the use of event instance ID is not necessary. I use the following notation,
is_included
e1 =====⇒ t1, instead of the TLINK markup representation for simplicity.
Temporal links in the temporal corpora can be categorized into five different types
based on what kind of temporal entities the temporal links are composed of: links
between events (event-event), between an event and a time expression (event-time),
between an event and Document Creation Time (event-dct), between time expressions (time-time), and between time expression and DCT (time-dct). Examples of
14
the link types are presented in (8). Any temporal entity can come first in annotated
relation
relation
links such as event =====⇒ time and time =====⇒ event when the two entities in a
temporal link are different types such as event-time.
(8) An example of types of temporal links
11/02/89 DCT
The company has reported event declines event in operating profit in
each of the past three years time , despite steady sales growth event for
the years time .
event-event reported -declines, reported -growth . . .
event-dct reported -11/02/89, declines-11/02/89 . . .
event-time reported -each of the past three years, declines-each of
the past three years . . .
time-time each of the past three years-the years . . .
time-dct each of the past three years-11/02/89 . . .
DCT (Document Creation Time) signifies when a news article was created or published. Most documents have an annotated DCT in the meta-data before a news
article. DCTs of some documents, however, are annotated as an attribute value of a
time expression in its news article. When DCTs are annotated as an attribute of a
time expression in a news article, I categorized the links with the DCT information
as event-time and time-time, not event-dct and time-dct, essentially treating the
DCT as a time entity.
Table 2.1 shows the distributions of temporal relations that are in TimeBank and
AQUAINT corpora. The table also shows the distributions of fine-grained temporal link types: a link from an event to a time expression (ET), a link from a
15
AFTER
BEFORE
BEGINS
BEGUN_BY
ENDED_BY
ENDS
IAFTER
IBEFORE
INCLUDES
IS_INCLUDED
SIMULTANEOUS
Total
event-event
time-time
1023
2300
56
59
114
23
50
67
384
484
1672
6232
2
1
0
0
0
0
1
0
0
0
2
6
time-DCT
DT TD
7
54
32
49
0
34
1
3
4
21
1
10
0
1
1
1
11
622
2
110
8
37
67
942
event-time
ET TE
20
114
137 615
1
0
2
0
10
2
0
0
0
1
1
2
96
145
28
227
105
78
400 1184
event-DCT
ED
DE
83
261
182
81
24
11
27
0
1
0
46
4
55
0
3
0
3
256
36
349
1477
1
1937 963
Total
1564
3397
126
92
152
84
108
75
1517
1236
3380
11731
Table 2.1: Distributions of temporal relations in TimeBank and
AQUAINT
time expression to an event (TE), a link from an event to DCT (ED), a link from
DCT to an event (DE), a link from a time expression to DCT (TD), and a link
from DCT to a time expression (DT). In the experiments reported in this thesis,
I mainly use temporal links that contain an event. Therefore, I use temporal links
between events, between an event and a time expression, and between an event and
Document Creation Time.
2.2
Previous Studies
In this section, I discuss some previous work on probabilistic approaches to identifying temporal relations in a narrative. I categorize the approaches into two types
for the convenience of the discussion. The first type, which will be called “temporal
relation classification”, focuses on classifying the temporal relation of a pair of temporal entities using a machine learning classifier, and does not attempt to construct
16
a consistent temporal structure. The other type, which will be called “constraint
based temporal relation identification”, constructs temporal relations that do not
conflict using transitive constraints (e.g., if “i before k and k before j, then i before
j”). TRI tasks using probabilistic methods have moved from the first type to the
second type in recent years.
2.2.1
Temporal Relation Classification
In many domains of natural language processing, probabilistic methods have been
shown to lead to good performance (e.g., Charniak, 1996; Manning and Schütze,
1999). This is particularly true of such domains as part-of-speech tagging and
parsing, in which classification is central. These domains have typically been
addressed with machine learning methods – ways of classifying or categorizing
inputs automatically on the basis of an automatic learning algorithm. The applications of the machine learning methods can be roughly categorized into supervised
learning and unsupervised learning. Supervised learning trains machine learning
methods on labeled data, while unsupervised learning uses unlabeled data and finds
patterns in the unlabeled data.
The publication of TimeBank (Pustejovsky et al., 2003b) facilitated a number of
studies using supervised learning techniques applied to the task of temporal relation
classification (Boguraev and Ando, 2005; Mani et al., 2006; Chambers et al., 2007).
Several link types have been examined: links between an event and a time expression, links between a pair of events, and links between an event and Document
Creation Time. The studies first constructed feature vectors of the form <feature1,
feature2, feature3, . . . > using relevant features such as word token, part-of-speech,
tense, etc. The feature vectors were split into training and test data. A machine
17
learning classifier was trained on the training data in order to map an unseen feature vector in test data to a target temporal relation. The trained classifier classified temporal relations in the test data. The automatically classified relations were
then evaluated by comparing the predictions of the system to the original corpus
annotations, using evaluation measures.
Boguraev and Ando (2005) were the first to apply machine learning techniques to
temporal relation classification. They were only concerned with the classification
of temporal relations between an event and a time expression where both entities
are in the same sentence. Their target relations were all fourteen temporal relations of TimeML including during, during_inv, and identity, and they added not-atemporal-link to indicate that no annotated link exists between a pair of temporal
entities. Therefore, their target set consisted of 15 relations in total (14 relations
from TimeML plus not-a-temporal-link ). In their evaluation, they generated all
temporal links between an event and a time expression within a sentence, and
they assigned one of the 15 target relations using a classifier called Robust Risk
Minimization (RRM) (Zhang et al., 2002). RRM is a linear classifier using thresholds of inner products of feature vectors and weight vectors. They achieved 53.1
F-measure.
Mani et al. (2006) applied a machine learning classifier to a larger set of temporal
link types: temporal links between an event and a time expression, between a pair
of events, and between an event and DCT. They collected three sets of annotated
temporal links based on the three link types. The collected links in each set were
divided into training and test data. They made two main contributions in the field
of temporal relation classification. The first contribution was to increase the size
of the training data by applying transitive constraints to the temporal links in the
corpus. This means that, for example, given the annotations, i before k and k before
18
j, in the corpus, they inferred the additional link, i before j. The automatically generated temporal links were added to training data. The second main contribution
was the use of merged relations in order to decrease the number of target relations.
They trained a Support Vector Machine on increased training data with a total of
six merged relations (i.e., before, ibefore, begins, ends, includes, and simultaneous)
using annotated attribute values when constructing their feature vectors. The evaluation of the trained classifier was carried out using test data composed of only
annotated links (i.e., no additional links generated using transitive constraints) on
the six merged relations. Their system achieved 62.5% and 76.1% accuracies for
temporal links between events and between an event and a time expression.
Since Mani et al. (2006), following studies (Chambers et al., 2007; Verhagen et al.,
2007, 2010; Tatu and Srikanth, 2008) have used merged relations as target relations.
Mani et al. (2006) argued for the usefulness of merged relations. Their study was
the first study that used six merged relations instead of all annotated qualitative
temporal relations. The merged relations were based on the existence of inverse
relations such as before and after and identical relations such as simultaneous and
identity in the TimeML relations. But, crucially, Mani et al. (2006) also applied
the merged relations to the evaluation data. The evaluation using merged relations
had the effect of reducing the search space, and resulted in artificially boosted performance, as I will show in Section 3.3. The following studies (Chambers et al.,
2007; Verhagen et al., 2007, 2010; Tatu and Srikanth, 2008) also used the evaluation process of Mani et al. (2006). As a result, no studies have given an accurate
report of the temporal relation classification task. As far as I know, the best performances of previous studies using the six merged relations were 67.6% accuracy of
Chambers et al. (2007) on event-event and 76.1% accuracy of Mani et al. (2006) on
event-time.
19
Task
Description
Task A to classify temporal relations between time
TempEval-1
and event expressions in the same sentence
Task B to classify temporal relations between
the Document Creation Time and event
expressions
Task C to classify temporal relations between the
main events of adjacent sentences
Task C identical to Task A of TempEval-1
Task D identical to Task B of TempEval-1
TempEval-2
Task E identical to Task C of TempEval-1
Task F to classify the temporal relation between
two events where one event syntactically
dominates the other event
Best
Performance
0.65
0.80
0.55
0.65
0.82
0.58
0.66
Table 2.2: Descriptions of TempEval tasks
The TempEval-1 (Verhagen et al., 2007) and TempEval-2 (Verhagen et al., 2010)
shared tasks were designed in order to evaluate the automatic annotation of temporal information on a document. Four tasks from the six tasks of both TempEvals
were to classify the temporal relation of fine-grained categorized temporal links,
as shown in Table 2.2. Two modifications were made in the TempEval evaluation
tasks: (1) reduce fourteen TimeML relations into three core relations (before, after,
and overlap 1 ), and (2) extract temporal links from TimeBank related to the target
tasks. The shared tasks defined two evaluation measures which they called “precision” and “recall”. Their “recall” is identical to accuracy: the ratio of correctly
classified relations to the total number of target relations. It was possible for teams
that attended to the shared tasks not to annotate a relation on a temporal link.
1 This
overlap relation is a cover term for all other relations in TimeML relations
except before and after.
20
Study
Target Links
Target Relations
Boguraev and
Ando (2005)
links between an event
and a time expression
in a sentence
event-time, event-event
14 TimeML relations +
“not-a-temporal-link”
6 merged relations
No
event-event
6 merged relations
No
TempEval-1
before, after, overlap
Yes
event-event
before, after
Yes
TempEval-1
before, after, overlap
Yes
TempEval-2
before, after, overlap
Yes
Mani et al.
(2006)
Chambers et al.
(2007)
Verhagen et al.
(2007)
Chambers and
Jurafsky (2008)
Yoshikawa et al.
(2009)
Verhagen et al.
(2010)
Source
Document
Distinction
No
Table 2.3: Target links and target relations of previous studies
Their “precision” evaluated the ratio of correctly classified relations to the total
number of classified relations. The precision calculation ignored temporal links that
teams did not classify. For three of the four tasks, the best performing system did
not achieve more than 0.7 F-measure, despite the fact that there were only three
target relations to be identified.
As previously mentioned, there have been no studies on temporal classification that
have examined the full set of qualitative relations available in TimeML.2 While
Boguraev and Ando (2005) did include the full set of relations (plus an additional
one), their results are not directly comparable to the work presented in this thesis.
They added to their target relations the judgment on whether a link had only one
2 We
only consider the task of classifying the relative temporal order between a pair of
temporal entities here.
21
qualitative relation or not. Moreover, Boguraev and Ando (2005) focused only on
temporal links between an event and a temporal expression within the same sentence. Other studies limited target relations to six and/or didn’t consider source
documents in splitting data into training and test data sets, as shown in Table 2.3.
In spite of the fact that the previous studies examined various ideas such as
increasing training data using transitive constraints and merging inverse relations,
their performance remains too poor to be considered good enough for real world
applications. Three issues may explain why this task is less suitable for machine
learning methods. The first issue is the fact that there are eleven target relations
but only a relatively small amount of annotated data. The second issue is the
skewed distributions of the relations. The last issue is the fact that temporal links
are dependent on each other. As mentioned by Yang and Wu (2006), when data
with multiple target classes depend on each other and are not evenly distributed, it
is a challenge for machine learning classifiers to achieve good performance. In addition, as I will show in Sections 3.2 and 3.3, previous work saw artificially boosted
performance due to errors made when the evaluations were carried out.
2.2.2
Constraint-Based Temporal Relation Identification
One critical issue that caused problems for the studies discussed in Section 2.2.1 is
the existence of conflicts between classified relations in a document. When other
applications such as Question and Answering systems use classified relations that
are generated by the methods in the studies, it is impossible for the systems to
directly use the classified relations as resources. The applications need to resolve
the inconsistencies in order to use the classified relations. For example, when a
Question and Answering system needs to answer a question about a qualitative
relation between a pair of temporal entities that is not classified into a relation,
22
the system needs to reason the relation of the pair using already classified relations. But, when conflicts among classified relations exist, the system cannot give
an answer without resolving the conflicts because the conflicts can interrupt the
reasoning process. Chambers and Jurafsky (2008), Tatu and Srikanth (2008), and
Yoshikawa et al. (2009) tried to construct a consistent temporal structure in a document in order to resolve the issue. Their studies applied transitive constraints to
their methods that constructed consistent temporal structure.
An advantage of the application of transitive constraints is the effect of reducing
the search space. When we apply transitive constraints to transitive links such as
?
?
?
i ===⇒ k, j ===⇒ j, i ===⇒ j, the number of temporal relation candidates of the
three links is reduced a lot because the constraints reject lots of cases that violate
af ter
af ter
bef ore
them such as i ===⇒ k, j ===⇒ j, and i ===⇒ j. It was assumed that the reduced
search space can improve performance of TRI systems. These studies resulted in
different opinions on the usefulness of transitivity constraints on achieving improved
performance compared to a classification system that did not use transitive constraints to classify temporal relations. The studies were based on different methods
and data. Therefore, it is difficult to judge the reasons for the different conclusions.
Chambers and Jurafsky (2008) extracted event-event links with before or after relations and used the links as training and test data. Temporal links with relations
other than before and after were ignored because the target relations of their study
were only before and after. They classified temporal relations of the event-event
links using SVM classifiers, and resolved conflicting relations using a combination of
lpsolve,3 which is an Integer Linear Programming (ILP) implementation, the probability scores from SVM, and transitive constraints. The transitive constraints they
used seemed to be applicable in the reconstruction when target relations were only
3 http://sourceforge.net/projects/lpsolve
23
Study
Data
Relations
Chambers
and
Jurafsky
(2008)
event-event
with before or
after relation in
TimeBank and
AQUAINT
relations of
TempEval
before,
after
TimeBank and
AQUAINT
Yoshikawa
et al.
(2009)
Tatu and
Srikanth
(2008)
Reconstruction
Method
the application
of ILP after the
classification
Performance
Improvement
3.6% improvement
after adding additional event-event
and time-time links
TempEval1
Markov Logic
Network
six merged
relations
two heuristics
2.5% improvement
(global formula +
local formula + timetime + time-dct)
compared with the
use of local formula
lowered improvement
Table 2.4: Differences between previous TRI studies
before and after. In their first experiment, there was no difference in performance
before and after the application of the ILP method. In their second experiment,
they increased data connectivity by adding two types of additional links: time-time
links in the temporal corpora, and the additional event-event links of Bethard et al.
(2007), which were over 600 manually annotated before and after relations between
events where the first event is a verb and the second event is the head of a clausal
argument to the verb. After the increase in connectivity, the performance increased
by 3.6% compared with the performance of their first classification step. But, this
is not a surprising result because the added information was more accurate. The
additional event-event links were manually annotated, and the time-time links
were either annotated ones or generated through simple calculations using annotated attributes (and therefore not very noisy). It was probable that the reliable
24
information led to the performance improvement because the reliable information
constrained possible relations of the links to be classified.
Yoshikawa et al. (2009) adopted Markov Logic to construct consistent temporal
relations. They used TempEval-1 (Verhagen et al., 2007) as data, which has different (simpler) annotations than TimeBank and AQUAINT corpora. Their target
temporal links were links between adjacent main events, between an event and a
time expression in the same sentence, and between an event and DCT. The target
relations of their study were before, after, and overlap. Overlap was a cover term for
all remaining TimeML relations. They used two types of formulas: local formulas
for classifying relations and global formulas for constructing consistent results.
Their study also used time-time and time-dct in the application of the global formulas. They reported a 2.5% improvement compared to 0.641 accuracy without
transitivity constraints.
Tatu and Srikanth (2008) used temporal links between events in TimeBank and
AQUAINT corpora as training and test data, and six merged relations as their
target relations. They first classified temporal relations using SVM and Maximum
Entropy (ME). After the classification step, they used a reasoning system to extract
consistent relations from the classified relations using two heuristics based on confidence scores from classifiers. In the reasoning process, other types of annotated temporal links such as between an event and a temporal expression and between a pair
of temporal expressions were added as additional information. Their first heuristic
was based on an iterative process. Their first-order logic prover, first, detected a
conflict, then relations that contributed to the derivation of the detected conflict
were extracted. The lowest scored relation among the extracted temporal relations
was modified. After the modification, the inconsistency checking process was rerun.
One problem with the iterative process is that later modifications are influenced
25
by earlier modifications, which causes cascading negative effects if an early decision is incorrect. The other heuristic, first, collected the top three scored relations
of each link. After the collection, the combination of relations that was consistent
and had the highest sum was selected from the collection. If ten links are given,
this heuristic greedily searches for a combination of relations among 310 possible
candidates. Both methods lowered the performance, by 0.5% and 3%, respectively,
compared to their classification performance. In other words, some correctly classified relations were modified because they had lower scores than some incorrectly
classified ones.
2.3
Temporal Constraint Propagation Algorithm
I use transitive temporal constraints in producing a consistent temporal structure
for a document in this thesis. There are several ways of applying transitive constraints. Chambers and Jurafsky (2008) and Yoshikawa et al. (2009) used inductive
logic programming and Tatu and Srikanth (2008) used a logic program. The proposed methods for detecting and correcting violations on transitive constraints in
this thesis are based on the Temporal Constraint Propagation Algorithm (TCPA).
This section gives a brief introduction to TCPA as background to the extensions
and modifications I will later propose.
Allen (1983) proposed the Temporal Constraint Propagation Algorithm. The algorithm infers implicit temporal relations from explicit ones, makes explicit relations
more constrained, and detects the existence of a conflict among the explicit temporal relations. In designing the algorithm, he also defined relative temporal relations between a pair of intervals. TimeML was designed based on Allen (1983), and
a temporal entity was assumed to be an interval.
26
Relation
X before Y
Symbol
<
Inverse Pictorial Example
>
XXX Y Y Y
X equal Y
=
=
X meets Y
m
mi
X overlaps Y
o
oi
X during Y
d
di
X starts Y
s
si
X finishes Y
f
fi
XXX
YYY
XXXY Y Y
XXX
YYY
XXX
YYYYYY
XXX
YYYYY
XXX
YYYYY
Boundary Relations
X − < Y −, X − < Y +,
X + < Y −, X + < Y +
X − = Y −, X − = Y +,
X + = Y −, X + = Y +
X − < Y −, X − < Y +,
X + = Y −, X + < Y +
X − < Y −, X − < Y +,
X + > Y −, X + < Y +
X − > Y −, X − < Y +,
X + > Y −, X + < Y +
X − = Y −, X − < Y +,
X + > Y −, X + < Y +
X − > Y −, X − < Y +,
X + > Y −, X + = Y +
Table 2.5: Thirteen relations of Allen (1983)
Allen Relation
before (<)
after (>)
meet (m)
inverse meet (mi)
during (d)
inverse during (di)
starts (s)
inverse starts (si)
finishes (f )
inverse finishes (f i)
equal (=)
TimeML Relation
BEFORE
AFTER
IBEFORE
IAFTER
IS_INCLUDED
INCLUDES
BEGINS
BEGUN_BY
ENDS
ENDED_BY
SIMULTANEOUS
Table 2.6: Mapping between TimeML relations and Allen (1983) relations
27
Allen (1983) proposed a method for representing relative temporal relations among
intervals. He defined an interval as a period of time with start and end boundaries.
In addition, the start boundary of an interval should be temporally placed before
its end boundary in his definition of an interval. Based on this definition, he categorized thirteen relative temporal relations between a pair of intervals. Allen started
by defining seven basic relations (before, finish, meets, overlaps, during, starts, and
equal ), and added inverses for all of the basic relations except equal to the relative
temporal relations. The thirteen relations are mutually exclusive and represent all
possible relative temporal orders between two intervals. The relations are computed
based on the calculations using the boundaries of two intervals. Let’s suppose two
intervals, x and y, are given; x− or y − is the start of an interval and x+ or y + is
its end. When the relation from the end of x (x+ ) to the start of y (y − ) is defined
bef ore
as before (x+ ===⇒ y − ), we can define the relation between x and y as x before y
bef ore
(x ===⇒ y). Some boundary orders, for example, x+ before y − and x− after y + ,
cannot be converted into Allen’s thirteen relations because the relations contradict the definition of an interval: a start boundary of an interval temporally happens before its end boundary. Table 2.5 and Table 2.6 show Allen’s basic relations,
their inverses, pictorial examples, relations of boundaries, and the correspondence
between Allen relations and TimeML relations. One thing to note is that overlap
and inverse overlap of Allen’s thirteen relations are not included in the TimeML
relations. It is not clear why these two Allen relations are excluded.
When intervals and some explicit relations between the intervals are given, Allen’s
Temporal Constraint Propagation Algorithm uses a directed graph to represent the
given information. Nodes of the graph are intervals and its edges are labeled with
disjunctions of the thirteen relations in the graph. The graph is called an interval
constraint network. Let’s suppose an interval constraint network with three inter28
<
e1
e2
>, d
e3
(a) Initial relations
<
e1
e2
>
<, di
<
>
>, d
e3
(b) Propagated relations
Figure 2.1: Initial and propagated graphs <, >, and d mean before, after, and
during relations.
vals such as e1, e2, and e3 in Figure 2.1a. Temporal relations from e1 to e2 are
<
before, and from e3 to e2 are after (>) or during (d), such that e1 ===⇒ e2 and
>,d
e3 ===⇒ e2.
TCPA iteratively applies transitive constraints to three intervals from the given set
of intervals. When there is no more change in relations through the iterative application, TCPA terminates. Through the iterative calculation of transitivity among
three intervals, Allen’s algorithm computes “the strongest entailed relations between
all the pairs of variables” (Gerevini, 2005). During the iteration, if a link between a
pair of intervals has no relation as a label (i.e., is unable to find any consistent relation), the network has a conflict among the explicit relations. As an illustration of
29
D
o
sm
d di
A
sm
B
d di
f fi
C
Figure 2.2: Inconsistent example that cannot be detected with TCP
the straightforward application of the algorithm, Figure 2.1a results in Figure 2.1b
after the application terminates, due to no more changes in the iterative process.
TCPA has two limitations. The algorithm only guarantees consistencies among
three nodes, not the global consistency of a network. So, the algorithm cannot
detect the inconsistency in Figure 2.2. TCPA calculates every local consistency
between three nodes in O(n3 ). The calculation of the global consistency of a graph
is an NP-complete problem, as shown by Vilain et al. (1990). Currently, there is no
known method to calculate global consistency. Therefore, I use Allen’s constraint
propagation algorithm in this thesis in spite of this limitation. The other limitation is the lack of the functionality to extract conflicting relations. The goals of the
algorithm were to make relations more constrained through specified relations and
transitive constraints and to detect the existence of a conflict among specified relations. Therefore, the algorithm can detect the existence of a conflict, but was not
designed to extract conflicting relations.
30
Chapter 3
Temporal Relation Classification between Temporal Entities
3.1
Introduction
This chapter discusses the temporal relation classification task, one of the components of a temporal relation identification system. The goal of the temporal relation
classification task is to label the directed link between a pair of temporal entities
with a qualitative temporal relation.
The example in (1) has reference to two events by the verbs, killed and wounded,
and one temporal expression, last week.
(1) Last week , Hezbollah fighters killed three Israeli soldiers and seriously
wounded six others.
Given these three entities, we can make three temporal links, such as a link from
killed to wounded, that have a corresponding reversed link, such as a link from
wounded to killed. Therefore, there are three possible relations among them. This
temporal relation classification task can be formulated as a labeling task: given the
text in (1), how can we label the following links, from the set of possible relations
?
?
(before, after, includes, etc.): killed ===⇒ wounded, killed ===⇒ last week, and
?
wounded ===⇒ last week.
The annotated corpora that we use only contain links that can be annotated with
one and only one relative temporal relation. Whenever there are multiple labels
31
possible, no annotation is made. The relations of these links are mapped to a
set of eleven relations: before, after, begins, begun_by, ends, ended_by, includes,
is_included, ibefore, iafter, and simultaneous. In (1), the is_included relation holds
is_included
in links from killed to last week (killed =====⇒ last week) and from wounded to
is_included
last week (wounded =====⇒ last week). It is impossible to select only one qualitative temporal relation between the killed and wounded entities because we do
not have enough context to be able to pinpoint only one temporal relation. The
killing events could happen before the wounding events or the wounding events
could happen first. It is also possible that the wounding and killing events happened simultaneously. A link between killed and wounded entities is not annotated
because there is no way to choose just one possible link from the many valid alternatives.
Following previous studies, I will make the following methodological assumptions,
as I mentioned in the introduction chapter:
• Temporal entities have already been identified.
• Annotated links only have one relative temporal relation.
The adoption of these assumptions makes it possible for me to focus on the core
task, Temporal Relation Classification (TRC), and to evaluate the classification
task without the interference of the performances of other systems that identify
temporal entities and temporal links with only one temporal relation. To summarize: the classification of a relation for a directly linked pair of entities is the target
task in this chapter.
In Section 3.2, I will outline the effect of using a larger target set of qualitative relations. As shown in Section 2.2.1, no studies have used all qualitative relations as
target relations since Mani et al. (2006) proposed the use of six merged relations
32
instead of all qualitative relations available in the corpora. In Section 3.3 I will
show why the application of the six merged relations to the evaluation data is problematic and why we can’t get the performance of a previous study in a real world
application.
The best performing temporal relation classification systems have achieved between
60% and 70% accuracy (Mani et al., 2006; Chambers et al., 2007). As mentioned
in the introduction chapter, the reported inter-annotator agreement for the annotations of temporal relations was 77% (Verhagen et al., 2007). When we use the
inter-annotator agreement rate as the level of performance that we expect from a
temporal relation classification system, the previous best performances of between
60% and 70% are not high enough for practical applications. Therefore, in Section 3.4, I will suggest an alternative method for making the classification task
somewhat easier. The method takes advantage of the fact that a temporal relation link represents a qualitative temporal order between temporal entities, i.e., an
interval. A temporal entity has two boundaries: start and end. A temporal relation
(i.e., an interval relation) can be represented with four boundary pairs from two
temporal entities. The method uses the four boundary pairs in a machine learning
step. A pair of boundaries can be represented by three target boundary relations:
before, equal, and after. So the method only needs four classifiers for each temporal
link. After classifying relations on the boundary pairs, I combine the four classified
relations into a single temporal relation using a score-based heuristic.
3.2
Temporal Relation Classification: Interval Relations
The goal of this experiment is to implement three temporal relation classification
systems for three temporal link types (i.e., links between events, between an event
33
and a time expression, and between an event and DCT) to label a temporal link
as one of eleven target relations. The corpora that I am working with have been
annotated with a set of eleven qualitative relations.1 It is reasonable to examine
the performance of a TRC system on all of the relative temporal relations that are
available.
There are eight possible types of temporal links between the three different types
of temporal entities (i.e., events, temporal expressions, and DCT), depending on
which entity plays the role of the temporal anchor that corresponds to the value of
the eventInstanceID or timeID attribute in an annotated temporal link:
• temporal links between events
• temporal links from an event to a time expression
• temporal links from a time expression to an event
• temporal links from an event to DCT
• temporal links from DCT to an event
• temporal links from a time expression to DCT
• temporal links from DCT to a time expression
• temporal links between time expressions
In this thesis, I only consider the five types where at least one element of the link is
an event. I normalized the five temporal link types into three types:
• temporal links from an event to a time expression (event-time)
1 The
actual number is fourteen, but these fourteen relations correspond to eleven
qualitative relations, as I discussed in Section 2.1.
34
• temporal links from an event to DCT (event-dct)
• temporal links between events (event-event)
In the normalization, I modified temporal links based on the criteria outlined in
(9).
(9) 1. An event is placed in the anchor position when a link has an event as its
temporal entity.
2. The event to first appear in the text is placed in the anchor position
when a link is composed of two events.
When a temporal link was composed of two different entity types such as between
an event and a time expression, I placed the event in the temporal anchor position in the link. The modification makes it possible to combine two temporal link
types such as event-time and time-event into a single link type: event-time. In addition, this modification makes it easier for the classifier to learn temporal relations
because it has access to more consistent data.
The second modification in (9) is applied to temporal links between a pair of events.
The TimeML annotation guide (Saurí et al., 2006) doesn’t specify which entity
type an annotator should mark as an anchor. Therefore, human annotators made
events either anchors or targets based on their subjective decisions. Their decisions
sometimes make it difficult to find a useful pattern in annotated links, as can be
seen in (10).
(10) . . . the parent asked the unit to respond by Oct. 31. The unit said it can
provide no assurance . . .
af ter
said ===⇒ asked
35
. . . she laid a wreath next to the crater . . . she handed a folded American
flag to U.S. Marines . . .
bef ore
laid ===⇒ handed
I fixed what event should be placed at the anchor position of a link. When a link
was composed of a pair of events, the order in which they appear in a narrative was
af ter
bef ore
used. The link, said ===⇒ asked, was converted into asked ===⇒ said.2 One reason
for adopting this convention is that there is a well known effect of narrative convention in temporal interpretation: the order in the text often corresponds to the
temporal order. In “John pushed Mary. Mary fell.”, for example, the natural interpretation is that the pushing was before the falling. Clearly, learning this narrative
convention is difficult if there is no normalization. Therefore, I have adopted the
convention that the event appearing first in the document is placed in the anchor
position.
In Section 3.2.1, I describe what features are used during the construction of the
feature vectors used in this classification task. Next, in Section 3.2.2, I describe
the setup of an experiment to measure the performance of the classifiers that label
a temporal link as one of eleven target relations. In Section 3.2.3, I present and
discuss the results of this experiment.
3.2.1
Features
In constructing feature vectors for the classifier, I use many of the features used by
Mani et al. (2006) and Chambers et al. (2007) because these studies showed that
the features were useful to some degree in the studies. In addition, I also examine
2 The
conversion is different from the merge of fourteen relations into six relations using
inverse relations in Mani et al. (2006). In this experiment, it is possible that annotated
bef ore
af ter
links such as e1 ===⇒ e2 are converted into e2 ===⇒ e1 if e2 first appears in the text.
36
This experiment Mani et al. (2006) Chambers et al. (2007)
tense
∨
∨
aspect
∨
∨
part-of-speech
∨
∨
modal
∨
∨
event class
∨
∨
stem
∨
∨
bigram of tenses
∨
∨
bigram of aspects
∨
∨
word existences in timex
plural in timex
samesent
∨
∨
samephrase
sameclause
relative
signal
∨
∨
adjacency
Table 3.1: Comparison of features in this experiment and previous works
several novel relevant features that capture the syntactic relatedness between two
temporal entities. The new features capture information about whether both entities are in the same phrase or in the same clause, and whether both entities are
part of a relative clause. In this experiment, I used the existing annotated information of events and temporal expressions in TimeBank and ACQUAINT corpora.
Additionally, I added some features that are extracted using natural language processing systems; for example, the stem of an annotated event token and the existence of two entities in the same phrase. I categorize which features are from the
previous studies and which ones are new in Table 3.1.
37
The features used for events are tense, aspect, modal, part-of-speech, stem,
and event class, most of which were taken directly from the TimeML annotation
for the corpora. These features have the following values:
Tense present, past, future, infinitive, prespart (present participle), pastpart (past
participle), and none.
Aspect progressive, perfective, perfective_progressive, and none.
Part-of-speech adjective, noun, verb, preposition, and other.
Modal is a modal verb that modifies an event. In the example of “it would tolerate
a rule in its own backyard”, the modal value of the tolerate event is “would”.
Event class (type of event) reporting, perception, aspectual, i_action, i_state,
state, and occurrence.
Stem is the stem of the event-denoting word. I extract stems using the morphological analyzer, morpha.3
To given an example, reported in “the company has reported the declines” is represented as <(TENSE:present), (ASPECT:perfect), (MODAL:none), (POS: verb),
(CLASS: reporting), (STEM:report)>. I also added bigrams of tense and aspect as
features for event-event links. These features were intended to reflect the tense and
aspect sequences of the events in the links. When the tenses of event1 and event2
relation
in event1 ===⇒ event2 are present and past respectively, the bigram of the tenses is
present-past.
The features used for TIMEXes are very different from those used for events.
TIMEXes such as “two days ago” are often multi-word expressions, with much
3 http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.
tar.gz
38
of the information about their temporal features encoded in specific words. For
example, “ago” can be very informative about the relationship between “two days
ago” and other temporal entities. Therefore, in an attempt to normalize temporal
expressions, as I extracted TIMEX features, I added features that record whether
specific time-related words are present in a time expression. The normalization
of time expressions is the first attempt in this field. I manually defined a list of
relevant temporal words and automatically checked for the existence of one of these
words in each time expression. These words were treated as binary features. The
words used as features are:
• ago, coming, current, currently, earlier, early, every, following, future, last,
later, latest, next, now, once, past, previously, recent, recently, soon, that, the,
then, these, this, today, tomorrow, within, yesterday, and yet
Additionally, I added the existence of plural words, seconds, minutes, hours, days,
months, and years, as a feature called plural. The use of the plural words usually
means that a temporal expression is a duration.
For example, I represent the past three years as <(PAST:1), (THE:1), (PLURAL:1)>.
Other words not in the expression are triggered as 0.
I added what I call relational features in order to capture the syntactic relatedness
between a pair of temporal entities. In “The company reported declines in the
past three years.”, the syntactic dependence between reported and the past three
years can be captured by a parse tree. The dependency can be key in inferring a
temporal relation between temporal entities. In this thesis, relational features are
those features that are used to infer relations between temporal entities, but are not
derived explicitly from the temporal entities.
39
I parsed each sentence using the Charniak parser (Charniak and Johnson, 2005)
in order to extract relational features when a temporal link existed between two
temporal entities in the same sentence. I present the parsed tree of “The company
reported declines in the past three years.” in Figure 3.1. Using the parsed tree,
I calculated the shortest path between the temporal entities. First, I found a path
of each temporal entity to the root of the parsed tree. The paths of reported and
the past three years to S were “reported ↑VBN↑VP1 ↑VP↑S” and “the past three
years↑NP↑PP↑VP1 ↑VP↑S” in Figure 3.1. Then, I searched for the shortest shared
ancestor between the two entities in the two paths of the entities to the root of the
parsed tree. The shortest shared ancestor means the shared ancestor that, among
all their shared ancestors, is the closest to both temporal entities. In the two paths
of reported and the past three years, the shared ancestors were VP1 , VP, and S.
Among the shared ancestors, the shortest shared ancestor, which was the closest
to both reported and the past three years, was VP1 . The shortest path between
the temporal entities was the path from one temporal entity to the other temporal entity in a temporal link via the shortest shared ancestor. The shortest path
between reported and the past three years was “reported ↑VBN↑VP1 ↓PP↓NP↓the
past three years”.
When the pair of temporal entities of a temporal link meets certain syntactic conditions, the following relational features are given the value 1; otherwise they receive
the value 0: samesent, samephrase, sameclause, and relative.
Samesent checks whether both temporal entities are in the same sentence. When
both entities exist in the same sentence, I mark the feature as 1, otherwise, 0.
40
S
VP
NP
The company AUX
has
VP1
VBN
NP
reported
NNS
PP
IN
NP
declines in the past three years
Figure 3.1: Syntactic parse tree of “The company has reported declines in the
past three years.” It is parsed with Charniak Parser.
Samephrase checks whether both entities are in the same phrase. Under certain
circumstances, this feature is triggered. When both entities are in the same
sentence and meet the following conditions:
Condition 1 The shortest shared ancestor in the shortest path between a
pair of entities should be a phrasal constituent.
Condition 2 Among the paths of the entities to the shortest shared ancestor,
only one path should have at most one additional phrase.
Condition 3 The head of the additional phrase is the entity whose path to
the shortest shared ancestor has the additional phrase.
The shortest path between reported and declines in Figure 3.1 is “reported ↑VBN
↑VP1 ↓NP↓NNS↓declines”. The shared ancestor of the words is VP1 . Therefore, the first condition is satisfied. The path from declines to VP1 in the
shortest path has only one phrase tag, NP, and no phrase tag exists in the
41
path from “reported” to the shared ancestor. Therefore, the second condition is also satisfied. The head of the NP is declines. Therefore, the third
condition is also satisfied, and the samephrase feature for a link between
reported and declines is 1. When a temporal expression is involved in a
link, the phrase of the temporal expression is ignored in the path calculation. The shortest path between reported and the past three years is
“reported ↑VBN↑VP1 ↓PP↓NP↓the past three years”. The NP of the past
three years is ignored in the count of phrases. Therefore, the path from VP1
to the expression has only one additional phrase, PP. But the NP is not the
head of the PP. The violation on the third condition makes the feature 0.
Sameclause checks whether both entities are in the same clause. This feature
is triggered under the condition that at most one clausal node exists in the
shortest path between temporal entities and the clausal node should be the
shared ancestor in the path. If the shortest path has no clause node (S or
SBAR), both entities are considered to be in the same clause. All the paths
between reported, declines, and the past three years in Figure 3.1 do not contain a clausal node. Therefore, all sameclause features for the pairs are 1.
When a path such as “. . .VBN↑VP↓SBAR↓S. . .” is given, the right side path
of VP contains S, a clause, and the S clause is not a shared ancestor. Therefore, sameclause is 0.
Relative When an event is related to a temporal expression in a relative clause
structure, the relationship needs to be captured. In “declines that were
reported in April ”, the samephrase and sameclause features between declines
and April are 0s because there is an additional clausal node in the shortest
path between declines and April. In order to capture the dependence between
42
NP
NP
SBAR
declines that were reported in April
Figure 3.2: Syntactic tree of a relative clause parsed with Charniak Parser
the entities, I introduce a relative feature. The Charniak Parser parses the
relative clause as shown in Figure 3.2. I trigger the relative feature under
the following conditions:
Condition 1 The shared ancestor in the shortest path between two entities
is NP.
Condition 2 The left child of the ancestor is NP and the right child is
SBAR.
Condition 3 The left NP has an event as its descendant.
Condition 4 No clausal node exists in the path from the right SBAR to
another entity.
Signal Certain prepositions and conjunctions are used as features when the
words are used as a head word in the shortest syntactic path between
two entities. The words represent a temporal relation between temporal
entities. The shortest path between reported and the past three years,
“reported ↑VBN↑VP1 ↓PP↓NP↓the past three years”, contains a PP. The
head word of the PP is “in” in “in the past three years”, and “in” is contained
43
in the list of “signal” words. Therefore, in is marked 1. The head words that
were used as features were:
• after, as, at, before, between, by, during, for, in, once, on, over, since,
then, through, throughout, until, when, and while
Adjacency This feature marks the existence of additional words between temporal entities. In cases such as “said last Wednesday” or “Friday meeting”,
there are no words between an event and a temporal expression and said and
meeting events are more likely to have occurred during last Wednesday and
Friday, respectively. This feature captures this tendency. When no additional
words exist between an event and a temporal expression such as “said last
Wednesday” or “Friday meeting”, this feature is marked as 1, otherwise 0.
One goal of this experiment is to see how effective machine learning classifiers are
at classifying TimeML relations. In order to test this, I minimized the intervention
of natural language processing programs in this experiment, only using natural language processing programs to extract a stem of an event and the relational features
that are not available in the temporal corpora.
Temporal links between a pair of events (event-event) and between an event and a
temporal expression (event-time) had feature vectors that consisted of features of
both entities and relational features. Temporal links between an event and DCT
(event-time) contained only features extracted from the event. The features triggered for each temporal link type are listed in Table 3.2.
The feature vector describing the link from declines to the past three years in “The
company has reported declines in the past three years.” is <(TENSE:none),
(ASPECT:none), (PART-OF-SPEECH:noun), (EVENT CLASS:occurrence),
44
event-dct event-event
tense
∨
∨
aspect
∨
∨
part-of-speech
∨
∨
modal
∨
∨
event class
∨
∨
stem
∨
∨
bigram of tenses
∨
bigram of aspects
∨
word existences in timex
plural in timex
samesent
∨
samephrase
∨
sameclause
∨
relative
∨
signal
∨
adjacency
∨
event-time
∨
∨
∨
∨
∨
∨
∨
∨
∨
∨
∨
∨
∨
∨
Table 3.2: Features per TLINK type in this experiment
(STEM:decline), (PLURAL-IN-TIMEX:1), (“past”:1), (“in”:1), (SAMESENT:1),
(SAMECLAUSE:1)>. All other features have the value 0.
As mentioned at the beginning of this discussion of the experiment, I modified temporal links based on the two criteria outlined in (9). This results in more training
data per target relation compared to an approach that constructs two different clasrelation
relation
sifiers, such as classifiers of event =======⇒ not-event and not-event =======⇒
event. By merging these two classifiers into one (always making the event the
anchor), we reduce the number of classifiers that we need to train, and increase
the training data available for the classifier.
45
3.2.2
Experiment Design
A data set of 214 documents that does not contain conflicting relations4 was used
in this experiment. I will use the results of this experiment in constructing consistent temporal relation structures in later chapters. Therefore, the documents with
innate inconsistencies were removed from the classification experiment.
This experiment focuses on the performance of temporal relation classification using
the following machine learning classifiers using eleven TimeML relations as target
relations: k-Nearest Neighbors (kNN), Naive Bayes (NB), Support Vector Machine
(SVM), and Maximum Entropy (ME). The classifiers are selected because they have
shown good performance in other NLP tasks. Before training the classifiers, I modified the temporal links in the corpora based on criteria outlined in (9). I ran the
Temporal Constraint Propagation Algorithm on all documents and generated more
links by constraining implicit links using transitivity constraints. The data that
included temporal links generated by the program were only used as training data.
Test data only included the original annotated links in the corpora. The distribution of relations after the modification is given in Table 3.3. In the table, “Original”
indicates manually annotated links, and “Closed” indicates manually annotated
links plus generated links after the application of the Temporal Constraint Propagation Algorithm. Both counts are the number of temporal relations after applying
the criteria outlined in (9) to the temporal links and merging during, during_inv,
identity, and simultaneous relations into simultaneous.
4I
found that 42 documents of the 256 documents in TimeBank and ACQUAINT
corpora have conflicting relations. I did not use the documents with inconsistencies in the
main parts of this thesis. The goal of my thesis is to construct a consistent temporal structure for a document. It is impossible to construct consistent structure for the documents
with inconsistencies. The inconsistencies found are listed in Appendix A.
46
Relation
AFTER
BEFORE
BEGINS
BEGUN_BY
ENDS
ENDED_BY
IAFTER
IBEFORE
INCLUDES
IS_INCLUDED
SIMULTANEOUS
Event-Event
Original Closed
735
11083
1239
12445
35
75
38
74
15
64
87
132
38
138
49
132
246
3987
327
4360
1370
2348
Event-Timex
Event-DCT
Original Closed Original Closed
86
2016
169
259
160
1603
721
1291
23
36
0
0
51
58
10
11
65
128
0
0
43
61
6
6
3
8
1
1
2
9
0
0
122
166
417
469
1495
2741
435
467
201
321
75
90
Table 3.3: Distribution of TimeML relations after modifying temporal
links
When I used kNN and SVM in the training process, I examined various parameters:
3 and 5 neighbors for kNN, and linear and non-linear (radial basis) kernel functions
in SVM. On each kernel in SVM, four margins were examined: 0.01, 0.1, 1, and 10.
The default settings were used in NB and ME. I used default parameters in this
experiment. This experiment is limited by the lack of a systematic approach to find
the parameters that shows the best performance of a temporal relation classifier. I
intend to investigate the best parameters in my future research.
In the evaluation, I used 10-fold cross validation. When I split training and test
data, the split occurred at the document level. All temporal links in a single document belonged to training or test data, not both. Tatu and Srikanth (2008) showed
that, when we do not adopt this source document constraint on the evaluation, the
performance before the adoption of the source document constraint may be higher
than the performance after the adoption of the constraint. When some links in a
47
document are in training data and the other links from the same document are in
test data, the shared temporal entities between both sets of links give the classifier
an advantage because it has already seen the temporal entities during training.
I measured two kinds of performances of a classification system. I used F-measures
in order to show how good a classification system was in classifying a specific
relation, and accuracies in order to show overall performance of the system. An
F-measure is calculated from a combination of precision and recall. I calculated
the F-measure for each individual target relation. In order to calculate precision and recall we need three counts: the total number of a target class (totaltarget-class), the number of predictions that a system classifies as the target class
(predicted-class), and the number of correct predictions of the predicted-class
(correctly-predicted-class). For example, for the target class before, total-target-class
is the number of temporal links that are labeled as before, classified-target-class
is the number of temporal links that a system labels as before, and correctlyclassified-target-class is how many classified before relations correspond to manually
annotated before relations. Precision, recall, and F-measure used in this study are
defined as:
Precision =
Recall =
correctly-classified-target-class
classified-target-class
correctly-classified-target-class
total-target-class
F-measure = 2 ×
precision × recall
precision + recall
48
(3.1)
(3.2)
(3.3)
When the number of links to be classified is defined as link-count and the number
of correctly classified relations is defined as correct-prediction-count, the accuracy
can be defined as in Equation 3.4.
Accuracy =
correct-prediction-count
link-count
(3.4)
This study did not use F-measures in evaluating the overall performance of a
system, although previous studies (Verhagen et al., 2007, 2010) did. The shared
task competitions, TempEval-1 (Verhagen et al., 2007) and TempEval-2 (Verhagen
et al., 2010), used F-measures to evaluate the overall performance of a system.
They defined total-target-class as the number of temporal links to be classified,
and correctly-classified-target-class as the number of temporal links that a system
correctly classified. Total-target-class and correctly-classified-target-class correspond to link-count and correct-prediction-count, respectively, in Equation 3.4. In
the competitions, a team was allowed not to classify temporal links. Therefore,
classified-target-class was defined as the number of temporal links that a system
classified. In this thesis, my system classifies all temporal links given. Therefore,
classified-target-class and target-class are identical. Precision and recall are also
identical. In addition, their precision corresponds to accuracy in Equation 3.4.
Therefore, I only use accuracy as an evaluation measure for overall performance of
a classification system.
3.2.3
Results and Discussion
In this section, I report the performance of my temporal relation classification
system using eleven qualitative relations as target relations. I present the models
that show the best performance in Tables 3.4, 3.5, and 3.6. I report accuracy as
49
overall performance measure. The tables also list F-measures of relations, showing
the performance of my system on each relation.
I compared trained classifiers for event-dct, event-event, and event-time with a
majority class baseline system. The three different link types have three different
most frequent relations: before for event-dct, simultaneous for event-time, and
is_included for event-event. The baseline systems achieved 0.369, 0.287, and 0.677
accuracies in event-dct, event-event, and event-time, respectively. The overall performance of the baseline systems without the temporal link type distinction was
0.407.
Relations
BEFORE
AFTER
BEGINS
BEGUN_BY
ENDS
ENDED_BY
IBEFORE
IAFTER
INCLUDES
IS_INCLUDED
SIMULTANEOUS
Accuracy
Baseline
kNN
0.670
0.257
N/A
0.000
N/A
0.000
N/A
N/A
0.612
0.512
N/A
0.597
NB
ME
0.633 0.670
0.304 0.233
N/A N/A
0.000 0.000
N/A N/A
0.000 0.000
N/A N/A
N/A N/A
0.593 0.566
0.299 0.541
N/A N/A
0.522 0.576
0.369
SVM
0.659
0.117
N/A
0.000
N/A
0.000
N/A
N/A
0.577
0.511
N/A
0.582
Table 3.4: Results of event-dct
The classifier that showed the best performance on event-dct was kNN using 5
neighbors with an accuracy of 0.597. The performance was 0.228 higher than the
0.369 of the baseline system. As I mentioned before, parameters for classifiers were
chosen arbitrarily in this experiment. I need to examine other parameters such as
k = 7 systematically in the future. Classifiers showed good performance in classifying before and includes as around 0.6 F-measure. The relation, is_included,
50
followed at around 0.5. The classifiers achieved very low performance on relations
such as after, begun_by, and ends.
Relations
BEFORE
AFTER
BEGINS
BEGUN_BY
ENDS
ENDED_BY
IBEFORE
IAFTER
INCLUDES
IS_INCLUDED
SIMULTANEOUS
Accuracy
Baseline
kNN
0.420
0.412
0.000
0.000
0.000
0.117
0.165
N/A
0.064
0.148
0.392
0.368
NB
ME
0.377 0.462
0.283 0.416
0.086 0.000
0.052 0.050
0.085 0.067
0.176 0.237
0.188 0.314
N/A N/A
0.098 0.133
0.054 0.160
0.367 0.389
0.292 0.389
0.287
SVM
0.347
0.261
0.050
0.000
0.000
0.249
0.236
N/A
0.168
0.103
0.507
0.350
Table 3.5: Results of event-event
The classifier that showed the best performance on event-event was ME with accuracy of 0.389. This performance was 0.102 higher than the 0.287 of the event-event
baseline system. There were no classifiers that achieved over 0.5 F-measure on any
relation. The best F-measure on a relation was 0.462 for before (ME). The majority
class was simultaneous at 0.328 ( 1370
in that 1370 is the number of simultaneous
4179
relation and 4179 is the total number of relations) in the original annotated data,
as can be seen in Table 3.3. After the data size was increased through the application of the Temporal Relation Propagation Algorithm, however, before became
the majority relation as 0.357 ( 12445
in that 12445 is the number of before relation
34838
and 34838 is the total number of relations). The change seems to make classifiers
perform well on before rather than simultaneous.
The classifier that showed the best performance on event-time was SVM with an
accuracy of 0.673 in Table 3.6. The performance was only 0.004 lower than the
51
Relations
BEFORE
AFTER
BEGINS
BEGUN_BY
ENDS
ENDED_BY
IBEFORE
IAFTER
INCLUDES
IS_INCLUDED
SIMULTANEOUS
Accuracy
Baseline
kNN
0.332
0.140
0.000
0.290
0.128
0.124
0.000
0.000
0.054
0.762
0.279
0.583
NB
ME
SVM
0.309 0.273 0.092
0.085 0.140 0.106
0.000 0.000 0.000
0.225 0.172 0.000
0.111 0.155 0.102
0.259 0.050 0.097
0.000 0.000 0.000
0.000 0.000 0.000
0.082 0.098 0.146
0.666 0.782 0.805
0.304 0.279 0.157
0.486 0.613 0.673
0.677
Table 3.6: Results of event-time
baseline system. Classifiers achieved the best F-measures on is_included with over
0.65 F-measure. For other relations, classifiers got F-measures below 0.3.
The best performance was achieved in event-time; however this performance was
still not better than the baseline system. The best improvement over the baseline
system was achieved in event-dct. This experiment indicates that the most challenging task is to identify relations between events.
When I combined the results of the best systems which were kNN in event-dct,
Maximum Entropy in event-event, and Support Vector Machine in event-time, the
overall accuracy was 0.516. The low accuracy in the classification task indicates
a high probability that conflicting relations exist among the classified relations. I
examine how many classified relations conflict with each other and how to restore a
consistent temporal structure from the classified relations in later chapters.
52
3.2.4
Final Remarks on This Experiment
This experiment showed what level of performance can be achieved when eleven
relations are used as target relations. My systems for event-dct, event-time, and
event-event achieved 0.597, 0.673, and 0.389 accuracies. Mani et al. (2006), which
used six merged relations as target relations, achieved 0.761 accuracy on event-time.
The performance of my system on event-time, 0.673, was 0.09 lower than their performance. My system performance on event-time, however, was comparable to their
performance considering that my system used five more relations as target relations.
The comparable performance resulted from the high distribution of is_included.
The performance of my system on event-event, 0.389, looked much lower than the
performance of Chambers et al. (2007), 0.676. The experiment reported in the next
section will show that the low performance of my system on event-event is also
comparable to the performance of Chambers et al. (2007).
3.3
3.3.1
Reconsidering the Use of Merged Relations
Motivation
Since Mani et al. (2006), a normalization has been applied by the studies that used
the TimeBank and AQUAINT corpora as data, such as Chambers et al. (2007) and
Tatu and Srikanth (2008), when classifying and evaluating temporal relations. The
normalization merges eleven temporal relations into six by reversing the direction
af ter
bef ore
of some temporal links. For example, e1 ===⇒ e2 is normalized as e2 ===⇒ e1. The
normalization is only applied to some links; that is, those that have been annotated
with a temporal relation ∈ {after, iafter, begun_by, ended_by, is_included}. The
directions of the links are reversed and the relations of the links are mapped to the
following set of inverse relations: before, ibefore, begins, ends, and includes.
53
When we use the normalization in training a classifier, the classifier can only classify a link into a relation among six normalized relations. The lack of the functionality to classify a link into a relation among other relations leads to adopting
additional steps in order to use the classifier in a real world application. The tem?
poral relation of a link to be classified is unknown, i.e., e1 ===⇒ e2. Therefore, we
cannot assume that given temporal links can only be classified into six relations
in a real world setting. Mani et al. (2006) and Chambers et al. (2007) applied the
normalization to both training and test data. We cannot get the performance of the
classifier in a real world setting when the normalization is applied to the test data.
The indiscriminate application of the merge to the test data reduced the space
of possible answers, making errors on inverse relations impossible and making it
impossible to measure the performance of the classifier in a real world setting. In
order to measure the performance of the systems in a real world setting, we need
to keep eleven relations in the test data and apply the classifier to a link in the test
data twice, reversing the positions of the anchor and target. For example, when
?
anchor ===⇒ target is given to a classifier trained with six merged relations, both
?
?
anchor ===⇒ target and target ===⇒ anchor links should be classified. Each link
is classified into a relation among six merged relations. Additional steps are, then,
required in order to combine two classified relations of two corresponding links into
?
a classified relation of the given link, anchor ===⇒ target. The additional steps are
relation
to convert a classified relation of target ===⇒ anchor into its inverse relation and
relation
to choose a relation between a classified relation of anchor ===⇒ target and the
converted relation. Through the additional steps, we can make a temporal relation
classification system that uses a classifier trained with six merged relations classify
a link into a relation among the eleven target relations. In addition, we can use the
finally selected relations in evaluating the performance of the temporal relation clas54
sification system. Although Chambers et al. (2007) reported around 0.60 accuracy
as the best performance of the classification task on temporal links between events,
they achieved that performance using evaluation data with only six target relations.
Therefore, we do not know how the system would really perform on real world data,
where we need to classify all eleven relations on temporal links between events. The
experiment in this section measures this performance by evaluating the Chambers
et al. (2007) system on evaluation data with eleven target relations.
The application of the normalization to temporal links that are composed of different types of temporal entities may be useless. As I mentioned in Section 2.1, a
temporal relation has two positions, anchor and target, because it is asymmetric.
When a temporal link is composed of two different types of temporal entities, e.g.,
temporal links between an event and a time expression or between an event and
DCT, it is important to consider which entity is placed in which position. The
application of the normalization to a temporal link composed of heterogeneous
temporal entities ignores this crucial indicator.
In temporal corpora, an event can be placed as either an anchor or a target position in a temporal link because the TimeML annotation guide (Saurí et al., 2006)
does not specify which entity type an annotator should mark as an anchor. Thererelation
fore, temporal corpora have four types of temporal links: event =======⇒ not-event,
relation
inverse relation
inverse relation
not-event =======⇒ event, event =======⇒ not-event, not-event =======⇒
event. The application of the normalization reduces the four types to two types:
relation
relation
event =======⇒ not-event, not-event =======⇒ event. After the application of the
normalization, then each type has just six merged relations.
One of the advantages of doing this normalization is to increase the size of the
training data per target relation for a classifier. The idea of the normalization is
that, by reducing the number of relations to predict, the classifier will have more
55
data per relation. But the training data is still divided into the two link types mentioned above. I propose an alternative normalization, where the four original link
relation
inverse relation
types are still reduced to two: event =======⇒ not-event and event =======⇒
not-event. One possible advantage to this approach is that all eleven relations are
preserved, which should make the classification more useful in a real world scenario.
It remains to be seen which approach is better.
Since the studies (Mani et al., 2006; Chambers et al., 2007; Tatu and Srikanth,
2008) using six merged relations in temporal relation classification, there have
been no experiments to evaluate the performance of a classification system using
eleven relations. In spite of the problems with the normalization, the application of
six merged relations to the training data is still attractive when the classification
task focuses on temporal links between events. The merge increases the amount of
training data for each target relation. But when the evaluation is carried out on
eleven relations, it is not clear whether the increased data per relation is beneficial
since the classifier trained on six merged relations requires an additional step in
order to get the final classification. As explained before, when a link to be classified
is given to a classifier trained on six merged relations, the classifier should classify the relations of the link and its inverse link, then one of the classified relations
should be selected as the final classification. If the final selection is not perfect, we
can expect performance degradation in this step. Due to the expected performance
degradation of the additional step, the question of whether the merge produces
a better classification system compared with a classification system using eleven
relations has not been answered. The experiment in this section explores this comparison. I developed two classifiers using two sets of data (one with six merged
relations, and the other with all eleven relations). After building the classifiers, I
evaluated them using identical evaluation data with eleven relations.
56
In this experiment, I replicated the Chambers et al. (2007) system that used six
merged relations first. Next, I compared the performance of the replicated system
with their reported performance, and then I evaluated the replicated system on
data with eleven relations. The evaluation showed how much their reported performance was boosted. Finally, I developed another classification system using eleven
relations and the same features as Chambers et al. (2007). The new classification
system was evaluated on eleven relations. The comparison of the second system
with the third system showed that the merge was not effective.
3.3.2
Experiment Design
I used the TimeBank v.1.2 and ACQUAINT corpora in this experiment, identical
to the experiment of Chambers et al. (2007). There are 259 documents in the corpora. There are 6,226 temporal links between events. Among the fourteen TimeML
relations, I merged during, during_inv, identity, and simultaneous into simultaneous. This is slightly different from Chambers et al. (2007) because they treated
during and during_inv as identical to is_included and includes, respectively.
I replicated the temporal relation classification system of Chambers et al. (2007).
The first step of the replication was to validate whether the replicated system
showed a similar performance to that of their system.5 They used six merged relations in both training and test data. In addition, they split training and test data
in such a way that the source document information was lost, and therefore some
links from the same document were split between training and test data. This way
of dividing the data boosts performance, according to Tatu and Srikanth (2008),
because the split results in temporal entities in the document being shared between
5 The
merge of during and during_inv into simultaneous makes a different distribution
of target relations than in the data of Chambers et al. (2007). Therefore, the replicated
system may show differences in performance.
57
training and test. Despite knowing this failing, I also split the training and test
data in this way in order to be able to replicate their system as closely as possible.
This replicated system had two factors boosting actual performance:
• no distinction on source documents in constructing training and test data
• six merged relations in test data
In this experiment, I removed the factors one by one and showed the boosted performance of each factor.
The last step of this experiment was the comparison of the system trained with the
merged relations with a temporal relation classification system trained with eleven
relations. The goal of the last comparison was to see how much better the approach
using the merged relations is compared to a naive approach using all qualitative
TimeML relations.
3.3.3
Features
I used the features of Chambers et al. (2007) in order to replicate their system. I
extracted features from annotated information whenever it was available. In the
case of features that are not annotated in the corpora, I used NLP tools such as a
sentence splitter and Charniak Parser in order to get the features. For example, the
parsed tree of “The company has reported declines in operating profit in the past
three years” is given in Figure 3.3.
Features for EVENT I extracted tense, aspect, modal, part-of-speech (POS),
polarity, and event class annotations from temporal corpora as features. Polarity
signals whether an event is negated or not. “Positive” means that an event is not
negated. I extracted the lemma and WordNet synset of an EVENT word using
NLP tools.
58
S
VP
NP
DT
NN
The company
AUX
has
VP1
NP
VBN
reported
PP
PP
NP
IN
NNS
NP
IN
NP
declines in operating profit in the past three years
Figure 3.3: Syntactic tree parsed with Charniak Parser
Information on words near events The part-of-speech labels (POS) of two
words preceding and one word following an event word were used as features. In
the case of reported in Figure 3.3, “has”, “company”, and “declines” are the words.
Their POSs (e.g., “AUX”, “NN”, and “NNS”) are feature values.
Relational features between events Samesent, signal, and dominance were features. Samesent and signal were explained in Section 3.2. Dominance means that
an event syntactically dominates another event in a sentence. An event is said to
dominate another event when the first phrase level of the first event is a shared
ancestor of both events. For example, reported in Figure 3.3 dominates declines
because the first phrase level (VP1 ) of reported is a shared ancestor of reported and
declines. When an event that is the first argument of a temporal link dominates the
other event in the link, the feature value is 1.
59
Data
Accuracy
Baseline
40.70
6-to-6-wo-source-distinction (Replication of Chambers)
61.28
6-to-6-w-source-distinction
59.63
6-to-11 w Source Distinction (6-to-11 )
43.74
11-to-11 w Source Distinction (11-to-11 )
47.59
Table 3.7: Accuracies of four compared systems
Bigrams Various bigrams were used as features. The bigram features are bigrams
of POSs of an event and one preceding word (e.g., “AUX-VBN” for reported
in Figure 3.3) and bigrams of event POSs, tenses, aspects, and classes (e.g.,
“VBN-NNS”, “PRESENT-NONE”, “PERPECTIVE-NONE”, “REPORTINGOCCURRENCE” for reported and declines).
Agreements were also features that indicate whether events agree in tense, aspect,
and event class. The agreements are features. When both events have identical
values for tense, aspect, and class, those feature values are 1s. For reported and
declines in Figure 3.3, tense, aspect, and class agreements are all 0s.
3.3.4
Results and Discussion
The goal of the experiment is to figure out the true performance of a state-of-art
temporal relation classification system. The system of Chambers et al. (2007) had
two performance-boosting factors in constructing evaluation data, as listed above.
I replicated and evaluated Chambers et al. (2007) including both factors. The first
replication (6-to-6-wo-source-distinction) had both boosting factors and showed
whether my replication has similar performance as Chambers’ system. Chambers
et al. (2007) reported 67.57% accuracy. My replication’s accuracy was 6% lower at
60
61.28%. There were two possible reasons for this. I merged during and during_inv
into simultaneous, but Chambers merged during and during_inv into is_included
and includes. In addition, I did not have access to their SVM parameter tuning
settings.
After establishing a system that roughly corresponds to the original Chambers
et al. (2007) system, I removed one factor: the invalid splitting of the training and
test data due to the ignoring of the source document information. I ensured that
no temporal links in the same document belonged to training and test data simultaneously. I trained and evaluated the replicated system on the new training and
test data. The second system (6-to-6-w-source-distinction) did not have the first
boosting factor. The performance was 59.63%. As reported in Tatu and Srikanth
(2008), the corrected split made it impossible to share information on temporal
links between training and test data and dropped the performance by almost 2%.
It is unlikely that a trained model will see the exact same examples again in real
world data (Domingos, 2012). Therefore, it is important to split data into training
and test data in such a way that data in the two sets do not share any information.
My experiment also supported that the data split method of Mani et al. (2006) and
Chambers et al. (2007) allows the sharing of some information between training
and test data. Therefore, the split of Mani et al. (2006) and Chambers et al. (2007)
should not be adopted in the evaluation of a temporal relation classification system.
The second boosting factor, the application of six merged relations to the evaluation data, is like passing a system a link whose direction is already decided.
But the direction is not known in real world data. After eliminating the second
factor, I evaluated the performance of the replication system in a real world scenario without the two boosting factors. I removed the second factor by generating
the corresponding reversed link to a given link and making the second replicated
61
system predict two relations of the given link and the reversed link. For example,
?
when e1 ===⇒ e2 was given, I automatically generated the corresponding reversed
?
link, e2 ===⇒ e1, then I classified both links. The classifier that is trained on six
merged relations can only examine which relation is appropriate to a given link
among the six target relations. The classifier does not have an ability to examine
the possibility that the given link can be labeled as a relation among other relations besides the six target relations. We can examine the possibility by classifying
the corresponding reversed link of a given link. When the second trained system
classified both links as befores, I compared confidence scores of the predictions and
?
chose the prediction with the higher score. If the before of e2 ===⇒ e1 has a higher
?
score than the other before, the final classified relation of e1 ===⇒ e2 is after. This
system (6-to-11 ) was trained on training data with six merged relations and evaluated on test data with eleven relations. The 6-to-11 system showed almost a 16%
drop in performance compared to 6-to-6-w-source-distinction at 43.74%. As a baseline system, I used the majority relation. The baseline achieved 40.70% accuracy. So
6-to-11 was only 3% higher than the baseline. But the evaluation accurately reflects
the current status of a temporal relation classification system using six merged
relations on temporal links between events and clearly illustrates the boosted performance from the application of six merged relations to the evaluation data.
The final exploration on the application of six merged relations was to see if a
system that is trained on six merged relations is more effective than a system
trained on eleven relations. I implemented a comparison system (11-to-11 ) using
Chambers’ features and the links in the training and test data of 6-to-11. I did not
apply the merging to the training or test data. Therefore, the links for 11-to-11
were identical to the original annotated links. The only modification that I applied
was the merge of during, during_inv, and identity into simultaneous. This system
62
achieved accuracy of 47.59, almost 4% higher than the 43.74 of 6-to-11. This comparison shows that the application of six merged relations is not effective in the
temporal relation classification task using temporal links between events.
I carried out one final piece of analysis, and evaluated how many incorrectly classified relations had been misclassified with their inverse relations. The evaluation
showed how much performance can be boosted by merging inverse relations. I collected all misclassified relations where the inverse relation was the correct relation;
for example, where before was misclassified as after. This type of misclassification
accounted for 27% of all misclassifications. If the set of target relations only contains the six merged relations, these errors go undetected. I counted the misclassifications as correct and measured accuracy. The accuracy was 59.11%, similar to
the accuracy of 6-to-6 w Source Distinction. This is further evidence of boosted
performance using six merged relations.
In short, the 11-to-11 system achieves the best performance on the real world task
of classifying eleven target relations. The process of selecting a final classified relation by simply basing the decision on the raw classifier scores in 6-to-11 seems to
lower the overall performance significantly. In order for 6-to-11 to get better performance than 11-to-11, the selection process needs to be improved.
3.3.5
Final Remarks on This Experiment
This experiment showed that a method using six merged relations is not effective
although this has been standard practice since Mani et al. (2006). In order to show
why the performance of the normalized method is not achievable in a real-world
application, I first demonstrated that the evaluation process in previous studies
led to artificially boosted performance. I replicated an existing system that had
63
reported results using this evaluation methodology and showed how the performance dropped when the correct evaluation was carried out. The replicated system
showed a drop of around 20% in performance when a correct evaluation was carried
out. Next, I compared a classifier trained on six merged relations with a classifier
trained on eleven relations. In the comparison, the classifier trained on eleven relations showed a 4% improvement in performance over the classifier trained on six
merged relations. But the performance of both classifiers was less than 50%. Therefore, much more improvement is required.
3.4
3.4.1
Temporal Relation Classification Using Boundary Relations
Motivations
Although the application of the normalization in the style of Mani et al. (2006)
couldn’t lead to the improved performance as shown in the prior section, there are
some potential alternative normalizations that might help to reduce the problem of
sparse data in the temporal relation classification domain. As noted in Chapter 2,
TimeML interval relations can be decomposed into boundary relations. I propose
a new approach to temporal relation classification that uses the boundaries of
temporal entities. The method is based on the fact that TimeML relations can
be reduced to three boundary relations: before, equal, and after. It would appear
conceivable that learning these boundary relations would be, in some sense, easier
because a classifier has only three target relations instead of eleven relations. This
method requires four relation classifiers for four pairs of boundaries of a temporal
link and each classifier has only three target boundary relations instead of eleven
TimeML relations. The four classified relations need to be combined in order to
restore the temporal relation of a temporal link.
64
In this experiment, I first split temporal links that have been annotated with
TimeML relations into four links with boundary relations. Second, I trained classifiers on the boundary links. Third, I collected confidence scores from each classifier
for each of the three boundary relations of a link. Fourth, I collected all confidence scores of the 12 relations on the four boundary links. Finally, I recovered a
TimeML relation using the collected relations with confidence scores. In recovering
an interval relation, I applied a heuristic using scores. The score-based heuristic
selected the combination of four boundary relations that has the highest sum and
that can be converted into an interval relation. I experimented with two machine
learning classifiers, kNN and Support Vector Machine (SVM), to produce the
probabilities of boundary relations.
In the following sections, I first describe the resources used in the experiment. In
the descriptions, I explain how a temporal link can be split into four boundary links
and which features are used. I then describe the score-based heuristics. Finally, I
report and discuss the results of the new method.
3.4.2
Resources and Data Preparation
The training and test data used in this experiment were identical to the data for
the experiment described in Section 3.2 in order to compare performance. I decomposed temporal links into four boundary links using the fact that an interval entity
(e1 ) consists of two boundaries: start (e1− ) and end (e1+ ) boundaries.
I divided each temporal link into four links between boundaries. Four combinations of start and end boundaries of the two temporal entities in the link were
includes
used. For example, entity1 ===⇒ entity2 was divided into the following four
before
before
links: entity1− ===⇒ entity2− (start-start), entity1− ===⇒ entity2+ (start-end ),
65
e1
-
e1
before
before
+
after
after
≡
includes
e1 ===⇒ e2
e2
-
e2
+
Figure 3.4: Conversion of an interval link to four boundary links
TimeML
Link
Start-Start
Link
Start-End
Link
End-Start
Link
x ===⇒ y
before
x− ===⇒ y −
before
x− ===⇒ y +
before
x+ ===⇒ y −
before
x+ ===⇒ y +
x ===⇒ y
after
x− ===⇒ y −
after
x− ===⇒ y +
after
x+ ===⇒ y −
after
x+ ===⇒ y +
x ===⇒ y
ibefore
x− ===⇒ y −
before
x− ===⇒ y +
before
x+ ===⇒ y −
equal
x+ ===⇒ y +
x ===⇒ y
iafter
x− ===⇒ y −
after
x− ===⇒ y +
equal
x+ ===⇒ y −
after
x+ ===⇒ y +
x ===⇒ y
begins
x− ===⇒ y −
equal
x− ===⇒ y +
before
x+ ===⇒ y −
after
x+ ===⇒ y +
x ===⇒ y
begun_by
x− ===⇒ y −
equal
x− ===⇒ y +
before
x+ ===⇒ y −
after
x+ ===⇒ y +
x ===⇒ y
ends
x− ===⇒ y −
after
x− ===⇒ y +
before
x+ ===⇒ y −
after
x+ ===⇒ y +
x ===⇒ y
ended_by
x− ===⇒ y −
before
x− ===⇒ y +
before
x+ ===⇒ y −
after
x+ ===⇒ y +
x ===⇒ y
includes
x− ===⇒ y −
before
x− ===⇒ y +
before
x+ ===⇒ y −
after
x+ ===⇒ y +
x ===⇒ y
is_included
x− ===⇒ y −
after
x− ===⇒ y +
before
x+ ===⇒ y −
after
x+ ===⇒ y +
simultaneous
x− ===⇒ y −
equal
x− ===⇒ y +
before
x+ ===⇒ y −
after
x+ ===⇒ y +
x ===⇒ y
End-End
Link
before
after
before
after
before
after
equal
equal
after
before
equal
Table 3.8: A temporal link decomposition table − and + represent start and
end boundaries.
66
EVENT-EVENT
End pairs
before
equal
start-start 1621 (39%) 1443 (35%)
start-end 3406 (82%)
38 (1%)
end-start 1239 (30%)
49 (1%)
end-end
1650 (39%) 1472 (35%)
EVENT-TIMEX
End pairs
before
equal
start-start 327 (15%)
275 (12%)
start-end 2162 (96%)
3
end-start
160 (7%)
2
end-end
1680 (75%) 309 (14%)
EVENT-DCT
End pairs
before
equal
start-start 1144 (62%)
85 (5%)
start-end 1664 (91%)
1
end-start
721 (39%)
0
end-end
1156 (63%)
81 (4%)
after
1115 (27%)
735 (18%)
2891 (69%)
1057 (25%)
after
1649 (73%)
86 (4%)
2089 (93%)
262 (12%)
after
605 (33%)
169 (9%)
1113 (61%)
597 (33%)
Table 3.9: Distribution of boundary relations
after
after
entity1+ ===⇒ entity2− (end-start), and entity1+ ===⇒ entity2+ (end-end ) as
shown in Figure 3.4. The decomposition was conducted based on the groups in
Table 3.8.
The decomposition step generated four groups of boundary links: links between a
pair of starts, between a pair of ends, from a start to an end, and from an end to
a start. I report the distribution of the three boundary relations in each group in
Table 3.9.
During feature construction, I adopted the same features used in the first experiment in Section 3.2. The features have proven useful in classifying TimeML
relations. The features still seem helpful for boundary relation classification. For
67
example, past and present tenses of two consecutive events are a clue to predicting
that the start and end boundaries of present tense events are probably after the
boundaries of past tense events.
3.4.3
Score-based Restoration Method
In classifying the TimeML relation of a temporal link using boundary links, two
steps are required: classifications of boundary relations on four boundary links
and the restoration of a TimeML relation from four classified boundary relations.
In the classification step, I examined two machine learning classifiers, k-Nearest
Neighbor (kNN) and Support Vector Machine (SVM), and predicted the following
three relations: before, equal, and after on four boundary pairs of a link.
After four boundary relations have been classified, the classified relations need to
be restored into one of the eleven TimeML relations. Because the four classifiers
are independent, however, there is no guarantee that a TimeML relation will be
generated from the four classified boundary relations. When the restoration failed, I
applied a heuristic called Highest Scores in order to restore the target relation.
Highest Scores uses the sums of the confidence scores from each classifier when
deciding a final target relation. Each of the four classifiers generates confidence
scores for the three relations (before, equal, and after ). There are 81 (34 ) possible
combinations of the relations. Of the 81 possible combinations, only 11 combinations are mapped to qualitative TimeML relations. From the eleven combinations,
the combination with the highest sum of scores is chosen as the final classified relation. When more than one candidate with the same sum of scores existed, I randomly selected one as the final relation.
To illustrate by means of an example, let us suppose the machine learning classifiers
for boundary relations generate the predictions and probabilities in (11).
68
(11) Example of classified boundary relations
start-start before:0.333, equal :0,
after :0.667
start-end
before:0.333, equal :0,
after :0.667
end-start
before:0.333, equal :0,
after :0.667
end-end
before:1,
after :0
equal :0,
There is no corresponding TimeML relation with the classified sequence in bold:
after for start-start, after for start-end, after for end-start, and before for
end-end. The next candidates that have the highest score (2.667) are: <after for
start-start, before for start-end, after for end-start, before for end-end>,
<after for start-start, after for start-end, before for end-start, before for
end-end>, and so on. Among the four candidates, the candidate, <after for
start-start, before for start-end, after for end-start, before for end-end>, only
corresponds to is_included in the eleven relations. Therefore, the combination is
selected and the final classified relation is is_included.
3.4.4
Results
I report two results. First, I present the performance of each of the four boundary
relation classifiers. Their performance is the basis for being able to extract a
TimeML relation. I compare the performance of the trained boundary classifiers
with majority class baseline systems in Table 3.9. Second, I examine the results of
the restored TimeML relations after the application of the restoration heuristic. For
the comparison of the results of the restored TimeML relations, I present the best
systems and majority-class baseline systems from Section 3.2.
Classifying the boundary relations of temporal links between an event and DCT
showed over 70% accuracy, as shown in Table 3.10. The baseline systems achieved
0.671, 0.916, 0.580, and 0.665 in start-start, start-end, end-start, and end-end.
69
bef ore
equal
start-start
af ter
Accuracy
Baseline
bef ore
equal
start-end
af ter
Accuracy
Baseline
bef ore
equal
end-start
af ter
Accuracy
Baseline
bef ore
equal
end-end
af ter
Accuracy
Baseline
kNN SVM
0.802 0.811
0.041 0.025
0.480 0.473
0.709 0.720
0.671
0.954 0.949
N/A
N/A
0.189 0.086
0.914 0.903
0.916
0.679 0.699
N/A
N/A
0.799 0.832
0.759 0.788
0.580
0.832 0.789
0.042 0.033
0.565 0.656
0.738 0.730
0.665
Table 3.10: Results of boundary relation classifiers for event-DCT link
type kNN, and SVM mean k Nearest Neighbors, and Support Vector Machine.
The majority boundary relations are before, before, after, and before. The performance of this system is better than that of the baseline system except for startend. The before relation of start-end in event-dct is very high at 0.91. But this
system using kNN shows almost identical performance to the baseline system of
start-end at 0.914. Next, I analyze how good the classifiers are on each relation
using F-measures. The analysis shows that the classifiers using kNN achieved high
F-measure scores in the relations corresponding to majority classes: 0.802 (before),
0.954 (before), 0.799 (after ), and 0.832 (before) in start-start, start-end, end-start,
70
and end-end respectively. But this system was very poor at classifying the equal
relation because the relation had very little training data.
bef ore
equal
start-start
af ter
Accuracy
Baseline
bef ore
equal
start-end
af ter
Accuracy
Baseline
bef ore
equal
end-start
af ter
Accuracy
Baseline
bef ore
equal
end-end
af ter
Accuracy
Baseline
kNN SVM
0.563 0.288
0.385 0.328
0.439 0.465
0.488 0.392
0.508
0.859 0.758
0.000 0.000
0.373 0.262
0.768 0.634
0.688
0.384 0.156
0.186 0.236
0.782 0.824
0.679 0.712
0.658
0.510 0.255
0.393 0.327
0.448 0.509
0.464 0.415
0.505
Table 3.11: Results of boundary classifiers for event-event link type
Classifying boundary relations of temporal links between a pair of events showed
the lowest performance of all boundary relation classifiers, as presented in
Table 3.11. Baseline systems achieved 0.508, 0.688, 0.658, and 0.505 in startstart, start-end, end-start, and end-end. This system achieved lower performance
than the baseline systems in start-start and end-end. The start-start and end-end
systems achieved accuracies lower than 50%. The analysis on each relation using
F-measures shows that this system had difficulty in classifying every relation in
start-start and end-end. The highest performance was 0.563 (before) and 0.510
(before) in start-start and end-end.
71
bef ore
equal
start-start
af ter
Accuracy
Baseline
bef ore
equal
start-end
af ter
Accuracy
Baseline
bef ore
equal
end-start
af ter
Accuracy
Baseline
bef ore
equal
end-end
af ter
Accuracy
Baseline
kNN SVM
0.242 0.076
0.235 0.093
0.838 0.849
0.729 0.742
0.689
0.959 0.920
N/A
N/A
0.103 0.038
0.922 0.852
0.739
0.160 0.054
N/A
N/A
0.949 0.962
0.904 0.928
0.773
0.848 0.864
0.294 0.000
0.227 0.000
0.746 0.761
0.642
Table 3.12: Results of boundary classifiers for event-time link type
This system achieved the best accuracies when classifying boundary relations on
links between an event and a time expression as shown in Table 3.12. The performance of this system was 0.742 (start-start), 0.922 (start-end), 0.928 (end-start),
and 0.761 (end-end) and was considerably better than the performance of baseline
systems. The baseline systems achieved 0.689 (start-start), 0.739 (start-end), 0.773
(end-start), and 0.642 (end-end). The analysis on each relation shows that the performance of this system was very skewed. This system made better classifications
on after of start-start, before of start-end, after of end-start, and before of end-end.
72
boundary method
interval method
baseline
boundary method
Event-Event
interval method
baseline
boundary method
Event-Timex
interval method
baseline
Event-DCT
kNN SVM
0.591 0.549
0.597 0.582
0.369
0.306 0.152
0.389 0.350
0.287
0.631 0.474
0.583 0.673
0.677
Table 3.13: Results of boundary relation method on TimeML relations
after combining classified boundary relations
Finally, I analyze how good this proposed temporal relation classification system
was in classifying TimeML relations. I evaluated the accuracy of the restored relations from boundary relations after the application of the heuristic (boundary
method in Table 3.13). For comparison, I provide two systems from Section 3.2.
The first comparison system is the best performing system that was trained on data
with eleven relations (interval method in Table 3.13). The other comparison system
is the baseline system using majority class (baseline in Table 3.13). The comparison
examines whether this complex approach using boundary relations is better than
the simpler approach.
This system using boundary relations achieved 0.591, 0.306, and 0.631 on eventDCT, event-event, and event-time, respectively. The performances of the systems
using boundary relations were slightly lower than the comparison systems using
all qualitative TimeML relations from Section 3.2. The results show that the systems using boundary relations do not achieve significantly better results when compared to the naive approach using eleven relations. Nevertheless, the systems using
73
boundary relations still show better performances than the baselines in event-dct
and event-event.
3.4.5
Final Remarks on This Experiment
In this experiment, a new method of temporal relation classification task was introduced in order to overcome some of the difficulties of the temporal relation classification task to date. The method calculated confidence scores for boundary relations of a temporal link. The relations with confidence scores were combined into a
TimeML relation using the proposed score-based heuristic. Each boundary relation
classifier showed higher performance (over 70% accuracy) than the baseline systems
of event-DCT and event-time links, as shown in Table 3.10 and Table 3.12. When
a TimeML relation was recovered, the new method showed similar performance to
classifiers trained on eleven relations. The similar performance indicates that the
various methods yield essentially the same quality of classification with the kinds of
features that we have considered as part of our machine learning experiments when
we use eleven qualitative TimeML relations as target relations.
74
Chapter 4
Detection of Inconsistent Temporal Relations
4.1
Introduction
Classified temporal relations in a document can cause violations of transitive constraints because predictions of machine learning classifiers on temporal links are
not perfect. The existence of violations among annotated temporal relations in the
TimeBank and AQUAINT corpora was verified using a modified version of the
Alembic NLP system (Pustejovsky et al., 2003b) when the relations were annotated.
When classified relations have a conflict among them because of a violation of transitive constraints, the conflict indicates the existence of misclassified relations (i.e.,
errors) in the classified relations. When we run TCPA on Figure 4.1, which has a
misclassified relation highlighted in grey, TCPA detects a conflict and stops. When
we try to find and correct the misclassified relation causing the conflict, we need
to find the misclassified relation among all the classified relations. Therefore, all
classified relations are candidates for the misclassified relation.
The correction of misclassified relations can improve the performance of a temporal relation identification system. Moreover, the correction step will be still useful
even if we improve the performance of a temporal relation classification system by
adding more training data or developing new features. If the improved classification
system cannot generate perfectly correct classified relations, we can still apply the
correction step after classifying relations because misclassified relations can cause
75
c
bef
iin
b
d
iaf
sim
a
iin
bef
e
Figure 4.1: Classified relations with an error Grey-colored relations are errors.
bef , af t, iaf , iin, and sim mean before, after, iafter, is_included, and simultaneous.
violations. In spite of the benefit of the correction task, the number of candidates
for misclassified relations is a factor that makes the task difficult. When we can
find a minimal set that causes the conflict detected, the relations in the minimal set
are only candidates for misclassified relations. Reducing the number of candidates
makes the correction task easier than it would be if all links are used as candidates.
bef ore
simultaneous
is_included
In Figure 4.1, the minimal set is c =====⇒ b, b =====⇒ a, c =====⇒ d, and
iaf ter
bef ore
bef ore
simultaneous
d =====⇒ a. The first two links, c =====⇒ b and b =====⇒ a, lead to c =====⇒ a,
is_included
iaf ter
iaf ter
but the other links, c =====⇒ d and d =====⇒ a, lead to c =====⇒ a. The two
iaf ter
bef ore
derived relations, c =====⇒ a and c =====⇒ a, are in conflict with each other. After
finding the minimal set, the candidates for the misclassified relation are only the
four relations identified, not the entire set of six relations. When we have lots of
classified relations and find a minimal set causing a conflict, reducing the number
of candidates is very beneficial for finding and correcting misclassified relations. For
example, if we have 100 links with classified relations and extract a minimal set
composed of four relations, the candidates for misclassified relations are reduced
76
to only four relations. The step to identify relations that cause a conflict greatly
reduces the search space.
In addition to the benefit of reduction of the search space, extracting the maximal number of minimal sets allows us to figure out the minimum number of misclassified relations. If we find two minimal sets that cause conflicts after running
the extraction process and the minimal sets consist of three and four links respectively, the seven relations are candidates for at least two misclassified relations.
This chapter discusses the problem of extracting the maximal number of minimal
sets.Inconsistent Sets).
Previous studies (Tatu and Srikanth, 2008; Chambers and Jurafsky, 2008;
Yoshikawa et al., 2009) have tried other methods for resolving conflicts. These
methods are computationally complex. A method of Tatu and Srikanth (2008) was
to find the set of consistent relations that had the largest sum of confidence scores
without identifying conflicting relations. But it is almost impossible to find this
set among all possible candidate sets due to the computational complexity. If we
have ten temporal links and use all eleven qualitative relations as target relations,
1011 candidates exist, and we still need to run TCPA, which has O(n3 ) complexity
on each candidate. Whenever one additional link is added, the number of candidates increases exponentially. Therefore, it seems computationally impossible to
search for a consistent set with the largest sum of confidence scores. Recognizing
this, Tatu and Srikanth (2008) used only the top three relations with the highest
scores among six merged target relations to find a consistent set, and Chambers
and Jurafsky (2008) and Yoshikawa et al. (2009) adopted the Statistical Relation
77
Learning algorithms, lpsolve,1 and markov thebeast,2 in order to find an optimized
solution instead of searching for all possible candidates.
In this thesis, I devise a stepwise system to restore a consistent temporal structure from classified relations with conflicts. The computational difficulty in reconstructing consistent temporal relations among all possible candidate sets leads
to this stepwise approach. In addition, the technical difficulty in applying Statistical Relation Learning methods to this restoration task that will be discussed
in Section 4.2 is another reason to devise this stepwise approach. The stepwise
approach, first, extracts classified relations that cause conflicts. Second, the reconstruction step resolves the conflicts among extracted relations using classified relations without a conflict. This chapter explains the method for extracting the set
of relations that are in conflict and distinguishing them from those that are nonconflicting (consistent). This step is necessary and important because we can reduce
the number of error candidates and the computational complexity. Identifying these
links in conflicts, which I will call the “extracted links”, is the first step on the road
to correcting the link that is, in some sense, the “cause” of the inconsistency.
In Section 4.2, I provide a proof that the task to extract all sets of conflicting relations is NP-hard. After the proof, I introduce another task similar to this task. The
similar task has been a popular research topic and several solutions for the similar
task have been introduced. I explain why I don’t use the solutions for the similar
task in this thesis and why I propose an algorithm based on a heuristic for this
task.
1 http://sourceforge.net/projects/lpsolve
2 http://code.google.com/p/thebeast/
78
Section 4.3 begins with the explanation of a newly devised method that extracts
conflicting relations. The extraction method is based on TCPA. I add to TCPA
some required functionalities that are described in Section 4.3.
In Section 4.4, I use the extraction method to empirically explain the failure of previous studies (Chambers and Jurafsky, 2008; Tatu and Srikanth, 2008; Yoshikawa
et al., 2009) to obtain improvements by applying transitive constraints. These
studies assumed that these logical constraints would lead to improvement because
the constraints allow only some sets of temporal relations without violations and
restrict the candidate space. Unfortunately, they achieved only a 4% improvement
after providing additional links with temporal relations. One explanation for the
disappointingly small improvement is the structure of the links in the TimeBank
and AQUAINT corpora. The structure of the links is such that transitive constraints cannot be used to effectively limit the candidate space for the links. The
experiment in Section 4.4 examines this explanation empirically using misclassified relations artificially generated. I apply the extraction method for identifying
inconsistent temporal relations from a set of relations to the documents from the
TimeBank and AQUAINT corpora. After the application of the extraction method,
I assess how many incorrectly classified temporal relations can in principle be ruled
out by transitive constraints. The assessment shows how much improvement we can
expect when we apply the transitive constraints to the current temporal corpora. In
Section 4.5 I analyze characteristics of inconsistent temporal relations, which are
extracted, using graphical measures.
In the second experiment, which is presented in Section 4.6, I apply the extraction
method to the real data from Section 3.2. The artificial errors of the first experiment can miss error patterns that a classifier makes. This experiment tests whether
the results of the first experiment using artificial errors are valid in the real world.
79
4.2
NP-hardness of Extracting Inconsistent Sets
I show that the problem of extracting all sets of conflicting relations (i.e., Extraction of Inconsistent Sets problem) is an NP-hard problem, which means “at least as
hard as any NP-problem (nondeterministic polynomial time problem)”(Kleinberg
and Tardos, 2005). As I discussed in the introduction of this chapter, the extraction
of all sets of conflicting relations is beneficial in getting a consistent set of temporal
relations. But if the problem is NP-hard, it is impossible to solve it in a computationally efficient way. The proof of the NP-hardness of the problem will justify why
in this thesis I propose a heuristic method in order to solve the problem, which is
to extract all sets of conflicting relations. A new problem can be proved as NP-hard
when a solution of the problem can be converted into a solution for any NP-hard
problem. The start of the proof is to convert classified temporal relations into a
form of a known NP-hard problem. I use SAT (satisfiability problem), which is
a known NP-hard problem(Kleinberg and Tardos, 2005), in my proof of the NPhardness of the Extraction of Inconsistent Sets problem. The SAT problem is to
decide if there is a truth value assignment to a given formula in conjunctive normal
form (CNF) that makes the given formula true (i.e., the CNF formula is satisfiable).
A formula in conjunctive normal form where clauses are connected by conjuncts
(e.g., logical ANDs) and each clause is a disjunction (e.g., logical OR) of boolean
variables, for example, (p ∨ q) ∧ (r ∨ s ∨ q) where p, q, r, and s are boolean variables
and formulas in parenthesis such as p ∨ q are clauses. After the conversion of classified relations into a CNF formula, I can prove the NP-hardness of the Extraction of
Inconsistent Sets problem using a reduction that I will explain in detail later.
Pham et al. (2006) showed how to convert a constraint propagation network and
transitive constraints into a formula in CNF. Because classified temporal relations
80
c
bef
b
all
all
sim
a
iin
iaf
d
Figure 4.2: Classified relations with an error Grey-colored relations are errors.
bef , af t, iaf , iin, and sim mean before, after, iafter, is_included, and simultaneous.
can also be represented in a constraint propagation network, classified temporal
relations can be converted into a formula in CNF. For the conversion to CNF formula, Pham et al. (2006) defined a boolean variable, xrij , that is true iff the relar
tion r is assigned to a link from i to j (e.g., i ===⇒ j). In Figure 4.2, relations on
bold links are classified relations and relations on dashed links are implicit relations, which are all possible qualitative relations. Relations on links can be reprebef ore
sented using the boolean variable. For example, we can convert b =====⇒ a and
all
ore
ore
ter
b =====⇒ d into (xbef
) and (xbef
∨ xaf
∨ . . . ∨ xsimultaneous
).
bd
ba
bd
bd
When checking whether a given formula in CNF is true (i.e., whether a formula
is satisfiable), only one relation is allowed per link. Without this condition, the
conflict in Figure 4.2 will be satisfied. Links between c and b and between b and a
bef ore
iaf ter
derive c =====⇒ a, and links between c and d and between d and a derive c =====⇒
a. As a result, c is before or iafter a. Neither of the relations, however, is consistent.
This condition on one relation per link blocks the cases. The condition is also added
using boolean variables when disjunctive relations are allowed for a link, such as
all
c =====⇒ a. The link is converted into the form, ∧u,v∈relations ¬xuij ∨ ¬xvij derived from
81
∧u,v∈relations xuij → ¬xvij . Therefore, a link that allows all eleven qualitative relations,
all
such as c =====⇒ a, is converted into conjunctions of 55 ( 11×10
) forms with a pair
2
ore
ter
ore
ore
ter
ore
of relations, such as (¬xbef
∨ ¬xaf
), (¬xbef
∨ ¬xibef
), (¬xaf
∨ ¬xibef
),
ca
ca
ca
ij
ij
ij
etc.
In addition to the representations of a given network and the condition on one
relation per link, transitive constraints also need to be added as a formula in CNF.
bef ore
af ter
all
Each transitive constraint, such as ((i ===⇒ k) ∧ (k ===⇒ j)) → (i ===⇒ j), is
ore
ter
ore
ter
converted into ¬xbef
∨ ¬xaf
∨ xbef
∨ xaf
∨ . . . ∨ xsimultaneous
derived from
ij
ij
ij
ik
kj
ore
ter
ore
ter
(xbef
∧ xaf
) → (xbef
∨ xaf
∨ . . . ∨ xsimultaneous
). For a set of three temporal
ij
ij
ij
ik
kj
entities, 121 (eleven qualitative relations × eleven qualitative relations) constraint
formulas are added. For example, when ten temporal entities are given, we get 120
( 10×9×8
) sets with transitive entities. Therefore, the final number of formulas for
3×2×1
transitive constraints will be 120 × 121.
I have shown how to convert classified relations into a formula in CNF using the
method of Pham et al. (2006). Now I prove that this problem is NP-hard using a
reduction. The reduction is a way to solve a problem by transforming the problem
into another problem that we can solve. Let’s suppose that we have two problems,
α and β, and that we can solve β. If we can transform α into β, we can solve α
using a solution for β. We can use the reduction to prove that a new problem is NPhard when an existing NP-hard problem can be transformed into the new problem.
Let’s suppose that an NP-hard problem (α) and a new problem (β) are given, and
that α can be transformed into β. We can use contradiction for the proof that the
new problem is also NP-hard. Let’s suppose that the new problem, β, is easily solvable using an algorithm. We can easily solve the α problem using the algorithm
for β after transforming α into β. This leads to a conclusion that the α problem is
easily solvable. But this conclusion contradicts the fact that the α problem is NP82
hard. Therefore, our presupposition that the β problem is easily solvable is wrong,
and we can conclude that the new problem (β) is NP-hard.
I use the SAT problem in the reduction. The SAT problem is known as an NP-hard
problem. We can convert the SAT problem into a problem of Maximum Extraction of Inconsistent Sets (MEIS) by adding clauses in CNF. The added clauses
should have one characteristic: the clauses do not have a variable and a predicate
in the SAT problem. Then the added clauses do not cause a conflict with clauses
in SAT and the added clauses can only cause a conflict with each other. When a
solution of MEIS extracts all inconsistent sets from the extended clauses and the
extracted sets include only clauses from the added clauses, we can say the original
SAT problem is satisfiable. If the extracted sets include clauses from the original
SAT problem, we can say the SAT problem is not satisfiable. In order to check the
inclusion of clauses from the original SAT problem in the extracted inconsistent
sets, we need a loop that has an embedded loop because we need to check whether
each clause in the original SAT problem exists in the extracted sets. The procedure
that checks the inclusion of clauses from the original SAT problem in the extracted
inconsistent sets is O(n2 ), due to the embedded loop. The extraction of all inconsistent sets should be easily solvable according to the presupposition. In addition, the
procedure that checks the inclusion of the original SAT clauses can be easily conducted through the embedded loop. Through the two easy solutions, we can easily
solve a given SAT problem. But it is known that the SAT problem is an NP-hard
problem. This contradicts the conclusion. Therefore, the MEIS problem can’t be
easily solved.
Theorem 4.2.1 The problem that extracts all inconsistent sets is an NP-hard
problem.
83
Proof.
Let’s suppose that inconsistent set extraction is easily solvable. The SAT problem is
an NP-hard problem that is known. We can convert the SAT problem to a problem
to extract all inconsistent sets by adding clauses that do not have a shared variable
and a shared predicate. Then we can easily solve the converted SAT problem. But
the conclusion that the converted SAT problem is easily solvable is in a conflict
with the fact that the SAT problem is an NP-hard problem. The conflict leads
to the conclusion that the problem extracting all inconsistent sets is an NP-hard
problem.
The extraction of all conflicting temporal relations can be said to be a task identical to find a subset of temporal relations without a conflict because temporal
relations remaining after the extraction are consistent. A problem that is similar to
this task is the Maximum Satisfiability problem (MAX-SAT) for a formula in CNF.
The MAX-SAT problem is to find an assignment of boolean values to boolean variables that maximizes the number of clauses that become true by the assignment
(i.e., the clauses are satisfied) (Argelich et al., 2008). The MAX-SAT problem is
known as an NP-hard problem and a popular problem that is proven by the annual
MaxSAT Evaluations.3
In the case of the graph with six links in Figure 4.1, the removal of one among
the three links between c and b, between c and d, and between b and a makes the
network consistent. There are three consistent sets with the maximal number of
clauses, because the removal of any link among these three links makes the five
bef ore
remaining links consistent. For example, the removal of c ===⇒ b makes the
remaining five links consistent. Therefore, the task of MAX-SAT can be said to be
to find a set among the three sets. Analogous to MAX-SAT, the task of extracting
3 http://www.maxsat.udl.cat/
84
conflicting temporal relations can be said to find a maximum set that is consistent.
But one more condition is required. The maximal set should not include a relation
that causes a conflict. Due to the difference, I cannot use solutions for MAX-SAT in
this thesis. As we can see in the preceding example, making links consistent is not
enough for this study because errors that cause conflicts can exist in the resulting
bef ore
sim
consistent links. For example, after the removal of c ===⇒ b in Figure 4.1, b ===⇒ a,
which is a misclassified relation, still exists in the resulting consistent links. The
remaining errors can invoke other errors when we restore consistent links using the
bef ore
sim
remaining links. After the removal of c ===⇒ b, the remaining links, b ===⇒ a,
iaf
iin
af ter
d ===⇒ a, and c ===⇒ d derive c ===⇒ b, which is a derived error. In order to avoid
this kind of derived error, I develop a method for MEIS in this thesis.
In addition, a technical difficulty that was found in an implementation for MAXSAT leads me to design an algorithm that uses heuristics. The task of finding a
truth assignment that maximizes the sum of weights of satisfied clauses in CNF
is also a necessary step in Statistical Relational Learning tools such as Markov
Logic Network (Niu et al., 2011a). Recently, a Markov Logic Network implementation, TUFFY (Niu et al., 2011a), which can handle large-scale data, was developed.
Before TUFFY, other MLN implementations were able to handle only small data
and even such small-scale data took hours to process (Niu et al., 2011a). As mentioned, the CNF version of a temporal propagation graph has lots of formulae that
are generated using transitive constraints. Therefore, an interesting future study
would be to find a consistent temporal structure using TUFFY. Because of the
technical difficulty of MLN implementations before TUFFY, this study proposes
a heuristic method to extract violations based on the Temporal Constraint Propagation Algorithm (Allen, 1983). The Temporal Constraint Propagation Algorithm
(TCPA) can be used to detect a conflict among classified relations. In order to find
85
relations that give rise to a detected conflict, I propose an extension to this algorithm.
4.3
Violation Extraction
I describe my violation extraction method in this section. Before the description of
the method, I explain how the Temporal Constraint Propagation Algorithm (Allen,
1983), which my violation extraction method is based on, works. Understanding the
TCPA is foundational to understanding how my extraction method works.
4.3.1
Temporal Constraint Propagation Algorithm
When temporal entities and some explicit relations between the entities are given,
Allen’s Temporal Constraint Propagation Algorithm, shown in Algorithm 1, uses
a directed graph to represent the given information. Temporal entities given are
represented as nodes of the graph, and explicit relations are represented as labels of
the edges of the graph. When two or more relations are possible between a pair of
entities, the label of the edge between the entities is represented as the disjunction
of the relations in the graph. Let us suppose a graph with three entities, such as e1,
e2, and e3 in Figure 4.3a. Temporal relations of links from e1 to e2 and from e3 to
bef
e2 are before (bef ), and after (af t) or is_included (iin), such as e1 ===⇒ e2 and
af t,iin
e3 ===⇒ e2.
The algorithm is based on three operations: complement, intersection, and composition operations. Complement is a unary operator. The complement operator generates an inverse relation set from a disjunctive relation set. When a link
86
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
INPUT: Graph with Entities and Relations;
OUTPUT: Propagated Relations;
Graph = Normalization(Graph);
NodesArray = collect nodes;
QueueOfNodePairs = collect node pairs;
while < i, j >= dequeue QueueOfNodePairs do
foreach k = Nodes do
if k 6= i and k 6= j then
RelationsOfItoJ = get relations from i to j;
RelationsOfJtoK = get relations from j to k;
RelationsOfItoK = get relations from i to k;
NewRelsOfItoK = Composition(RelationsOfItoJ,RelationsOfJtoK);
NewRelsOfItoK = Intersection(NewRelsOfItoK,RelationsOfItoK);
if NewRelsOfItoK = ∅ then
return SignalOfConflict
end
if NewRelsOfItoK 6= RelationsOfItoK then
enqueue(QueueOfIntervalPairs,< i, k >);
store NewRelsOfItoK as relations from i to k;
NewRelsOfKtoI = Complement(NewRelsOfItoK);
store NewRelsOfKtoI as relations from k to i;
end
RelationsOfKtoI = get relations from k to i;
RelationsOfItoJ = get relations from i to j;
RelationsOfKtoJ = get relations from k to j;
NewRelsOfKtoJ = Composition(RelationsOfKtoI,RelationsOfItoJ);
NewRelsOfKtoJ = Intersection(NewRelsOfKtoJ,RelationsOfKtoJ);
if NewRelsOfKtoJ = ∅ then
return SignalOfConflict
end
if NewRelsOfKtoJ 6= RelationsOfKtoJ then
enqueue(QueueOfIntervalPairs,< k, j >);
store NewRelsOfKtoJ as relations from k to j;
NewRelsOfJtoK = Complement(NewRelsOfKtoJ);
store NewRelsOfJtoK as relations from j to k;
end
end
end
end
Algorithm 1: Constraint propagation algorithm
87
<
e1
e2
>, d
e3
(a) Initial relations
<
e1
e2
>
<, di
all
all
>, d
e3
(b) Graph after normalization
<
e1
e2
>
<, di
<
>
>, d
e3
(c) Derived relations
with a composition
operation
Figure 4.3: Derivation example for Allen’s algorithm
88
has {af ter, is_included} as its label, the application of the complement operation to the label results in {bef ore, includes(inc)}. Intersection is a binary operator between two relation sets. When two disjunctive relation sets are given,
intersection selects relations that overlap between the two sets. When the two
relation sets given to the intersection operator are {bef ore, is_included} and
{af ter, is_included}, the operator makes {is_included}. Composition between
two basic relations is a binary operator. A composition operation between two
basic relations calculates transitive relations of the relations based on transitivity
constraints (e.g., if a is before b and b is before c then it follows that a is before
iin
iin
c). For example, the composition operation of i ===⇒ k and k ===⇒ j derives
bef,ove,ibe,iin,beg
i =======⇒ j.4 Allen offered a transitivity table between two relations among
thirteen basic relations. The composition operation between two basic relations
uses the transitivity table in deriving transitive relations of the two basic relations
given. Composition between two disjunctive relation sets is based on the composition operation between two basic relations. The composition between two disjunctive relation sets first derives transitive relations of a pair of basic relations from
the two sets. The composition operation between two disjunctive sets, then, generates the union of the derived transitive relations as the final set. For example,
bef,iin
af t,iin
{bef ore, is_included} and {af ter, is_included} of i ===⇒ k and k ===⇒ j
are given, the compositions of pairs of atomic relations such as [bef ore, af ter],
[bef ore, is_included], [is_included, af ter], and [is_included, is_included] are calculated, the union of the compositions being the final output. In the example of
bef,iin
af t,iin
i ===⇒ k and k ===⇒ j, [bef ore, af ter] makes all thirteen relations. Therefore, the
all
output of the operation is a set with all relations such as i ===⇒ j.
4 ove, ibe,
and beg mean overlap, ibefore, and begins. The overlap in this example refers
to overlap in Allen’s thirteen relations. The overlap relation is not part of the fourteen
TimeML relations, as mentioned before.
89
TCPA begins with a normalization step that adds inverse relations and all relations
to a graph. When the graph in Figure 4.3a is given to TCPA, TCPA adds inverse
relations of explicit relations to the reversed links after adding reversed links of
the links with explicit relations. When there is no link between a pair of nodes,
TCPA adds a link and its reversed link between the pair, and the links have the
disjunction of all relations as their label. When the graph in Figure 4.3a is given
af t
bef,inc
bef
af t,iin
to TCPA, reversed e2 ===⇒ e1 and e2 ===⇒ e3 of e1 ===⇒ e2 and e3 ===⇒ e2
all
are added to the graph through the normalization step. The links, e1 ===⇒ e3 and
all
e3 ===⇒ e1, are also added because there is no link between e1 and e3. The graph
after the normalization step is the graph in Figure 4.3b.
After the normalization step, TCPA generates two arrays that have all nodes and
all pairs of nodes. In the outer loop of TCPA, which corresponds to the lines from 6
to 39 in Algorithm 1, TCPA iteratively gets a pair of intervals from the array with
all pairs of nodes. TCPA makes transitive nodes in the inner loop of the algorithm,
which corresponds to the lines from 7 to 38 in Algorithm 1, after iteratively getting
another node that is not in the pair from the array with all nodes.
Relations on the transitive nodes are constrained through composition, intersection, and complement operations, which correspond to the lines from 23 to 27
bef
bef,inc
in Algorithm 1. The composition operation with e1 ===⇒ e2 and e2 ===⇒ e3
bef
all
derives e1 ===⇒ e3. The intersection of the original e1 ===⇒ e3 and the derived
bef
bef
e1 ===⇒ e3 makes e1 ===⇒ e3. After getting a constrained label of a link, TCPA
also changes the label of its reverse link using complement operation. The label
of the reverse link from e3 to e1 that corresponds to the link from e1 to e3 also
bef
becomes e1 ===⇒ e3.
When relations of a pair of nodes are constrained, the node pair is added to the end
of the array of interval pairs in order to propagate the constrained relations into
90
b
c
after
simultaneous
is_included
a
d
ibefore
Figure 4.4: Graph with a conflict
relations of other node pairs. During the iteration, if a pair of nodes gets an empty
set of relations as their label, TCPA terminates and a conflict exists in the graph.
When the array with pairs of nodes is empty, TCPA also terminates. If each pair of
nodes of a graph has at least one relation as its label at the termination of TCPA,
there is no conflict among the relations that were given to TCPA at the beginning.
TCPA gets the graph in Figure 4.3c from the initial graph in Figure 4.3a. We can
conclude that no conflict exists among the relations in the initial graph.
4.3.2
Violation Extraction Method
Every document in the TimeBank and AQUAINT corpora consists of a set of temporal entities and a set of temporal links between these entities. Each temporal link
has an explicitly marked temporal relation. The set of temporal links with explicitly
annotated temporal relations is typically only a subset of the set of pairs of temporal entities (meaning that many temporal entities stand in relations that aren’t
explicitly marked).
Allen’s Temporal Constraint Propagation Algorithm (TCPA) works by using transitivity constraints to successively reduce the number of possible temporal relations
compatible with the explicit relations in a document. For example, in Figure 4.4,
91
af ter
af ter
it is easy to see that c =====⇒ a is implied by the explicit links, c =====⇒ b and
simultaneous
b =====⇒ a. An inconsistency is detected when there are no temporal relations
between two temporal entities that are compatible with a set of specified links. In
Figure 4.4, we see that there are no possible relations compatible with the set of
af ter
simultaneous
explicit links given. On the one hand, c =====⇒ b and b =====⇒ a together entail
af ter
is_included
ibef ore
bef ore
c =====⇒ a, while c =====⇒ d and d =====⇒ a together entail c =====⇒ a. These
are incompatible. The three links and the “misclassified relation” (in grey) have
given rise to the incompatibility.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
INPUT: Explicit Links (EL) with n temporal entities;
OUTPUT: Inconsistent Sets (IS);
ExtractedLinks = None;
while a new extracted link is added to ExtractedLinks do
make an array of sets of three links ;
place sets without an extracted link at the front of the array ;
while TCPA(the array) finds a conflict do
retrieve links that cause the conflict ;
remove additional links from the links ;
extract conflicting links from the array ;
remove conflicting links from the array ;
if the conflicting links not in ExtractedLinks then
add the conflicting links to ExtractedLinks;
end
end
end
Algorithm 2: Violation extraction algorithm
The goal of this section is to describe and assess an algorithm for extracting the
af ter
minimal set of links that gives rise to this incompatibility (here c =====⇒ b,
simultaneous
is_included
ibef ore
b =====⇒ a, c =====⇒ d and d =====⇒ a). In order to identify these kinds of
minimal sets of temporal links, I propose the algorithm in Algorithm 2. The basic
idea is that we store both the explicit links that have logical consequences (i.e.,
that reduce the number of relations that might hold between two entities) and a
92
c
bef
iin
b
d
iaf
sim
a
iin
bef
e
Figure 4.5: Classified relations with a misclassified relation
af ter
reference to the implicit link the explicit links have an effect on (e.g., c =====⇒ a
af ter
simultaneous
is derived from c =====⇒ b and b =====⇒ a). Once an inconsistency is found, the
set of all explicit links that contributed to the inconsistency is extracted and then
minimized.
The workings of the algorithm are easiest to see by example. For the case specified
bef ore
in Figure 4.5 the algorithm stores the fact that the implicit link c ===⇒ a is created
bef ore
simultaneous
using c ===⇒ b and b =====⇒ a. When the TCPA detects a conflict it extracts
all the explicit relations that are marked as related to this implicit link as these
iaf ter
is_included
explicit relations are the source of the conflict. Here c ===⇒ a (from c ===⇒ d and
iaf ter
bef ore
d ===⇒ a) contradicts with the previously derived c ===⇒ a. This extraction process
is_included
identifies the following four links in Figure 4.5 as the sources of the conflict: c ===⇒
iaf ter
bef ore
simultaneous
d, d ===⇒ a, c ===⇒ b, and b =====⇒ a, leaving the remaining “harmless” links,
bef ore
is_included
e ===⇒ d and e ===⇒ a. Since these identified links are in conflict, we would suspect
that one of the four extracted links would be a misclassified relation generated by a
TRI system.
93
b
af t
af t
bef
sim
a
c
bef
d
f
af t
bef
af t
e
h
sim
bef
g
Figure 4.6: Graph with two unrelated conflicts
The example in Figure 4.5 is evidence of the possibility that no misclassified relation can exist in the transitive links at the moment of a conflict detection. When we
manually follow a derivation sequence that causes a conflict and manually examine
the links at the moment of a conflict detection, we can easily see the case. When
is_included
iaf ter
iaf ter
c ===⇒ a is derived from c ===⇒ d and d ===⇒ a and a conflict is detected, however,
is_included
iaf ter
c ===⇒ d and d ===⇒ a do not have a misclassification. Therefore, it is a necessary
step to extract all explicit relations that caused the detected conflict at the moment
of a conflict detection.
Of course, a set of explicit relations might have more than one conflict, and these
conflicts might exist in unrelated parts of the temporal graph. In order to extract
multiple sets of inconsistent links, we need to iterate this process to detect a conflict and to extract the relations that caused the conflict. In Figure 4.6 there are
two sets of conflicting links. To detect these, we run the extraction process in a
loop that corresponds to the lines from 7 to 15 in Algorithm 2. After the detection
of an inconsistency, the links that lead to the inconsistency are removed from the
set of links that was tested. The extraction of conflicting relations from the set of
relations and the subsequent rerun of the extraction process using the remaining
links are repeated until no more inconsistencies are found. Again returning to the
94
TRI case, we can take the number of extracted sets as a lower bound of the number
of misclassified relations generated by TRI. This means that there are at least two
misclassified relations in Figure 4.6.
This single loop is not enough for extracting multiple conflicts. In the case of
af ter
simultaneous
Figure 4.7, for example, the extraction of conflicting links from a ===⇒ b, b ===⇒ c,
bef ore
bef ore
a ===⇒ d, and d ===⇒ c makes it impossible to extract the conflict in the right side
bef ore
of the rectangle including d ===⇒ c. In order to extract multiple conflicts that have
shared links I add an additional loop. We extract multiple conflicts with shared
links by adding the outer loop, which corresponds to the lines from 4 to 16 in Algorithm 2. The order of sets of transitive links to be processed is important in order
to extract multiple conflicts with shared links.
af ter
simultaneous
bef ore
In Figure 4.7, if we failed to detect the conflict among f ===⇒ c, e ===⇒ f , e ===⇒
bef ore
simultaneous
af ter
d, and d ===⇒ c before the detection of the conflict among a ===⇒ b, b ===⇒ c,
bef ore
bef ore
a ===⇒ d, and d ===⇒ c, we could not extract the conflict on the right side of
the rectangle. For example, if {d, e, f } are processed first, and {a, b, c} and {a, c, d}
follow, the conflict in the left side of the rectangle will be detected. Therefore, I
process a set of temporal links without the extracted temporal links first in order
to handle this case. In Figure 4.7, we would process the three links on the right side
af ter
simultaneous
bef ore
(f ===⇒ c, e ===⇒ f , and d ===⇒ d) if we extracted temporal links that consist
of the left rectangle. I run the loop until there is no change in the extracted links.
bef ore
This modification has a limitation. If d ===⇒ c is the only misclassified relation
in Figure 4.7, the repetition will extract more links than necessary. Nevertheless,
I choose to extract all conflicting links in spite of this issue in order to extract all
possible candidates for misclassified relations.
This extraction method does not necessarily result in minimal inconsistent sets. In
af ter
af ter
ibef ore
Figure 4.8, for example, a ===⇒ b, b ===⇒ c, and a ===⇒ c make a conflict without
95
b
af t
af t
bef
sim
a
c
bef
d
f
sim
bef
e
Figure 4.7: Graph with exclusively extractable conflicts
bef
ibe
a
af t
b
af t
c
inc
d
af t
aft
Figure 4.8: Extracted violation with an extra link
96
af ter
includes
af ter
af ter
c ===⇒ d because a ===⇒ b and b ===⇒ c derives a ===⇒ c, not ibefore. If exactly
these three links are not processed first, it is possible for an additional link to be
ibef ore
includes
extracted incidentally. Suppose, for example, that a ===⇒ c and c ===⇒ d first
bef ore
af ter
af ter
af ter
derive a ===⇒ d, then a ===⇒ b and b ===⇒ d, which is derived from b ===⇒ c and
includes
af ter
af ter
c ===⇒ d, derive a ===⇒ d. This a ===⇒ d conflicts with the previously derived
bef ore
a ===⇒ d.
includes
In order to preserve minimality, I remove additional links such as c ===⇒ d in
Figure 4.8 when the additional link relates to a temporal entity with only one
explicit connection (such as d). Because it is possible that more than one additional link is included in the extracted set, I repeat the removal of additional links
from the extracted links until the set of extracted links has no extraneous links.
The removal of extraneous links from the extracted links is done immediately after
extracting conflicting links in Algorithm 2.
I used this proposed method to extract 42 conflicts, given in Appendix A, from
the TimeBank and AQUAINT corpora. The extracted conflicts were manually
examined in order to see if the extracted relations contribute to a conflict and if
the set of extracted relations is the minimal set that causes a conflict. All of the
extracted conflicts were the minimal sets that contribute to a conflict.
4.4
4.4.1
Evaluating the Extraction Method with Artificial Data
Motivations
This section assesses how much improvement in performance we can expect by
applying transitive constraints to the temporal relation identification task with the
TimeBank and AQUAINT corpora. If the corpora have lots of links that form a
graphical structure where the application of transitive constraints is useless, the
97
b
after
d
after
a
after
c
Figure 4.9: A graph in which misclassified relations are unextractable
using transitive constraints
evaluation using the corpora cannot clearly show the effectiveness of a temporal
relation identification system using transitive constraints. When annotated temporal relations and temporal links that can be represented as a graph in Figure 4.9
are given, it is impossible to extract a misclassified relation using transitive constraints. In addition, if the corpora, which previous studies used, have lots of forms
of links where the application of transitive constraints is useless, we can’t blame
previous studies for achieving only slight performance increases.
In this section, I use randomly generated misclassified relations to show the expectation of using the TimeML annotated corpora. An experiment using the randomly
generated misclassified relations shows how many (artificial) misclassifications lead
to violations of the transitive constraints in the TimeBank and AQUAINT corpora. When a violation on transitive constraints is found, it is possible to extract
relations that cause the violation using the extraction algorithm described in Section 4.3. Relations in the extracted violations (extracted relations) are candidates
for misclassified relations that can be detected and corrected using transitive constraints.
Using the number of misclassified relations in the extracted violations, we can calculate the maximum improvement using transitive constraints. We can achieve
98
the maximum improvement when every misclassified relation among the violations
is detected and corrected. If a misclassified relation does not cause a violation of
transitive constraints, we do not have a chance to detect and correct it. Therefore,
we cannot expect performance improvement by detecting and correcting it. The
number of misclassified relations in the extracted violations is only the expectation
of how much performance improvement we can get using transitive constraints. In
order to establish the number of misclassified relations in extracted violations, I use
the extraction algorithm that I described in Section 4.3.
In the analysis in Section 4.5, I discuss the difference between misclassified relations
that cause violations (extractable misclassifications) and misclassified relations that
do not cause a violation (unextractable misclassifications) using graph measures
that have previously been used in Navigli and Lapata (2007) and Korkontzelos et al.
(2009). As I discussed at the beginning of this chapter, the extraction of all violations is an NP-hard problem. In addition, my extraction algorithm in Section 4.3
is computationally expensive. Therefore, if we can tell which relations are more
likely to cause violations, we can reduce search space and computational expenses
by focusing on these relations. The goal of the analysis on the two sets of misclassified relations is to see if graph measures can be used to distinguish relations that
can cause violations from relations that cannot cause a violation. In addition, the
comparison of the graph measures shows which measure is better at distinguishing
between the types of relations.
Again, I used 214 documents without inconsistent annotated temporal relations in
the TimeBank and AQUAINT corpora in this experiment. All temporal link types
such as event-event, event-dct, event-time, time-time, and time-dct in the corpora
were included in this experiment.
99
The number of temporal links in the 214 documents was 8,541. The average and
standard deviation of the number of temporal links per document was 39.9 and
37.4, respectively. Fourteen TimeML relations were merged into eleven relations
as in the experiments presented in Chapter 3. Therefore, simultaneous, identity,
during, and during_inv relations were merged into simultaneous.
4.4.2
Experiment Design
In order to evaluate the effectiveness of using transitive constraints to aid in the
temporal relation identification task, I carried out a number of experiments using
these constraints to identify conflicts that misclassified relations contribute to.
There are two important parameters determining how effective transitive constraints might be for the identification of misclassified relations: the actual network geometry of the temporal relations to be identified in a particular document
(are they tightly linked or independent?) and the nature and number of misclassified relations to be identified (Are there few or many? Do they give rise to logical
inconsistencies?). If temporal links have a form such as the graph in Figure 4.9, no
performance improvement can be expected using transitive constraints. When a
misclassified relation does not cause a conflict, such as in Figure 4.10, or when the
number of misclassified relations is few, we can expect little performance improvement.
To examine the effectiveness of inconsistency detection over a wide range of types
of misclassified relations, I generated random types (and quantities) of misclassified relations by generating from each original document a set of perturbed documents that contains some number of perturbed links. I created the set of perturbed
documents by selecting at random a fraction of the annotated links in each document for perturbation. The annotated links were perturbed by randomly choosing
100
c
after
b
is_included
before
a
ibefore
d
Figure 4.10: Graph with a misclassified relation that does not cause a
conflict
Misclassification
per Doc
File
Count
Total
1
2
4
8
15
20
30
40
50
60
70
80
90
100
110
214
214
207
158
98
73
45
26
16
11
6
4
3
2
1
1076
Table 4.1: Document count per the number of misclassified relations of a
document generated from 214 original documents
a relation from among the 10 temporal relations that differ from the original annotation. The number of perturbations introduced into each file was varied from 1,
2, 4, 8, 15, then 20, 30, 40, etc. up to a maximum of half of the links in a given
document. This procedure resulted in a test set of 1,076 perturbed texts with a
wide variety of misclassified relations (in Table 4.1 the distribution of files per
number of links perturbed is displayed). Some of the corpus documents are so
large that a wide range of perturbations is possible. For example, the large document WSJ910225-0066.tml was used as the basis for 15 perturbed texts. On the
other hand, I can generate only one and two perturbations for the tiny documents
wsj_0006.tml and wsj_0150.tml. There were a total of 10,834 perturbed links
among the 1,076 generated texts.
101
I ran my violation extraction algorithm on this set of perturbed texts, measuring
how many inconsistent sets (“violations”) my algorithm extracted per document,
how many temporal links there were in each of these violation sets, and how many
perturbations (or “introduced misclassified relations”) appeared in each of these violation sets. This give us an empirical measure of how many misclassified relations
can possibly be identified on the basis of transitive constraints. This procedure was
repeated five times, with the results averaged over these trials.
4.4.3
Results
I randomly made five different sets of 1,076 documents with randomly generated errors. Each set had a total of 10,834 randomly generated misclassified relations. When I extracted violations from each set using the extraction
method, I could extract only 12% of the misclassified relations on average
( misclassified
relations in extracted violations
).
10834
My analysis on violation sets such as
{a ⇒
= b, b ⇒
= c, a ⇒
= d, d ⇒
= c} and {e ⇒
= f, h ⇒
= f, g ⇒
= e, g ⇒
= h} in Figure 4.6 shows
that each violation set had 3.7 links and 1.7 misclassified relations on average. This
means that each violation set consists of around 45% ( 1.7
) links that have misclassi3.7
fied relations on average.
For the analysis per error ratio range, I collected documents into groups by 5%
of generated misclassified relations
error ratio ( numbernumber
). All violations in a group were colof temporal links in a doc
lected. For example, wsj_0344.tml has 20 annotated links, and I generated four
different documents with 1, 2, 4, and 8 misclassified relations. The four different
1
2
4
8
documents belonged to ≤ 0.05 ( 20
), ≤ 0.10 ( 20
), ≤ 0.20 ( 20
), and ≤ 0.40 ( 20
) ratios,
respectively. For each group, I counted the number of total misclassified relations
(total-errors), the number of temporal links in violations (tlinks-violation), and the
number of misclassified relations in violations (errors-violation) for each ratio. Let’s
102
Error
Ratio
≤ 0.05
≤ 0.10
≤ 0.15
≤ 0.20
≤ 0.25
≤ 0.30
≤ 0.35
≤ 0.40
≤ 0.45
≤ 0.50
Precision Recall
24.0
27.2
32.1
34.7
38.8
42.2
46.8
49.2
52.0
56.0
8.0
9.0
10.0
11.3
9.9
10.8
12.9
12.3
10.8
14.8
FMeasure
11.9
13.5
15.2
17.0
15.8
17.1
20.2
19.7
17.9
23.4
Table 4.2: Precision, recall, and F-measure per error ratio range
suppose that wsj_0344.tml in ≤ 0.05 has 20 links and only one misclassified relation and the misclassified relation causes a violation that consists of four links; then
total-errors, tlinks-violation, and errors-violation are 20, 4, and 1, respectively. In
this experiment, the ≤ 0.20 group had 105 documents, 4,956 links, 870 misclassified
relations, 251 tlinks-violation and 85 errors-violation. Using the counts, I calculated precision ( errors-violation
), recall ( errors-violation
), and F-measure. The results are
tlinks-violation
total-errors
reported in Table 4.2.
The highest F-measure was 23.4 in an error ratio range between 0.45 and 0.5
(≤ 0.5) due to the highest precision and recall (0.560 and 0.148) in the table. But
it is difficult to tell whether the highest F-measure is the best one, because ≤ 0.5
has more misclassified relations and the misclassified relations are more likely to be
placed at temporal links that are more extractable. Therefore, it is a little tricky
to tell which F-measure is the best one naively based on the measures. Precisions
between ≤ 0.05 and ≤ 0.50 showed a big difference range of 0.32, although the dif-
103
Error
Ratio
≤ 0.05
≤ 0.1
≤ 0.15
≤ 0.2
≤ 0.25
≤ 0.3
≤ 0.35
≤ 0.4
≤ 0.45
≤ 0.5
Average
Violation
2.4
3.2
3.4
4.6
4.8
4.0
6.0
6.5
5.1
7.9
Average
TLink
Average
Error
3.2
3.6
3.4
3.3
3.5
3.6
3.6
3.2
3.5
3.5
1.0
1.0
1.3
1.4
1.5
1.6
1.8
1.6
1.9
1.9
Table 4.3: Averages of extracted violations per document (Average Violation), temporal links per violation (Average TLink ), and misclassified
relations per violation (Average Error )
ference of the recalls was only 0.07. The differences show that some temporal links
have a higher chance of being extracted than most other links in the current temporal corpora. We know that we cannot extract misclassified relations on temporal
links that form a line, as shown in Figure 4.10; however, we do not know what characteristics temporal links that are more extractable have. The analysis will be given
in Section 4.5. When 10 temporal links have three misclassified relations in the current corpora and all misclassified relations are extracted, one violation with three
misclassified relations is more likely than three violations that include one misclassified relation according to the precision difference. If three violations are extracted
and each violation is composed of a misclassified relation and several temporal links,
precision should be stable or get a little bit lower.
104
Table 4.3 supports the precision and recall patterns. The table shows the average
number of extracted violations, the average number of temporal links in a violation, and the average number of misclassified relations in a violation per error ratio.
We can say that the number of the misclassified relations in ≤ 0.5 is 10 times the
number of the misclassified relations in ≤ 0.05. If we have 100 links, ≤ 0.05 and
≤ 0.5 have 5 (100 × 0.05) and 50 (100 × 0.5) misclassified relations. But, the number
of extracted violations increases only 3 times (2.4 vs 7.9 in average violation). That
is why recall in Table 4.2 shows a small improvement. In addition, each violation
has a similar number of temporal links, but misclassified relations in a violation
continuously increase in Table 4.3 when the error ratio goes up. That is the reason
for the precision change in Table 4.2. The trend may be indirect evidence that certain links in the current corpora are more extractable when the links have a misclassified relation and transitive constraints are used to detect a violation.
The recall can be used to measure how much improvement we can expect when we
compare a TRI system using transitive constraints with a TRI system that does not
use the constraints using the current temporal corpora as evaluation data. An error
ratio range represents the performance of a TRI system without the constraints.
The recall represents the improvement we can expect when we apply the constraints
to a TRI system without the constraints. The 14.8 recall of ≤ 0.5 can be interpreted as less than 8% absolute improvement in accuracy. If we assume that the
ratio of extracted misclassified relations in a real system is similar to the ratio in
this experiment, we can calculate the expected performance improvement. Let’s
suppose that a document with 100 links is given and only 50% accurately classified
relations exist. In the document, 50 misclassified relations exist. Only 7.4 misclassified relations (50 × 0.148) among the 50 misclassified relations are extractable
using the constraints. Therefore, if we restore all extractable misclassified relations
105
to target relations, we can achieve only 7.4% improvement in accuracy. If five misclassified relations exist in a document with 14.8 recall, 0.75 misclassified relations
are detectable. This means that less than one link is detectable. This study using
artificially generated misclassified relations shows that a TRI system using transitive constraints will show similar or at most 7.4% improved performance to a TRI
system without using the constraints if we use the current temporal corpora in
the comparison. But we cannot guarantee that we will get similar results in a real
system because artificially generated errors may differ in various ways from real
errors. The experiment in Section 4.6 will show how much performance improvement we can expect using transitive constraints.
This study also shows that random selection is not a good method for constructing
test data when we want to compare two TRI systems using and not using transitive constraints. In this experiment, the variation of each document on extractable
misclassified relations was large. Of the 214 documents, 95 documents have no
extracted misclassified relations in the five experiment sets. If most test documents are from these 95 documents, a method to construct a temporal structure
without a violation cannot be evaluated properly on the test data because transitive constraints are useless on the documents. Moreover, there would be no difference between the performances of TRI systems using and not using transitive
constraints. A possible way to evaluate and compare both systems is to put aside
documents that have a certain number of extractable misclassified relations as test
documents.
Another possible solution to see the true effect of transitive constraints would focus
on more densely connected temporal links. The step to remove additional links from
extracted links can be applied to temporal links in temporal corpora. When annotated temporal links form a line, transitive constraints are useless in constraining
106
f
a
c
b
d
e
Figure 4.11: Small document with links that form lines
and detecting temporal relations that violate the constraints. The application of
the removal step to temporal corpora makes it possible to get more densely connected links that transitive constraints are more applicable to. Figure 4.11 has three
links that form lines. A link between f and a, and two links between c and e form
lines. If we can remove the links that form lines, we can get links between a and b,
between b and c, and between a and c that are more densely connected. The use
of this kind of more densely connected links after the removal of lines can let us
evaluate the true effect of transitive constraints.
4.5
Graph Analysis on Inconsistent Relation Extraction
In this section I show what graph structural characteristics are helpful in figuring
out which temporal links are more extractable when the links have misclassified
relations. The experiment with misclassified relations artificially generated in the
preceding section showed an approximate expectation of how many misclassified
relations can be extracted from TimeBank and AQUAINT. The experiment, however, did not show how to characterize documents with extracted misclassified
relations and how to characterize extracted links in extracted violations. Among
214 documents in the experiment, 95 documents did not have any extracted links
using transitive constraints. In this section I show what differences lead to the
107
extractability of misclassified relations using transitive constraints by comparing
documents with extracted violations to documents without an extracted violation.
In addition, temporal links with misclassified relations can also be categorized into
two groups: links with misclassified relations in extracted links (extracted errors)
and links with misclassified relations in remaining consistent links after the extraction (unextracted errors). After categorizing the links with a misclassified relation
into the two groups, I compare the groups. The comparison shows which links are
more extractable when they have a misclassified relation.
It is important to figure out when we need to apply transitive constraints in order
to construct consistent temporal relations in a document. The application of transitive constraints to the TRI task is computationally expensive. My violation extraction method takes time O(n5 ) because the outer loop of my method has an inner
loop and the inner loop runs TCPA, which takes time O(N 3 ). Statistical Relational
Learning methods such as Markov Logic Network are also computationally expensive and face challenges in efficiency and scalability in dealing with real data (Niu
et al., 2011b). Although the TimeBank and AQUAINT corpora have small sizes
of documents, real data that TRI systems will deal with can have documents of
much bigger sizes. If we know that to apply transitive constraints to a given document does not lead to any effect, we do not need to use computationally expensive
methods for the document. When we know to which parts of the links in a given
document it is worth applying computationally expensive methods, we can apply
the methods only to these parts. The analysis in this section will proovide a way
to help figure out which documents and which temporal links are worth applying
transitive constraints to.
In both comparisons, I used measures borrowed from graph analysis. Temporal
links in a document can be represented as a graph (G = (V, E)) with temporal
108
c
b
d
a
e
Figure 4.12: Undirected version of a temporal link graph
entities as vertices (V ) and temporal links as edges (E). Recent studies (Navigli
and Lapata, 2007; Korkontzelos et al., 2009) have shown that graph connectivity
measures can be useful in natural language processing tasks. The measures that I
borrowed from Navigli and Lapata (2007) and Korkontzelos et al. (2009) were edge
density, compactness, entropy, clustering coefficient, average clustering coefficient,
normalized density, key player problem, and betweenness.
In the analysis using the graph measures, I considered a temporal link that is
a directed link as an undirected link, and I ignored a temporal relation on a
link. Therefore, the graph with directed links in Figure 4.5 was represented as
Figure 4.12 in this analysis. The use of an undirected version without a relation
instead of a directed version can be justified because a temporal link has a corresponding inverse link. In addition, the connectivities among temporal entities are
more important in this analysis than the consideration of the directionality and
relation of a link.
In characterizing documents with extractable errors and documents without an
extractable error, four global graph measures were examined: edge density, com109
Global measures
edge density, compactness, graph entropy, average clustering
coefficient
normalized density, clustering coefficient, key player
problem, betweenness
Local measures
Table 4.4: Global measures vs local measures
c
b
d
a
e
Figure 4.13: Complete graph
pactness, graph entropy, and average clustering coefficient. The other measures
(normalized density, clustering coefficient, key player problem, and betweenness)
were used to distinguish unextractable and extractable links with a misclassified
relation artificially generated.
4.5.1
Graph Measures
Temporal links in a document can be represented by a graph, and I use graph measures in analyzing characteristics of the temporal links. I begin with explanations of
the terms that are used in the graph analysis. In this analysis, temporal entities of
the links are vertices of the graph, the links are edges of the graph, and the edges
110
are considered undirected. A degree of a vertex (deg(v)) is the number of edges
from v. In the case of b in Figure 4.12, the vertex has two edges (b — a and b — c).
Therefore, deg(b) is 2.
The vertices that are directly connected to a vertex are called neighbors. The neighbors of b are a and c. When two vertices are neighbors, they are adjacent. If every
vertex has all other vertices as a neighbor in a graph, the graph is a complete graph.
In other words, each vertex has edges to every other vertex. A complete graph version of Figure 4.12 is given in Figure 4.13.
It is possible that various paths exist between two vertices. We can reach e from
b in Figure 4.12 using the following paths: b — c — d — e, b — a — d — e, and
b — a — e. Among possible paths, a path that consists of the smallest number of
edges is called the shortest path and the number of edges in the shortest path is the
shortest distance (d(u, v)). The shortest path between b and e is b — a — e and
the shortest distance (d(b, e)) is 2. When a complete graph is given, every vertex
can reach every other vertex by only one edge. Therefore, the shortest distance of
a vertex to another vertex in a complete graph is always 1. When a graph has no
edges among vertices, the shortest distance is defined as the number of the vertices
(|V |) in the graph in this analysis.
Edge Density (Navigli and Lapata, 2007, p. 1686) measures how many realizable
connections are realized. In other words, the density is the ratio of the number of
edges in a graph to the number of edges of its complete version. The number of
edges of a complete graph is given in Equation (4.1).
|E|
|V |
2
111
(4.1)
(a) Line
(b) Star
(c) Loop
Figure 4.14: Three different graph types
where |E| is the number of edges in a graph and
|V |
2
is the number of edges of
a complete graph with V vertices. When a graph has no edge, its edge density is
0. The edge density of a complete graph is 1. The edge density of the graph in
6
).
(52)
Compactness (Navigli and Lapata, 2007, p. 1686) measures how easy it is for a
Figure 4.12 is 0.6 (
vertex to reach other vertices. Compactness is a measure that shows where a graph
can be positioned between a completely connected graph and a completely disconnected graph. The sum of all shortest distances of a completely connected graph
with V vertices is defined as |V | · (|V | − 1), which is M IN . In the case of a completely disconnected graph, the shortest path from a vertex to another is defined as
V . Therefore, the sum of all shortest distances of a completely disconnected graph
with V vertices is defined as |V | · |V | · (|V | − 1), which is M AX. When a graph with
P
V vertices is given, the sum of all shortest paths is defined as u∈V,v∈V :u6=v d(u, v).
The compactness of a graph (G = (V, E)) is formulated as in Equation (4.2).
P
M AX − u∈V,v∈V :u6=v d(u, v)
Compactness(G) =
M AX − M IN
112
(4.2)
One reason to introduce the compactness measure is that the density measure
cannot tell the difference between the graphs in Figure 4.14. The three graphs have
5
). But the compactness shows differences as
(|6|2 )
0.73 for Figure 4.14a, 0.87 for Figure 4.14b, and 0.84 for Figure 4.14c. The compact-
an identical density value, 0.33 (
ness of the graph in Figure 4.12 is 0.9.
Graph Entropy (Navigli and Lapata, 2007, p. 1686) shows how many relatively
important vertices a graph has. Edge density and compactness cannot show the
characteristics of a star-shaped graph in that a vertex has lots of edges connected
to other vertices like the graph in Figure 4.14b. In the case of a star structure,
connections are concentrated on several vertices and other vertices have fewer
connections. In order to tell a graph that has vertices with relatively more connections than other vertices, we can use graph entropy. Document Creation Time
in TimeML can have more connections than other entities. This measure tells us
which documents have more links to DCT, and we can see how many misclassified
relations can be extracted according to the degree of centralized temporal entities. The entropy (en(u)) of a vertex, u, is defined as en(u) = −p(u) log2 p(u), and
the probability of a vertex, p(u), is determined by the degree distribution such as
n
o
p(u) = deg(u)
. Graph entropy is computed by summing all vertex entropies
2|E|
u∈V
and normalizing by log2 |V | as in Equation (4.3).
P
en(u)
log2 |V |
u∈V
(4.3)
When the entropy of a graph is high, the vertices of the graph are equally important or have similar numbers of connections. Low entropy can be interpreted
as meaning that some vertices are more important or have more connections.
113
Entropies of the line graph, the star graph, and the loop graph in Figure 4.14 are
0.98, 0.84, and 1.08. The graph entropy of the graph in Figure 4.12 is 0.987.
Average Clustering Coefficient (Korkontzelos et al., 2009, p. 40) is the average
of clustering coefficients (ACC). The Clustering Coefficient (CC) of a vertex measures interconnectivity between the neighbors of a vertex. Let’s suppose that a
vertex (u) has k neighbors and n edges exist among the neighbors. The clustering
coefficient of u, (CC(u)), is given in Equation (4.4).
CC(u) =
2n
k(k − 1)
(4.4)
In Figure 4.12, e has a and d as neighbors. The neighbors have one edge (a − d).
2·1
Therefore, the clustering coefficient of e, (CC(e)), is 1 ( 2·(2−1)
). In the case of b, no
edges exist between a and c. Therefore, the clustering coefficient of b is 0. In the
case of d, three neighbors, a, c, and e, exist. There is one edge (a − e) among the
2·1
neighbors. The clustering coefficient of d is 0.333 ( 3·(3−1)
). The Average Clustering
Coefficient (ACC) of a graph given in Equation 4.5 is the average of clustering
coefficients of all vertices that the graph has.
ACC(G) =
P
CC(u)
|V |
u∈V
if G = (V, E)
(4.5)
The ACC of the graph in Figure 4.12 is 0.333.
The described measures above, Edge Density, Compactness, Graph Entropy, and
Average Clustering Coefficient, are for document-level analysis. I examine whether
the measures can tell the “extractability of misclassified relations” of documents.
In addition to the document-level analysis, it would be great to have a measure
to decide whether a misclassified relation in a temporal link can be included in
extracted violations. The misclassified relation in Figure 4.15 (a) is extractable
114
bef
b
af t
bef
a
d
c
bef
(a) Extractable Error
af t
b
bef
bef
a
d
bef
c
(b) Unextractable Error
Figure 4.15: Extractable and unextractable errors
using transitive constraints, but the misclassified relation in Figure 4.15 (b) is not
extractable. Global measures cannot tell which link is extractable when a link has
a misclassified relation. I examine local measures such as Normalized Degree, Key
Player Problem, and Betweenness in order to figure out if the local measures are
effective in distinguishing extractable links from unextractable links.
The original local measures were designed to capture the characteristics of a vertex,
not a link. My analysis, however, needs to capture the characteristics of a temporal
link. In order to capture the link characteristics, I modify the original local measures. The local measure of a temporal link is the average of the corresponding
original local measures of the temporal entities that are elements of the link. For
example, when a temporal link is composed of two entities, e1 and e2, the Normalized Degree of the link is the average of the original Normalized Degrees of e1 and
e2.
115
Normalized Degree (ND) is the normalized version of deg(v) that is divided by
|V | − 1 such that N D(v) =
deg(v)
.
|V |−1
In Figure 4.12, the normalized degree of a (N D(a))
is 0.6 ( deg(a)
). For temporal links, I use the average of Normalized Degrees of two
6−1
entities as given in Equation (4.6).
N D(edge(u, v))u,v∈V =
N D(u) + N D(v)
deg(u) + deg(v)
=
2
2(|V | − 1)
(4.6)
The Normalized Degree of a − d in Figure 4.12 is 0.75.
Key Player Problem (Navigli and Lapata, 2007, p. 1685) shows which vertex is
relatively close to all other vertices. The formula of Key Player Problem (KPP) is:
1
v∈V :v6=u d(v,u)
P
KP P (u) =
(4.7)
|V | − 1
where the sum of the inverse shortest distances between v and all other nodes is
divided by the number of vertices except v itself. The sum of the inverse shortest
distance shows the closeness of a vertex to the other vertices. In a complete graph,
the sum of the inverse shortest distances of a vertex is |V | − 1 ( 11 + 11 + . . .) because
each d(u, v) is 1. Therefore, the KPP of a vertex in a complete graph becomes 1. In
a completely disconnected graph, the sum becomes
|V |−1
|V |
( |V1 | +
1
|V |
+ . . .) that is
close to 1 because each d(u, v) is |V |. Therefore, the smallest KPP is
close to
1
,
|V |−1
|V |−1
|V |
|V |−1
that is
not 0. In the case of a in Figure 4.12, the sum of the inverse shortest
3.5
distances is 3.5 ( 11 + 12 + 11 + 11 ), and its KPP is 0.7 ( 6−1
). In case of e in Figure 4.12,
1
the KPP of e is 0.6 ( 1
+ 12 + 21 + 11
6−1
).
116
The KPP of a temporal link is the average of the KPP of two temporal entities as
given in Equation (4.8):
KP P (i) + KP P (j)
KP P (edge(i, j)) =
=
2
1
v∈V :v6=i d(v,i)
P
+
1
v∈V :v6=j d(v,j)
P
2(|V | − 1)
(4.8)
3.5+3
The KPP of the link between a and e is 0.65 ( 2(6−1)
).
Betweenness (Navigli and Lapata, 2007, p. 1685) represents how many times a
vertex is involved in the shortest path between two other vertices. When lots of
shortest paths include a vertex, the inclusion means that the vertex is critical in a
graph. The Betweenness of a vertex is defined in Equation (4.9).
P
Betweenness(v) =
i,j∈V :i6=v6=j
σij (v)
σij
(|V | − 1)(|V | − 2)
(4.9)
where σij (v) means the number of the shortest paths from i vertex to j vertex that
have v on the path, and σij is the number of the shortest paths from i to j. The
Betweenness of a in Figure 4.12 can be calculated as follows. The pairs of vertices
excluding a are (b,c), (b,d), (b,e), (c,b), (c,d), (c,e), (d,b), (d,c), (d,e), (e,b), (e,c),
and (e,d) and their shortest paths are b → c, (b → a → d or b → c → d), b → a → e,
c → b, c → d, c → d → e, (d → a → b or d → c → b), d → c, d → e, e → a → b,
e → d → c, and e → d. The value of σbd is 2 because (b,d) has two shortest paths,
and σbd (a) is 1 because one of the two paths includes a in the path (b → a → d).
Therefore,
σbd (a)
σbd
is 12 . The sum of
σij (v)
σij
is 3 ( 01 + 12 + 11 + 01 + 01 + 01 + 12 + 01 + 01 + 11 + 01 + 01 ),
3
therefore, the Betweenness of a is 0.25 ( (|5|−1)(|5|−2)
). The Betweenness of d is also
0.25.
117
Edge Density
all docs
0.073
docs wo extracted errors
0.080
docs w extracted errors
0.067
Compactness
0.368
0.285
0.434
Entropy ACC
0.939
0.054
0.943
0.001
0.935
0.096
Table 4.5: Overall values of global measures
When I calculate the betweenness of a temporal link, I average the betweenness of
its two vertices as in Equation (4.10).
Betweenness(edge(u, v)) =
Betweenness(u) + Betweenness(v)
2
(4.10)
The Betweenness of edge(a, d) is 0.25, the average of 0.25 and 0.25.
Clustering Coefficient measures the interconnectivity among nodes that are adjacent to a node as explained in Average Clustering Coefficient. The Clustering Coefficient of an edge is the average of the clustering Coefficients of vertices that make
up the edge as in Equation (4.11).
CC(edge(u, v)) =
4.5.2
CC(u) + CC(v)
2
(4.11)
Analysis
I first analyzed the characteristics of documents in TimeBank and AQUAINT corpora using edge density, compactness, graph entropy, and average clustering coefficient. I calculated the measures of all 214 documents (all docs). Second, I split the
documents into documents with extracted errors (docs w extracted errors) and
118
without extracted errors (docs wo extracted errors), then I calculated the measures of both document groups separately. The calculated measures are given in
Table 4.5.
All docs in Table 4.5 show the overall characteristics of temporal documents. Low
edge density and compactness values imply sparsity of temporal links in most documents. The density of 0.07 can be interpreted as showing that only seven links are
realized from 100 link candidates, and 0.368 compactness shows that a graph structure of temporal links in a document is relatively close to a completely disconnected
graph, of which the compactness is 0. The Average Clustering Coefficient (ACC)
also supports the sparsity. Neighbors of temporal entities in temporal corpora seem
poorly connected to each other, as the low ACC shows. That means there are a
very small number of temporal entities that make triangular connections. Next,
0.054 ACC means that only 5 cases are realized among 100 candidates that can
make a triangle. In addition, the low ACC value shows that naive application of
transitive constraints that checks for violations among only three temporal links
that consist of a triangle will not be successful in detecting violations because only
5% of the temporal links make such a triangle. Therefore, the propagation process of TCPA is necessary in violation detection in order to get violations that are
caused by links that are composed of more than three. The Entropy of a temporal
document was 0.939 on average. The entropy value shows that all temporal entities
have a similar number of connections to other temporal entities if the entities have
connections. Based on the entropy value, we can deduce that DCT does not have
more connections to temporal entities in a document than any other entities in the
document.
The comparison using docs wo extracted errors and docs w extracted errors indicates that compactness and ACC are relatively better measures than density and
119
Category ND KPP BT
CC
all links 0.040 0.153 0.021 0.049
unextracted links 0.028 0.084 0.026 0.014
extracted links 0.081 0.258 0.096 0.376
Table 4.6: Comparison of local measures for extracted and unextracted
errors
entropy for telling in which document we can extract violations using transitive constraints than density and entropy. The differences of density and entropy are -0.013
(0.067 − 0.08) and -0.008 (0.935 − 0.943); however, compactness and ACC show bigger
differences of 0.149 (0.434 − 0.285) and 0.095 (0.096 − 0.001).
In order to see the difference between extractable and unextractable errors, I collected all links (all links) that had artificial misclassified relations, then categorized
the collected links into links (extracted links) that were included in extracted violations and links (unextracted links) that were never extracted. I calculated modified
Normalized Degree (ND), Key Player Problem (KPP), Betweenness (BT), and
Clustering Coefficient (CC) of the links, as in Table 4.6.
The low values of ND, KPP, and BT support the finding that an average temporal
entity in TimeBank and AQUAINT has few connections to other entities. The
values also indicate that the temporal link is more likely to be extracted under
certain conditions when a temporal link has a misclassified relation. Based on a
difference of 0.06 (0.08 − 0.02) for ND, we can say that a temporal link is more
extractable when its temporal entities have more connections to other temporal
entities. Differences of 0.18 and 0.07 for KPP (0.26 − 0.08) and ND (0.10 − 0.03),
respectively, are evidence that a link that is involved in lots of paths between other
entities is also more extractable. In addition, 0.27 difference for CC (0.38 − 0.01)
120
a
bef,aft,iin
b
a
bef,aft,iin
rel
c
bef,aft
b
(a) Pattern that allows all relations
iin
c
rel
(b) Pattern that allows part
of relations
Figure 4.16: TLINK patterns with three links and one unextractable error
verifies that the triangular structure of three temporal links is also important for
extracting errors.
Misclassified relations randomly generated are not extracted sometimes although
the misclassified relations are placed on temporal links where a violation could be
detected structurally. In order to analyze the cases, I extracted sets of temporal
links that were extracted as inconsistent relations in one perturbed version of a
document and were not extracted in another version, although a misclassified relation existed among them. I collected violations that consist of three, four, and five
temporal links with only one misclassified relation, and manually analyzed them in
order to find unextractable patterns of temporal relations.
I found 166, 54, and 29 cases of three, four, and five temporal links. Among
the cases of three temporal links, 141 cases showed patterns similar to that of
bef ore
bef ore
af ter
af ter
Figure 4.16a. The three patterns, {a =====⇒ b, a =====⇒ c}, {a =====⇒ b, a =====⇒
is_included
is_included
c}, and {a =====⇒ b, a =====⇒ c}, allow all relations on a link between b and c.
When a misclassification occurs at a position where all relations are possible, the
wrong prediction does not cause a conflict. Another very common pattern with 23
cases was Figure 4.16 (b). Each pattern allows only four relations, {before, ibefore,
121
sim
b
c
bef,aft,inb
bef,aft,inb
a
rel
d
Figure 4.17: TLINK patterns with four links and one unextractable error
is_included, begins} or {after, iafter, is_included, ends}, on a link between b and
c. Both patterns in Figure 4.16 have only before, after, is_included on links from
a to b and from a to c. One reason that such a pattern that consists of these relations was found frequently is that the relations make up most of the distribution of
annotated relations.
Among 54 unextracted cases with four links, 42 cases (78%) had the pattern outlined in Figure 4.17. The pattern is almost identical to Figure 4.16 (a) except that
simultaneous
it has b =====⇒ c, which can be considered as a node. For comparison, I collected
57 extracted inconsistent cases that consisted of four links with one misclassified
relation. There was no case with the pattern in the extracted cases.
29 cases were composed of five links with one misclassified relation. All except five
cases can be derived from the patterns in Figure 4.16a and Figure 4.17 as illustrated in Figure 4.18.
The analysis of documents and links in the TimeBank and AQUAINT corpora
using graph measures showed how sparsely connected temporal entities in the documents are. According to the analysis, temporal links with misclassified relations
are more likely to be extracted using transitive constraints when temporal entities
122
c
af t
iin
af t
af t
b
d
af t
beb
a
rel
e
Figure 4.18: TLINK pattern with five links and one unextractable error
in the links have more connections to other entities and the links are parts of paths
between other entities. Therefore, in order to make the application of transitive constraints to the corpora more effective in extracting misclassified relations, temporal
entities in the corpora should have more connections to other entities using more
temporal links.
4.6
Extracting Violations from Classified Relations
This section shows what we can expect when we apply the extraction method to
real data. In Section 4.4, I showed how effective the proposed inconsistent relation
extraction algorithm was with misclassified relations randomly generated. In addition, the experiment in the previous section showed achievable maximum improvements with the data. But the randomly generated misclassified relations may fail
to show misclassification tendencies of classifiers. Trained classifiers can show a
preference for certain relations. The biased selection does not exist in the random
123
selection. Therefore, this experiment shows what happens in violation extraction
when certain relations occupy more distributions in classified relations.
For the investigation, I collected the classified relations in the temporal relation
classification experiment that used all qualitative TimeML relations in Section 3.2
based on source documents. I classified temporal relations of event-event, eventtime, and event-dct in the experiment. Because we are looking for inconsistencies
in a set of links generated by our TRI system and this system only identified eventevent, event-time, and event-dct links, I limit the investigation to these links.5 In
this experiment, the relations of the classifiers that achieved the best performance
in each link type are combined. I apply the violation extraction method to the collected data. After the extraction, I compare the extracted violations of the classified
relations to the violations of artificial data.
4.6.1
Experiment Design
The goal of this experiment is to see if there is difference in the extraction of violations from classified relations compared with the extraction using misclassified
relations randomly generated. In order to carry out the comparison, I first collected
classified relations of the classifiers that showed the best performance in Section 3.2:
kNN in event-dct, Maximum Entropy in event-event, and Support Vector Machine
in event-time. Second, I collected the classified relations per document.
In the classification experiment, 10 test sets existed because 10-fold cross validation
was used. Each set in the test sets had 21 documents. Therefore, 210 documents
were available for this experiment. As shown in the classification experiment, three
link types were used in this experiment: event-event, event-time, and event-dct.
5 In
a TRI system, there is no need to identify time-time and time-dct links, as they are
typically computed on the basis of the temporal values.
124
After the application of the violation extraction method to the data, I evaluated
various statistics such as the average number of violations per document, the
average number of temporal links, the average number of misclassified relations
in a violation, precision, recall, and F-measure. After the evaluation, I compared
the statistics of the classified data with the artificial data.
4.6.2
Results and Discussion
I collected classified relations of event-dct, event-event, and event-time per document in the test data. In the collection, I used the classified relations of the classifiers that showed the best performance in each link type: kNN in event-DCT
(0.597), ME in event-event (0.389), and SVM in event-time (0.673). The overall
performance of the classified relations was 0.516 accuracy. Event-event had the
most instances among the link types (51%). Therefore, the overall accuracy got
lower.
After collecting the classified relations per document, I extracted violations using
the violation extraction algorithm introduced in this chapter. From the test set,
I got 0.32 violations per document on average among 210 documents, and 184
extracted links. That means that I failed to get a violation set from lots of documents due to the use of less connected links. When a document had violations, each
violation set consisted of 3.1 links and 1.7 errors on average.
in violations
in violations
I calculated precision ( misclassification
), recall ( misclassification
), and
links in violations
all misclassifications
F-measure of the extracted classified relations (classification), which are identical
to the measures used in Section 4.4. As a comparison, I provided precision, recall,
and F-measure of ≤ 0.50 (random) in Table 4.2 because the overall performance of
classified relations was in the range (≤ 0.50).
125
precision
random (≤ 0.5)
0.560
classification
0.563
recall F-measure
0.148
0.234
0.021
0.041
Table 4.7: Performance of violation extraction
The precision of random and classification is almost identical, but the recall of
classification is significantly lower (0.12 in absolute value). The drop also causes a
drop in F-measure. The drop of recall is expected because the classification collection does not have time-time and time-dct links, and 0.02 recall means that we can
expect at most 2% improvement after the restoration step if we have 100 misclassified relations. I will show how many more misclassified relations can be extracted
and what improvement we can expect when we add more links to the classified
relations in Section 5.4.
The data in this section will be used in the consistency restoration experiments in
Chapter 5. My restoration methods, presented in Chapter 5, use confidence scores
achieved from classifiers as a criterion for selecting a link with a misclassified relation among the extracted links and replacing the classified relation in the link by
another relation. The data with misclassified relations that were randomly generated in Section 4.4 do not have confidence scores. Therefore, that data cannot be
used in the experiments of Chapter 5.
4.7
Conclusions
In this chapter, I showed how to extract violations using a novel algorithm based
on TCPA. In addition, I examined the appropriateness of current corpora for eval126
uating a method that constructs a consistent temporal structure using transitive
constraints. I extracted violations of classified relations in Section 4.6 using the
extraction algorithm. When I designed the violation extraction algorithm, I added
various functionalities: extracting multiple violations, extracting violations with
shared errors, and removing redundant links from extracted violations.
In the experiment on the appropriateness of the current corpora for the evaluation
of a TRI system based on transitive constraints, I artificially generated various
ratios of misclassified relations, then showed how many misclassified relations can
be extracted using transitive constraints. The results of the experiment indicated
that sparsely connected links in the current corpora allow small numbers of violations to be extracted. Therefore, applying transitive constraints to the construction
of a consistent temporal structure will only achieve small performance improvements in the current corpora. The analysis of the results of this experiment using
graph measures showed the graph structural characteristics of documents from
which violations are extractable and the characteristics of misclassified relations
that are extractable.
Finally, I applied my violation extraction method and verified that extracted violations from automatically classified relations also show similar precision. But, due to
missing time-time links in the classified data, the recall drops significantly.
127
Chapter 5
Restoration of Consistent Temporal Relations with Reasoning
5.1
Introduction
In the preceding chapters, I proposed a method for classifying temporal relations of
event-event, event-time, and event-dct links using eleven qualitative relations and
evaluated this method. I also presented a heuristic for extracting violations that
are collections of conflicting relations from classified relations in a document. Using
my proposed violation extraction algorithm, I extracted conflicting relations and
evaluated the expected performance improvement with transitive constraints. This
method for extracting inconsistent sets of relations significantly reduces the space of
error candidates. In this chapter, I provide an algorithm for identifying the source
of inconsistencies and removing them, restoring a consistent temporal structure to a
document.
The set of temporal links in a document that are extracted by the violation extraction method are categorized into one of two types: they are either in a violation set,
or they are not. We are unable to identify any “errors” that might exist in the links
that are not in the violation set, and so will have to ignore the possibility of the
existence of errors in these links. This ignorance leads to the assumption that the
classified relations of the links are correct. In what follows, I will be concerned with
identifying which of the links in the violation sets are the source of the violation.
128
The assumption that the relations of the links that are not in the violation set have
been correctly classified is based on the observation that a set of relations without
a conflict is less likely to be correct than a set of relations with a conflict. When
?
?
?
three transitive links are given, e.g., a ===⇒ b, b ===⇒ c, and a ===⇒ c, consistent
transitive relations are only 17% (229 from 133 possible combinations) according to
the transitivity table in Allen (1983). The more links that are connected, the less
likely it is to get a set of relations that contains misclassified relations and does not
have any conflicts.
As I briefly mentioned in Chapter 4, the extraction of violations reduces the computational complexity of problems by making us focus on links in violation sets
when reconstructing a consistent temporal structure. In addition, the extraction
reduces the chance of lowering performance by preventing us from modifying links
that are not in violation sets. As I discussed in this chapter’s opening, when transitive relations with misclassified relations are given, the chance that the misclassified
relations do not make a conflict is only 17%. Therefore, when we modify relations
in consistent sets, it is more probable that the modifications can generate more
errors. My method avoids such cases by focusing on conflicting relations.
I construct a consistent temporal structure by adding (one by one) the links that
are found in the violation sets to the set that does not contain violations. If the
links in the violation sets are composed of n temporal links, there are n! ways
to add them to the set. When we try a sequence of temporal links from the n!
ways, we need to run TCPA after the addition of a temporal link in the sequence.
Because of the computational complexity of trying n! ways, I examine two heuristics for adding the links in violation sets to the links not in violation sets.
Two alternative heuristics are examined in the process of reconstructing a consistent temporal structure. The first heuristic method assumes that relations with
129
higher confidence scores are more likely to be correct, and adds links from the violation sets by decreasing order of confidence scores to the links not in violations.
The other heuristic method first adds links with the most constrained relations (i.e.,
the link with the smallest number of relations consistent with the set of relations
that were not identified as containing a conflict). The second heuristic is based on
the assumption that relations that do not contain a conflict are correct. In the first
experiment of this chapter, I evaluate the performances of both methods on the
extracted violations from Section 4.6.
As I discussed in detail in Section 4.4, the temporal links in the corpora are
sparsely connected. In the second experiment of this chapter, I examine the performance of the restoration methods proposed in Section 5.2 after increasing connectivity between temporal entities. When we add temporal links between time
expressions and between a time expression and DCT, the additional links make it
possible to find more conflicts. For example, when five links with classified relations
are given, such as event1 — event2, event1 — dct, event2 — dct, event1 — time1,
and event2 — time2, represented by solid lines in Figure 5.1, and event1 — time1
and event2 — time2 have misclassified relations, the misclassified relations do not
cause a conflict. After adding time2 — dct and time1 — dct, the misclassified relations can cause conflicts.
The additional links may also be additional pieces of evidence pointing to the nonexistence of errors. If there are no misclassified relations among the five links represented by solid lines in Figure 5.1, the triangular connection of event1 — event2,
event1 — dct, and event2 — dct is the only evidence of no errors among the three
links if we do not consider the additional time-time and time-dct links. Moreover,
there is no evidence that event1 — time1 and event2 — time2 have no misclassified relations. The addition of time1 — time2, time1 — dct, and time2 — dct
130
DCT
time1
time2
event1
event2
Figure 5.1: Addition of time-time and time-dct links
makes it possible to verify the non-existence of misclassified relations in various
ways. In the case of event1-dct, the triangular connection with event1 — time1 and
time1 — dct and rectangular connection with event1 — event2, time2 — dct, and
event2 — time2 also provide pieces of evidence that the classified relation of event1
— dct has been correctly classified. In the second experiment of this chapter, I add
time-time and time-dct links to the links from Section 3.2 and examine the influence of the added links when extracting violations and reconstructing consistent
temporal relations.
The increased connectivity from adding additional links does not guarantee performance improvement, as shown by Chambers and Jurafsky (2008), Tatu and
Srikanth (2008), and Yoshikawa et al. (2009). These studies failed to achieve
improved performance on classified relations with transitive constraints when the
constraints were applied to sparsely connected temporal links. When additional
temporal links such as time-time were added, Chambers and Jurafsky (2008) and
131
Study
Data
Chambers and
Jurafsky (2008)
event-event
with before
and after
event-event
Tatu and
Srikanth (2008)
Yoshikawa et al.
(2009)
TempEval-1
Target
Relations
before and
after
six merged
relations
before,
after,
overlap
Additional Links
Improved
event-event from
other study and
time-time
time-time
Yes
time-time
Yes
No
Table 5.1: Influence of additional links in previous studies
Yoshikawa et al. (2009) but not Tatu and Srikanth (2008) achieved some improvement, as Table 5.1 shows.
The comparison of Chambers and Jurafsky (2008) and Tatu and Srikanth (2008)
indicates the possibility that the number of target relations can be influential in the
performance of a system constructing a consistent temporal structure. Both studies
used event-event links from TimeBank and AQUAINT corpora as target links, as
shown in Table 5.1. Chambers and Jurafsky (2008), who used only two relations as
target relations, achieved an improvement after adding additional links, but Tatu
and Srikanth (2008), who used 6 relations, failed to get an improvement in spite of
the additional links. The experiment in this chapter uses eleven relations as target
relations. It will show whether the increased connectivity can lead to performance
improvement with eleven target relations.
This chapter shows how effective the heuristic consistency restoration methods
that I propose are in removing conflicts and recovering target relations. This study
examines the performance of the consistency restoration methods in two environ-
132
ments: (1) temporal links from Section 4.6 and (2) temporal links from Section 4.6
+ additional time-time links. In the following sections, I first explain the heuristics.
Second, I evaluate my consistency restoration methods on temporal links from Section 4.6. Finally, after adding additional time-time links, I examine whether there is
a performance difference in the extraction step and the restoration step.
5.2
Consistency Restoration Heuristics
After extracting violations from classified relations in a document, we can categorize the links in the document into two types: temporal links in violations and
temporal links not in violations. The construction of a consistent temporal structure of a document can be defined as the process of resolving the conflicts in the
links in violations. The resolution can be composed of two processes: choosing a
link with a misclassified relation and recovering a correct relation. I carry out the
resolution process by adding the temporal links in violations into the temporal links
not in violations one by one. In the resolution process, I use heuristics to decide
which link has an error and what the correct relation is.
In this section, I explain two heuristic methods that are used to resolve the
extracted conflicts: (1) a heuristic method that adds relations with higher scores
first and (2) a heuristic method that adds more constrained relations first. Since
we assume that the relations of the links not in violations (unextracted relations)
are correct, the general method used is to add links from the extracted set to the
unextracted set one by one.
Let us suppose the relations and the confidence scores in Figure 5.2 have been
extracted by our classifiers. The extraction method in Chapter 4 splits the links
133
c
bef
inc
af t
b
bef
d
af t
a
?
b ===⇒ a:
?
d ===⇒ a:
?
c ===⇒ b:
?
d ===⇒ c:
?
d ===⇒ b:
(before, 0.4), (after, 0.3), (includes, 0.15), . . .
(after, 0.4), (before, 0.3), (is_included, 0.1), . . .
(includes, 0.5), (after, 0.2), (ends, 0.1), . . .
(before, 0.45), (includes, 0.3), (after, 0.15), . . .
(after, 0.6), (includes, 0.2), (iafter, 0.1), . . .
Figure 5.2: Illustrative example of error identification task Grey-colored
relations are errors. Boldfaced relations are the target relations.
includes
bef ore
af ter
in Figure 5.2 into links in violations (c =====⇒ b, d =====⇒ c, and d =====⇒ b), and
af ter
bef ore
links not in violations (d =====⇒ a and b =====⇒ a).
To identify the error in the links in violations, I consider the three conflicting relations as potential candidates. The error should be detected and corrected. I use
relations without a violation, such as the before relation between b and a and the
after relation between a and d in Figure 5.2 as grounding relations that are presumed correct. I add each link from the extracted violations to the grounding relations. The order in which links are added is based on heuristics. Whenever an addition causes a conflict, I consider the relation of the added link to be the misclassified relation, and it is to be replaced by a different relation. When using the scorebased heuristic, I select the link having the highest score, i.e., the link between b
and d in Figure 5.2. When using the entailment heuristic, I select the links having
134
the smallest number of entailed relations, which is also the link between b and d
bef ore
af ter
because the two grounding links, b =====⇒ a and d =====⇒ a, allow only the
after relation on the link from d to b. If the replacement of the targeted relation
by another relation does not give rise to a conflict with the grounding relations, I
assume the replaced relation is correct.
The score-based heuristic method (Score Heuristic) for reconstructing a consistent temporal structure, as briefly described in the previous paragraph, gives
priority to a temporal link with a higher scored relation when adding links with
conflicting relations into links without conflicting relations. When an added relation
causes a conflict, it is replaced by a relation with the next higher score in the link.
This process is iterated until the replacement of a relation does not cause a conflict.
bef ore
af ter
In the example in Figure 5.2, the grounding relations are b =====⇒ a and d =====⇒
af ter
a, as shown in Figure 5.3. This method first adds d =====⇒ b to the grounding
af ter
relations because d =====⇒ b has the highest score of 0.6, compared with 0.5 and
includes
bef ore
includes
0.45 of c =====⇒ b and d =====⇒ c. The method, then, adds c =====⇒ b with
the following highest score. The additions do not cause a conflict, as in Figure 5.4.
bef ore
After the addition of d =====⇒ c, the execution of TCPA on the graph with the
bef ore
includes
added link detects a conflict because d =====⇒ c is not compatible with c =====⇒ b
af ter
and d =====⇒ b, which would only allow after, iafter, is_included, and ends as a
?
relation of d ===⇒ c.
The method selects one of the remaining 10 alternative relations to the targeted
classified relation after the addition of the classified one causes a conflict. In the
example, includes has the next highest score of 0.3. Therefore, the next replacement
for before is includes. But this replacement still causes a conflict with the other
relations. A relation that does not cause a conflict and has the highest score among
the other relations is after. After inserting after, the method has a consistent tem135
b
d
bef
aft
a
Figure 5.3: Grounding relations
c
inc
aft
b
bef
d
aft
a
Figure 5.4: Graph after the additions of the first two links using score
heuristics
1
2
3
4
5
6
7
8
9
10
11
12
input : relations without a violation and relations within violations with
scored relations
output : consistent relations
while size(conflicting relations) > 0 do
get the highest scored relation among relations within violations;
add the relation to relations without a violation;
run Temporal Constraint Propagation Algorithm on relations without a
violation ;
if a conflict is found after the addition then
repeat
select the next highest scored relation among alternative relations;
replace the added relation by the next highest scored one;
run Temporal Constraint Propagation Algorithm on the relations
with the replaced one;
until no conflict is found ;
end
end
Algorithm 3: Score Heuristic Method
136
c
af t
inc
af t
b
bef
d
af t
a
Figure 5.5: Final consistent graph using Score Heuristic
poral structure, as shown in Figure 5.5. The algorithm of this method is given in
Algorithm 3.
The entailment method adds links in violations into grounding relations on the
basis of the number of candidate relations in links in violations that are constrained
by grounding relations. When we run the Temporal Constraint Propagation Algorithm on grounding relations, some links in violations are also constrained and
relations of the links that are acceptable to grounding relations become explicit.
Constrained relations of links in violations mean propagated relations from correct
relations. The intuition behind this heuristic is the assumption that the likelihood
of choosing a correct relation is high when the number of constrained relations is
small. If we have two constrained relations, the chance of choosing a correct one is
50%, but the chance drops to 33% with three constrained relations.
This method begins by running TCPA on the grounding relations in Figure 5.3.
The execution of TCPA constrains which relations of temporal links can be added.
This method uses the number of constrained relations as the criterion for selecting
a link to be added to the grounding relations set. In short, the temporal link that is
most constrained by the grounding relations is the first to be added to the relations.
137
c
all
all
af t
b
bef
d
af t
a
Figure 5.6: Graph after running TCPA on grounding relations Relations on
grey lines are constrained relations after running TCPA. All means eleven target
relations.
This method combines information about how constrained the links in violations
are by the grounding relations, e.g., how many potential relations the links might
have. When the link that is most constrained has several constrained relations,
the highest scored relation from its constrained relations is added. When there are
several links that are equally highly constrained, the link with the highest scored
relation is also selected and the highest scored relation is added to the grounding
relations. This method iterates the execution of TCPA and the insertion of links
with the smallest number of constrained relations.
Let us suppose that this method is applied to the example in Figure 5.2. As mentioned in Section 5.2, the execution of the violation extraction algorithm splits
includes
bef ore
classified relations into the links in violations (c =====⇒ b, d =====⇒ c, and
af ter
af ter
bef ore
d =====⇒ b) and the links not in violations (d =====⇒ a and b =====⇒ a). TCPA
constrains relations on the links in violations, and results in the graph in Figure 5.6.
Since a temporal link from d to b has only one possible relation, after, after running
af ter
TCPA, we add d =====⇒ b to the grounding relations.
138
c
all
all
af t
b
bef
d
af t
a
Figure 5.7: Graph after adding a link from d to b, labeled as after, to
grounding relations
c
inc af t, iaf, iin
af t
b
bef
d
af t
a
Figure 5.8: Graph after the addition of the link from c to b
Adding this relation to the grounding relations does not further constrain the rela?
?
tions in the links in violations, c ===⇒ b and d ===⇒ c. Both links have all eleven
relations as their relation candidates. In this circumstance, we choose the link with
the highest scored relation and add the relation to the grounding relations. Since
includes
includes
c =====⇒ b has the highest score, 0.5, we add c =====⇒ b.
139
input : relations without a violation and relations within violations with
scored relations
output : consistent relations
1
2
while size(relations within violations) > 0 do
run Temporal Constraint Propagation Algorithm on relations without a
violation ;
3
get a link with more constrained relations among links within violations ;
4
if a classified relation in constrained relations then
5
add the relation to relations without a violation;
6
else
7
select the highest scored relation among constrained relations;
8
add the selected relation to relations without a violation;
9
10
end
end
Algorithm 4: Entailment Heuristic Method
includes
?
The addition of c =====⇒ b does constrain the possible relations on d ===⇒ c to
after, iafter, is_included, and ends. From the constrained relations, the highest
scored relation that is consistent with the other relations is after. Therefore, the
method adds the relation, and reconstructs the same consistent temporal structure
as the Score Heuristic. This algorithm is summarized formally in Algorithm 4.
One difference between the Entailment Heuristic Method and the Score Heuristic
Method is how to choose a link to be added among links in violations. The Entailment Heuristic chooses the link with the most constrained relations from the links
in violations to add into grounding relations, and the Score Heuristic chooses the
link with the highest confidence score. In addition, when examining which temporal
relation should be added for the selected link, the Entailment Heuristic searches
140
Link Selection
Score Heuristics
Link with the highest score
Relation Candidates
All relations of the selected
link
Entailment Heuristics
Link with the most constrained relations
Constrained relations of
the selected link
Table 5.2: Difference between Score Heuristic and Entailment Heuristics
e
end
iin
c
bef
inc
af t
b
bef
d
af t
a
?
c ===⇒ b: (includes, 0.45), (after, 0.2), (ends, 0.1), . . .
?
d ===⇒ c: (before, 0.6), (includes, 0.3), (after, 0.15), . . .
?
d ===⇒ b: (after, 0.5), (includes, 0.2), (iafter, 0.1), . . .
Figure 5.9: Graph with more links and an error Only confidence scores of
relations within violations are given for explanation.
only constrained relations, but the Score Heuristic tries all relations. Table 5.2 summarizes the differences between these two methods.
The characteristics of Entailment Heuristics in choosing a link to add make the
restoration of a consistent temporal structure less influenced by score information
when more connected links are given because the constrained relations are mainly
used. In addition, this method reduces the number of possible candidates.
141
illustrate more completely the advantages of the Entailment Heuristic, consider an
example with more connected links, as shown in Figure 5.9. The extraction of violations from the temporal links in Figure 5.9 leaves the relations without a violation,
as in Figure 5.10. The running of TCPA reduces the number of possible relations
?
?
?
on d ===⇒ b, c ===⇒ b, and d ===⇒ c, as shown by the dotted lines in Figure 5.10.
?
The link, d ===⇒ b, is constrained to the single relation, after. This link is, thereincludes
fore, the first one added to the grounding relations. The addition of c =====⇒ b
follows because includes has the highest score among the constrained relations
(after, begun_by, iafter, and includes) on the link from c to b. These two additions
?
reduce the possible relations on d ===⇒ c to after, iafter, is_included, and ends as
shown in Figure 5.11. Finally, an entirely consistent structure is generated by the
af ter
addition of d =====⇒ c because after has the highest score among the constrained
relations.
In the Score Heuristic, the order of the links added is different from the order in the
bef ore
Entailment Heuristic. Using the Score Heuristic, d =====⇒ c is added first, then
af ter
af ter
d =====⇒ b and c =====⇒ b follow. In this case, the additions lead to the insertion
of two errors that are different from the target relations in the links from d to c and
from c to b.
142
e
end
iin
c
af t,iaf ,
inc,beb
all
af t
b
bef
d
af t
a
Figure 5.10: Relations without a violation after the violation extraction
Solid lines are links without a violation and dotted lines are links with inferred
relations.
e
end
c
iin
af t,iaf ,
iin,end
inc
af t
b
bef
d
af t
a
Figure 5.11: Relations after adding after between d and b and includes
between c and b
143
5.3
Identifying and Correcting Errors in TRI
In Section 4.6, I described an algorithm for extracting violations. In the previous
section, I presented two methods for identifying links with misclassified relations
and constructing a consistent temporal structure based on the heuristics presented
in the prior section. In this section, I evaluate the methods using extracted relations
in Section 4.6.
The restoration can be decomposed into two sequential steps, as shown in (12):
(12)
• Misclassified link identification: to identify the temporal links that have
a misclassified relation from the set of links that are extracted
• Target relation restoration: to restore the target relation of the link
that is selected as the misclassified link
I evaluate the two heuristics on both tasks. As I already reported in Section 4.6.2,
my violation extraction method extracted only 2% of the misclassified relations
from all misclassified relations because of the loosely connected link structure of
the target temporal links. Therefore, the results in this section show how good my
restoration methods are at restoring the target relations among this 2% of cases.
I had 210 documents in my test data. There were 184 links extracted in total. In
these, 94 links had actual misclassified relations. When I ran the two methods on
the test data, the Score Heuristics and Entailment Heuristics changed relations of
52 and 51 links, respectively, among the 184 extracted links. Among the 52 and 51
links with modified relations, the number of links that really had misclassified relations were 29 and 26 links with the Score Heuristic and the Entailment Heuristic,
144
Score Heuristics
Entailment Heuristics
precision
0.558
0.510
recall f-measure
0.322
0.408
0.289
0.369
Table 5.3: Performance of two restoration methods in misclassified link
identification
Score Heuristics
Entailment Heuristics
accuracy
0.212
0.176
Table 5.4: Performance of two restoration methods in target relation
restoration
respectively. I present precision, recall, and F-measure values calculated using the
formulas in (5.1) and (5.2) in Table 5.3.
precision =
recall =
correctly modified links
modified links
correctly modified links
actual links with misclassified relations in violations
(5.1)
(5.2)
Both methods achieved around 0.5 precision and around 0.3 recall. The Score
Heuristic achieved 0.04 higher measures in precision and recall. The precision values
mean that about half of the links identified as “errors” by the algorithm are “not
errors”. The wrongly selected links will lower overall performance if the restoration
method fails to get more correctly restored relations than wrongly selected links.
145
To evaluate the effectiveness of the two methods in restoring the target relations of
links that were in extracted violations and had misclassified relations, I calculated
the accuracy measure in Equation 5.3, which shows the proportion of correctly
restored relations in the identified links with misclassified relations.
|correctly restored relations|
|correctly identified links|
(5.3)
This measure also shows whether we will finally get performance degradation in
the target relation restoration step. If we identify 10 links as having misclassified
relations and 5 of the 10 links actually have correctly classified relations, these 5
links will lowers the overall performance because the target relation restoration
step will change the correct relation to a wrong relation. When we restore the misclassified relations of the other five links, this cannot cause performance to drop
since the relations have already been misclassified. In short, we need the number
of correctly restored relations to be equal to or higher than the number of wrongly
identified links in order to get performance identical to or higher than the starting
performance. I got around 20% accurately restored relations among correctly identified links, as shown in Table 5.4. The ratio is, however, too low to maintain the
classification performance because the total number of all correctly identified links
is similar to the total number of wrongly identified links according to the precisions
of the misclassified link identification.
The low recall in the link identification and the low accuracy in the relation restoration led to overall lower accuracies (0.515 and 0.515 after the applications of SH
and EH) than the accuracy of the classification step (0.516) after the application of
the restoration step.
146
Chambers and Jurafsky (2008) and Tatu and Srikanth (2008) reported no improvement or slightly degraded performance before adding more links. I observed the
same results in this experiment. This experiment shows how difficult it is to get
performance improvement using transitive constraints when the given temporal
links are sparsely connected. The extracted violations contained only 2% of the
total misclassified relations in this experiment. In addition, there are two more challenging steps: misclassified link identification and target relation restoration. My
heuristic methods also achieved low performance of approximately 0.4 F-measure in
the misclassified link identification and 0.2 accuracy in the target relation restoration. Those low performances led to a performance almost identical to that in the
relation classification step instead of achieving an improvement.
5.4
Addition of Time-Time Information
Taking an idea from Chambers and Jurafsky (2008) and Tatu and Srikanth (2008) I
experiment with the addition of the information to be gleaned from temporal links
relating times to one another (time-time links) in order to increase the connectivity
of temporal links. I add temporal links between time expressions and between a
time expression and DCT to the set of automatically classified link relations. This
may lead to improving the performance of my algorithm because the increased
connectivity may be beneficial in resolving conflicts by reducing the number of combinations of relations that are consistent when connectivity increases. Let’s suppose
we have the conflicting relations in Figure 5.12. It is not clear which one is the error
that causes the conflict only by looking at the extracted relations. Therefore, all
four links are considered error candidates. When a link from t1 to t2 is added on
the basis of the temporal value of the temporal expressions, however, the number
147
e1
bef
iin
t1
t2
bef
iin
e2
Figure 5.12: Extracted conflicting relations
of possible relations for the other links is reduced. For example, if after is added
af ter
to the link (t1 =====⇒ t2), then the number of error candidates is reduced to two:
is_included
bef ore
e2 =====⇒ t1 and e2 =====⇒ t2. This results in a conflict, indicating that there
might be an error between the two links.
I add links relating temporal expressions (both timex3 and Document Creation
Time) on the basis of the temporal information they contain (we know that “July
4, 1965” refers to a time that is before the time referred to by “July 4, 1995”). I give
the details of this process in the next section. I analyze the performance in each
step and report the final results.
5.4.1
The Increase of Time-Time Links
As described in Chambers and Jurafsky (2008), we can increase time-time links
with a closure process using TCPA, simple calculations using annotated time
stamps in TIMEX tags, and some words in time expressions such as morning, afternoon, etc. In the case of the closure process, I run TCPA on documents in the data,
148
extract time-time links with one relation, and add the time-time links to the documents.
Simple calculations using time-stamps also generate time-time links with a relation.
In (13), temporal expressions such as Nov. 16 and March 1, 2011 are tagged and
time-stamped as “1988-11-16” and “2011-3-1”. Based on the annotated time stamps,
we can infer that Nov. 16 is before March 1, 2011. All documents in TimeBank and
AQUAINT corpora have 1,655 annotated timex3 tags and 279 annotated temporal
links between time expressions and DCT. TimeBank and AQUAINT corpora have
−279) between time expressions
over 1,300,000 unannotated temporal relations ( 1655
2
and between a time expression and DCT. Some temporal relations between time
expressions can be achieved using annotated time stamps.
(13) Automatic Data Processing Inc. plans to redeem on Nov. 16 its $150 million of 6.5% convertible subordinated debentures due March 1, 2011.
In order to carry out this calculation, I convert all temporal value expressions to
intervals, with specified start and end boundaries. When a time stamp has a form
with specific information such as “1988-11-16”, I convert the time stamp into an
interval with start and end time points such as “1988-11-16T00:00:00” and “198811-16T23:59:59”. I also convert a time expression such as “the morning” into a duration when its tag has a time stamp such as “1988-11-16TMO”. In the conversion
of “morning”, “afternoon”, “evening”, and “night” time expressions I assume that no
overlapping time duration exists in these expressions in a document. Based on this
assumption, “morning”, “afternoon”, “evening”, and “night” are converted into durations from 6am to 12pm, from 12pm to 6pm, from 6pm to 9pm, and from 9pm to
12am.
149
When a time stamp is annotated with an underspecified duration or period (such
as “three days” or “an hour”), and a specific interval can be inferred on the basis
of other annotated information, I convert the unspecified duration into a duration
with specific time information using the specific time information of a time expression linked to the time expression with the unspecified duration. In (14), “Three
days” and “Friday” are tagged as time expressions.
(14) Three days t1 later, as of Friday t2 ’s close, the Nasdaq Composite was down
6%, compared with 5.9% for the industrial average, 5.7% for the SampP 500
and 5.8% for the Big Board Composite.
The specific time information of “Three days” is underspecified as “P3D” that
means “P eriod 3 Days”. But, its specific time information can be calculated
through anchored time information. The tag of “Three days” refers to “Friday” as
the end point of the three days in its annotation, and “Friday” has the annotated
specific time, “1989-10-27”. The “Friday” can be used in calculating the specific time
information of “Three days”. The specific time duration of “Three days” is from
“1989-10-25T00:00:00” to “1989-10-27T23:59:59”.
When specific time information cannot be inferred for a time expression, I ignore
the time expression in calculating additional time-time links. But when the tag of
a time expression has one of the time-stamps: “PAST_REF”, “PRESENT_REF”,
or “FUTURE_REF”, I calculate temporal relations between time expressions with
the time-stamps. For example, a time expression with “PAST_REF” is before time
expressions with “PRESENT_REF” and “FUTURE_REF”. In addition, a time
expression with “PRESENT_REF” is considered to include DCT. In Example
(15), current is marked as “PRESENT_REF” and the time expression includes
the DCT.
150
(15) Wall Street’s plunge helped spark the current weakness in London.
After getting specific time information for all time expressions, temporal relations
between the time expressions are calculated. When the calculated temporal relations conflict with annotated temporal relations and closed temporal relations
between time expressions, the annotated and closed relations are used. In (16),
ended_by
simultaneous
two annotated relations of e5 =====⇒ t40 and t41 =====⇒ e5 make it possible to
ended_by
derive t41 =====⇒ t40, which is a closed time-time link.
(16) Conflicting time-time relations
For the year t41(1989)
ended e5
Sept. 30 t40(1989-09-30) , Ralston
earned e6 $422.5 million, or $6.44 a share, up 8.9% from $387.8
million, or $5.63 a share. . . . the full year t44(1989) . . . Ralston
attributed its fourth-quarter t47(1989-Q4) . . .
Annotated TLINKs
<TLINK relType=“ENDED_BY” timeID=“t41”
relatedToEventInstance=“e5”/>
<TLINK relType=“SIMULTANEOUS” eventInstanceID=“e5”
relatedToTime=“t40”/>
The simple calculation using time stamps results in a different relation. The temporal expressions, t41 and t40, are time-stamped as “1989” and “1989-09-30”. The
includes
naive calculation generates t41 =====⇒ t40, which is different from the closed link,
ended_by
t41 =====⇒ t40. The difference happens because the year in the example specifies
Ralston’s fiscal year that ends on Sept. 30th. Because my methods for inferring
time-time links is heuristic, I do not consider the contextual information. Due to
151
annotated
closed
calculated
added
overlapping
count
time-time 156
time-time 260
time-time 2356
time-time 2506
time-time 266
Table 5.5: Counts of annotated, closed, calculated, and added time-time
links
the issue of the naive calculation, I give priority to annotated and closed relations
in adding time-time links.
5.4.2
Results and Discussion
I added originally annotated, closed, and calculated time-time links to the test data
with automatically classified relations. After the addition, I extracted violations
and restored consistent temporal structure using the violation extraction method
in Section 4.3 and two restoration methods in Section 5.2. Finally, I compared the
performances of the violation extraction and restoration methods on the data to
which time-time links had been added with the performances on only classified
relations.
I report the numbers of originally annotated, closed, and calculated time-time links
in Table 5.5. There were only 256 annotated time-time links. After the application
of the closure process, an additional 260 time-time links were generated. The application of the calculation with time stamps generated 5 times more than annotated
and closed time-time links. From the 2,356 calculated time-time links, 266 links
overlapped with annotated and closed links, and 28 calculated links had conflicting
152
t40
bef
enb
sim
e1
t42
end
t41
Figure 5.13: An example of a violation with only correctly classified relations
relations with annotated and closed relations. Because of this, the number of timetime links added in the end was 2,506 (156 + 260 + 2356 − 266).
Because of the heuristic nature of the time-time calculations, it was possible that
my methods applied to the entire set of inferred links in a text might give rise
to violations containing time-time links - these might be wrong. In the example
bef ore
in Figure 5.13, there are no misclassified relations, and, t40 =====⇒ t42 and
ended_by
t41 =======⇒ t42 are generated through the calculation using time stamps.
The violations can occur because of my naive use of time stamps in calculating
time-time links, as shown in (16). In my simple calculation algorithm using time
stamps, I handled “YYYY-Q1”, “YYYY-Q2”, “YYYY-Q3”, and “YYYY-Q4” as
periods between YYYY-01-01 and YYYY-03-31, between YYYY-04-01 and YYYY06-30, between YYYY-07-01 and YYYY-09-30, and between YYYY-10-01 and
YYYY-12-31. But I found that the interpretation of quarters can be different
depending on context.
In other cases, annotation errors caused extracted violations composed of only correct relations or only time-time links. In (17), an annotated temporal expression,
153
classification
adding time-time
violation per doc
0.32
0.64
precision
0.563
0.475
recall F-measure
0.021
0.041
0.048
0.087
Table 5.6: Violation extraction results using classified relations and classified relations plus time-time links
the end of 1990, has the wrong time-stamp value, 1990, instead of the correct value
of 1990-12-31.
(17) An example of annotation errors
The company has said it wants to boost non-GM revenue to at least 50% of
its total business by the end of 1990 value=1990 .
When these kinds of time-time links are added because the links do not overlap
with annotated and closed time-time links, the additional links can also cause conflicts with correctly annotated temporal links. In the experiment, after adding timetime links, I already noticed that these kinds of annotation errors generated 15
incorrect time-time links in my simple calculations because the 15 links caused violations that consisted of only correctly classified relations and time-time relations.
These violations were not the violations I looked for. Therefore, I excluded the violations that were composed of only time-time links or only target relations when I
evaluated the violation extraction task using precision and recall.
Table 5.6 shows the performance of the violation extraction method on the data
with additional time-time links. After the addition of time-time links, the average
number of extracted violations per document doubled from 0.32 to 0.64. When a
document has violations, a violation consists of 3.81 links and 1.91 misclassified
154
relations on average. When I used only the links of classification data from Section 3.2, I got 3.1 links and 1.72 errors on average. As we can see in Table 5.6, the
precision of violation extraction on the data with time-time links was 0.09 lower
than the precision of violation extraction on the classified relations. The lower precision on the data with time-time links can be explained by the differences in the
numbers of links and errors per violation. The number of links that consist of a
violation increases by 0.7, but misclassified relations in a violation increase by 0.2.
That means that each violation set is composed of fewer misclassified relations
although the number of links in an extracted violation increases. Moreover, violations that are only extracted after the addition of time-time links have time-time
links as members. In the calculation of the precision, the time-time links increased
the count of the denominator. The precision was defined as
correctly identified links
.
identified links
The
time-time links were included in the count of identified links, but not included in
the count of correctly identified links because the time-time links do not have a misclassified relation. Therefore, the precision of violation extraction is lowered. But its
recall is higher due to the increase of extracted violations.
The extracted violations were reconstructed using two restoration methods. In the
restoration step, all time-time links moved to the links not in violations (grounding
relations). When a violation was composed of only correctly classified relations
and time-time relations, the violation also moved into the grounding relations. But
moving the violations caused a conflict in the restoration. In order to resolve the
conflict, I allowed two disjunctive relations as a label of a link. For example, moving
the violation in Figure 5.13 to the group of links without a conflict causes a conflict
in the group. In order to prevent the move from causing the conflict, I manually
bef ore,ends
modified the label of the link from t40 to t42 into t40 ==========⇒ t42.
155
classification
score heuristics
entailment heuristics
accuracy
0.516
0.511 (0.515)
0.511 (0.515)
Table 5.7: Results of restoration methods after adding time-time links
As we can see in Table 5.7, both methods showed identical performance of 0.511
accuracy after the restoration step. The accuracies are 0.004 lower than accuracies (0.515) that I got when I applied the restoration to the documents without
the additional time-time links. The analysis of the performance of the misclassified
link identification task and the target relation restoration task illustrates where
the lower performance comes from. Compared to the precision and recall of the
misclassification link identification task using the data before adding time-time
links, the precision and recall of the misclassification link identification task using
the data after adding time-time links achieved lower precision and recall, as shown
in Table 5.8. Both restoration methods also had lower performance in the target
relation restoration step when the data with time-time links are used. The lower
performances in the two steps lead to lowering the overall accuracy. The lower performances can be explained by (1) the lack of useful information in the restoration
process and (2) the limitations of my restoration methods.
in violation
When we calculate the ratio of errors to links in a violation ( errors
), the
links in violations
ratios are 0.55 ( 1.72
) before the addition of time-time links and 0.50 ( 1.91
) after the
3.1
3.81
addition of time-time links. The ratios mean that half of the links in a violation
have misclassified relations. For example, when a violation is composed of four links,
at least two links have misclassified relations. But my restoration methods usually
156
Misclassified Link Identification
score
heuristics
entailment
heuristics
precision
0.492 (0.558)
Target Relation
Restoration
recall
F-measure
accuracy
0.293 (0.322) 0.367 (0.408) 0.123 (0.212)
0.462 (0.510)
0.268 (0.289) 0.340 (0.369)
0.118 (0.176)
Table 5.8: Performance of two restoration methods in misclassified link
identification after adding time-time links
t2
aft
iin
e1
iin
t1
Figure 5.14: An example of an extracted violation after adding time-time
links
end up modifying only one relation. Figure 5.14 has two misclassified relations,
is_included
is_included
e1 =====⇒ t1 and e1 =====⇒ t2. The entity, e1, has no other connections except the
links to t1 and t2. The misclassified relations are extracted after adding a time-time
link from t1 to t2. Because there is no other information except the triangular links,
the addition of the link from t1 to t2 did not provide enough information to restore
the target relations of the two links.
These kinds of links illustrated in Figure 5.14 also influence the target relation
restoration task. Only the confidence scores from classifiers are applicable to the
restoration methods in the example. My restoration method ends up changing one
157
relation in a violation and finding a consistent structure. This is a limitation of my
method. When we consider violations that have more than one error, the modification of an error usually leads to no improvement.
The experiment using additional time-time links achieves overall lower performance.
The increased connectivity may contribute to the extraction of more violations. But
the increased connectivity using time-time links is useless in the misclassified link
identification task and the target relation restoration task. The low performances
of the restoration methods in the tasks illustrate that my methods fail to correct
errors although more errors are included in violations when we use the data with
time-time links. It remains for future work to investigate how to improve the performance of the two tasks.
5.5
Conclusions
This chapter presented two methods that restore a document with conflicting relations to a consistent temporal structure. The methods give priority to confidence
scores from a classifier or inferred relations. In the first experiment, I applied the
methods to classified relations, and failed to improve performance. In the second
experiment, I examined whether I could get better performance by adding additional time-time links in order to increase connections between classified links.
The added links made it possible to extract more misclassified relations. But the
increased connectivity did not help to restore target relations.
158
Chapter 6
Conclusions
The popularity of applying machine learning classifiers to NLP applications has
increased because the performance of the NLP applications is enhanced through the
use of the classifiers. Popular classifiers such as SVN, Naive Bayes, and Maximum
Entropy consider each instance to be classified as independent. But some NLP
fields have instances that are dependent upon each other. Therefore, the dependencies should be taken into account when the popular classifiers are applied. Temporal Relation Identification task is also an NLP field where instances are dependent. An instance is defined as a temporal link between a pair of temporal entities,
and classifiers classify the temporal relation of a temporal link. The classified relations are related to each other. This thesis examined how to find a globally consistent solution to the classified relations.
The main questions addressed in this thesis were whether the application of logical
constraints to the TRI task can actually result in performance improvement, and
this thesis showed that no performance improvement is achieved, as well as investigating the reasons for this. For the analysis, this dissertation adopted a stepwise
approach. First, I constructed naive classifiers using three popular machine learning
classifiers in order to get classified relations. Second, in order to resolve conflicts
in the set of classified relations, I used transitive logical constraints. The logical
constraints were used to extract classified relations that violated the logical constraints and to resolve conflicts in the extracted relations. The process to resolve
159
conflicting relations can be decomposed into two sequential steps: (1) identifying
temporal links with a misclassified relation and (2) restoring its original target relation. Although previous studies of this task have never explicitly mentioned these
subtasks, they would have implicitly contained them. Through the analysis of each
of the steps, this thesis investigated issues that arose at each step.
This thesis made the following contributions to research into the Temporal Relation
Identification Task:
• Examining temporal relation classification systems using all qualitative relations available in the TimeBank and AQUAINT corpora as target relations.
(Section 3.2)
• Empirically showing that the use of six merged relations is not effective compared to the use of eleven qualitative relations when classifying a temporal
relation between events. (Section 3.3)
• Proposing a temporal relation classification system using boundary relations.
(Section 3.4)
• Proposing a new method that extracts temporal relations that violate transitive constraints. (Section 4.3)
• Demonstrating an empirical analysis of the maximum performance improvement expectation of a Temporal Relation Identification system compared to a
Temporal Relation Classification system. (Section 4.4)
• Introducing and comparing two heuristic methods that reconstruct a consistent temporal structure. (Section 5.2)
160
• Examining the effect of increased connectivity on constructing consistent
temporal structure. (Section 5.4)
6.1
Future Research
The approaches of this thesis that reconstruct consistent temporal relations still
leave various issues to be explored. Future research should (1) consider alternative
methods for restoring consistent relations from extracted violations, (2) further
utilize the application of temporal constraints in handling larger sizes of documents,
and (3) explore how to construct temporal links in a way that makes it possible to
maximally utilize temporal constraints.
Unfortunately, neither of the heuristic methods that this thesis proposed for the
restoration of a consistent structure were effective. Alternative methods could of
course be investigated in the future. The first option is a greedy search approach
that examines all possible combinations of a violation and selects a consistent one
with the highest sum of confidence scores. The extracted violations were composed
of less than 4 temporal links on average (Section 5.4.2). Therefore, it should be
feasible to scrutinize all possible combinations. Another possible solution is to use
a complex machine learning classifier such as Markov Logic Network to restore
conflicting relations in a violation instead of the greedy search. If we can find an
alternative method that shows good performance, this approach may be competitive.
The sizes of documents in the current temporal corpora are small because the documents are news articles. Therefore, the application of a method using transitive
constraints was acceptable, although the method shows expensive computation complexity. When a method needs to handle a large amount of data, the use of this
161
c
a
f
b
d
e
Figure 6.1: A graph that can be split into two subgraphs
method would be problematic due to the complexity. For example, we can use the
approach in this thesis to construct a consistent temporal structure of several news
articles associated with an event. The size of the whole dataset, then, will be an
issue if we want to handle it as one input. One option is to remove temporal links
where the application of transitive constraints is ineffective, such as the extra links
that I identified in Section 4.3. Another option is to split the complete temporal
link graph into subgraphs that are independent. In Figure 6.1, we can split the
graph into two triangular subgraphs. What links are removable and how we can
split a graph into subgraphs are research questions that remain to be explored.
The experiments in this thesis were conducted in an ideal condition that all temporal links to be classified exist. But in the real world, the selection of temporal
links to be classified is also a task to be performed. We can select temporal links
in such a way that we can maximally utilize temporal constraints on the selected
temporal links. The selected temporal links will have many fewer possible combinations of consistent temporal relation candidates. One possible approach is to make
complete connections among temporal entities. But this approach is not practical
because it is not reasonable to classify a temporal relation between temporal enti-
162
ties at the beginning and the end of a document. Therefore, it will be useful to
investigate how to construct temporal link structures in a reasonable way.
163
Appendix A
Conflicting Annotated Temporal Relations
This appendix lists annotated temporal links that my violation extraction method
found.
A.1
TimeBank 1.2
ei1630
iden
ei1612
bef
inc
t287
Figure A.1: AP900816-0139.tml
164
t16
beb
ei113
iin
af t
ei114
Figure A.2: APW19980227.0468.tml
ei256
iin
ei258
sim
iden
ei254
ei285
af t
sim
t31
bef
ei284
Figure A.3: CNN19980126.1600.1104.tml
165
inc
ei2013
ei2005
iden
iden
ei1999
ei2019
bef
af t
ei2000
ei2018
bef
iden
ei2008
ei2010
iden
Figure A.4: CNN19980227.2130.0067.tml
t212
iin
ei2237
sim
af t
ei2238
Figure A.5: NYT19980206.0460.tml
166
t48
iin
ei2219
inc
end
bef
ei2243
t43
Figure A.6: NYT19980402.0453.tml
t133
bef
ei549
inc
inc
t138
Figure A.7: PRI19980303.2000.2550.tml
167
ei109
iden
ei104
beb
iden
ei110
ei103
beg
iin
t13
af t
ei102
Figure A.8: wsj_0032.tml
ei152
inc
sim
t33
ei149
sim
iden
t30
Figure A.9: wsj_0160.tml
168
t101
sim
inc
ei167
t27
bef
af t
ei165
ei150
iden
enb
ei160
ei151
iden
Figure A.10: wsj_0169.tml
ei1995
inc
ei1999
enb
enb
ei1997
Figure A.11: wsj_0505.tml
169
t86
bef
inc
bef
ei359
ei357
Figure A.12: wsj_0542.tml
ei148
af t
ei146
inc
beb
ei149
Figure A.13: wsj_0660.tml
ei753
iden
af t
t138
ei743
enb
iin
ei744
Figure A.14: wsj_0675.tml
170
ei2004
inc
ei2003
enb
enb
ei2005
Figure A.15: wsj_0762.tml
t232
inc
inc
ei1992
ei1989
sim
iaf
ei1988
Figure A.16: wsj_0778.tml
171
ei418
ei417
end
sim
iden
af t
ei415
ei413
Figure A.17: wsj_0786.tml
ei515
bef
ei514
iin
t116
dur
Figure A.18: wsj_0816.tml
ei2416
bef
sim
ei2415
ibe
ei2414
Figure A.19: wsj_0927.tml
172
bef
t57
ei2048
bef
Figure A.20: wsj_1011.tml
173
A.2
ACQUAINT TimeML Corpus
ei28
iaf
sim
iaf
ei16
ei1
Figure A.21: XIE19980808.0031.tml
t3
iin
ei134
bef
beg
t4
Figure A.22: NYT20000601.0442.tml
174
ei7
inc
bef
ei8
beg
ei45
Figure A.23: NYT20000424.0319.tml
t1
iin
inc
ei76
ei44
bef
bef
ei42
Figure A.24: NYT20000414.0296.tml
175
ei22
sim
bef
ei23
t1
bef
bef
t4
iin
ei24
Figure A.25: NYT20000329.0359.tml
t12
beg
beg
ei80
bef
ei174
Figure A.26: NYT20000224.0173.tml
176
t2
iin
bef
t1
iin
ei20
Figure A.27: NYT20000113.0267.tml
ei21
bef
ei59
beb
ei17
beb
Figure A.28: NYT20000106.0007.tml
ei67
bef
ei65
bef
iin
ei66
Figure A.29: NYT20000105.0325.tml
177
ei11
bef
iin
t5
inc
ei12
Figure A.30: NYT19990419.0515.tml
inc
ei88
ei90
bef
iin
t11
inc
ei138
Figure A.31: APW20000417.0031.tml
ei54
bef
ei55
inc
dur
ei56
Figure A.32: APW20000403.0057.tml
178
ei61
bef
dur
bef
ei25
ei24
Figure A.33: APW20000401.0150.tml
t13
bef
iin
t14
iin
ei77
Figure A.34: APW20000210.0328.tml
ei49
beg
enb
ei48
ei50
inc
iin
t12
af t
t13
Figure A.35: APW20000128.0316.tml
179
t6
bef
iden
t3
t1
bef
inc
t5
beg
ei71
Figure A.36: APW20000115.0031.tml
t19
iden
t12
iden
iden
t8
t11
iin
iin
t15
iden
t10
Figure A.37: APW20000107.0318.tml
180
t2
bef
iin
t11
t1
iin
Figure A.38: APW199980817.1193.tml
ei23
af t
bef
t3
inc
ei22
Figure A.39: APW19991008.0151.tml
t8
inc
ei58
iin
af t
ei59
Figure A.40: APW19990506.0155.tml
181
ei130
ei137
iden
af t
beg
ei53
ei139
af t
dur
t1
iin
ei52
Figure A.41: APW19980818.0515.tml
ei10
iden
ibe
ei4
af t
ei5
Figure A.42: APW19980811.0474.tml
182
Bibliography
Allen, James. 1983. Maintaining knowledge about temporal intervals. Communications of the Association for Computing Machinery 26(1):832–843.
Argelich, Josep, Chu Min Li, Felip Manyà, and Jordi Planes. 2008. The first
and second max-sat evaluations. Journal on Satisfiability, Boolean Modeling and
Computation 4(2-4):251–278.
Barzilay, Regina, Noemie Elhadad, and Kathleen R. Mckeown. 2002. Inferring
strategies for sentence ordering in multidocument news summarization. Journal
of Artificial Intelligence Research 17:35–55.
Bethard, Steven, James H. Martin, and Sara Klingenstein. 2007. Timelines
from text: Identification of syntactic temporal relations. 2012 IEEE Sixth
International Conference on Semantic Computing 0:11–18.
Boguraev, Branimir and Rie Kubota Ando. 2005. TimeML-compliant text
analysis for temporal reasoning. In Proceedings of the 2005 International Joint
Conference on Artificial Intelligence. pages 997–1003.
Bramsen, Philip, Pawan Deshpande, Yoong Keok Lee, and Regina Barzilay.
2006. Inducing temporal graphs. In Proceedings of the 2006 Conference on
Empirical Methods on Natural Language Processing. pages 189–198.
Chambers, Nathanael and Dan Jurafsky. 2008. Jointly combining implicit
constraints improves temporal ordering. In EMNLP ’08: Proceedings of the
183
Conference on Empirical Methods in Natural Language Processing. Association
for Computational Linguistics, Morristown, NJ, pages 698–706.
Chambers, Nathanael, Shan Wang, and Dan Jurafsky. 2007. Classifying temporal relations between events. In Proceedings of 45th Annual Meeting of the
Association for Computational Linguistics. pages 173–176.
Charniak, Eugene. 1996. Statistical Language Learning. MIT Press, Cambridge,
MA.
Charniak, Eugene and Mark Johnson. 2005. Coarse-to-fine n-best parsing and
maxent discriminative reranking. In Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA,
pages 173–180.
Domingos, Pedro. 2012. A few useful things to know about machine learning.
Communications of the ACM 55(10):78–87.
Ferro, Lisa, Inderjeet Mani, Beth Sundheim, and George Wilson. 2001. TIDES
temporal annotation guidelines - version 1.0.2. Technical report, The MITRE
Corporation, McLean, VA.
Gerevini, Alfonso. 2005. Processing qualitative temporal constraints. In Dov
Gabbay, Lluis Vila, and Michael Fisher, editors, Handbook of Temporal Reasoning in Artificial Intelligence, Elsevier Science, pages 247–276.
Hitzeman, Janet, Marc Moens, and Claire Grover. 1995. Algorithms for
analysing the temporal structure of discourse. In Proceedings of the Annual
Meeting of the European Chapter of the Association for Computational Linguistics. Dublin, Ireland, pages 253–260.
184
Hwang, Chung Hee and Lenhart K. Schubert. 1992. Tense trees as the "fine
structure" of discourse. In In Working Notes of the AAAI Fall Symposium
on Discourse Structure in Natural Language Understanding and Generation,
Asilomar . pages 232–240.
Kamp, Hans and Uwe Reyle. 1993. From Discourse to Logic: Introduction to
Model Theoretic Semantics of Natural Language. Kluwer Academic, Boston.
Kleinberg, Jon and Eva Tardos. 2005. Algorithm Design. Addison-Wesley
Longman Publishing Co., Inc., Boston, MA, USA.
Korkontzelos, Ioannis, Ioannis Klapaftis, and Suresh Manandhar. 2009. Graph
connectivity measures for unsupervised parameter tuning of graph-based sense
induction systems. In UMSLLS ’09: Proceedings of the Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics. Association for
Computational Linguistics, Morristown, NJ, pages 36–44.
Lascarides, Alex and Nicholas Asher. 1993. Temporal interpretation, discourse
relations, and commonsense entailment. Linguistics and Philosophy 16:437–493.
Mani, Inderjeet, Marc Verhagen, Ben Wellner, Chong Min Lee, and James
Pustejovsky. 2006. Machine learning of temporal relations. In Proceedings of the
21st International Conference on Computational Linguistics and 44th Annual
Meeting of the Association for Computational Linguistics.
Manning, Christopher D. and Hinrich Schütze. 1999. Foundations of Statistical
Natural Language Processing. The MIT Press.
185
Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993.
Building a large annotated corpus of english: the penn treebank. Computational
Linguistics 19:313–330.
Navigli, Roberto and Mirella Lapata. 2007. Graph connectivity measures for
unsupervised word sense disambiguation. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI). Hyderabad, India,
pages 1683–1688.
Niu, Feng, Christopher Ré, AnHai Doan, and Jude Shavlik. 2011a. Tuffy: scaling
up statistical inference in markov logic networks using an rdbms. Proceedings
of the VLDB Endowment 4(6):373–384.
Niu, Feng, Ce Zhang, Christopher Ré, and Jude W. Shavlik. 2011b. Felix:
Scaling inference for markov logic with an operator-based approach. ArXiv
e-prints abs/1108.0294.
Pham, Duc Nghia, John Thornton, and Abdul Sattar. 2006. Towards an efficient
sat encoding for temporal reasoning. In Proceedings of the 12th international
conference on Principles and Practice of Constraint Programming. SpringerVerlag, Berlin, Heidelberg, pages 421–436.
Pustejovsky, James, José Castano, Robert Ingria, Roser Saurí, Robert
Gaizauskas, and Andrea Setzer. 2003a. TimeML: robust specification of event
and temporal expressions in text. In IWCS-5 Fifth International Workshop on
Computational Semantics.
Pustejovsky, James, Patrick Hanks, Roser Saurí, Andrew See, David Day, Lisa
Ferro, Robert Gaizauskas, Marcia Lazo, Andrea Setzer, and Beth Sundheim.
186
2003b. The TimeBank corpus. In Proceedings of Corpus Linguistics 2003 .
Lancaster, UK, pages 647–656.
Saurí, Roser, Jessica Littman, Robert Gaizauskas, Andrea Setzer, and James
Pustejovsky. 2006. Timeml annotation guidelines, version 1.2.1.
Schilder, Frank and Andrew McCulloh. 2005. Temporal information extraction from legal documents. In Graham Katz, James Pustejovsky, and Frank
Schilder, editors, Proceedings of the 2005 International Conference on Annotating, Extracting and Reasoning about Time and Events.
Tatu, Marta and Munirathnam Srikanth. 2008. Experiments with reasoning
for temporal relations between events. In COLING ’08: Proceedings of the
22nd International Conference on Computational Linguistics. Association for
Computational Linguistics, Morristown, NJ, pages 857–864.
UzZaman, Naushad and James F. Allen. 2010. Trips and trios system for
tempeval-2: Extracting temporal information from text. In Proceedings of the
5th International Workshop on Semantic Evaluation. Association for Computational Linguistics, Morristown, NJ, pages 276–283.
Verhagen, Marc, Robert Gaizauskas, Frank Schilder, Mark Hepple, Graham
Katz, and James Pustejovsky. 2007. SemEval-2007 task 15: TempEval temporal
relation identification. In Proceedings of the 4th International Workshop on
Semantic Evaluations (SemEval-2007). Prague, pages 75–80.
Verhagen, Marc, Robert Gaizauskas, Frank Schilder, Mark Hepple, Jessica
Moszkowicz, and James Pustejovsky. 2009. The TempEval challenge: identifying
temporal relations in text. Language Resources and Evaluation 43(2):161–179.
187
Verhagen, Marc, Inderjeet Mani, Roser Saurí, Robert Knippen, Seok Bae Jang,
Jessica Littman, Anna Rumshisky, John Phillips, and James Pustejovsky. 2005.
Automating temporal annotation with tarsqi. In ACL ’05: Proceedings of the
ACL 2005 on Interactive poster and demonstration sessions. Association for
Computational Linguistics, Morristown, NJ, pages 81–84.
Verhagen, Marc, Roser Saurí, Tommaso Caselli, and James Pustejovsky. 2010.
Semeval-2010 task 13: Tempeval-2. In Proceedings of the 5th International
Workshop on Semantic Evaluation. Association for Computational Linguistics,
Stroudsburg, PA, pages 57–62.
Vilain, Marc, Henry Kautz, and Peter van Beek. 1990. Constraint propagation
algorithms: A revised report, Morgan Kaufman, San Mateo, California, pages
373–381.
Webber, Bonnie. 1988. Tense as discourse anaphor. Computational Linguistics
14(2):61–73.
Yang, Qiang and Xindong Wu. 2006. 10 challenging problems in data mining
research. International Journal of Information Technology & Decision Making
5(4):597–604.
Yoshikawa, Katsumasa, Sebastian Riedel, Masayuki Asahara, and Yuji Matsumoto. 2009. Jointly identifying temporal relations with markov logic. In
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL
and the 4th International Joint Conference on Natural Language Processing
of the AFNLP . Association for Computational Linguistics, Suntec, Singapore,
pages 405–413.
188
Zhang, Tong, Fred Damerau, and David Johnson. 2002. Text chunking based on
a generalization of winnow. Journal of Machine Learning Research 2:615–637.
Zhou, Li, Carol Friedman, Simon Parsons, and George Hripcsak. 2005. System
architecture for temporal information extraction, representation, and reasoning
in clinical narrative reports. In AMIA Annual Symposium Proceedings. pages
869–873.
189
© Copyright 2026 Paperzz