Applications of Lexical Cohesion Analysis in the Topic

Applications of Lexical Cohesion
Analysis in the Topic Detection and
Tracking Domain
Nicola Stokes B.Sc.
A thesis submitted for the degree of
Doctor of Philosophy in Computer Science
Supervisor: Dr Joseph Carthy
Department of Computer Science
Faculty of Science
National University of Ireland, Dublin
Nominator: Prof. Mark Keane
April 2004
ii
Abstract
This thesis investigates the appropriateness of using lexical cohesion analysis to
improve the performance of Information Retrieval (IR) and Natural Language
Processing (NLP) applications that deal with documents in the news domain. More
specifically, lexical cohesion is a property of text that is responsible for the
presence of semantically related vocabulary in written and spoken discourse. One
method of uncovering these relationships between words is to use a linguistic
technique called lexical chaining, where a lexical chain is a cluster of related words,
e.g. {government, regime, administration, officials}.
At their core, traditional approaches to IR and NLP tasks tend to treat a
document as a ‘bag-of-words’, where document content is represented in terms of
word stem frequency counts. However, in these implementations no account is
taken of more complex semantic word associations such as the thesaural
relationships synonymy (e.g. home, abode), specialisation/generalisation (e.g. cake,
dessert) and part/whole (e.g. musician, orchestra). In this thesis we present a novel
news-oriented chaining algorithm, LexNews, which provides a means of exploring
the lexical cohesive structure of a news story. Unlike other chaining approaches that
only explore standard thesaural relationships, the LexNews algorithm also examines
domain-specific statistical word associations and proper noun phrase repetition.
We also report on the performance of some challenging, real-world applications
of lexical cohesion analysis with respect to the performance of ‘bag-of-words’
approaches to these problems. In particular, we attempt to enhance New Event
Detection and News Story Segmentation performance: two tasks currently being
investigated by the Topic Detection and Tracking (TDT) initiative, a research
programme dedicated to the intelligent organisation of broadcast news and
newswire data streams. Our results for the New Event Detection task are mixed, and
a consistent improvement in performance was not achieved. However, in contrast,
our News Story Segmentation results are very positive. In addition, we also explore
the effect of lexical cohesion analysis on News Story Gisting (i.e. a type of
summarisation that generates a news story title or headline), which although not
defined as an official TDT task, is still an important component of any real-world
TDT system. Our experiments show that News Story Gisting performance improves
when a lexical chaining approach to this task is adopted.
iii
Acknowledgements
First and foremost I would like to thank my thesis supervisor Joe Carthy for having
provided unfailing support, feedback and ideas over the course of my research.
Without his belief in me none of this would have been possible.
I would also like to thank John Dunnion for introducing me to this topic in the
final year of my degree, and for his programming expertise and English grammar
lessons! Thank you also to my proofreaders, Fergus, Will, Eamonn, and Ray, and to
Gerry Dunnion for filling in the ‘vast chasms’ in my technical expertise.
I would also like to acknowledge the Eurokom gang for happy memories and
much devilment especially my roommates Paula and Maf, my coffee buddies
Doireann, Aibhin, Colm and Dave, and my long standing partner-in-crime Ray
‘Boldie’ Rafter.
During the course of my research, two wonderful collaborations emerged.
Firstly, in the Autumn/Winter semester of 2001 I spent an enlightening few months
at the Center for Intelligent Information Retrieval, University of Massachusetts.
Many thanks to James Allan and Victor Lavrenko for their guidance and support
during my stay. Secondly, from 2002 to the end of 2003 I worked on the Físchlár
News Stories System with the Department of Engineering and the Centre for Digital
Video Retrieval at Dublin City University. Thanks to Alan Smeaton and ‘the lads’
for making this a very pleasant experience.
Outside of my ‘college cocoon’ there has been a wealth of support from friends
and family, including Gra, Dee, Yvette, Sinead, Lisa; Paula and Ray (again); my
parents, landlords, and financial benefactors Joan and Brian1; my little brother Cu;
my nana Halligan; my wonderful boyfriend Gar and his parents Betty and George.
In particular, apologies and special recognition must go to Gar for (un)successfully
feigning interest in the merits of lexical chaining over the past 3 years, and for his
love, patience and support. Is iomaí cor sa tsaol.
1
Financial support from Enterprise Ireland is also gratefully acknowledged.
iv
Table of Contents
Abstract
iii
Acknowledgements
iv
Chapter 1 Introduction
1
1.1
Thesis Goals
3
1.2
Thesis Outline
4
Chapter 2 Lexical Cohesion Analysis through Lexical Chaining
7
2.1
Cohesion, Coherence and Discourse Analysis
8
2.2
The Five Types of Cohesion
9
2.3
Lexical Cohesion
10
2.4
Semantic Networks, Thesauri and Lexical Cohesion
13
2.4.1
Longmans Dictionary of Contemporary English
13
2.4.2
Roget’s Thesaurus
14
2.4.3
WordNet
15
2.4.4
The Pros and Cons
17
2.5
Lexical Chaining: Techniques and Applications
20
2.5.1
Morris and Hirst: The Origins of Lexical Chain Creation
22
2.5.2
Lexical Chaining on Japanese Text
25
2.5.3
Roget’s Thesaurus-based Chaining Algorithms
26
2.5.4
Greedy WordNet-based Chaining Algorithms
28
2.5.5
Non-Greedy WordNet-based Chaining Algorithms
33
2.6
Discussion
40
Chapter 3 LexNews: Lexical Chaining for News Analysis
42
3.1
Basic Lexical Chaining Algorithm
43
3.2
Enhanced LexNews Algorithm
48
3.2.1
Generating Statistical Word Associations
50
3.2.2
Candidate Term Selection: The Tokeniser
55
3.2.3 The Lexical Chainer
3.3
58
Parameter Estimation based on Disambiguation Accuracy
v
61
3.4
Statistics on LexNews Chains
67
3.5
News Topic Identification and Lexical Chaining
69
3.6
Discussion
74
Chapter 4 TDT New Event Detection
4.1
76
Information Retrieval
77
4.1.1
Vector Space Model
80
4.1.2
IR Evaluation
82
4.1.3
Information Filtering
84
Topic Detection and Tracking
86
4.2
4.2.1
Distinguishing between TDT Events and TREC Topics
87
4.2.2
The TDT Tasks
89
4.2.3
TDT Progress To Date
93
New Event Detection Approaches
94
4.3
4.3.1
UMass Approach
95
4.3.2
CMU Approach
97
4.3.3
Dragon Systems Approach
99
4.3.4
Topic-based Novelty Detection Workshop Results
100
4.3.5
Other notable NED Approaches
102
4.4
Discussion
104
Chapter 5 Lexical Chain-based New Event Detection
5.1
Sense Disambiguation and IR
105
106
5.1.1
Two IR applications of Word Sense Disambiguation
107
5.1.2
Further Analysis of Disambiguation for IR
108
5.2
Lexical Chaining as a Feature Selection Method
111
5.3
LexDetect: Lexical Chain-based Event Detection
114
5.3.1
The ‘Simplistic’ Tokeniser
115
5.3.2
The Composite Document Representation Strategy
115
5.3.3
The New Event Detector
116
The TDT Evaluation Methodology
120
5.4
5.4.1
TDT Corpora
120
5.4.2
Evaluation Metrics
123
5.5
TDT1 Pilot Study Experiments
125
vi
5.5.1
System Descriptions
125
5.5.2
New Event Detection Results
127
5.5.3
Related New Event Detection Experiments at UCD
129
5.6
TDT2 Experiments
132
5.6.1
System Descriptions
133
5.6.2
New Event Detection Results
134
5.7
Discussion
138
Chapter 6 News Story Segmentation
6.1
142
Segmentation Granularity
143
6.1.1
Discourse Structure and Text Segmentation
143
6.1.2
Fine-grained Text Segmentation
144
6.1.3
Coarse-grained Text Segmentation
145
6.2
Sub-topic/News Story Segmentation Approaches
147
6.2.1
Information Extraction Approaches
148
6.2.2
Lexical Cohesion Approaches
151
6.2.3
Multi-Source Statistical Modelling Approaches
159
6.3
Discussion
161
Chapter 7 Lexical Chain-based News Story Segmentation
7.1
SeLeCT: Segmentation using Lexical Chaining
7.1.1
7.2
The Boundary Detector
162
163
164
Evaluation Methodology
168
7.2.1
News Segmentation Test Collections
169
7.2.2
Evaluation Metrics
169
7.3
News Story Segmentation Results
172
7.3.1
CNN Broadcast News Segmentation
173
7.3.2
Reuters Newswire Segmentation
176
7.3.3
The Error Reduction Filter and Segmentation Performance
178
7.3.4
Word Associations and Segmentation Performance
180
7.4
Written versus Spoken News Story Segmentation
182
7.4.1
Lexical Density
183
7.4.2
Reference and Conjunction in Spoken Text
187
7.4.3
Refining SeLeCT Boundary Detection
189
vii
7.5
Discussion
192
Chapter 8 News Story Gisting
195
8.1
Related Work
196
8.2
The LexGister System
197
8.3
Experimental Methodology
198
8.4
Gisting Results
200
8.5
Discussion
204
Chapter 9 Future Work and Conclusions
206
9.1
Further Lexical Chaining Enhancements
206
9.2
Multi-document Summarisation
209
9.3
Thesis Contributions
210
9.4
Thesis Conclusions
212
Appendix A The LexNews Algorithm
216
A.1
Basic Lexical Chaining Algorithm
216
A.2
Lexical Chaining Stopword List
221
Appendix B LexNews Lexical Chaining Example
222
B.1
News Story Text Version
223
B.2
Part-of-Speech Tagged Text
224
B.3
Candidate Terms
225
B.4
Weighted Lexical Chains
226
Appendix C Segmentation Metrics: WindowDiff and Pk
228
Appendix D Sample News Documents from Evaluation Corpora
231
D.1
TDT1 Broadcast News Transcript
232
D.2
TDT2 Broadcast News Transcript
234
D.3
TDT Newswire Article
235
D.4
RTÉ Closed Caption Material
237
References
238
viii
Table of Figures
1.1 News story extract illustrating lexical cohesion in text.
1
2.1 Sample category taken from Roget’s thesaurus.
15
2.2 Number of synsets for each of part-of-speech in WordNet.
17
2.3 A generic lexical chaining algorithm.
23
3.1 Example of a spurious relationship between two nouns in WordNet
by not following St-Onge and Hirst’s rules.
45
3.2 Diagram illustrating the process of pushing chains onto the chain
stack.
47
3.3 LexNews system architecture.
49
3.4 Examples of statistical word associations generated from the TDT1
corpus.
53
3.5 Example of noun phrase repetition in a news story.
56
3.6 Graph showing relationship between disambiguation error and number
of senses or different contexts that a noun may be used in.
66
3.7 Graph showing the dominance of extra strong and medium-strength
relationships during lexical chain generation.
68
3.8 Graph showing a breakdown of all relationship occurrences in the chaining
process.
68
3.9 Sample broadcast news story on the Veronica Guerin movie.
72
3.10 WordNet noun phrase chains for sample news story in Figure 3.9.
73
3.11 Non-WordNet proper noun phrase chains for sample news story
in Figure 3.9.
74
4.1 IR metrics precision and recall.
83
4.2 Typical Information Filtering System.
85
4.3 TDT system architecture.
92
5.1 System architecture of the LexDetect system.
114
5.2 The effect on TDT1 NED performance when a combined document
representation is used.
128
5.3 Example of cross chain comparison strategy.
130
5.4 DET graph showing performance of the SYN system using two
alternative lexical chain-based NED architecture.
ix
131
5.5 DET graph showing performance of two alternative lexical
chain-based NED architectures.
132
5.6 DET graph showing performance of the LexDetect and CHAIN
systems (using the basic LexNews chaining algorithm), and the
UMass system for the TDT2 New Event Detection task.
135
5.7 DET graph showing performance of the LexDetect and CHAIN
system (using the enhanced LexNews chaining algorithm), and the
UMass system for the TDT2 New Event Detection task.
135
6.1 Example of fine-grained segments detected by Passonneau and
Litman’s segmentation technique.
144
6.2 Extract taken from CNN transcript which illustrates the role of
domain independent cue phrases in providing cohesion to text.
6.3 A timeline diagram of a news programme and some domain cues.
149
150
6.4 Graph representing the similarity of neighbouring blocks determined
by the TextTiling algorithm for each possible boundary or block
gap in the text.
153
6.5 Extract of CNN report illustrating the role of lexical cohesion in
determining related pieces of text.
156
7.1 SeLeCT news story segmentation system architecture.
163
7.2 Sample lexical chains generated from concatenated news stories.
165
7.3 Chain span schema with boundary point detected at end of
sentence 1.
166
7.4 Diagram showing characteristics of chain-based segmentation.
167
7.5 Diagram illustrating allowable margin of error.
170
7.6 Accuracy of segmentation algorithms on CNN test set.
174
7.7 Graph illustrating effects on F1 measure as margin of allowable
error is increased for CNN segmentation results.
7.8 Accuracy of segmentation algorithms on Reuters test set.
175
177
7.9 Graph illustrating effects on F1 measure as margin of allowable
error is increased for Reuters segmentation results.
177
7.10 Graph illustrating the effect of the error reduction filter on
SeLeCT’s F1 measure for the CNN collection as the margin of
allowable error increases.
179
x
7.11 Graph illustrating the effect of the error reduction filter on SeLeCT’s
recall and precision for the CNN collection as the margin of allowable
error increases.
179
7.12 Graph showing effect of word relationships on segmentation
accuracy.
180
7.13 Example of the effect of weak semantic relationships on the
segmentation process.
181
7.14 CNN transcript of movie review with speaker identification
information.
188
7.15 Diagram Illustrating how cohesion information can help SeLeCT’s
boundary detector resolve clusters of possible story boundaries.
190
8.1 Recall, Precision and F1 values measuring gisting performance
for 5 distinct extractive gisting systems and a set of human
extractive gists.
201
A.1 Chaining example illustrating the need for multiple searches.
219
C.1 Diagram showing system segmentation results and the correct
boundaries defined in the reference segmentation.
xi
229
Table of Tables
2.1 Semantic relationships between nouns in WordNet.
16
3.1 Contingency table of frequency counts calculated for each bigram
in the collection.
52
3.2 Disambiguation accuracy results taken from Galley and McKeown.
63
3.3 Results of Galley and McKeown’s evaluation strategy using the
‘accuracy’ metric and default sense assignments, compared with
the recall, precision and F1 values when all disambiguated nouns
are not assigned default senses.
64
3.4 Comparing the effect of different parameters on the disambiguation
performance of the LexNews algorithm.
64
3.5 Chains statistics for chains generated on subset of the SemCor
collection.
69
3.6 Weights assigned to lexical cohesive relationships between chain
terms.
70
5.1 Values used to calculated TDT system performance.
123
5.2 Miss and False Alarms rates of NED systems for optimal value of
the Reduction Coefficient R on the TDT1 corpus.
128
5.3 Breakdown of TDT2 results into broadcast and newswire system
performance.
137
5.4 Breakdown of document lengths in the TDT1 and TDT2 corpora.
138
7.1 Precision and Recall values from segmentation on concatenated
CNN news stories.
174
7.2 Precision and Recall values from segmentation on concatenated
Reuters news stories.
177
7.3 Results of SeLeCT segmentation experiments when verbs are adding
into the chaining process.
185
7.4 Results of C99 and TextTiling segmentation experiments when
nominalised verbs are adding into the segmentation process.
187
7.5 Improvements in system performance as a result of system
modifications discussed in Sections 7.4.1 and 7.4.3.
7.6 Paired Samples T-Test on initial results from Table 6.1 and 6.2.
xii
191
191
7.7 Pair Samples T-Test p-values on refined results taken from
Table 6.4.
192
C.1 Error calculations for each metric for each window shift in
Figure C.1.
230
xiii
Chapter 1
Introduction
Humans instinctively know when sentences in a text are related. Cohesion and
coherence are two properties of text that help us to make this judgement, where
coherence refers to the fact that a text makes sense, and cohesion to the fact that
there are elements in the text that are grammatically related (e.g. ‘John’ referred to
as ‘he’) or semantically related (e.g. ‘BMW’ referred to as a ‘car’). Of these two
properties cohesion is the easiest to compute because it is a surface relationship and
coherence requires a deeper textual understanding.
The main suspect in the shooting dead of a 58-year-old woman with a hunting
rifle was today jailed for life following the return of a unanimous verdict by a
Dublin jury. The accused originally denied the charge, but pleaded guilty to
second-degree murder when a witness came forward to testify that the murder
weapon belonged to the defendant.
Figure 1.1: News story extract illustrating lexical cohesion in text.
For example, consider the news story extract in Figure 1.1. In this text we can
see that although these sentences have no vocabulary in common (apart from
stopwords, e.g. a, the, to) they are unequivocally related due to the presence of
lexical cohesive relationships between their words (i.e. the commonest form of
cohesion found in text). In particular, looking only at the nouns in this text we find
the following clusters of related noun phrases: {suspect, accused, defendant},
{murder weapon, hunting rifle}. These clusters are examples of lexical chains
generated from the text using thesaural-based relationships where, according to the
WordNet thesaurus (Miller et al., 1990), ‘suspect’ is a synonym of ‘defendant’, the
‘accused’ is a specialisation of both ‘suspect’ and ‘defendant’, and ‘hunting rifle’ is
a specialisation of a ‘murder weapon’.
In the course of this thesis, we analyse lexical cohesion in text using this
technique of lexical chaining. However, unlike previous approaches we also
1
examine lexical cohesive relationships that cannot be defined in terms of thesaural
relationships, but are considered ‘intuitively’ related due to their regular cooccurrence in text. There are many examples of these co-occurrence relationships
between words in the text extract in Figure 1.1. More specifically, these words are
related through their frequency of use in similar news story contexts relating to
criminal law, e.g. {life, verdict, jury, guilty, witness, charge, murder}.
However, the main focus of this thesis is the use of lexical cohesion analysis in
challenging Natural Language Processing (NLP) and Information Retrieval (IR)
tasks, with the intention of improving performance over standard techniques. One
of the most common approaches to text analysis is based on an examination of word
frequency occurrences in text, where the intuition is that high frequency words
represent the essence of a particular discourse. For example, word frequency
information has been used to build extractive summaries, where sentences that
contain many high frequency words are included in the resultant summary. Word
frequency information also forms the basis of most approaches to IR, where for
example in an ad hoc retrieval situation documents are ranked in order of relevance
to a query based on the frequency of occurrence of the query words in each
document. However, in these ‘bag-of-words’ techniques frequency counts are
calculated with respect to exact syntactic repetition, while other forms of repetition
such as synonymy, specialisation/generalisation and part/whole relationships are
ignored. Hence, a ‘bag-of-words’ analysis of the previously discussed news extract
would have found little similarity between the two sentences except for stopwords.
Thus it appears that there are many tasks that could benefit from the additional
textual knowledge provided by a lexical cohesion analysis of text. In this thesis we
test this hypothesis in the relatively new IR research area of Topic Detection and
Tracking (TDT). The TDT initiative (Allan et al., 1998a; Allan, 2002a) is
concerned with the organisation of streams of broadcast news and newswire data
into a collection of structured information that satisfies a set of user needs, in
particular:
News Story Segmentation: The segmentation of broadcast news programmes
into distinct news stories.
Event Tracking: The tracking of a known event (given a set of related news
stories) as documents arrive on the input stream.
2
Cluster Detection: The clustering of similar news stories into distinct nonoverlapping groups.
New Event Detection: The detection of breaking news stories as they arrive on
the news stream.
The novelty of these tasks in comparison to previous IR research is that they are
required to operate on real-time news streams from a variety of media sources
(radio, television and newswire) rather than on a static, retrospective newspaper
collection. This requirement makes these filtering-based tasks (excluding Cluster
Detection) more difficult than standard query-based retrieval and text classification
tasks as relevancy decisions must be made based on only those documents seen so
far on the input stream and without any knowledge of subsequent postings. Also the
TDT evaluation defines a finer-grained notion of ‘aboutness’ than is found in
standard IR evaluations. For example, a TDT system is not only required to find all
documents on a topic like ‘the OJ Simpson trial’, but also the system must be able
to distinguish between the different events making up this topic, e.g. ‘OJ Simpson’s
arrest’ and ‘the DNA evidence presented in the trial’. What makes this an
interesting application domain for lexical cohesion analysis is that TDT participants
have found that standard IR approaches to these tasks have reached a performanceplateau, and that new techniques are required in order to effectively tackle these
complex problems.
1.1
Thesis Goals
This thesis addresses five primary research goals:
To develop a novel lexical chaining method that provides a full analysis of
lexical cohesive relationships in text by considering not only thesaural-based
links between words, as previous chaining approaches have done, but also to
explore domain-specific statistical associations between these words that are not
defined in the WordNet taxonomy.
To establish which IR and NLP applications most benefit from the lexical
cohesion analysis provided by lexical chains. In particular, to determine whether
New Event Detection, Story Segmentation and News Story Gisting performance
can be improved by considering a richer semantic view of a text than that which
is provided by a ‘bag-of-word’-based approach to these problems.
3
To determine the performance of these applications in a previously unexplored
domain for lexical cohesion analysis, i.e. TDT broadcast news.
To use large-scale evaluation methodologies for ascertaining application
performance. This goal is prompted by the observation that previous research
efforts involving applications of lexical chaining have involved small-scale
evaluations which provide little conclusive evidence of system effectiveness.
To comment on the extent to which an NLP technique like lexical chaining is
affected by ‘noisy’ data sources such as speech transcripts and closed caption
material taken from broadcast news reports.
A set of secondary goals arising from the above are also addressed in the thesis:
To establish the sense disambiguation accuracy of our LexNews lexical
chaining algorithm, as this facet of the algorithm has implications on the
performance of the IR and NLP applications explored in this thesis.
To propose a novel method of integrating lexical cohesion information into an
IR model, and to investigate how this technique performs with respect to the
standard conceptual indexing strategy put forward by many lexical chaining
approaches to IR problems, where words are replaced with WordNet synsets,
e.g. concepts such as {airplane, aeroplane, plane}. Our chosen task for this
investigation is TDT New Event Detection.
To discover whether a lexical chain-based segmentation strategy that was
previously proposed for sub-topic segmentation analysis is a reliable method for
determining the boundaries between adjacent news stories in a broadcast news
programme.
To implement and evaluate a lexical chain-based News Story Gisting system
that verifies that text summarisation tasks are appropriate vehicles for lexical
cohesion analysis.
1.2
Thesis Outline
This thesis is organised into four parts: the first part is dedicated to lexical chaining
(Chapters 2 and 3), the second to New Event Detection (Chapters 4 and 5), the third
to News Story Segmentation (Chapters 6 and 7), and the fourth to our initial
experiments concerning News Story Gisting (Chapter 8).
4
Chapter 2 is concerned with the notion of lexical cohesion as a property of text,
and how this textual characteristic can be analysed using a word clustering method
called lexical chaining. Since many types of lexical cohesion can be discovered by
examining thesaural relationships between words in a text, the merits and
weaknesses of three knowledge sources capable of providing these relationships are
discussed. A detailed overview of contemporary approaches to lexical chain
generation and its applications are discussed in the remainder of the chapter.
Chapter 3 presents our lexical chaining algorithm, LexNews. We establish the
values of a number of important parameters in the algorithm using the SemCor
corpus, i.e. a collection of documents manually tagged with WordNet synsets. This
collection also facilitates the comparison of the performance of our disambiguation
algorithm with another approach to chain generation developed at Columbia
University. The chapter ends with a motivating example of how lexical chains
capture the topicality of a news document.
Chapter 4 presents an overview of common approaches used to address
information retrieval problems, since these techniques form the basis of many of the
approaches taken in TDT implementations. The research objectives of the TDT
initiative are then introduced, while the remainder of the chapter focuses on New
Event Detection, TDT participant approaches to the problem, and a number of
important conclusions drawn from TDT workshops.
Chapter 5 describes the LexDetect system, our lexical chain-based approach to
the New Event Detection task. The lexical cohesion information provided by the
LexNews algorithm is used as a means of representing the essence of a news story.
This linguistically-motivated document representation is then integrated into a
traditional vector space modelling (VSM) IR approach. The experiments described
in this chapter are split into two parts: those performed on the TDT1 corpus using
our own implementation of the VSM, and those performed on the TDT2 corpus
using the UMass New Event Detection (VSM-based) system where both systems
incorporate a lexical chain representation of a news story in their implementations.
Chapter 6 provides some background on text segmentation methods, another
application of our LexNews algorithm. This chapter examines segmentation
approaches with respect to the granularity of the text segments that they produce.
5
Coarse-grained approaches, such as News Story Segmentation techniques, are
explored in detail as a prelude to the work described in the following chapter.
Chapter 7 describes the SeLeCT system, our lexical chain-based approach to
News Story Segmentation. The performance of the SeLeCT system is evaluated
with respect to two other well-known lexical cohesion approaches to segmentation:
the C99 and TextTiling algorithms. This chapter also investigates the effect of
different news media (i.e. spoken broadcast news versus written newswire) on
segmentation performance.
Chapter 8 discusses the results of our initial experiments on the application of
lexical cohesion analysis to News Story Gisting. The evaluation of the LexGister
system is two-fold: firstly, the results of an automatic evaluation based on recall and
precision values are reported, and secondly the results of a manual evaluation
involving a group of human judges are discussed.
Finally, in Chapter 9, our plans for future work are presented followed by a
summary of the research contributions and conclusions arising from this work.
6
Chapter 2
Lexical Cohesion Analysis through Lexical
Chaining
In this chapter we introduce the fundamental linguistic concepts necessary for
understanding one of the main focuses of this thesis:
The identification of lexical cohesion in text using a linguistic technique called
lexical chaining that discovers naturally occurring clusters of semantically related
words in text, e.g. {jailbreak, getaway, escape} and {shotgun, firearm, weapon}.
In the following sections we establish where lexical cohesion fits into the general
framework of textual properties, and how lexical cohesion in text can be
represented using lexical chains. Since lexical cohesion is realised in text through
the use of related vocabulary, knowledge sources such as thesauri and dictionaries
have been used as a means of identifying lexical cohesive ties between words.
Hence, we review the pros and cons of three lexical knowledge sources: Longmans
Dictionary of Contemporary English, Roget’s thesaurus and the WordNet
taxonomy. This is followed by an in-depth review of the different approaches to
lexical chain generation discussed in the literature and the performance of various
NLP and IR applications of this technique.
7
2.1
Cohesion, Coherence and Discourse Analysis
When reading any text it is obvious that it is not merely made up of a set of
unrelated sentences, but that these sentences are in fact connected to each other
through the use of two linguistic phenomenon, namely cohesion and coherence. As
Morris and Hirst (1991) point out, cohesion relates to the fact that the elements of a
text (e.g. clauses) ‘tend to hang together’; while coherence refers to the fact that
‘there is sense (or intelligibility) in a text’.
Observing the interaction between textual units in terms of these properties is
one way of analysing the discourse structure of a text. Most theories of discourse
result in a hierarchical tree-like structure that reflects the relationships between
sentences or clauses in a text. These relationships may, for example, highlight
sentences in a text that elaborate, reiterate or contradict a certain theme. Meaningful
discourse analysis like this requires a true understanding of textual coherence which
in turn often involves looking beyond the context of the text, and drawing from
real-world knowledge of events and the relationships between them.
Hasan, in her paper on ‘Coherence and Cohesive Harmony’ (1984),
hypothesises that the coherence of a text can be indirectly measured by analysing
the degree of interaction between cohesive chains in a text. Analysing cohesive
relationships in this manner is a more manageable and less computationally
expensive solution to discourse analysis than coherence analysis. For example,
Morris and Hirst (1991) note that, unlike research into cohesion, there has been no
widespread agreement on the classification of different types of coherence
relationships2. Furthermore, they note that even humans find it more difficult to
identify and agree on textual coherence because, although identifying cohesion and
coherence are subjective tasks, coherence requires a definite ‘interpretation of
meaning’, while cohesion requires only an understanding that terms are about ‘the
same thing’.
2
Some coherence relationships that have been identified between sentences or clauses are
elaboration, cause, support, exemplification, contrast and result. In recent work by Harabagiu
(1999), attempts have been made to map specific patterns of lexical cohesion coupled with the
occurrence of certain discourse markers directly to these coherence relationships, e.g. contrast in text
is indicated by the existence of both the discourse marker ‘although’ and an antonymy relationship
(e.g. dead-alive or happy-sad).
8
To get a better idea of the difference between the two, consider the following
example:
After a night of heavy drinking the party fizzled out at around 6am. They then ate
breakfast while watching the sunrise.
These sentences are only weakly cohesive. Consequently, a deeper understanding of
the concept ‘morning’ makes the existence of a coherence relationship between the
two sentences highly plausible. However, in the more usual case where an area of
text shares a set of cohesively related terms, Morris and Hirst hypothesise that
cohesion is a useful indicator of coherence in text especially since the identification
of coherence itself is not computationally feasible at present. Stairmand (1996)
further justifies this hypothesis by emphasising that although cohesion fails to
account for grammatical structure (i.e. readability) in the way that coherence does,
cohesion can still account for the organisation of meaning in a text, and so, by
implication, its presence corresponds to some form of structure in that text.
2.2
The Five Types of Cohesion
As stated in the previous section, cohesion refers to the way in which textual units
interact in a discourse. Halliday and Hasan (1976) classify cohesion into five (not
always distinct) classes:
Conjunction is the only class which explicitly shows the relationship between
two sentences, ‘I have a cat and his name is Felix’.
Reference and lexical cohesion, on the other hand, indicate sentence
relationships in terms of two semantically equivalent or related words.
o In the case of reference, pronouns are the most likely means of conveying
referential meaning. For example, consider the following sentences: ‘“Get
inside now!” shouted the teacher. When nobody moved, he was furious’. In
order for the reader to understand that ‘the teacher’ is being referred to by
the pronoun ‘he’ in the second sentence, they must refer back to the first
sentence.
o Lexical cohesion arises from the selection of vocabulary items and the
semantic relationships between them. For example, ‘I parked outside the
library, and then went inside the building to return my books’, where
9
cohesion is represented by the semantic relationships between the lexical
items ‘library’, ‘building’ and ‘books’.
Substitution and Ellipsis are grammatical relationships, as opposed to
relationships based on word meaning or semantic connection.
o In the case of nominal substitution, a noun phrase such as ‘a vanilla icecream cone’ can be replace by the indefinite article ‘one’ as shown in the
following example, ‘As soon as John was given a vanilla ice-cream cone,
Mary wanted one too’.
o Ellipsis is closely related to substitution as it is often described as the
special case of ‘zero substitution’, where a phrase such as ‘in my exams’ is
left out as it is implied by the preceding sentence which contains the phrase
‘in your exams’. For example, ‘Did you get a first in your exams? No, I only
got a third’.
For automatic identification of these relationships, lexical cohesion is the easiest to
resolve since less implicit information is needed to discover these types of
relationship between words in a text. In the sample sentence used to define lexical
cohesion we identified a generalisation relationship between ‘library’ and ‘building’
and a has-part relationship between ‘library’ and ‘books’. However, there are five
further lexical cohesive relationships that are explored in the following section.
2.3
Lexical Cohesion
Lexical Cohesion ‘is the cohesion that arises from semantic relationships between
words’ (Morris, Hirst, 1991). Halliday and Hasan (1976) define five types of
lexical cohesive ties that commonly occur in text. Here are a number of examples
taken from a collection of CNN news story transcripts, since the news story domain
is the focus of our analysis in this thesis:
Repetition (or Reiteration) – Occurs when a word form is repeated again in a
later section of the text. ‘In Gaza, though, whether the Middle East'
s old violent
cycles continue or not, nothing will ever look quite the same once Yasir Arafat
come to town. We expect him here in the Gaza Strip in about an hour and a half,
crossing over from Egypt’.
10
Repetition through synonymy – Occurs when words share the same meaning,
but have two unique syntactical forms3. ‘Four years ago, it passed a domestic
violence act allowing police, not just the victims, to press charges if they believe
a domestic beating took place. In the past, officers were frustrated, because
they'
d arrive on the scene of a domestic fight, there'
d be a clearly battered
victim and yet, frequently, there'
d be no one to file charges.’
Word association through specialisation/generalisation – Occurs when a
specialised/generalised form of an earlier word is used. ‘They'
ve put a possible
s hands; that'
s something that no one knew
murder weapon in O.J. Simpson'
before. And it shows that he bought that knife more than a month or two ahead
of time and you might, therefore, start the theory of premeditation and
deliberation.’
Word association through part-whole/whole-part relationships – Occurs
when a part-whole/whole-part relationship exists between two words, e.g.
‘committee’ is made up of smaller parts called ‘members’. ‘The Senate Finance
Committee has just convened. Members had been meeting behind closed doors
throughout the morning and early afternoon.’
Word association through collocation - These types of relationships occur
when the nature of the association between two words cannot be defined in
terms of the above relationship types. These relationships are most commonly
found by analysing word co-occurrence statistics, e.g. ‘Osama bin Laden’ and
‘the World Trade Centre’. Halliday and Hasan also classify antonymy in this
category of word relationship. Antonyms are words that are exact semantic
opposites or complementaries, e.g. male-female, boy-girl, adult-child.
All of these relationships, except statistical word co-occurrences, are types of
lexicographical relationships that can be extracted from a domain-independent
thesaurus. Statistical associations between words, on the other hand, are generated
from domain-specific corpora that reflect the most commonly used senses of words
3
More formally, synonymy refers to the relationship between semantically equivalent words which
are interchangeable in all textual contexts. In reality, true synonyms are rare, where near-synonyms
or plesionyms are the most common form of synonymy in text. Halliday and Hasan’s definition of
synonymy also includes these near-synonyms. Hirst (1995) states that true-synonymy is mostly
limited to technical terms like groundhog/woodchuck. He provides the following example of nearsynonymy: lie/misrepresentation, where a lie is a deliberate attempt to deceive, while the use of
misrepresentation tends to imply an untruth told merely out of ignorance. So depending on the
context, these terms may be intersubstitutable.
11
in domains such as American Broadcast News or Inorganic Chemistry. The role of
the thesaurus in identifying lexical cohesive structure is discussed in the following
section, while a description of how to generate these co-occurrences is given in
Section 3.2.1.
We stressed in the previous section that one of the advantages of analysing
cohesion is that it is a surface relationship and thus, unlike coherence, it is relatively
easier to model discourse structure by observing cohesive ties. With regard to
lexical cohesion (one of the five identified categories of cohesion), Hasan explains
that it is the most prolific form of cohesion in text, and as already stated it is the
least computational expensive means of identifying cohesive ties.
As Hasan and many other researchers have observed, nouns usually convey
most of the information in a written text, and so identifying lexicosemantic
connections between nouns is an adequate means of determining cohesive ties
between textual units. Although verbs also make an undeniable contribution to the
grammatical and semantic content of a text (Klavans, Min-Yen, 1998), they are
more difficult to deal with than nouns for the following reasons:
1. Verbs are more polysemous than nouns. Fellbaum (1998) states that nouns in
the Collins English Dictionary have an average of 1.71 senses whereas verbs
have an average of 2.11 senses. For example WordNet defines 3 sense for the
noun form of the word ‘close’ and 14 senses for its verb form. Fellbaum
suggests the reason for this is that there are fewer verbs than nouns in the
English lexicon (in spite of the fact that all sentences need a verb), and so to
compensate for this, verb meanings tend to be more flexible. More specifically,
the meaning of a verb (especially ambiguous verbs) tends to be dictated by the
noun accompanying it in a clause. For example, consider the meaning of the
verb ‘have’ in the following contexts:
She had a baby. => She gave birth.
He had an egg for breakfast. => He ate an egg for breakfast.
2. Fewer verbs are truly synonymous. This depends on how rigid a definition of
synonymy is used, but, in general, this means the presence of lexical cohesion in
text through synonymous verbs is limited.
3. Not all verb categories suit being cast into a taxonomic framework.
Fellbaum states that in the design of WordNet it was possible to associate a
12
large majority of verb categories using an entailment4 relationship between
verb-pairs which is not one of the cohesive relationships described by Hasan.
Also, unlike the noun hierarchy, a verb sense may belong to one type (or baseconcept), but have as its superordinate another verb sense of a different type
(Kilgarriff, Yallop, 2000).
As we will see in the next section, taxonomic frameworks are an essential resource
in the identification of lexical cohesive ties. From this point on in our discussion all
lexical cohesive relationships will refer to relationships between nouns.
2.4
Semantic Networks, Thesauri and Lexical Cohesion
In this section we illustrate how semantic networks generated from machinereadable thesauri and dictionaries have been used to identify cohesive links between
noun pairs. We will focus much of this discussion on the WordNet thesaurus as it
has become a standard resource among researchers in the NLP, AI and IR fields. It
is also the knowledge source used by our lexical chaining algorithm, LexNews,
described in Chapter 3.
2.4.1 Longmans Dictionary of Contemporary English
The Longmans Dictionary of Contemporary English (LDOCE) was the first
available machine-readable dictionary. Its popularity as a lexical resource can also
be attributed to the simplicity of its design, since it was created with non-native
English speakers in mind. More specifically, the dictionary was written so that all
gloss definitions were described with respect to a controlled vocabulary of 2,851
words referred to as the Longmans Defining Vocabulary (LDV).
In their paper, which looks at calculating the similarity between words based on
spreading activation in a dictionary, Kozima and Furugori (1993a) took advantage
of this design feature, and generated a semantic network from it using the gloss
entries of a subset of the words in the dictionary. This subset of the dictionary is
called Glossème and consists of all words included in the LDV. They then created
their semantic network Paradigme by connecting all LDV words that share gloss
terms resulting in a network of 2,851 nodes connected by 295,914 links.
4
A verb X entails Y if X cannot be done unless Y is, or has been, done. ‘Snoring’ entails ‘sleeping’
as a person cannot snore unless he/she is sleeping.
13
Kozima (1993b) defines lexical cohesion in terms of the semantic similarity
between two words, where the similarity between two words w and w’ is measured
in the following way: “Produce an activation pattern by activating a node for w and
then observe the activity of the second node w’ in the activation pattern”. A
similarity score or strength of association between w and w’ can then be calculated
based on the significance of w and the activity of the node representing w’ in the
activity pattern for w in the network. This measurement results in a score ranging
from 0 to 1, where 1 indicates a significant lexical cohesive tie and 0 no relationship
between the terms. Kozima (1993b) also defines a method of segmenting text by
generating a Lexical Cohesion Profile, a means of representing the cohesiveness of
a text in terms of the cohesiveness of windows of word sequences in the text.
Kozima’s work will be returned to again in the discussion on text segmentation in
Chapter 6.
2.4.2 Roget’s Thesaurus
Roget’s Thesaurus is one of a category of thesauri (like the Macquarie Thesaurus)
that were custom built as aids to writers who wish to replace a particular word or
phrase with a synonymous or near-synonymous alternative. Unlike a dictionary,
they contain no gloss definitions, instead they provide the user with a list of
possible replacements for a word and leave it up to the user to decide which sense is
appropriate. The structure of the thesaurus provides two ways of accessing a word:
1. By searching for it in a list of over 1,042 pre-defined categories, e.g. Killing,
Organisation, Amusement, Physical Pain, Zoology, Mankind, etc.
2. By searching for the word in an alphabetical index that lists all the different
categories in which the word occurs, i.e. analogous to defining sense
distinctions.
Lexical cohesive relationships between words can also be determined using a
resource of this nature, since words that co-exist in the same category are
semantically related. There is a hierarchical structure above (classes, sub-classes
and sub-sub-classes) and below (sub-categories) these categories in the thesaurus,
and it is this structure that facilitates the inferring of a small range of semantic
strengths between words; however, unlike LDOCE, ‘no numerical value for
semantic distance can be obtained’ (Budanitsky, 1999). In Section 2.5.1, we will
explain how Morris and Hirst (1991) used Roget’s thesaurus to find cohesive ties
14
between words in order to build lexical chains. Figure 2.1 shows an extract from the
index entry for the noun ‘ball’ in Roget’s thesaurus. Each of the categories defined
after the word ‘ball’ represent its different senses in various contexts. For example,
the ‘dance’ sense of ‘ball’ is found in the ‘Sociality’ category, which contains
related words such as ‘party’, ‘entertainment’ and ‘reception’.
Ball: #249 Rotundity; #284[Motion given to an object situated in front.]
Propulsion. ……. #892 Sociality;
#892. Sociality
party, entertainment, reception, at home, soiree; evening party, morning
party, afternoon party, bridge party, garden party, surprise party; kettle,
kettle drum; housewarming; ball, festival; smoker, smoker-party; sociable
[U.S.], stag party, hen party; tea-party; #840 Amusement
Figure 2.1: Sample category from Roget’s thesaurus.
2.4.3 WordNet
WordNet (Miller et al., 1990; Fellbaum, 1998a) is an online lexical database whose
design is inspired by current psycholinguistic theories of human lexical memory.
WordNet is divided into 4 distinct word categories: nouns, verbs, adverbs and
adjectives. The most important relationship between words in WordNet is
synonymy. The WordNet definition of synonymy also includes near-synonymy.
Hence, WordNet synonyms are only interchangeable in certain contexts (Miller,
1998). A unique label called a synset number identifies each synonymous set of
words (a synset) in WordNet. Each node or synset in the hierarchy represents a
single lexical concept and linked to other nodes in the semantic network by a
number of relationships. Different relationships are defined between synsets
depending on which semantic hierarchy they belong to. For example, most verbs
are organised around entailment (synonymy and a type of verb hyponymy called
troponymy (Fellbaum, 1998b)), adjectives and adverbs around antonymy (opposites
such as big-small and beautifully-horribly) and synonymy.
15
Nouns, on the other hand, are predominantly related though synonymy and
hyponymy/hypernymy. In addition, 9 other lexicographical relationships are also
defined between nodes in the noun hierarchy. Table 2.1 defines each of these
relationships, where 80% of these links are attributed to the hypernymy/hyponymy
relationships (Budanitsky, 1999).
WordNet Noun Relationship
Example
Hyponymy
Specialisation: apple is a hyponym of fruit since
(KIND_OF)
apple is a kind of fruit.
Hypernymy
(IS_A)
Generalisation: celebration is a hypernym of
birthday since birthday is a type of celebration.
Holonymy
(HAS_PART)
HAS_PART_COMPONENT: tree is a holonym
of branch.
HAS_PART_MEMBER: church is a holonym of
parishioners .
IS_MADE_FROM_OBJECT: tyre is a holonym
of rubber.
Meronymy
(PART_OF)
OBJECT_IS_PART_OF: leg is a meronym of
table.
OBJECT_IS_A_MEMBER_OF:
sheep
is
a
meronym of flock.
OBJECT_MAKES_UP: air is a meronym of
atmosphere.
Antonymy (OPPOSITE_OF)
Girl is an antonym of boy.
Table 2.1: Semantic relationships between nouns in WordNet.
Figure 2.2 illustrates how each WordNet word category has expanded in size
from version 1.5 to version 1.7.1. It is evident from this graph that the noun part of
the semantic network is by far the largest. It is also the noun word category that
exhibits the most connectivity between its elements, making it an ideal resource for
discovering cohesive relationships between nouns in a text.
The noun hierarchy is organised around a set of unique beginners, which are
simply synsets that do not have any hypernyms; these include ‘entity’, ‘state’,
16
‘phenomenon’ and ‘abstraction’. Words are then connected in a vertical fashion via
hypernymy and holonymy relationships, and horizontal via meronymy and
holonymy relationships. Unfortunately, there is little interconnectivity between the
noun, verb, adverb and adjective files in the WordNet taxonomy: the verb file has
no relations with any of the other files, the adverb file has only unidirectional
relations with the adjective file, and there are only a limited number of ‘pertains to’
relationships linking adjectives to nouns.
80000
WordNet 1.5
70000
WordNet 1.6
WordNet 1.7
60000
50000
40000
30000
20000
10000
0
Nouns
Adjectives
Verbs
Adverbs
Figure 2.2: Number of synsets for each part-of-speech in WordNet (versions 1.5, 1.6 and 1.7.1).
2.4.4 The Pros and Cons
We will now explore some of the advantages and disadvantages of using the
knowledge sources outlined in the preceding subsections to identify lexical
cohesive links between words in a text.
The main reasons why WordNet has become the most commonly used
knowledge resource in NLP circles is that it is both freely available and part of an
on going research initiative involving Princeton University and other research
communities. In contrast, only the original Roget’s Thesaurus (1911 edition) has
been made available in machine-readable format (by Project Guttenberg in 1991),
while the most up to date version, Roget’s International thesaurus, is still
unavailable due to copyright restrictions. As we will see in Section 2.5.3, Roget’s
1911 edition has been successfully used to find relationships between words, even
17
though it has a number of limitations: a lack of up-to-date words and phrases; it
contains a number of obsolete words; and it has no word index, so that looking for
the various uses of a word is a laborious task that requires searching in each of the
various categories of the thesaurus until the word is found. WordNet, on the other
hand, is a lexical database which contains an index of words separated into different
parts of speech which are explicitly linked to related words in the taxonomy. The
LDOCE is also readily available upon request for academic research purposes.
However, unlike WordNet, the LDOCE comes in a dictionary format that must be
transformed into a semantic network, as was described in Section 2.4.1, before it
can be used to establish lexicographical relationships between words.
Another advantage that WordNet has over other online thesauri and dictionaries
is that it is in a continuous state of transition where improvements are made based
on the findings and suggestions of the research community that use it. It is currently
in its 8th edition, which is particularly important to applications like ours that work
on documents from the news story domain since new vocabulary, word meanings
and world events are constantly being added to the general English lexicon.
Consider the recent media obsession with the use of military jargon in Iraqi War
reports. For example, consider phrases such as ‘blue on blue’ or events such as
‘Operation Iraqi Freedom’. Obviously, in any NLP application it is preferable to
have a lexical resource that covers these terms and explicitly relates them to other
words in the taxonomy. The LDOCE is also continually expanded with novel
vocabulary and word usages; however, it is being developed as a commercial
product rather than as a research resource.
Another important knowledge source that is also available in the public domain,
that has not been discussed so far, is the Macquarie Thesaurus. This thesaurus has
been mapped onto a WordNet-like structure so it can be used for language
engineering problems. It is an impressive body of work that covers general English
terms as well as Australian English, Aboriginal English and elements of English
spoken in South-East Asia. Like WordNet, it is available for academic purposes (for
a small annual fee). Although this resource has not been used for the generation of
18
lexical chains, it has been used in Question Answering at TREC and SENSEVAL
tasks5.
So far we have looked at the advantages of using the WordNet semantic
network. Nevertheless, there are also a number of well-documented problems with
the taxonomy that can have a significant effect on the performance of NLP
applications that avail of it. One of the most striking differences between estimating
semantic distances with WordNet and the other dictionaries/thesauri mentioned so
far is that relationships between words are explicitly defined in terms of a set of
semantic ties, which in turn leads to certain advantages and disadvantages. On the
one hand, WordNet is missing a lot of explicit links between intuitively related
words. Fellbaum (1998) refers to such obvious omissions in WordNet as the ‘tennis
problem’ where nouns such as ‘nets’ and ‘rackets’ and ‘umpires’ are all present in
the taxonomy, but WordNet provides no links between these related ‘tennis’
concepts. A thesaurus like Roget’s or a semantic network like Kozima’s LDOCE
has a much richer set of relationships between words due to the organisation of
terms into categories such as the ‘sociality’ category shown in Figure 2.1. However,
it has also been observed that ‘the price paid for this richness is a somewhat
unwieldy tool with ambiguous links’ (Kilgarriff, 2000). This unwieldiness makes
any measurement of semantic similarity into a near binary decision, i.e. we decide
whether two words are related but we can’t quantify how strong this relationship is.
In contrast, since the length of paths between related nodes in a taxonomy, such as
WordNet, can be measured in terms of edges, one might think that semantic
relatedness can be measured in terms of semantic distance with more accuracy.
However, this is not the case, as such a measure is based on the following two
assumptions outlined by Mc Hale (1998) that do not hold true for any known
thesaurus, dictionary or semantic network:
1. Every edge in the taxonomy is of equal length.
2. All branches in the taxonomy are equally dense.
Mc Hale suggests that edge length gets shorter as the depth in the hierarchy
increases. In the case of WordNet it is well known that certain categories such as
those relating to plants and animals are more developed than others. To address
5
TREC stands for Text Retrieval Conference (see Section 4.1.2) and SENSEVAL is a workshop
dedicated to the development of sense disambiguation systems (see Section 5.1.2).
19
these discrepancies many researchers have tried to adapt the basic edge counting
measure with taxonometric information such as depth and density, and measures of
statistical word association such as mutual information. For a more detailed
examination of these metrics we refer the reader to two excellent sources: Mc Hale
(1998) and Budanitsky (1999).
Another complaint commonly encountered when using WordNet is that its level
of sense granularity is too fine. Polysemy is represented in WordNet as a list of
different synset numbers for a particular syntactic form, while in Roget’s thesaurus
polysemy is captured by assigning the same syntactic form, such as the word
‘bank’, to a number of different heads or categories. Wilks (1998) claims that this
contributes to ‘the fine level of sense distinction present in WordNet’ and to the fact
that ‘it lacks much of the abstract classification of Roget’s’. The aim of the
lexicographer when annotating dictionary entries such as those of WordNet or the
LDOCE is to define and distinguish between all the distinct meanings of a word.
However, when the lexicographer is assigning words to categories in a thesaurus,
senses of words that are very similar will tend to be placed (only once, to avoid
repetition) in the same category and so ‘many dictionary sense distinctions will get
lost in the thesaurus’ (Kilgarriff, 2000).
Despite the fact that WordNet is an imperfect lexical resource the general
consensus in the NLP and CL (Computational Linguistics) communities is that it is
still a very valuable one. As a testimony to its indispensability a number of largescale research initiatives have germinated from the WordNet project, including the
Euro WordNet project (Vossen, 1998); the Annual WordNet Conference; the
SENSEVAL-2 task (sense disambiguation evaluated with respect to WordNet
synsets) (SENSEVAL-2, 2001); the numerous attempts to combine WordNet with
other lexical resources; the recent release in March 2003 of the eXtended WordNet
(implicitly links words through their glosses) (Mihalcea, Moldovan, 2001); and the
recently released of WordNet version 2.0.
2.5
Lexical Chaining: Techniques and Applications
So far in this chapter we have established what lexical cohesion is, and how its
various relationships are manifested in coherent text. We have also examined the
merits of different knowledge sources for identifying cohesive ties between nouns
20
in a text. In the remainder of this chapter we focus on lexical chaining as a method
of representing the lexical cohesive structure of a text. Lexical chains are in essence
sequences of semantically related words, where lexical cohesive relationships
between words are established using an auxiliary knowledge source such as a
dictionary or a thesaurus. Lexical chains have many practical applications in IR,
NLP and CL research such as the following:
Discourse Analysis (Hirst, Morris, 1991)
Text Segmentation (Okumura, Honda, 1994; Min-Yen, Klavans, McKeown,
1998; Mochizuki et al., 2000; Stokes, Carthy, Smeaton, 2002; Stokes, 2003,
2004a)
Word Sense Disambiguation: (Okumura, Honda, 1994; Stairmand, 1997;
Galley, McKeown, 2003)
Query-based Retrieval (Stairmand, 1996)
A Term Weighting Scheme (Bo-Yeong, 2003)
Multimedia Indexing (Kazman et al., 1996; Al-Halimi, Kazman, 1998)
Hypertext Construction (Green, 1997a; 1997b)
Text Summarisation (Barzilay, Elhadad, 1997; Silber, McCoy, 2000; Brunn,
Chali, Pinchak, 2001; Bo-Yeong, 2002; Alemany, Fuentes, 2003; Stokes et al.,
2004)
Malapropisms Detection in Text (St-Onge, 1995; Hirst, St-Onge, 1998)
Web Document Classification (Ellman, 2000)
Topic Detection and Tracking (Stokes et al., 2000a, 2000b, 2000c; Stokes,
Carthy, 2001a, 2001b, 2001c; Carthy, Smeaton, 2000; Carthy, Sherwood-Smith,
2002; Carthy, 2002)
Question Answering (Moldovan, Novischi, 2002)
In 1991 Morris and Hirst published their seminal paper on lexical chains with the
purpose of illustrating how these chains could be used to explore the discourse
structure of a text. At the time of writing their paper no machine-readable thesaurus
was available so they manually generated chains using Roget’s Thesaurus. Since
then lexical chaining has developed from an idea on paper to a fully automated
process that captures not only cohesive relationships, but also discourse properties
such as thematic focus. In the following subsections, we review some of the
21
principal chaining approaches proposed in the literature, and how lexical chains
have been used to solve some of the complex research problems listed above.
2.5.1 Morris and Hirst: The Origins of Lexical Chain Creation
Morris and Hirst (1991) used lexical chains to determine the intentional structure of
a discourse using Grosz and Sidner discourse theory (1986). In this theory, Grosz
and Sidner propose a model of discourse based on three interacting textual
elements: linguistic structure (segments indicate changes in topic), intentional
structure (how segments are related to each other), and attentional structure (shifts
of attention in text that are based on linguistic and intentional structure). Obviously,
any attempt to automate this process will require a method of identifying linguistic
segments in a text. Morris and Hirst believed these discourse segments could be
captured using lexical chaining, where each segment is represented by the span of a
lexical chain in the text.
Morris and Hirst manually generated lexical chains using Roget’s Thesaurus
which consists of an index entry for each word that lists synonyms and nearsynonyms for each of its coarse-grained senses followed by a list of category
numbers that are related to these senses. A category in this context consists of a list
of related words and pointers to related categories. They used the following rules to
glean semantic associations from the thesaurus during the chain generation process,
where two words are related if any of the following relationship rules apply:
1. They have a common category in their index entries.
2. One word has a category in its index entry that contains a pointer to a category
of the other word.
3. A word is either a label in the other word’s index entry or it is listed in a
category of the other word.
4. Both words have categories in their index entries that are members of the same
class/group.
5. Both words have categories in their index entries that point to a common
category.
Morris and Hirst also introduced a general algorithm for generating chains, shown
in Figure 2.3, on which most other chaining implementations are based.
22
A General Lexical Chaining Algorithm
1. Choose a set of candidate terms for chaining, t1 ……tn. These terms are usually
closed-class nouns, i.e. highly informative words as opposed to stopwords.
2. Initialise: The first candidate term in the text, t1, becomes the head of the first
chain, c1.
3. for each remaining term ti do
4.
for each chain cm do
5.
Find the chain that is most strongly related to ti
with respect to the following chaining constraints:
a. Chain Salience
b. Thesaural Relationships
c. Transitivity
d. Allowable Word Distance
6.
If the relationship between a chain and ti adheres to these
constraints then ti becomes a member of cm, otherwise ti becomes
the head of a new chain.
7.
end for
8. end for
Figure 2.3: A generic lexical chaining algorithm
The constraints listed in statement 5 of the algorithm are critical in controlling
the scope, size and in many cases the validity of the relationships within a chain. If
these constraints are not adhered to or if suitable parameters are not chosen for each
of them, then the occurrence of spurious chains (chains that contain weakly related
or incorrect terms) will be greatly increased. We now look at each of these
constraints in turn:
Chain Salience: This constraint refers to the notion that words should be added
to the most recently updated chain. This intuitive rule appeals to the notion that
terms are best disambiguated with respect to active chains, i.e. active themes or
speaker intentions in the text.
Thesaural Relationships: Regardless of the knowledge source used to deduce
semantic similarity between terms, a set of appropriate knowledge source
relationships must be decided on. Morris and Hirst state that their relationship
rules 1 and 2, defined above, based on Roget’s thesaural structure, account for
nearly 90% of relationships between chain words. On the other hand, in
23
WordNet-based chaining the specialisation/generalisation hierarchy of the noun
taxonomy is responsible for the majority of associations found between nouns.
Transitivity: Another factor to consider when searching for relationships
between words is transitivity. In particular, although weaker transitive
relationships (such as a is related to c because a is related to b and b is related to
c) increase the coverage of possible word relationships in the taxonomy, they
also increase the likelihood of spurious chains, as they tend to be even more
context specific than strong relationships such as synonymy. For example,
consider the following tentative relationship found in WordNet: ‘foundation
stone’ is indirectly related to ‘porch’ since ‘house’ is directly related to both
‘foundation stone’ and ‘porch’. Deciding whether these transitive relationships
are useful is a difficult decision as one must also consider the loss of possible
valuable relationships if they are ignored, e.g. ‘cheese’ is indirectly related to
‘perishable’ since ‘dairy product’ is directly related to both words according to
WordNet.
Allowable Word Distance: This constraint works on a similar assumption as
Chain Salience, where the relationships between words are best disambiguated
with respect to the words that they lie nearest to in the text. The general rule is
that relationships between words that are situated far apart in a text are only
permitted if they exhibit a very strong semantic relationship such as repetition
or synonymy.
As well as these constraints Morris and Hirst also refer to the notion of chain
returns, a final chaining step where related chains are merged after the initial chain
formation process is complete. Chain returns are chains which share candidate
terms with an earlier created chain. In their own words, chain returns occurs when a
theme in a text represented by a chain ‘has clearly stopped’, and is then returned to
by the speaker later on in the text, thus creating the second occurrence of the chain,
i.e. a chain return. If such chains returns are linked back to the original chain and
concatenated with it then this resulting chain will represent a single theme or
‘structural text entity’ in the discourse. In many subsequent lexical chaining
implementations, including our own, there are no occurrences of chain returns, i.e.
no word belongs to more than one chain. This is made possible by relaxing the
allowable word distance parameters for strong relationships (to span the entire text),
24
so as to ensure that all repetitions of a particular word will occur in the same chain.
However, since Morris and Hirst’s aim was to model intentional shifts in the text,
shorter word distance constraints are necessary.
Morris and Hirst conclude that lexical chains are good indicators of text
structure with respect to Grosz and Sidner’s structural analysis method.
Nevertheless, they admit that further research is needed in order to uncover the true
impact of ‘unfortunate chain linkages’, the stability of parameters and constraints
across different textual styles, and how a static knowledge source such as Roget’s
might impact on the practical limitations of their work.
2.5.2 Lexical Chaining on Japanese Text
The first fully automated version of Morris and Hirst’s algorithm was implemented
by Okumura and Honda (1994) for use on Japanese text which uses a thesaurus
similar in design to Roget’s called Bunrui-goihyo. However, unlike Morris and
Hirst, they only avail of a subset of all possible relationships provided by the
thesaurus. More specifically, for two words to be related they must either be
repetitions or share the same category number.
Okumura and Honda’s algorithm begins by choosing a set of candidate terms for
chaining. In this case all nouns, adjectives and verbs are chosen. Unlike the generic
algorithm detailed in Figure 2.3 words are not immediately added to chains. Instead,
an attempt is made to disambiguate term ti with respect to the other words that
occur in the current sentence. Any terms that remain ambiguous after this step will
be disambiguated when added to the most appropriate chain in the next step. Again,
as in the case of the generic algorithm, a term is added to the chain that is most
salient and strongly related to a particular sense of ti. If a satisfactory relationship
between a chain cm and ti is found then ti becomes a member of cm, otherwise ti
becomes the head of a new chain.
Since one of the side effects of the lexical chaining process is sense resolution,
Okumura and Honda decided to test how well their chaining algorithm could
disambiguate words in a small collection of 5 Japanese texts. They found that
chain-based disambiguation obtained an average accuracy of 63.4%. According to
the authors this is ‘a tolerable level of performance without any training’.
Disambiguation errors were attributed to two sources, morphological analysis errors
and errors due to word senses being ‘dragged into the wrong contexts’.
25
Okumura and Honda also experimented with lexical chains as a means of
segmenting text into semantic units. They observed that distinct segments tend to
use related words. Therefore, an area of the text that exhibits a high number of
chain-end and chain-begin points indicates a transition between old and new themes
in the text. This lexical chaining application will be returned to again in Chapters 6
and 7, where a more detailed look at the authors’ approach to this task is given. The
performance of their segmentation method was evaluated on a collection of only 5
texts and appeared to be poor with an average recall and precision of 0.52 and 0.25
respectively6. However, they noted that this work was in a preliminary stage, and
that with a number of refinements, such as taking account of discourse markers
(clue words) and including a measure of chain importance, they could improve
these results.
More recently in Mochizuki et al. (2000), Okumura and Honda’s algorithm was
used to segment Japanese documents into sub-topic segments. These segments were
then used in passage-level retrieval experiments, where relevant passages or
segments of long documents are returned in response to user queries instead of an
entire set of semi-relevant documents. Mochizuki et al.’s results showed that by
combining passage-level retrieval with a combination of keyword retrieval and
lexical chain-derived segments, their system could outperform a method that used
either one of these techniques. Using lexical chains as a means of segmenting text
will be returned to again in Chapters 6 and 7.
2.5.3 Roget’s Thesaurus-based Chaining Algorithms
In their original chaining algorithm, Morris and Hirst (1991) manually identified
lexical cohesive relationships between chain words using a version of Roget’s
Thesaurus. Subsequent attempts to automate the chaining process have
predominantly focussed on finding lexical cohesive relationships using WordNet.
St-Onge (1995) and Stairmand (1996) both cite the following convincing arguments
as to why they choose WordNet over Roget’s, (a number of these points have
already been discussed in more detail in Section 2.4.4):
6
In this context, recall refers to the number of correctly identified boundaries divided by the total
number of boundaries in the text, while precision is the number of correctly identified boundaries
divided by the total number of boundaries returned by the system.
26
The 1911 version of Roget’s has no index (only categories are listed hence this
index must be automatically generated), and although the 1993 version includes
an index it is not freely available.
Roget’s lacks new words and so is not well suited to processing contemporary
texts.
Defining the strength of a cohesive link using Roget’s is more difficult as most
relationships between words are found by their co-existence in thesaural
categories with no accompanying explanation of how related they are. On the
other hand, relationships between terms in WordNet are defined in a more
principled and explicit manner.
Concepts in WordNet are organised around psycholinguistic theories of human
lexical memory structures.
WordNet versions are accompanied by a library of functions providing access to
its database.
However, in spite of these inadequacies, there have been two recent attempts at
automating chain creation using Roget’s Thesaurus and Morris and Hirst’s original
cohesive link definitions (listed in Section 2.5.1).
Ellman (2000) details a system called Hesperus that generates lexical chains and
calculates document similarity using these chains. Ellman acknowledges the
difficulties outlined by Stairmand (1996) and St-Onge (1995), but also points out
that Roget’s, unlike WordNet, contains intuitive associations between words, which
are an important part of the chaining process. He also notes that Roget’s has a more
balanced structure than WordNet, and so might prove more efficient for
establishing cohesive ties. He also suggests that the difficultly in determining the
strength of relationships using Roget’s might not be such a disadvantage since
determining semantic proximity in WordNet has also proven to be a very difficult
problem.
Ellman’s Hesperus system is designed to enhance search engine retrieval
results. The lexical chaining element of his research involves using cohesive chains
to build document representations based around Roget’s categories, called Generic
Document Profiles (GDP). Two documents are then considered very similar if their
GDP’s have a number of Roget’s categories in common. Ellman evaluates the
Hesperus document similarity strategy by comparing its ranking of texts to that of
27
human judges who were asked to rank a random set of texts in order of similarity to
a set of texts taken from Encarta on diverse topics ranging from artificial
intelligence to socialism. The results of his experiment were mixed: where rankings
relating to some topics were shown to be statistically significant when compared to
the gold standard human rankings, while results for other topics were disappointing.
Ellman also found that comparing document representations composed of finelevels of Roget’s category granularity, and no explicit sense disambiguation worked
best in most cases.
Jarmasz and Szpakowicz (2003) also generated lexical chains using Roget’s.
However, they used a machine-readable 1987 version of Penguin’s Roget’s
Thesaurus of English Words and Phrases. Their research group at the School of
Information Technology and Engineering, University of Ottawa has recently been
granted permission to work with this version of the thesaurus. According to the
authors, initial experiments have shown that, as expected, Roget’s provides a very
broad range of associations between words using Morris and Hirst’s relationship
rules. However, they conclude that these ‘thesaural relations are too broad to build
well-focussed chains or too computationally expensive to be of interest’. Ellman
came to a similar conclusion and only found a sub-set of Morris and Hirst’s
relationship types were useful (see Section 2.5.1 for a list of relationship types).
2.5.4 Greedy WordNet-based Chaining Algorithms
The generic algorithm shown in Figure 2.3 is an example of a greedy chaining
algorithm where each addition of a term ti to a chain cm is based only on those
words that occur before it in the text. A non-greedy approach, on the other hand,
postpones assigning a term to a chain until it has seen all possible combinations of
chains that could be generated from the text. One might expect that there is only
one possible set of lexical chains that could be generated for a specific text, i.e. the
correct set. However, in reality, terms have multiple senses and could be added into
a variety of different chains. For example consider the following piece of text:
Among the many politicians who sought a seat on the space shuttle, John Glenn was
the obvious, perfect choice.
There are 10 distinct senses of ‘space’ defined in the WordNet thesaurus, ranging
from the topological sense of space to the ‘blank space on a form’ sense. It is
28
obvious when reading the above sentence that the correct sense of ‘space’ used in
this context is WordNet’s sense number 5, i.e. any region of space outside the
earth’s atmosphere. However, a greedy algorithm would have chosen WordNet’s
sense 6 defined as ‘an area reserved for a purpose’. More specifically, the algorithm
was forced to make this disambiguation error as sense 1 of ‘seat’ is a type of ‘space’
and so the word ‘space’ will be added to a chain containing the word ‘seat’. This is
an example that fortifies the argument against a greedy chaining approach, where
the algorithm could have made an more informed decision, and chosen the correct
sense of ‘space’, if it had postponed disambiguating ‘space’ at such an early stage in
the text and waited until it had seen additional vocabulary such as ‘shuttle’, ‘John
Glenn’, ‘NASA’, ‘space station’, and ‘MIR’. On the other hand, non-greedy
chaining approaches are still prone to disambiguation errors and are less efficient.
Further discussion of these points is left to Section 2.5.5, while in the following
section we will look at some greedy chaining algorithms proposed in the literature
and how they have been applied to various IR and NLP tasks.
St-Onge and Hirst
St-Onge and Hirst’s algorithm (St-Onge, 1995; Hirst and St-Onge, 1998) was the
first published work to use WordNet as a tool for building lexical chains. Their
intention was to use lexical chains as a means of detecting malapropisms in text. StOnge (1995) defines a malapropism as ‘the confounding of one word with another
word of similar sound and/or similar spelling that has a quite different meaning’. StOnge provides the following example, ‘an ingenuous machine for peeling oranges’
where ‘ingenuous’ is confused with ‘ingenious’. A traditional spell checker would
not have picked up this type of error because its purpose is to identify incorrect
spellings and, in some cases, grammatical mistakes.
St-Onge and Hirst’s biggest contribution to the study of lexical chains is the
mapping of WordNet relations and paths (transitive relationships) to Morris and
Hirst’s word relationship types. St-Onge defines three categories of WordNet-based
relationships in a text:
Extra Strong Relations include all repetition-based relationships, e.g. ‘men’
and ‘man’.
29
Strong Relations include all synonyms (bike, bicycle), holonyms/meronyms
(arm, biceps), antonyms (night, day), and hypernyms/hyponyms (aircraft,
helicopter).
Medium-Strength Relations include all relationships with allowable paths in
WordNet (with a maximum path length 5).
By defining what constitutes an allowable path between words in the taxonomy, StOnge and Hirst aimed to limit spurious and tentative links between words in chains.
Since St-Onge and Hirst’s algorithm forms an integral part of our own approach to
lexical chain generation, we leave an in-depth discussion of the exact details of their
algorithm to Chapter 3, which documents our enhanced version of their algorithm.
Many other chaining algorithms supervening their work are also based on their
approach. In particular, the systems of both Green (1997b) and Kozima and Ito
(1997), described below, use St-Onge’s algorithm, LexC, to enhance hypertext
generation and to improve multimedia indexing, respectively.
St-Onge and Hirst base their malapropism detector on the following hypothesis:
words that do not form lexical chains with other words in a text are potential
malapropisms, as they appear to be semantically dissimilar to the general context of
the text. Once these potential malapropisms have been detected St-Onge’s
algorithm then tries to find slight spelling variations of these words that fit into the
overall semantics of the document. Hence, if one of these spelling alternatives
forms a relationship with one of the lexical chains, then St-Onge’s algorithm
concludes that the original word was incorrectly used, and that the variation of the
word was the intended use. An evaluation of this system on 500 Wall Street Journal
articles, that were deliberately corrupted with roughly one malapropism every 200
words (1409 in total) yielded a precision of 12.5% and a recall of 28.2%7. In further
experiments by Budanitsky (1999), it was shown that malapropism detection could
be improved by using a more simplistic approach that analysed the semantic
distance between all terms in the text, rather than one based on lexical chains.
7
In this context recall is defined as the percentage of correctly identified malapropisms as a portion
of the total number of malapropisms and precision is defined as the percentage of correctly identified
malapropisms as a portion of the total number of malapropisms detected by the system (Budanitsky,
1999).
30
Green and Hirst
Green’s system (1997b) for the construction of hypertext between newspaper
articles creates links between related paragraphs within a document (intra-article
links) and across documents (inter-article links). Inter-article links are generated
based on the cosine similarity of their WordNet synset vectors, where each synset
vector is made up of all lexical chain members generated for that document. Green
states that these synset vectors ‘can be seen as a conceptual or semantic
representation of an article, as opposed to the traditional IR method of representing
a document by the words it contains’. A different strategy is used for generating
intra-article hypertext links, where the similarity between each pair of paragraphs is
calculated based on two factors: the number of words they contain that occur in the
same lexical chain and the importance of that chain to the overall topic of the text.
Chain importance is weighted with respect to the relative length of the chain in the
text, i.e. the number of elements in the chain divided by the total number of content
words in the document.
Green evaluated the usefulness of his hypertext links by examining how well
subjects answered a set of questions by following links within and between
documents in a TREC corpus of newspaper articles. Green found that users
experienced no significant advantage when answering questions using lexical chainbased links over links generated using a simple vector space model of document
similarity. Green suggests that a more appropriate evaluation would involve directly
analysing and scoring the validity of the intra and inter-article links generated by
each approach rather than indirectly evaluating links with respect to a task.
Kazman et al.
Kazman et al. (1995, 1997) used lexical chains as a means of creating a
‘meaningful’ index of the recorded words spoken during a meeting or
videoconference. By ‘meaningful’ Kazman et al., like Green, refer to
topic/concept/theme-based indexing rather than a traditional keyword-based
indexing approach. This indexing strategy is just one component of a larger
multimedia indexing system called Jabber, which provides users with a unified
browsing/searching interface. As well as concept indexing, there are three further
indexing utilities: indexing by human-interaction patterns, indexing by human-
31
prepared meeting agendas, and indexing by a participant’s use of a shared
application. Initially, Jabber captured topics discussed in a meeting by generating
lexical trees. These trees are essentially lexical chains that have been reorganised
into a hierarchical structure where the most general word in the chain is placed at
the root of the tree which represents the general concept of the chain or tree. Later
lexical trees were replaced by concept clusters that only use relationships in the is-a
(hypernym/holonym) hierarchy to generate chains because Kazman found that it is
easier to characterise or label the cluster with the lowest common hypernym.
According to Kazman et al. (1996), lexical trees compared favourably with human
generated trees created from a journal article of 1,800 words where more than 75%
of subject-assigned words that characterised the theme of their cluster were found in
the characteristic tree (i.e. the strongest automatically generated lexical tree). They
go onto conclude that the results of their experiment are ‘a strong indication of the
usefulness of the lexical trees in information indexing and retrieval’ (Kazman et al.,
1996).
Brunn et al.
Brunn et al. (2001, 2002) at the University of Lethbridge use lexical chains to
extract relevant sentences from a text that should be included in a summary. They
use a greedy chaining algorithm that is similar in nature to St-Onge’s except that
they only consider relationships between words that have a path length no longer
than two edges in WordNet. Their algorithm also requires that the relationship
between words in a chain is pairwise mutual. This means that every word in a chain
must be related to every other word in the chain. In contrast, St-Onge only requires
one chain word to be related to the target word during chain generation.
The most significant difference between their technique and the other
approaches, discussed so far in this section, is their preprocessing step for choosing
candidate nouns for chaining. In most cases, preprocessing involves part-of-speech
tagging in order to identify nouns, proper nouns and noun phrases. This is often
followed by a morphological analysis that transforms nouns into their singular
form. A stoplist of ‘noisy’ nouns is usually used at this stage to eliminate words that
cause spurious chains to be formed and/or contribute very little to the subject of the
text. However, Brunn et al. (2001) suggest a more attractive alternative to static
32
‘noisy’ noun removal using a stopword list, where nouns are dynamically identified
as ‘information poor’ by applying the following hypothesis: ‘nouns contained
within subordinate clauses are less useful for topic detection than those contained
within main clauses’. The problem then becomes how to identify subordinate
clauses in sentences– no easy task according to the authors. Hence, they use the
following heuristic on parsed text: if a noun is either ‘the first noun phrase or the
noun phrase included in the first verb phrase taken from the first sub-sentence of
each sentence’, then it is a candidate term and will take part in chaining; otherwise,
the noun belongs to a subordinate clause and is eliminated from the chaining
process. Although this noun filtering heuristic is appealing, the authors did not
compare the performance of their lexical chain-based summariser when it used this
preprocessing component to when it used the traditional method of filtering
‘troublesome’ nouns from the chaining process using a stopword list. Hence, it is
unclear if their technique actually improves chain performance. A more detailed list
of heuristics for identifying subordinate clauses can be found in Brunn et al. (2002),
and information on multi-document summarisation using their chaining technique
can be found in Chali et al. (2003).
2.5.5 Non-Greedy WordNet-based Chaining Algorithms
As stated previously, a non-greedy approach to lexical chaining postpones resolving
ambiguous words in a text until it has analysed the entire context of the document.
Barzilay and Elhadad (1997) were the first to discuss the advantages of a nongreedy chaining approach. They argued that disambiguating a term after all possible
links between it and the other candidate terms in the text have been considered was
the only way to ensure that the optimal set of lexical chains for that text would be
generated. In other words, the relationships between the terms in each chain will
only be valid if they conform to the intended interpretation of the terms when they
are used in the text. For example, chaining ‘jaguar’ with ‘animal’ is only valid if
‘jaguar’ is being referred to in the text as a type of cat and not a type of car.
However, with this potential improvement in lexical chain quality comes an
exponential increase in the runtime of the basic chaining algorithm, since all
possible chaining scenarios must be considered (Silber, McCoy, 2002). This has
lead to a number of recent initiatives to develop a linear time non-greedy algorithm
that attempts to improve chaining accuracy without over-burdening CPU and
33
memory resources. In Chapter 3, we look more closely at the assumption that nongreedy chaining produces better chains. In particular, we show that this extra
computational effort does not necessarily result in improved disambiguation
accuracy during chaining.
Two categories of non-greedy chaining approaches have been proposed in the
literature: those that attempt to create all possible chains and then choose the best of
these chains (Stairmand, 1996; Barzilay, 1997; Silber McCoy, 2002); and those that
disambiguate terms before noun clustering begins resulting in a single set of chains
(Bo-Yeong, 2003; Galley and McKeown, 2003). We now examine approaches from
both of these categories in detail.
Stairmand and Black
Stairmand (1996), like St-Onge (1995), was one of the first to attempt the automatic
realisation of lexical chains using WordNet. Stairmand’s thesis (1996) examines
how lexical cohesion information about a text can be used to improve traditional
keyword-based approaches to IR problems. More specifically, Stairmand used
lexical chains for segmenting text (similar to Okumura and Honda’s technique,
described in Section 2.5.2) and for disambiguating word senses (Stairmand, 1997)
and for improving ad hoc retrieval. However, his main focus was his experimental
IR system COATER (Context-Activated Text Retrieval) (Stairmand, 1996;
Stairmand, Black, 1997; Stairmand, 1997). COATER’s task was to take a set of
disambiguated TREC queries, consisting of WordNet senses, and subsequently
determine the relevance of a document to a query by observing the level of
activation of the each query word’s concepts in a document representation
consisting of lexical clusters of WordNet synsets.
In line with other lexical chaining techniques, Stairmand’s chaining algorithm
QUESCOT (Quantification and Encapsulation of Semantic Content), first chooses a
set of candidate terms (nouns) for chaining. The second phase of his chaining
procedure, the non-greedy aspect of his algorithm, then establishes all the possible
term senses in the text that are related to each other through direct and indirect links
found in WordNet. In this context, direct links are repetitions, synonyms,
hypernyms, holonyms, meronyms and hyponyms; while indirect links are hypernym
paths in WordNet where no maximum path length is defined. After establishing all
34
possible links between each sense of each word in the text, the next step involves
generating a set of lexical clusters. Stairmand’s lexical clusters are not mutually
exclusive, so the same word can occur in different chains. The next step is to merge
any lexical clusters that share the same sense of a word. This allows for transitive
relationships to be found between word senses that are not explicitly defined in the
taxonomy. Once merging is completed, all lexical clusters are broken up into lexical
chains by splitting clusters at points in the cluster where the allowable distance
threshold of 80 words between adjacent terms is exceeded. Any chain resulting
from this split must exceed 3 words. Consequently, there will be some words in the
cluster that are not included in the resultant chains. These chains are then weighted
with respect to the portion of text spanned by the chain (their span) and their density
(the number of terms divided by the span of the chain). The overall span of a cluster
then becomes the average span of each of its chains, while its overall density
becomes the average density of each of its chains.
Stairmand’s main evaluation of his lexical chains centres on ad hoc TREC
retrieval. Document representations in COATER consist of a set of synsets derived
from the chain clusters, where each synset is weighted with respect to the strength
of its cluster, i.e. the normalised product of its span and density. In this way,
concepts in a text are weighted in terms of their overall contribution to the message
of the text, where concepts that are members of low scoring clusters are assigned a
low score in the document representation. COATER was evaluated with respect to
the SMART information retrieval system on a set of 50 queries taken from a TREC
evaluation corpus. Stairmand found that COATER ranked relevant documents more
precisely than SMART if the query’s terms refer to the main theme or an important
sub-theme of the ranked documents. However, the SMART system was more adept
at distinguishing between documents that were mildly relevant or not relevant at all
to the query. Stairmand believes that this result is encouraging and shows that
‘textual context can improve precision performance’. However, he also admits that
the recall performance of COATER is relatively poor (due to gaps in WordNet
coverage, especially in the case of proper noun coverage), which he concludes
‘prohibits the use of COATER in a real world IR scenario’ (Stairmand, 1997).
35
Barzilay and Elhadad
Barzilay and Elhadad (1997) were the first to coin the phrase ‘a non-greedy or
dynamic solution to lexical chain generation’. They proposed that the most
appropriate sense of a word could only be chosen after examining all possible
lexical chain combinations that could be generated from a text.
Their dynamic algorithm begins by extracting nouns and noun compounds.
Barzilay reduces both non-WordNet and WordNet noun compounds to their head
noun e.g. ‘elementary_school’ becomes ‘school’. As each target word arrives, a
record of all possible chain interpretations is kept and the correct sense of the word
is decided only after all chain combinations have been completed. As stated at the
beginning of this section, one of the main problems associated with non-greedy
algorithms is that they exhibit an exponential runtime. To reduce this algorithmic
complexity, Barzilay’s dynamic algorithm continually assigns each chain
interpretation a score determined by the number and weight of the relations between
chain members. When the number of active chain interpretations for a particular
word sense exceeds a certain threshold (i.e. 10 chains), weaker interpretations with
lower scores are removed from the remainder of the chaining process. Further
reductions in the runtime of the algorithm are also achieved by Barzilay’s
stipulation that relationships between words are only permitted if words occur in the
same text segment. Hearst’s TextTiling algorithm was used to segment the
document into sub-topics or text segments8. Consequently, chain merging is a
necessary component of the algorithm as same sense words often occur, due to this
stipulation, in different chains.
Once all chains have been generated only the strongest chains are retained.
Barzilay provides a more rigorous justification of her chain weighting scheme than
Stairmand does. In particular, she uses a human evaluation to determine what chain
characteristics are indicative of strong chains (representing pertinent topics) in a
text, i.e. chain length, chain word distribution in the text, chain span, chain density,
graph topology (of chain word relationships in WordNet) and the number of word
repetitions in the chain. Barzilay found that the best predictors of chain importance
8
In further experiments Barzilay showed that segmentation does not improve chain disambiguation
accuracy (Barzilay, 1997). However Silber and McCoy (2000; 2002), described next, use
segmentation to ‘reduce the complexity of their algorithm’, which is a more efficient version of
Barzilay and Elhadad’s.
36
or strength were: the chain length (the number of words in the chain plus
repetitions) and the homogeneity index (one minus the number of distinct
occurrences of words in the chain divided by its length). A single measure of chain
strength is calculated by combining chain length with homogeneity index. So in
order for a chain to be retained it must exceed the average chain strength score plus
twice the standard deviation of this average.
The main focus of Barzilay’s thesis is to investigate if a useful summary can be
built from a lexical chain representation of the original text. Since chains represent
pertinent themes in a document, significant sentences that should be included in a
summary can therefore be identified by examining the distribution of chains
throughout the text. More specifically, Barzilay found that the following heuristic
worked well: for each chain choose a representative word (i.e. the term with the
highest frequency of occurrence), then extract the first sentence in the text that
contains a representative word for each of the chains (i.e. extract one sentence for
each chain).
A data set of 40 TREC news articles containing roughly 30 sentences each was
chosen for the evaluation. Human subjects were asked to produce summaries of
length 10% and 20% respective of the length of the original document. Barzilay
then compared the similarity of these manually constructed summaries to those
generated by her lexical chain-based system, the Microsoft Word 1997 Summariser,
and Marcu’s summariser based on discourse structure analysis (Marcu, 1997).
Results from these experiments show that Barzilay’s lexical chain-based summaries
are closer to human generated summaries than either of the other two systems.
Results from these experiments showed that lexical chains are a strong intermediate
representation of a document, and hence should perform well in other applications
that would benefit from a more meaningful representation of a document than a
‘bag of words’ representation.
Silber and McCoy
An important extension of Barzilay and Elhadad’s work has been Silber and
McCoy’s (2000, 2002) linear time version of their lexical chaining algorithm. They
make two modifications to Barzilay’s algorithm in order to reduce its runtime.
37
The first modification relates to the WordNet searching strategy used to
determine word relationships in the taxonomy. In her original implementation
Barzilay uses the source code accompanying WordNet to access the database,
resulting in a binary search of the input files. Silber and McCoy note that chaining
efficiency could be significantly increased by re-indexing the noun database by line
number rather than file position, and saving this file in a binary indexed format.
Consequently, this also meant writing their own source code for accessing and
taking advantage of this new arrangement of the taxonomy.
Their second modification to Barzilay’s algorithm related to the way in which
‘chain interpretations’ are stored, where Barzilay’s original implementation
explicitly stores all interpretations (except for those with low scores), resulting in a
large runtime storage overhead. To address this, Silber and McCoy’s
implementation creates ‘a structure that implicitly stores all chain interpretations
without actually creating them, thus keeping both the space and time usage of the
program linear’. Once all chain interpretations, or meta-chains as Silber and McCoy
refer to them, are created, their algorithm must then decide which meta-chains are
members of the optimal set of lexical chains for that document. To decide this, their
algorithm makes a second pass though the data taking each noun in the text, and
deciding which meta-chain it contributes the most to. The strength of a noun’s
contribution to a chain depends on two factors: how close the word is in the text to
the word in the chain to which it is related, and how strong the relationship between
the two words is. For example, if a noun is linked by a hypernym relationship to a
chain word that is one sentence away then it gets assigned a score of 1, if the words
are 3 sentences away the score is lowered to 0.5. Silber and McCoy define an
empirically-based scoring system for each of the WordNet relationships found
between terms during chaining. The subsequent steps of their algorithm proceed in
a similar fashion to Barzilay’s, where only chains that exceed a threshold (twice the
standard deviation of the mean of the chain scores plus the mean chain score) are
selected for the final stage of the summarisation process.
Although Silber and McCoy evaluate their chaining approach with respect to a
summarisation task, they do not compare the results of their algorithm to Barzilay
and Elhadad’s. However, in theory their algorithm should perform better since they
did not have to employ a pruning cycle in their algorithm as Barzilay did, in order
38
to improve the runtime of her algorithm by periodically eliminating low scoring
chain interpretations. Instead they focus on the fact that their algorithm can
complete the chaining of a document in 4 seconds that would have taken Barzilay’s
implementation 300 seconds. Due to these improvements in time/space complexity,
Silber and McCoy approach does not impose an upper limit on document size.
Bo-Yeong
Unlike the previous three chaining algorithms, Bo-Yeong’s technique (Bo-Yeong,
2002, 2003) belongs to the second category of non-greedy algorithm that
disambiguates words before chain formation proceeds. Her idea is similar to
Okumura and Honda’s approach in that an attempt is made at disambiguating nouns
within a certain local context, or, as Bo-Yeong calls it, a semantic window. The
larger this window size the more nouns are examined, i.e. for a given noun, if the
window size is n then 2n nouns will be involved in its disambiguation. The most
likely sense of a noun is the sense that links most frequently with the terms in the
semantic window, where each sense is assigned a score depending on the strength
of this link (Bo-Yeong, 2002). Once as many terms as possible are disambiguated in
this way, the chaining algorithm proceeds in a similar manner to the generic
algorithm in Figure 2.3.
Bo-Yeong ran a similar experiment to Barzilay’s, and also found that her
chaining method produced summaries that were closer in content to human
generated summaries than Microsoft Office 2000 summaries. Her chaining method
was also evaluated as a keyword extraction technique, which, when compared to the
KEA extraction system (Witten et al., 1998), extracted more nouns that were
deemed topically important by human judges. However, this evaluation ignored
other parts of speech such as verbs and adjectives, which may have significantly
impacted KEA’s performance.
Galley and McKeown
Like Bo-Yeong, Galley and McKeown (2003) devised a non-greedy chaining
method that disambiguates nouns prior to the processing of lexical chains. Their
method first builds a graph, called a disambiguation graph, representing all possible
links between word senses in the document, where each node corresponds to a
distinct sense of a word. Like most other techniques these relationships between
39
words are weighted with respect to two factors: the strength of the semantic
relationship between them, and their proximity in the text. Galley’s weighting
scheme is nearly identical to Silber and McCoy’s (2002). However, their
relationship-weight assignments are in general lower for terms that are further apart
in the text. Once the graph is complete, nouns are disambiguated by summing the
weights of all the edges or paths emanating from each sense of a word to other
words in the text. The word sense with the highest score is considered the most
probable sense and all remaining redundant senses are removed from the graph.
Once each word has been fully disambiguated lexical chains are generated from the
remaining disambiguation graph.
Another interesting aspect of Galley and McKeown’s paper is that they
evaluated the performance of their lexical chaining algorithm with respect to the
disambiguation accuracy of the nouns in the resultant lexical chains. Their
evaluation showed that, in this respect, their algorithm was more accurate than
Barzilay and Elhadad’s and Silber and McCoy’s. A more in-depth discussion on the
format of this experiment is left until Section 3.3 where we report the findings of a
similar experiment undertaken in the course of our research.
2.6
Discussion
In this chapter, we have explored the notion of lexical cohesion and how it relates to
the discourse structure of a text. Our definition of lexical cohesion, taken from
Halliday and Hasan (1984), encompasses five different types of lexical cohesion:
repetition, synonymy, generalisation/specialisation, whole-part/part-whole and
collocation or statistical word association. A number of lexicosemantic knowledge
sources have been used by researchers to capture these lexical cohesive
relationships between words in text. We looked at the advantages and disadvantages
of three such resources: Longmans Dictionary of Contemporary English, the
WordNet online thesaurus and Roget’s Thesaurus. In the course of our research we
rely on the WordNet thesaurus for establishing associations between words in text
due to its popularity in NLP circles. However, although it covers four of the five
forms of lexical cohesion, WordNet is still an inadequate resource for capturing
statistically significant or ‘intuitive’ word associations. Most researchers try and
indirectly capture these relationships by following path lengths in WordNet.
40
However, it has been shown that this is an unreliable means of establishing
semantic relationships between words because the existence of a path between two
words in the taxonomy (especially a long path) does not necessarily correspond to a
strong semantic relationship between these words.
In the following chapter, we give details of how statistical word associations can
be incorporated into a lexical chaining framework. This framework is based on StOnge and Hirst’s lexical chaining algorithm, described in Section 2.5.4, which like
the majority of approaches is WordNet-based. We make two modifications to their
original algorithm which complement it and help to improve the generation of
lexical chains in a news environment. Our algorithm, LexNews, falls into the
greedy category of chaining algorithms. As stated in Section 2.5, various lexical
chaining researchers, such as Barzilay and Elhadad, have stressed the need for nongreedy approaches to chaining. The assumption is that by postponing the
disambiguation of a noun until all possible lexical chains (or senses of the noun)
have been considered, disambiguation accuracy will improve and the occurrence of
spurious chains will be reduced. However, in Section 3.3 we show that a nongreedy approach to chaining does not necessarily improve disambiguation accuracy,
and that St-Onge and Hirst’s greedy algorithm is as effective as a non-greedy
approach.
41
Chapter 3
LexNews: Lexical Chaining for News Analysis
Most IR and NLP applications such as question answering, machine translation, text
summarisation, information retrieval and information filtering, to name but a few,
are developed and tested on collections of news documents. News text is a popular
area for such research for two reasons:
1. There is a genuine demand for intelligent and automatic tools for managing
ever-expanding repositories of daily news.
2. Large volumes of news stories are freely available on the Internet, on television
and radio broadcasts and in newswire and print formats. Hence, it is relatively
easy to gather a large collection of related news documents for experimental
purposes, as opposed to collecting documents of a more sensitive and
confidential nature which would lead to access restrictions and other additional
overheads.
As already stated, the focus of the work in this thesis is based around the tasks
defined by the Topic Detection and Tracking (TDT) initiative which, unlike the
majority of other news related investigations, looks at broadcast news
(automatically recognised speech transcripts and closed-caption transcripts from
radio and television broadcasts) as well as documents from standard newswire
collections. With this in mind, we have designed and developed a lexical chaining
method called LexNews that is specifically suited to building lexical chains for
documents in the news domain. Unlike previous chaining approaches, LexNews
integrates domain dependent statistical word associations into the chaining process.
As already stated in Section 2.3, statistical word associations represent an additional
type of lexical cohesive relationship that is not found in WordNet. LexNews also
recognises the importance of analysing the lexical cohesion resulting from the use
of named entities, such as people and organisations that are ignored by standard
chaining techniques since such proper noun phrases are also absent from the
WordNet taxonomy.
42
We begin this chapter with a discussion of the basic chaining algorithm used in
our LexNews implementation. This algorithm was devised by St-Onge and Hirst
and was categorised in the previous chapter as a ‘greedy WordNet-based chaining
approach’. In Section 3.2, the LexNews algorithm is described in terms of the
enhancements made to this basic chaining algorithm. This is followed by a more indepth analysis of the lexical chains generated by the LexNews algorithm, which
includes an evaluation of the quality of the chains with respect to disambiguation
accuracy, details of chain statistics, and finally a discussion on how lexical chains
can be used as an intermediary natural language representation of news story topics.
3.1
Basic Lexical Chaining Algorithm
In Section 2.5.4, we described a lexical chaining algorithm by St-Onge and Hirst
that establishes relationships between nouns in a text using the WordNet taxonomy.
This algorithm is an important focal point of the work described in this thesis as it
forms the basis of our own lexical chaining approach, LexNews. As already stated,
LexNews has been adapted in a number of ways, which gives it an advantage over
other chaining algorithms in a news domain. In this section, we will review some of
the main aspects of St-Onge’s chainer, and look more closely at how the algorithm
traverses WordNet’s noun taxonomy seeking out valid relationships between noun
phrases along its pathways.
We begin our dissection of St-Onge and Hirst’s approach with their definition of
the three possible link directions between words in the WordNet thesaurus:
Horizontal Link: Includes antonym relationships, i.e. nouns that are opposites
such as ‘life’ and ‘death’.
Upwards Link: Includes semantically more general relationships such as
meronyms (‘building’ is more general than ‘office_building’) and hypernyms
(‘flower’ is more general than ‘rose’).
Downwards Link: Includes semantically more specific relationships such as
hyponyms (‘fork’ is more specific than ‘cutlery’) and holonyms (‘school_year’
is more specific than ‘year’).
Coupled with these link direction definitions are three categories of relationship
type based on repetition relationships and lexicographic WordNet relationships:
43
Extra Strong Relations: These relationships include word repetitions, e.g.
mouse/mice.
Strong Relations: These relationships are split into three different subtypes
o Two words are related if they have the same synset number in
WordNet, e.g. telephone/phone
o Two synsets are related if they are connected by a horizontal link
o Two synsets are related if an upward link or downward link exists
between them
o Two words are related if one word is a compound noun that contains
the other, e.g. orange_tree, tree
Medium-strength Relations: Two synsets are related if an allowable path of
length greater than 1, but no more than 5 exists between them in WordNet,
where an allowable path is defined by the following two rules
o No other direction must precede an upward link
o No more than one change of direction is allowed except when a
horizontal link is used to make the transition from an upward to a
downward link
As explained in Section 2.4.4, the use of WordNet to measure the semantic distance
between words is based on the premise that semantic distance is directly related to
semantic proximity (or the number of edges) between terms in the taxonomy. As
stated before, this basic assumption is incorrect as it presumes that every edge and
all branches (sub-hierarchies) in the taxonomy are of equal length. Unfortunately,
this is not the case and a strategy to compensate for these inadequacies is needed.
Hence, St-Onge and Hirst define link directions and link categories in order to limit
the number of semantically proximate but unrelated terms that can be found by
following illogical paths in the taxonomy. The two rules that define an allowable
path in a medium-strength relationship are based, according to Budanitsky (1999),
on ‘psycholinguistic theories concerning the interplay of generalisation,
specialisation, and coordination9’. However, only intuitive reasons for these rules
are given in St-Onge’s thesis.
9
Coordination is a structural relationship that exists between words due to the use of the
conjunctions ‘and’ and ‘or’. An example of a coordinate relationship between terms is a string of
words such as ‘all green vegetables are good for you especially spinach, cabbage and broccoli’. This
is an obvious list of related words but less obvious context dependent relationships may be identified
44
With regard to the first rule, stating that no link may precede a general link, StOnge explains that once the context has been narrowed down by an antonym or a
specific relationship, ‘enlarging the context by following a general link doesn’t
make much sense’. With regard to the second rule, St-Onge explains that changes of
direction constitute large semantic steps, and therefore must be limited except in the
case of horizontal links that represent small semantic steps. However, the only
horizontal link possible between two nouns is an antonym relationship which means
that this exception to the second rule is a rare occurrence. Hence, most mediumstrength relationships will consist of paths of generalisations (upward links) or paths
of specialisations (downward links). The example shown in Figure 3.1, of an
inaccurate link between the words ‘handbag’ and ‘airbag’, helps to illustrate the
need for these rules.
Restraint
Handbag
Handbag
Fastener
Airbag
Generalisation
Clasp
Specialisation
Figure 3.1: Example of a spurious relationship between two nouns in WordNet by not
following St-Onge and Hirst’s rules.
As well as defining relationship categories, St-Onge and Hirst also rank them in
order of strength. Thus when their chaining algorithm is searching for a relationship
between a target word and a word in a chain it seeks an extra strong relationship
first, then a strong relationship, and finally a medium-strength relationship, if the
preceding relationships were not found. Unlike the relationships that make up the
extra strong and strong link categories, medium-strength relationships differ in
strength with respect to two factors: the length of the path and the number of
direction changes in the path. Hence, even if the algorithm comes across a medium-
by words or even clauses being listed together in this way, e.g. ‘Tom, Dick and Harry’ are also
associated though a coordination relationship. This type of relationship is captured in WordNet by
the inclusion of noun phrases like ‘skull_and_crossbones’ and ‘seek_and_destroy_mission’. It is less
specific than specialisation or generalisation associations.
45
strength relationship it must continue searching for all remaining medium-strength
relationships between the target word and all currently created chains, in order to
ensure that it has found the strongest possible medium-strength link. St-Onge
defines a formula to weight these medium strong links:
Link Strength = C − path length − k * (number of direction changes )
where C and
(3.1)
k are constants. According to this equation, medium-strength
relationships with short path lengths and a low number of direction changes are
assigned higher weights.
St-Onge and Hirst also impose further restrictions on the allowable word-chain
distance between related terms, where the word-chain distance is defined as the
difference between the sentence number of the current word being added to the
chain and the sentence number of the chain word with the highest (or closest)
sentence number to the current word. More specifically, St-Onge limits the search
scope to a distance of 7 sentences for strong relationships and 3 sentences for
medium-strength links. No distance restriction is defined for extra strong
(repetition) relationships as St-Onge’s algorithm concurs with the ‘one sense per
discourse’ assumption (Gale et al., 1992).
St-Onge and Hirst’s chainer is categorised in Chapter 2 as a greedy lexical
chaining approach. This implies that the algorithm adds a word to a chain by
considering only the context to the left of that word up to that point in the text, and
that no information regarding the context to the right of the word is considered in
the chaining process. However, this is not entirely true as St-Onge’s implementation
attempts to chain words in a sentence first (the word’s immediate context) before
committing the algorithm to choosing a particular weak sense for a word or making
it the seed of a new chain. This process is implemented using a queue data structure
where each word in a sentence n is added to the queue, and extra strong
relationships are sought between these sentence words and all currently created
chains. The search though the chain stack for a relationship with the current queue
member halts soon as an extra strong relationship is found, whereupon the
candidate term is removed from the queue and added to the related chain. Strong
relationships are then sought between all remaining members of the sentence word
queue and each lexical chain. Again, any related words are added to their respective
chains and removed from the queue as soon as a strong relationship is found. This
46
process is repeated for medium-strength relationships. However in this case, as
already stated, all medium-strength connections within the search scope must be
sought, since the weight of a medium-strength relationship can vary (unlike extra
strong or strong links).
Once all medium-strength relationships are found and weighted using Equation
3.1, the current queue word is deleted from the queue and added to the chain with
the strongest medium-strength weight. At this point in the algorithm, if there are
still unchained words in the sentence queue, a new chain for each unchained queue
member is created. Use of this queue structure is similar to a windowing approach
where words are disambiguated with respect to their left- and right-hand contexts.
The size of these left- and right-hand contexts depends on the position of the word
in the sentence, and the sentence’s begin and end points in the text.
5
kid
4
Stack Head
3
{foster_home, home}
2
{guardian, ward, deputy, offical}
1
{child}
0
{fund, money, assistance, welfare}
5
4
Stack Head
3
{child, kid}
2
{foster_home, home}
1
{guardian, ward, deputy, official}
0
{fund, money, assistance, welfare}
5
4
{school}
3
{child, kid}
2
{foster_home, home}
1
{guardian, ward, deputy, official}
0
{fund, money, assistance, welfare}
Stack Head
school
Figure 3.2: Diagram illustrating the process of pushing chains onto the chain stack.
47
One final important feature of St-Onge’s algorithm is chain salience. This facet
of the algorithm ensures that words are not only added to the most strongly related
chain, but also the most recently updated chain. This idea is based on the notion that
chains that are currently active in the text will be the most appropriate context in
which to disambiguate the current noun. This feature of their algorithm is
implemented using the ‘chain stack’ data type illustrated in Figure 3.2. In this
example, the next word in the sentence queue to be added to the chain stack is the
word ‘kid’ which is added to the chain at position 1 in the stack. Since ‘kid’ and
‘child’ are synonyms (a type of strong relationship), the search for a related term is
complete, and ‘kid’ is added to this chain which is in turn moved to the head of the
chain stack. The next word in the current sentence queue is ‘school’; however, no
lexicographical relationship is found between this word and any of the chains in the
chain stack so ‘school’ becomes the seed of a new chain. This chain is then pushed
onto the head of the chain stack.
3.2
Enhanced LexNews Algorithm
In the previous section we reviewed St-Onge and Hirst’s approach to chain
generation. In this section we give details of our lexical chaining system called
LexNews which is an enhanced version of their algorithm. LexNews consists of two
components: a ‘Tokeniser’ for text preprocessing and selection of candidate terms
for chaining, and a ‘Chainer’ which clusters related candidate terms into groups of
semantically associated terms. Our novel approach to chaining differs from
previous chaining attempts (see Chapter 2) in two respects:
It incorporates genre specific information into the chaining process in the form
of statistical word associations.
It acknowledges the importance of considering proper nouns in the chaining
process when dealing with text in a news domain, and builds a distinct set of
chains representing the repetition relationships between these parts of speech in
the text.
The motivation for including statistical word associations in the chaining procedure
is discussed in the following section and our method of generating these cooccurrence relationships is also described. This is followed by a description of our
tokeniser based on the identification of technical terms in text proposed by Justeson
48
and Katz (1995) and our enhanced lexical chainer. Figure 3.3 gives an overview of
the LexNews architecture.
Input News
Articles
Tokeniser
Sentence Boundary
Identifier
WordNet
Part-of-speech Tagger
Noun Phrase Parser
Statistical
Word
Associations
Candidate
Terms
Lexical Chainer
Lexical Chains
generated for each
News Article
Figure 3.3: LexNews system architecture.
49
3.2.1 Generating Statistical Word Associations
As previously stated, existing approaches to generating lexical chains rely on the
existence of either repetition or lexicographical relationships between nouns in a
text. However, there are a number of important reasons for also considering cooccurrence relationships in the chaining process. These reasons relate to the
following missing elements in the WordNet taxonomy:
Missing Noun Relationships: The fact that all relationships between nouns in
WordNet are defined in terms of synonymy, specialisation/generalisation,
antonymy and part/whole associations means that a lot of intuitive cohesive
relationships, which cannot be defined in these terms, are ignored. This
characteristic of the taxonomy was referred to in Section 2.4.4 as ‘the tennis
problem’ (Fellbaum, 1998b), where establishing links between topically related
words such as ‘tennis’, ‘ball’ and ‘net’ is often impossible.
Missing Nouns and Noun Senses: WordNet’s coverage of nouns is continually
improving with the release of each new version. However, a lot of British and
Hiberno-English phrases are still missing, including the ‘sweater’ sense of
‘jumper’, the ‘bacon’ sense of ‘rasher’ and the ‘drugstore’ sense of ‘chemist’.
On occasion there are also noun phrase omissions, for example ‘citizen band
radio’ and its abbreviation ‘CBR’. However, the omission of examples such as
these usually only become obvious or critical when working in a very specific
domain.
Missing Compound Noun Phrases: Many compound nouns are also absent
from the noun taxonomy. For example, there are a number of important newsrelated compound noun phrases such as ‘suicide bombing’ or ‘peace process’
that are not listed. Quite often these compound nouns are ‘media phrases’, that
in time will find their way into future versions of the WordNet thesaurus.
The generation of co-occurrence relationships between nouns from the news
story domain is one method of addressing these inadequacies. In fact, augmenting
WordNet with these types of statistical relationships has been, and still remains, a
‘hot-topic’ in computational linguistic research. Many researchers believe that this
is the most appropriate means of both improving the connectivity of a general
ontology, and of adapting it to better suit a particular domain specific application
(Resnik, 1999; Agirre et al., 2000; Mihalcea, Moldovan, 2001; Stevenson, 2002).
50
In our case, rather than attempting the complex task of fitting word associations
into the taxonomy (i.e. disambiguating word associations with respect to synsets),
and the subsequent re-organisation of the taxonomy, we view our co-occurrence
data instead as an auxiliary knowledge source that our algorithm can ‘fall back on’
when a relationship cannot be found in WordNet. We will now look at how these
co-occurrences were generated using a log-likelihood association metric on the
TDT1 broadcast news corpus (Allan et al., 1998).
Tokenisation of the TDT1 corpus is the first step in generating bigram statistics
for token pairs, where tokens in this context are simply all WordNet nouns and
compound nouns (excluding proper nouns) identified by the JTAG tagger (Xu,
Broglio, Croft, 1994). These nouns are transformed, if necessary, into their singular
form using handcrafted inflectional morphological rules. With regard to the
identification of proper noun phrase relationships, we felt this process would
complicate the estimation of frequencies as some form of normalisation would be
needed in order to tackle the disambiguation and mapping of phrases such as
‘Hillary Clinton’ and ‘Senator Clinton’ to a single concept, while correctly identify
that ‘Bill Clinton’ is an entirely different entity. However, identifying these types of
relationships would be a definite advantage during the chaining process, and a
possible avenue for future research.
The next step is the collecting and counting of bigram frequencies in a window
size of four nouns, where all combinations of noun bigrams within this window are
extracted and their frequency counts are updated. We emphasise the phrase
‘window size of four nouns’ as most co-occurrence statistics are calculated based
on the entire vocabulary of the corpus. We are (as previously stated) only interested
in relationships between nouns. Therefore, we simplify this process by providing
the bigram identifier with only nouns in this active window, where the window is
limited by sentence and document boundaries. In addition, the ‘lift-right’ ordering
of the nouns in each bigram in the window is preserved with respect to their original
ordering in the text.
Once the relevant statistical counts are collected the algorithm then uses a
statistical association metric to determine which bigrams occur in the TDT1 corpus
more often than would be expected by chance. Dunning (1994) highlights a number
of common statistical measures of co-occurrence and their inadequacies. In
51
particular, Dunning states that it is incorrect to assume that all words are either
normally or approximately normally distributed in a large corpus. For example,
‘simple word counts made on a moderate sized corpus show that words which have
a frequency of less than one in 50,000 words make up about 20% - 30% of typical
English newswire reports’. Hence, these low frequency words are too rare to expect
standard statistical techniques (that rely heavily on the assumption of normality) to
work. In particular, Dunning criticises the use of measures like Pearson’s
test
and z-score tests since they tend to over-estimate the significance of the occurrence
of rare events. Instead, he suggests that likelihood ratio tests that ‘do not depend so
critically on assumptions of normality’ are more suitable for textual analysis. He
also stresses that a likelihood ratios test such as the G2 statistic is easier to interpret
than a t test, z test or
test as it does not have to be looked up in a table. Instead, it
can be directly interpreted as follows:
G2 measures the deviation between the expected value of the frequency of the
bigram AB and the observed value of its occurrence in the corpus.
We use the following log-likelihood formula taken from Pedersen (1996) to
measure the independence of the words in a bigram, where Table 3.1 is a
contingency table showing the counts needed to estimate G2.
Bigram = AB
A
¬A
Total
B
n11
n12
n1+
¬B
n21
n22
n2+
Total
n+1
n+2
n++
Table 3.1: Contingency table of frequency counts calculated for each bigram in the collection.
n11 is the frequency of the bigram AB in the collection, n1+ is the number of bigrams with the
word B in the right position, n+1 is the number of bigrams with A in the left position, and n++ is
the total number of bigrams in the corpus.
The maximum likelihood estimate is then calculated as follows:
mi , j =
n i+ *n+ j
n++
(3.1)
and the log-likelihood ratio is defined as:
52
G
2
=2
nij log
i, j
nij
mij
(3.2)
So after filtering out less significant bigrams (i.e. removing all bigrams with G2 <
18.9) from a corpus of 1,565988 nouns, we collected 25,032 significant bigrams or
collocates, which amounted to 3,566 nouns that had an average of 7 collocates each.
This is admittedly a relatively small number of co-occurrence relationships;
however, our intention here is to capture only the most strongly related and domain
specific on-topic noun relationships in the corpus, as most other relationships can be
found using the WordNet taxonomy.
AIDS: virus 0.993134 HIV 0.950758 patient 0.897451 research 0.806503 disease
0.8009 infection 0.788194 vaccine 0.753551 activist 0.662563 epidemic 0.64386
drug 0.635456 researcher 0.625237 health 0.528567 cure 0.41355 counselor
0.408026 testing 0.16426 treatment 0.0705887 cancer 0.0479009
Airport: flight 0.978022 runway 0.926649 passenger 0.845762 plane 0.813131
hijacker 0.779987 arrival 0.771899 convoy 0.628275 port 0.59517 harbor 0.528054
airline 0.506392 airlift 0.503038 delay 0.482481 facility 0.408617 perimeter
0.404356 departure 0.309304 cargo 0.214528 pilot 0.158775
Glove: estate 0.995581 scene 0.973169 blood 0.966698 property 0.955137 murder
0.935448 walkway 0.920652 home 0.874014 crime 0.870699 defense 0.859927
driveway 0.834991 match 0.820273 evidence 0.817314 guesthouse 0.773595 racist
0.702691 knit 0.697956 house 0.682884 detective 0.675308 new_yorker 0.667298
trail 0.630485 hair 0.617661 grounds 0.582741 article 0.582623 left_hand 0.549755
bronco 0.540522 mansion 0.46512 prosecutor 0.428464 plastic 0.338857 bag
0.302399 body 0.0952888 football 0.0610795 prosecution 0.0288431
Plutonium: reactor 0.944839 nuclear_weapon 0.927557 ingredient 0.890152
uranium 0.84229 bomb 0.840949 material 0.674085 Munich 0.605903 waste
0.5494 fuel 0.360125 Korea 0.339765 expert 0.202612 weapon 0.202573 site
0.0967882 facility 0.0877131
Figure 3.4: Examples of statistical word associations generated from the TDT1 corpus.
53
Figure 3.4 contains examples of statistically derived word associations (in order
of normalised strength) for four nouns in the TDT1 collection. The majority of
these relationships could not be found using the WordNet thesaurus. For example,
the compound noun phrases ‘AIDS activist’, ‘AIDS epidemic’ and ‘Plutonium
reactor’, and the word associations ‘airport-passenger’, ‘airport-plane’ and ‘AIDSHIV’ cannot be associated using the taxonomy and the basic chaining algorithm
described in Section 3.1. Although the relationship between AIDS and HIV is
represented in WordNet, a path length of 16 edges must be traversed in order to
establish a link between these words, and our chaining algorithm only looks at
relationships with a maximum path length of four edges.
Although Figure 3.4 contains a number of motivating word relationships, there
are also examples of weakly related word associations that have surprisingly high
strength of association scores. In particular, we noticed a number of strange
relationships with the noun ‘glove’, e.g. ‘glove-estate’, ‘glove-scene’, ‘glovemurder’, and ‘glove-blood’. These co-occurrences are classic examples of how
corpus statistics can be skewed when generated from a document collection
containing a large number of stories on a particular topic. In this case the topic is
the OJ Simpson trial, where a blood-soaked glove found near the crime scene was
discussed at length in the trial by the prosecution, and consequently in the TDT1
collection. One method of tackling this problem is to remove documents from some
of the larger topics in the collection when calculating associations. However, we
felt that a number of important word relationships pertaining to the ‘judicial system’
may have been lost, so all 16,856 TDT1 news stories were considered. We also
found that the occurrence of weak links like these in the chaining process is
infrequent as the likelihood of finding, for example, the nouns ‘glove’ and ‘blood’
in the same news story in a news corpus which does not cover the OJ Simpson case
is low, thus ensuring the integrity of relationships between lexical chain members.
However, incorporating statistical word associations into the chaining process
does pose one significant problem when generating WordNet-based lexical chains.
More specifically, statistical word associations fail to consider instances of
polysemy where the sense of a word defined in a chain may not be related to the
intended sense of the statistically associated word. For example, ‘gun’ is
statistically related to the word ‘magazine’, and so they should in theory be added to
54
the same lexical chain. However, in the context of the lexical chain {book,
magazine, new edition, author}, it is evident that the ‘publication’ sense of the word
‘magazine’ is intended rather than the ‘ammunitions’ sense, and so the noun ‘gun’
should not be added to this chain. Unfortunately, errors such as these will result in
the generation of spurious chains. For the remainder of this section, we will
examine the LexNews chain generation process.
3.2.2 Candidate Term Selection: The Tokeniser
The objective of the chain formation process is to build a set of lexical chains that
capture the cohesive structure of the input stream. However, before work can begin
on lexical chain identification, all sentence boundaries in each sample text must be
identified. We define a sentence boundary in this context as any word delimited by
any of the characters in the following pattern: [ ! | . | .” | .’ | ? ]+. Exceptions to this
rule include abbreviated that contain full stops such as social titles (e.g. ‘Prof.’,
‘Sgt.’, ‘Rev.’ and ‘Gen.’), qualifications (e.g. ‘Ph.D.’ ‘M.A.’ and ‘M.D.’), first and
middle names reduced to initials (e.g. ‘W.B. White’), and abbreviated place names
(e.g. ‘Mass.’ and ‘U.K.’).
Once all sentences have been identified in this way, the text is then tagged using
the JTAG part-of-speech tagger (Xu, Broglio, Croft, 1994). All tagged nouns in the
text are then identified and morphologically analysed: all plurals are transformed
into their singular state, adjectives pertaining to nouns are nominalized and all
sequences of words that match grammatical structures of compound noun phrases
are extracted, e.g. WordNet compounds such as ‘red wine’, ‘act of god’ or ‘arms
deal’. By considering such noun phrases as a single unit, we can greatly reduce the
level of ambiguity in a text. This will significantly reduce lexical chaining errors
caused by phrases such as ‘man of the cloth’ (priest) which do not reflect the
meaning of their individual parts. To identify these sequences of proper noun/noun
phrases our algorithm uses a series of regular expressions. The extraction of noun
phrases that match these pattern are often referred to as technical terms, an idea first
proposed by Justeson and Katz (Justeson, Katz, 1995). Another advantage of
scanning part-of-speech tagged news stories for these technical terms is that
important non-WordNet proper noun phrases, such as White House aid, PLO leader
Yasir Arafat and Iraqi leader Saddam Hussein, are also discovered. In general,
news story proper noun phrases will not be present in WordNet, since keeping an
55
up-to-date repository of such words is a substantial and unending problem.
However, non-WordNet proper nouns are still useful to the chaining process since
they provide a further means of capturing patterns of lexical cohesion though
repetition in the text. For example, consider the following news story extract in
Figure 3.5.
Iraqi President Saddam Hussein has for the past two decades the dubious
distinction of being the most notorious enemy of the Western world.
Saddam was born in a village just outside Takrit in April 1937. In his teenage
years, he immersed himself in the anti-British and anti-Western atmosphere of the
day. At college in Baghdad he joined the Baath party.
Figure 3.5: Example of noun phrase repetition in a news story.
There are 2 distinct technical term references to Saddam Hussein in Figure 3.5:
Iraqi President Saddam Hussein and Saddam. As is evident in this passage the main
problem with retaining words in their compound proper noun format is that they are
less likely to have exact syntactic repetitions elsewhere in the text. Hence we
introduce into our lexical chaining algorithm a fuzzy string matcher that looks first
for a full syntactic match (Saddam_Hussein
Saddam_Hussein) and then a partial
syntactic match (Iraqi_President_Saddam_Hussein
Saddam).
Approximate string matching spans a very large and varied area of research,
which includes computational biology, IR and signal processing, to name but a few
(Navarro, 2001). The most common algorithm for determining gradations in string
similarity is one based on calculating the edit distance between two words, which
calculates the minimum possible cost of transforming the two words so that they
match exactly. This process may involve a number of insertions, deletions or
replacements of letters in the respective strings, where the edit distance is the sum
of the costs attached to each transformation. So for example, if each change has a
cost of 1 then the edit distance between tender and tenure is 3 since this
transformation involves three replacements: d to u, e to r, r to e.
Another popular alternative to the edit distance measure is to calculate the
longest common subsequence (LCS), which measures the longest (order-dependent)
sub-string common to both words. Since we are looking at string matching as a
means of approximating semantic similarity between words, any matching function
that we use must be have strict matching constraints in order to ensure that tentative
56
links between compound proper noun phrases are ignored. Hence, our LCS distance
measure imposes the following limitations:
Let S ∈
(a finite alphabet) be a string of length |S|.
Let P ∈
be a string or pattern of length |P|
|S|.
Let d(S, P) be a distance metric such that P ⊂ S, where d(S, P) is the length of
the longest sub-string in P that matches a sub-string in S. However, a match
between P and S is only found if the following conditions hold:
1. |P| > 3 and d(S, P) > 3
2. If P ⊄ S and |P| > 6 then find d(S, P′) such that P′ ⊂ P where P′ is P where
the last three letters in P have been deleted. In line with condition 1, the
following inequalities must hold true: |P′| > 3 and d(S, P′) > 3
3. Pattern P or P′ must occur at text position 0 to |P| or |P′| respectively, in
string S.
Condition 1 ensures that words like Prince, Prime_minister and Prisoner do not
match. Immediately one might think that there are prefixes longer than 3 that could
cause the same problem like fore-, ante-, trans-, semi-; however, our compound
noun phrases mainly consist of proper nouns which, unlike regular nouns, in
general do not take prefixes such as these. While condition 3 ensures that a string
like Stan in Stan_Laurel does not match within a word like Pakistan or Afganistan,
or Tina in Tina_Turner with Argentina. Longer proper nouns often have slight
variations due to pluralisation or the genitive case, so condition 2 ensures that the
phrase Amateur_Footballer’s_Organization matches the proper noun phrase
Football_Association_Ireland.
These heuristics for capturing variations in compound noun phrases are not foolproof, for example, in the case of a story on “citizen band radio” the matching
function was unable to associate CBR with CBers (people who use CBR’s) due to
Condition 1. Also the algorithm is unable to find the link between an acronym such
as CBR and the phrase citizen band radio, which would need to be resolved using
an entity normalisation technique commonly used in Information Extraction
applications. However, this fuzzy matching technique is sufficient for our purposes.
In summary then, the Tokeniser produces tokenised text consisting of noun and
proper noun phrases, including information on their location in the text, i.e. word
number and sentence number. This information is then given as input to the next
57
step in the LexNews algorithm, the Lexical Chainer. During the course of our work,
both the Chainer and the Tokeniser used version 1.6 of WordNet. The fuzzy
matching algorithm for proper noun phrases described in this section is used during
the chain creation step, which is described in more detail in the next section.
3.2.3 The Lexical Chainer
In Section 3.1, we described St-Onge and Hirst’s lexical chaining approach, which,
as already stated, forms the basis of our own lexical chaining algorithm. The aim of
our chainer is to find relationships between tokens (nouns, proper nouns, and
compound nouns) in the data set using the WordNet thesaurus and a set of statistical
word associations. Our algorithm follows all of the following chaining constraints
implemented by St-Onge and discussed in Section 3.1:
Word relationship types and strengths (extra strong, strong and mediumstrength).
Maximum allowable word-chain distances between the current word and a
chain word (applies to strong and medium strength relationships).
Chain salience (implemented using the chain stack data structure).
Sentence-based disambiguation (implemented using the sentence queue data
structure).
Rules defining admissible paths between related words in the WordNet
taxonomy.
All of these constraints help to eliminate spurious or weakly cohesive chains. An
example of a spurious link between words would be associating ‘gas’ with ‘air’
(hypernymy) when ‘gas’ refers to ‘petroleum’ (synonymy). This example illustrates
the necessity for seeking out semantic word relationships based on the ordering set
out by St-Onge, where extra strong relationships precede strong relationships, and
strong relationships are followed by a medium-strength relationship search in the
taxonomy. In the ‘gas-petroleum-air’ case both relationships are strong; however,
synonymy precedes hypernymy in the search for a strong word connection.
This point leads us on to the question of where statistical word association
should fit into this searching strategy. Let us first define what a statistical word
association is in this context:
58
A statistical word association exists between a word and a chain word if the
log-likelihood association metric indicates that the co-occurrence of these words
in the TDT1 news corpus is greater than chance.
In Section 2.3, we stated that this type of word connection occurs when there is an
intuitive link between words, but that the nature of the association cannot be
defined in terms of repetition, synonymy, antonymy, specialisation/generalisation or
part/whole relationships. However, when determining these relationships using a
statistical measure of association some lexicographical relationships defined in
WordNet will also be found. For example, in Figure 3.4 the occurrence of ‘AIDS’
and ‘disease’ is statistically significant and this relationship is also captured in the
WordNet taxonomy where ‘AIDS’ is a type of (hyponym) ‘infectious disease’ and
‘infectious disease’ is a type (hyponym) of ‘disease’. Since these statistical word
associations are not mapped to any synset numbers in WordNet they do not provide
the algorithm with an explicit means of disambiguating a related word when it is
added to a chain. Hence, our algorithm puts statistical word associations last in its
relationship search (i.e. after medium-strength relationships). However, statistical
word associations for the most part are strong evidence of a connection between
words so we define the maximum allowable word-chain distance for this
relationship as 5 sentences (the same distance constraint imposed on mediumstrength relationships).
Up to this point in our description of the LexNews algorithm we have reported
on the generation of one type of lexical chain consisting of WordNet nouns.
However, an integral part of our algorithm is the inclusion of proper noun phrases
in the chaining process (e.g. Chairman Bill Gates, Economist Alan Greenspan).
Other chaining algorithms ignore these phrases as their coverage in WordNet is
either sparse or non-existent. The chaining procedure for proper noun chains is
simpler than for their noun-only counterparts, since the algorithm is not concerned
with either statistical or lexicographical relationships. Instead it uses the repetitionbased fuzzy matching function described in Section 3.2.2 to find associations
between compound proper noun phrases in the text. As was the case for noun
chaining, word associations are searched for in order of strength. The following is a
list of fuzzy matches in order of strength, starting with the strongest match first:
59
1. Exact Full Phrase Match:
Helmut_Kohl
Helmut_Kohl
2. Partial-Phrase Exact-Word Match:
Hubble_Telescope
Space_Telescope_Science_Institute.
3. Partial-Phrase Partial-Word Match:
National_Caver’s_Association
Irish_Cave_Rescue_Organisation.
As previously stated, during noun-only chaining, allowable chain-word distances
are longer for stronger relationships than weaker associations. A distance parameter
is also enforced during proper noun chaining. Although, in this case, all types of
phrase match are classified as repetition relationships, and so are assigned an
infinite word-chain distance (confined by the length of the document). Optimal
values for these distance parameters are discussed in Section 3.3. The result of this
part of the chaining process is a distinct set of proper noun lexical chains that,
coupled with the noun-based chains, form a representation of the lexical cohesive
structure of a news story document. Appendix A contains a more formal description
of the chaining part of our LexNews algorithm. Appendix A also includes a
stopword list of ‘problematic’ WordNet nouns that are eliminated from the chaining
process since they are often the root cause of spurious chains, i.e. chains containing
incorrectly disambiguated or weakly cohesive chain members. Stopword lists are
also widely used in other lexical chaining implementations (St-Onge, 1995;
Stairmand, 1997; Green, 1997b). In our case, most of these offending nouns were
found by automatically looking for concepts in the topology that subordinated a
higher than average number of nouns in the taxonomy, e.g. the concept ‘entity’.
Hence, for the most part these nouns or concepts tend to lie in the top-level of the
taxonomy and were considered too general an indication of semantic similarity
between nouns. A number of manually identified nouns were also added to this list.
The remainder of this chapter will focus on performance and parameter
estimation issues arising from the generation of the chains as well as a concrete
example of how chains capture cohesion in a text.
60
3.3
Parameter Estimation based on Disambiguation
Accuracy
In Section 2.5.1, the importance of choosing correct chaining parameters in order to
reduce the number of spurious or incorrect chains being generated was mentioned.
No formal method of parameter estimation has yet been developed for lexical
chaining analysis. Usually parameter estimation is the process of maximising the
performance of a system on an initial training collection, and then applying these
parameters to a test collection from which the system’s performance is determined.
The hypothesis is that parameters that worked well on the training set should work
well on the test collection (assuming they are from the data sample). The key to
finding optimal parameters is finding optimal performance. However, lexical chain
performance or quality cannot be evaluated directly. Instead lexical chains can only
be evaluated with respect to a task-oriented evaluation strategy. However, if the
lexical chains perform poorly on a specific task, for example, as an indexing
strategy for an IR engine, then this may be due to the unsuitability of the application
rather than a reflection on the quality of the chains. Hence, a task-oriented
evaluation of lexical chains must be based on the performance of a fundamental
operation involved in the lexical chaining algorithm. As already stated, a side-effect
of lexical chain creation is noun disambiguation. Consequently, by measuring the
disambiguation accuracy of our lexical chainer we can indirectly establish the
chaining performance of our algorithm, since a disambiguation error implies that a
word has been incorrectly added to a chain.
We use the Semantic Concordance corpus (SemCor version 1.6) to evaluate
lexical chain disambiguation accuracy. SemCor is a collection of documents on a
variety of topics taken from the Brown Corpus that have been manually annotated
with synset numbers from WordNet (Miller et al., 1993). Using SemCor and the IR
metrics recall, precision and F1 we can indirectly measure the quality of a set of
lexical chains generated from the original text of the SemCor corpus. In this context
these IR metrics are defined as follows:
Recall is the number of correctly disambiguated nouns returned by the
disambiguator divided by the total number of nouns in our SemCor test set.
Precision is the number of correctly disambiguated nouns returned by the
disambiguator divided by the total number of SemCor nouns disambiguated by
61
the system. In general, there will be instances where the disambiguator will not
be able to decide on the correct sense of a noun, thus producing different
denominators in the recall and precision formulae.
F1 is the harmonic mean of the recall and precision values for a system that
represents a single overall measure of system performance, or in this case
disambiguation effectiveness.
Initially the main purpose of this disambiguation-based evaluation of lexical
chaining quality was to establish whether the original chaining parameters
suggested by St-Onge (1995) where optimal. In his thesis St-Onge suggested, based
on a manual observation of his results, allowable word-chain distances for strong
and medium-strength relationships and a set of rules for following paths in
WordNet, so as to minimise the occurrence of spurious chains. At the outset of our
research we verified these parameters by observing their effect on disambiguation
accuracy.
More recently we ran further experiments on a subset of documents taken from
the SemCor corpus. This second set of experiments was driven by Galley and
McKeown’s (2003) recent publication. In this paper, they investigate the
effectiveness of their non-greedy chaining algorithm with respect to two other nongreedy approaches by Barzilay (1997) and Silber and McCoy (2003) (for
descriptions of these algorithms see Section 2.5.5). Galley and McKeown ran these
experiments on a (74 document) subset of the SemCor corpus because Barzilay’s
algorithm, due to its high run-time and space requirements, could only run on
shorter length documents10.
Galley and McKeown define a single disambiguation accuracy measure:
Accuracy is calculated as the number of correctly disambiguated nouns divided
by the total number of nouns in the SemCor corpus.
This definition is identical to the definition of system recall. This implies that
precision values are also needed in order to gain a true picture of system
effectiveness. However, in Galley and McKeown’s experiment, recall and precision
values are equal since each system is required to return a default sense for each
word that it fails to disambiguate. Consequently, the denominators in both metrics
10
Established during personal communications with the authors.
62
are equivalent. This point should be clearer following a more formal definition of
these metrics:
Precision = c / (c + i)
Recall = c / (c + i + d)
where c is the number of nouns disambiguated correctly by the system, i is the
number of nouns disambiguated incorrectly by the system and d is the number of
nouns not disambiguated by the system. So in the case of the systems involved in
Galley and McKeown’s experiments, all systems disambiguated all nouns in the
collection, i.e. d = 0. Hence, recall is equivalent to precision. Table 3.2 shows the
results of this experiment published in (Galley, McKeown, 2003) as well as the
results of the same experiment using our lexical chaining algorithm.
Algorithm
Accuracy
Barzilay and Elhadad
56.56 %
Silber and McCoy
54.48 %
Galley and McKeown
62.09 %
LexNewsbasic
60.39 %
Table 3.2: Disambiguation Accuracy results taken from Galley and McKeown (2003).
Using this evaluation strategy Galley and McKeown’s system outperforms the
two non-greedy algorithms and our semi-greedy approach. However, there is a flaw
in this evaluation methodology due to the default assignment of sense 1 to all nondisambiguated nouns returned by each system. In WordNet all senses of a particular
word are ordered with respect to their frequency of use in the English language. In
order to generate these frequency counts WordNet designers created the SemCor
corpus, and counted all the occurrences of specific senses in an attempt to estimate
their frequency of use in the English language. Consequently, if an algorithm were
to choose the first sense of all nouns in the SemCor corpus then it would achieve an
accuracy, recall, precision and F1 value of 76.32%. Hence, any system in Galley
and McKeown’s experiment that returns a relatively low number of disambiguated
terms will significantly outperform a system that returns a higher number of
disambiguated terms since all non-disambiguated nouns will have a 76.32% chance
63
of being correct. This gives low recall systems an advantage in this experimental set
up.
In order to address these problems we ran additional experiments, with the help
of the authors, using an evaluation methodology that required that all systems did
not use a default sense assignment when they failed to make a decision on the sense
of a word. The results of this experiment are shown in Table 3.3, where coverage is
defined as the percentage of nouns that were disambiguated (either correctly or
incorrectly) by the disambiguator. As expected Galley and McKeown’s accuracy
measure biases systems with lower recall values. Using F1 values to evaluate chain
disambiguation accuracy we see that our algorithm marginally outperforms Galley
and McKeown’s technique.
Algorithm
Accuracy
Precision
Recall
F1
Coverage
LexNewsbasic
60.39%
59.45%
56.92%
58.20%
94.77%
Galley and McKeown
62.02%
59.64%
56.00%
57.65%
93.89%
Table 3.3: Results of Galley and McKeown’s evaluation strategy using the ‘accuracy’ metric
and default sense assignments, compared with the recall, precision and F1 values when all
disambiguated nouns are not assigned default senses.
Algorithm
Accuracy
Precision
Recall
F1
Coverage
LexNewsbasic A
60.39%
59.45%
56.92%
58.20%
94.77%
LexNewsbasic B
71.20%
71.61%
30.12%
42.41%
42.06%
66.70%
64.13%
46.41%
53.85%
72.37%
60.54%
59.55%
56.44%
57.95%
94.77%
(Synonymy Relations only)
LexNewsbasic C
(Strong Relations only)
LexNewsbasic D
(All relations and St-Onge
Path Restrictions)
Table 3.4: Comparing the effect of different parameters on the disambiguation performance of
the LexNews algorithm.
The results in Table 3.4 also verify the bias toward low recall systems inherent in
Galley and McKeown’s accuracy measure. However, the purpose of the results in
Table 3.4 is to corroborate a number of parameter settings and design issues in the
64
LexNewsbasic algorithm, i.e. St-Onge and Hirst’s original chaining procedure
without the use of fuzzy proper noun matching and statistical word associations.
1. Strong and medium relations: LexNewsbasic A as described in Section 3.1
searches for extra-strong, strong and medium-strength relationships during
chain generation. In order to verify the use of these different relationships we
ran LexNewsbasic B and C, where B only looks at synonym relationships and C
looks at all strong relationships. In both cases we observe a decrease in
coverage where precision dramatically increases at the expense of recall. A
comparison of the F1 measures for LexNewsbasic A, and the B and C versions
verifies the use of medium-strength as well as strong relationships in the
chaining process.
2. St-Onge path restrictions: In Section 3.1 we described a number of path
restrictions that St-Onge suggested should be imposed on WordNet paths in the
taxonomy that exceeded length 1 in order to avoid finding spurious links
between nouns. In Table 3.4 from the results of LexNewsbasic D, we see that
restricting paths using St-Onge’s rules marginally improves precision
performance. However, a comparison of F1 measures suggest that allowing all
paths (LexNewsbasic A) slightly improves coverage with a slight decrement in
precision. So it seems St-Onge’s rules having little or no effect on chaining
accuracy. The most likely reason for this outcome is that these rules were
written to take care of spurious relationships with path of lengths greater than
the 4 edges limit that we use in our implementation. We hypothesise that these
rules are more critical for chaining algorithms that use a broader search scope
than this when searching for related nouns in WordNet.
3. Allowable distances between related terms: St-Onge and Hirst originally
suggested an allowable distance of 7 sentences (roughly a 130 word distance)
for strong chain word relationships and a distance of 3 sentences (roughly a 60
word distance) for medium-strength relationships. Our experiments confirm
these allowable distance values, in fact even small variations in these distances
resulted in a slight deterioration. For example, a distance of 50 words for strong
associations and 120 words for medium-strength relationships resulted in an F1
of 58.11 while a distance of 60 words and 140 words respectively produced an
F1 value of 58.08. Although a wide range of other parameter values were also
65
experimented with, no improvement could be gained over the original 60 – 130
estimate proposed by St-Onge. This result is quite surprising since, as already
stated, St-Onge and Hirst choose these parameter values based on a manual
observation of their results.
Figure 3.6 examines the results of the LexNewsbasic A system in more detail. This
graph plots four lines representing:
The percentage of polysemous nouns in SemCor with n senses that were not
disambiguated by the LexNewsbasic system, i.e. % Not Disambiguated.
The percentage of polysemous nouns in SemCor with n senses that were
incorrectly disambiguated by the LexNewsbasic system, i.e. % Incorrectly
Disambiguated.
The percentage of polysemous nouns in SemCor with n senses that either failed
to be disambiguated by the LexNewsbasic system, or were incorrectly
disambiguated. More specifically, the sum of the previous two errors, i.e. % of
Noun Errors.
The percentage of polysemous nouns with n senses in the SemCor corpus, i.e. %
of Nouns in Collection.
18
% Not Disambiguated
% Incorrectly Disambiguated
% of Noun Errors
% of Nouns in Collection
16
14
% of Nouns
12
10
8
6
4
2
0
2
3
4
5
6
7
8
9
10
No. of Senses (Polysemy)
Figure 3.6: Graph showing relationship between disambiguation error and number of senses
or different contexts that a noun may be used in.
66
So for example, taking all nouns with two senses in the SemCor corpus we find
that these nouns represent: 16.08% of all nouns in the collection, 9% of all errors
made by LexNewsbasic, where 6.67% of these errors were due to incorrectly
disambiguated nouns, and 2.33% to failing to disambiguate these nouns. Due to the
systems large coverage (94.77% of nouns are disambiguated), errors attributed to
the failure of the system to disambiguate a noun are low, and errors caused by
incorrect disambiguation by the system are high. We can see that the system has
most trouble with words that have 5 or more senses and that lower sense words are
more easily handled by the system. An interesting additional experiment would be
to compare these disambiguation-error trends with a non-greedy approach to
chaining, in order to determine if the same categories of polysemous nouns are
responsible for the majority of disambiguation errors produced by the algorithm.
As a result of the findings described in this section of the thesis all further
experiments involving our lexical chain algorithm, LexNews, will use St-Onge’s
suggested parameters. In spite of the slight degradation in performance when StOnge path restrictions were incorporated into the search for medium-strength
relationships, we still included these rules in all further applications of the chains so
as to ensure that the resultant chains were as accurate as possible. This decision was
motivated by the observation that some spurious chains and incorrect chain
additions could be avoided when these rules were applied.
3.4
Statistics on LexNews Chains
In this section, we examine a number of characteristics of lexical chains, including
the types of relationship that dominate chain creation, and the size and span of
chains in a document. As explained throughout this chapter our chaining algorithm
uses three categories of word relationship strengths when clustering related noun
word senses in a text. Figure 3.7 shows the percentage of each relationship category
that participates in the chaining process, where extra strong relationships
(repetition) account for 55.83% of relationships, strong relationships (synonymy
and other WordNet associations) account for 13.03%, and medium-strength
relationships (path lengths greater than 1 in WordNet) account for 31.14%.
Figure 3.8 presents a breakdown of the results in Figure 3.7, where we see that
hyponymy and hypernymy (specialisation/generalisation) are the most dominant
67
strong relationships, followed by synonymy, and last of all by meronymy and
holonymy (part/whole). The sparsest relationship type in the WordNet taxonomy is
antonymy, accounting for a mere 0.01% of word relationships.
% Chain Word Relationships
60
50
40
Extra Strong
Strong
30
Medium
20
10
0
Extra Strong
Strong
Medium
St Onge Relationship Types
Figure 3.7: Graph showing the dominance of extra strong and medium-strength relationships
during lexical chain generation.
% of Chain Word Relationships
60
50
40
30
20
10
0
REP
SYN
HYPO
HYPR
MER
HOL
PATH > 1
Chain Relationship Types
Figure 3.8: Graph showing a breakdown of all relationship occurrences in the chaining
process. REP = repetition or extra strong relationships, SYN = synonymy, HYPO =
hyponymy, HYPR = Hypernymy, MER = meronymy, HOL = holonymy, and PATH > 1 = all
medium-strength relationships.
68
Table 3.5 presents the average number of nouns per chain, chains per document
and the length/span of a chain in a document in the SemCor collection, where the
average document length is 2022.12 words (standard deviation 11.31). On average
442.4 of these words are nouns (standard deviation 50.2). However, in all cases for
each of these statistics the standard deviation is high due to the fact that a document
will contain a number of long chains followed by a series of short, often
unimportant chains. This implies that a strong chain spans a large section of a topic,
thus capturing a central theme in the discourse. However, the strength of the
relationships between the words in the chain is also an important factor. In the next
section we look in more detail at this idea of chain strength or importance and how
it can be measured. The next section also examines the type of chains that are
generated by the LexNews algorithm (this time also incorporating non-WordNet
proper nouns and statistical word associations), and how these chains reflect themes
in a document.
Chain Statistic
Average
Standard Deviation
Nouns per Chain (including repetitions)
11.54
20.17
Chains per Document
29.80
8.49
Word Span of Chains in Text
187.01
162.08
Sentence Span of Chains in Text
41.16
36.76
Table 3.5: Chains statistics for chains generated on subset of SemCor collection referred to in
Section 3.4.
3.5
News Topic Identification and Lexical Chaining
Figure 3.9 shows a broadcast news document discussing the premiere of the film
‘Veronica Guerin’. Appendix B shows the part-of-speech output for this text and
the candidate terms selected by the Tokeniser discussed in Section 3.2.2. As
mentioned in Section 3.2.3 our chaining algorithm, LexNews, produces two distinct
sets of lexical chains for a news text: non-WordNet proper noun phrase chains and
WordNet noun phrases chains, which are shown in Figures 3.10 and 3.11
respectively.
In both these figures, chains are ordered with respect to their last ordering in the
chain stack. Chain numbers represent the order in which the chains were created by
69
the LexNews algorithm, where all chains containing only one candidate term (with
a frequency of one) are filtered out at the end of the chaining process. Chain words
in each chain are displayed according to the order of their addition to the chain,
where the word tagged as ‘SEED’ is the first word to be added to the chain. Chain
spans are represented in terms of words and sentences, where a chain span
represents the portion of text covered by a lexical chain i.e. in the case of the
sentence span, this represents the maximum and minimum sentence number of the
chain’s members. Also, each chain word has a frequency and weight assigned to it.
This weight represents the strength of the relationship between the term and the
most strongly related chain member that was added to the chain based on this
relationship. These relationship weights, shown in Table 3.6, were chosen after a
manual analysis of different weighting schemes by the author on news documents.
Relationship Type
Repetition
Synonymy
Antonymy, Hyponymy, Meronymy, Holonymy, and Hypernymy
Path lengths greater than 1 in WordNet
Statistical Word Associations
Weight
1.0
0.9
0.7
0.4
0.4
Table 3.6: Weights assigned to lexical cohesive relationships between chain terms.
Consider, for example, the word ‘murder’ in Chain 2 in Figure 3.11,
[murder (investigation) Freq 2 WGT 0.7 STRONG]
The seed of this chain is ‘investigation’, and ‘murder’ was added to this chain based
on a statistical relationship with the noun ‘murder’ (unfortunately, the compound
noun ‘murder_investigation’ is not listed in the WordNet noun database). However,
the word ‘murder’ is assigned a score of 0.7 (not 0.4) since it is responsible for the
addition of the word ‘killing’ to the chain, where a strong relationship (or hyponym
relationship to be exact) exists between ‘killing’ and ‘murder’ in the taxonomy.
Chain characteristics such as chain span, chain density and chain word relationship
strengths are all useful indicators of the importance of a chain to a text, where high
scoring chains represent dominant themes in the discourse. In Section 8.2, we use
these chain scores as a means of assigning lexical cohesion-based weights to terms
in a broadcast news story in order to generate an extractive gist (a short summary no
70
greater than a sentence) that captures the essence of that story. A more detailed
description of this linguistically-motivated weighting strategy is left until Section
8.2, while in the remainder of this section we will concentrate on analysing the
words that make up the chains.
Looking first at the WordNet noun phrase chains in Figure 3.10, the four most
important chains in this list cover the following themes: ‘the film’ (chain 4), ‘the
location of the movie premiere’ (chain 3), and ‘the film topic: the murder of
Veronica Guerin’ (chain 2, chain 1). Chains 10, 14, 7, and 9 all represent lower
ranked chains, while chains 7 and 9 should be amalgamated with chain 4 as they are
related to ‘the film’ theme, and chain 10 should be merged with chain 3 of the
proper noun chain set because ‘Veronica Guerin’ is a specialisation of a ‘journalist’.
However, neither of our auxiliary knowledge sources can relate the members of
these two chains. Finally, consider chain 14:
CHAIN 14; No. Words 3; Word Span 151-243; Sent Span 11-17;
[energy (SEED) Freq 1 WGT 0.4 MEDIUM]
[nature (energy) Freq 1 WGT 0.4 MEDIUM]
[pressure (energy) Freq 1 WGT 0.4 MEDIUM],
This is a perfect example of a weak spurious chain. Although the noun ‘energy’ is
being used in the ‘vigour’ sense, the noun ‘pressure’ in the ‘personal stress’ sense,
and the noun ‘nature’ in the ‘type of something’ sense, the algorithm has incorrectly
linked these nouns through their alternative ‘scientific’ senses. Unfortunately, these
types of chain are common occurrences in lexical chaining output, due to the
ambiguous nature of these nouns in the text. However, the values of chain
characteristics, such as the relatively low span of this chain in the text and the weak
strength of the relationships between its chain members, are good indications that
this chain represents a minor theme in the general discourse of the news story.
Looking next at the chains in Figure 3.11 we see that our fuzzy matching
component has identified the relationship between the phrases ‘Veronica’ and
‘Veronica_Guerin’, and between ‘Cate_Blanchett’ and ‘Blanchett’. Capturing the
relationships between proper noun phrases such as these is an important part of our
chaining algorithm as it helps to reduce the generation of spurious chains by
removing ambiguity from the text. In particular, we have noticed that a large
number of first names and surnames are documented in WordNet as noun phrases
that bear no resemblance to their intended use in the news story. For example, the
name ‘Veronica’ is indexed as a type of plant in the WordNet taxonomy, and the
71
noun ‘Savoy’ in the phrase ‘Savoy_cinema’ is described as a type of cabbage.
Consequently, under the St-Onge chaining regime or even a non-greedy strategy,
‘savoy’ and ‘veronica’ would be deemed related and added to the same chain as
they are both types of plant. This is a prime example of the need for the tokenisation
component of the LexNews algorithm, which identifies compound proper noun and
noun phrases such as these. Thus, helping to reduce the ambiguity of the candidate
terms chosen for the lexical chaining step.
As Gardai launch an investigation into gangland murders in Dublin and
Limerick a film opened in Dublin tonight which recalls the killing of another
victim of organised crime in 1996.
The world premiere of the Veronica Guerin movie took place in Dublin's Savoy
Cinema, with Cate Blanchett in the title role.
The film charts the events leading up to the murder of the Irish journalist.
Crowds gathered outside the Savoy Cinema as some of Ireland's biggest names
gathered for the premiere of Veronica Guerin, the movie.
It recounts the journalists attempts to exposed Dublin drug gangs.
But for many the premiere was mixed with sadness.
“It'
s odd. It can'
t be celebratory because of the subject matter.”
Actress Cate Blanchett takes on the title role in the movie.
It was a part she says she felt honoured to play.
“I got this complete picture of this person full of life and energy. And so that'
s
when it became clear the true nature of the tragedy of the loss of this extraordinary
human being, and great journalist.”
Apart from Blanchett every other part is played by Irish actors.
Her murderer was later jailed for 28 years for drug trafficking.
The film-makers say it'
s a story of personal courage, but for the director, there
was only one person'
s approval that mattered.
“A couple of months ago I brought the film to show to her mother. It was the most
pressure I'
ve ever felt.”
But he needn'
t have worried.
“I see it as a tribute to Veronica, a worldwide tribute.”
Figure 3.9: Sample broadcast news story on the Veronica Guerin movie.
72
WordNet Noun Phrase Chains
CHAIN 4; No. Words 14; Word Span 11-264; Sent Span 1-19;
[film (SEED) Freq 3 WGT 0.9 STRONG]
[movie (film) Freq 3 WGT 0.9 STRONG]
[premiere (film) Freq 3 WGT 0.4 MEDIUM]
[subject_matter (film) Freq 1 WGT 0.7 STRONG]
[actress (movie) Freq 1 WGT 0.7 STRONG]
[picture (film) Freq 1 WGT 0.9 STRONG]
[actor (actress) Freq 1 WGT 0.7 STRONG]
[film_maker (film) Freq 1 WGT 0.4 MEDIUM]
[approval (subject_matter) Freq 1 WGT 0.7 STRONG]
[story (subject_matter) Freq 1 WGT 0.4 MEDIUM]
[director (actor) Freq 1 WGT 0.4 STATISTICAL]
[tribute (approval) Freq 2 WGT 0.7 STRONG]
CHAIN 14; No. Words 3; Word Span 151-243; Sent Span 11-17;
[energy (SEED) Freq 1 WGT 0.4 MEDIUM]
[nature (energy) Freq 1 WGT 0.4 MEDIUM]
[pressure (energy) Freq 1 WGT 0.4 MEDIUM]
CHAIN 1; No. Words 6; Word Span 4-198; Sent Span 1-14;
[gangland (SEED) Freq 1 WGT 0.4 MEDIUM]
[world (gangland) Freq 1 WGT 0.4 MEDIUM]
[crowd (gangland) Freq 1 WGT 0.4 MEDIUM]
[gang (crowd) Freq 1 WGT 0.9 STRONG]
[drug (gang) Freq 2 WGT 0.4 STATISTICAL]
CHAIN 3; No. Words 7; Word Span 7-187; Sent Span 1-13;
[Dublin (SEED) Freq 4 WGT 0.7 STRONG]
[Ireland (Dublin) Freq 3 WGT 0.7 STRONG]
CHAIN 2; No. Words 9; Word Span 3-168; Sent Span 1-12;
[investigation (SEED) Freq 1 WGT 0.4 STATISTICAL]
[murder (investigation) Freq 2 WGT 0.7 STRONG]
[killing (murder) Freq 1 WGT 0.7 STRONG]
[victim (killing) Freq 1 WGT 0.4 STATISTICAL]
[crime (victim) Freq 1 WGT 0.4 STATISTICAL]
[life (murder) Freq 1 WGT 0.4 MEDIUM]
[loss (life) Freq 1 WGT 0.4 STATISTICAL]
[murderer (victim) Freq 1 WGT 0.4 MEDIUM]
CHAIN 10; No. Words 3; Word Span 63-177; Sent Span 3-12;
[journalist (SEED) Freq 3 WGT 0]
CHAIN 7; No. Words 2; Word Span 48-123; Sent Span 2-9;
[title_role (SEED) Freq 2 WGT 0]
CHAIN 9; No. Words 2; Word Span 54-112; Sent Span 3-8;
[event (SEED) Freq 1 WGT 0.4 MEDIUM]
[celebration (event) Freq 1 WGT 0.4 MEDIUM]
Figure 3.10: WordNet noun phrase chains for sample news story in Figure 3.9.
73
Non-WordNet Proper Noun Phrase Chains
CHAIN 3; No. Words 3; Word Span 33-260; Sent Span 2-19;
[Veronica_Guerin (SEED) Freq 2 WGT 0.8]
[Veronica (Veronica_Guerin) Freq 1 WGT 0.8]
CHAIN 5; No. Words 3; Word Span 44-180; Sent Span 2-13;
[Cate_Blanchett (SEED) Freq 2 WGT 0.8]
[Blanchett (Cate_Blanchett) Freq 1 WGT 0.8]
CHAIN 4; No. Words 2; Word Span 40-68; Sent Span 2-4;
[Dublin’s_Savoy_cinema (SEED) Freq 1 WGT 0.8]
[Savoy_cinema (Dublin’s_Savoy_cinema) Freq 1 WGT 0.8]
Figure 3.11: Non-WordNet proper noun phrase chains for sample news story in Figure 3.9.
3.6
Discussion
This chapter has primarily focused on our news-oriented lexical chaining algorithm,
LexNews. LexNews addresses a number of inadequacies in previous chaining
approaches that generate chains for documents in the news story domain. More
specifically, our algorithm incorporates domain knowledge and named entities such
as ‘people’ and ‘organisations’ into the chaining process in order to capture
important lexical cohesive relationships in news stories that have been ignored by
previous chaining techniques. LexNews consists of two principal components: a
tokeniser that selects candidate words for chaining, and a lexical chainer that creates
two distinct sets of lexical chains, i.e. non-WordNet proper noun chains and
WordNet noun chains.
This chaining algorithm is based on St-Onge and Hirst’s greedy WordNet-based
approach. In our review of chaining approaches in Chapter 2, we differentiated
between greedy and non-greedy chaining techniques where greedy methods have
been largely dismissed based on the assumption that delaying noun disambiguation
until all possible lexical chains have been generated greatly improves chaining
accuracy (Barzilay, 1997). However, our experiments in Section 3.3 show that noun
disambiguation accuracy for both greedy and non-greedy approaches remains stable
at an F1 value of around 58%. This is an interesting outcome as it shows that StOnge and Hirst’s semi-greedy chaining approach, which disambiguates words in
their local context (i.e. the sentence in which they occur), works as well as an
algorithm that considers the entire document before assigning a sense to a noun.
74
This counter-intuitive result may be explained by Voorhees (1998) observation that
although WordNet sense definitions are satisfactory as reference units for
disambiguation, the lexicographical relationships defined between these senses are
not sufficient to achieve full and accurate disambiguation. Voorhees suggests that
syntagmatic information (such as corpus statistics) is needed, in addition to
lexicographical relationships, in order to achieve improvements in disambiguation
accuracy. This statement has somewhat been confirmed by the results of the
SENSEVAL-2 workshop (SENSEVAL-2, 2001), which found that in general
supervised disambiguation algorithms where the most accurate WordNet-based
disambiguators.
The SemCor experiment, described in Section 3.3, also provides us with a means
of parameter tuning our chaining algorithm. In particular, we looked at the effect of
various word relationship types and distance constraints between related words, on
noun disambiguation accuracy. A closer analysis of these chains showed that
repetition relationships between words were responsible for 55.83% of chain word
additions, followed by medium-strength relationships with 31.14% (i.e. path lengths
greater than 1 in WordNet), and strong relationships which only accounted for
13.03% of all relationships.
In the final part of this chapter, we examined a single news story in order to
illustrate how lexical chains can be used as an intermediary natural language
representation of news story topics. It is this representation of a document’s content,
in terms of its lexical cohesive structure, that is used in the TDT applications of our
lexical chaining technique described in the remainder of this thesis. More
specifically, Chapters 4 and 5 look at our attempts to improve New Event Detection
performance using lexical cohesion analysis; while Chapters 6 and 7 examine the
use of lexical chains as a means of segmenting a broadcast news stream into its
consistent news stories. The thesis ends with a discussion of some preliminary
results relating to on-going work on News Story Gisting using LexNews chains.
75
Chapter 4
TDT New Event Detection
This is the first of two chapters on New Event Detection (NED): a task that deals
with the automatic detection of breaking news stories as they arrive on an incoming
broadcast news and newswire data stream. The aim of this chapter is to provide
some background on this task and on the techniques that are commonly used to
accomplish it.
In Section 4.1, we discuss the most commonly used Information Retrieval
model, the vector space model. The popularity of this model is not only due to its
effectiveness as an IR model, but also to the simplicity of its implementation. This
retrieval model is the underlying technique used in the implementation of our
approach and most other approaches to the NED task. NED is one of five tasks
defined by the Topic Detection and Tracking (TDT) initiative. In Section 4.2, we
explore the notion of a topic as defined by the TDT community, which forms the
basis of all TDT task definitions. This discussion is followed by an overview of
previous NED approaches in Section 4.3.
In contrast to these approaches, our NED technique augments the basic vector
space model with some linguistic knowledge derived from a set of lexical chains
capturing the cohesive structure of each news story. A detailed description of our
approach can be found in Chapter 5. In particular, this chapter discusses the
performance of our technique with respect to the TDT1 and TDT2 evaluation
methodologies.
76
4.1
Information Retrieval
This section contains a brief overview of some Information Retrieval terminology
that is essential for an understanding of our New Event Detection system, its
evaluation described in Chapter 5, and the various approaches to New Event
Detection examined in the remainder of this chapter. For a comprehensive
introduction to general topics in Information Retrieval, we refer the reader to the
following core textbooks on the subject (Van Rijsbergen 1979; Salton and McGill,
1983; Baeza-Yates, Ribeiro-Neto 1999).
Information Retrieval (IR) research deals with the representation, storage,
accessibility and organisation of information items (Baeza-Yates, Ribeiro-Neto
1999). Hence, an IR system is responsible for processing and responding to a user
query by presenting the user with a ranked list of the most relevant documents that
relate to that query, in a large collection of documents (e.g. the World Wide Web).
Most contemporary IR systems represent documents as a set of index terms or
keywords, where the relevance of a document to a query is calculated using a
matching function that determines how frequently the query terms occur in the set
of index terms representing the document. Hence, if a set of query terms occur
frequently in a document a high relevance score will be assigned to the document.
There are a variety of techniques or IR models that are based on this intuitive
representation of document content, some of which include:
The Boolean Model: This model specifies a document only in terms of the
words it contains and disregards how frequently they occur. Hence, the weight
of a query term in a document is 1 if it is present, and 0 if it is absent.
Consequently, a document is either relevant or non-relevant and no ranking of
documents is possible. A Boolean query is expressed in terms of the Boolean
operators: and, or and not. Although the Boolean model is attractive due to its
neat formalism, it is not commonly used by IR systems due to its ineffectiveness
at ranking documents and the difficulty that users have with formulating
complex queries.
The Vector Space Model: This model is one of the most popular approaches
used by researchers in the IR community. Unlike the Boolean model, the vector
space model employs a term weighting scheme which facilitates document
ranking.
77
In this model, documents and queries are represented as vectors in ndimensional space, where the intuition is that documents and queries that are
similar will lie closer together in the vector space than dissimilar documents.
This model is discussed in more detail in the next section.
The Probabilistic Model: This model like the vector space model is capable of
ranking documents with respect to their relevance to a query. More specifically,
the probabilistic model ranks documents by their probability of relevance given
the query. Also, index term weights are all binary variables as in the Boolean
model. The similarity of the document to the query is defined using the
following odds ratio:
sim(dj , q) =
P ( R | dj )
P ( R | dj )
(4.1)
where P( R | dj ) is the probability that the document belongs to the relevant set
of documents to the query R, and P( R | dj ) is the probability that the document
is a member of the non-relevant set R . Assuming independence of the terms in
a document (a strong assumption since the occurrence of a word is in some way
related to the occurrence of other words in the text), and using Bayes’ rule,
documents can be ranked in terms of the follow equation:
sim( dj , q ) ~
t
i =1
wi , q × wi , j × log
1 − P ( ki | R )
P ( ki | R )
+ log
1 − P ( ki | R )
P ( ki | R )
(4.2)
where P(ki | R ) is the probability that the index term ki is present in a document
randomly selected from the set of relevant documents R, P( ki | R ) is the
probability that the index term ki is present in a document randomly selected
from the non-relevant document set R , and wi , q and wi , j are the weights of the
term t in the query and the document set respectively. The problem then
becomes how to estimate P(ki | R ) and P( ki | R ) . To begin with, assumptions are
made regarding the values of these probabilities, i.e. P(ki | R ) is constant for all
index terms ki (usually 0.5), and P( ki | R ) is approximated using the distribution
of index terms among all the documents in the collection. After the initial
ranking, subsequent rankings are made, and the values of these probabilities are
78
refined as more and more information is known about the distribution of terms
in the relevant and non-relevant portions of the collection.
The Language Model: A language model can be defined as a probabilistic
model for generating natural language text. A language modelling approach to
query-based retrieval assigns a score to the query for each document
representing the probability that the query was generated from that document, in
contrast to the probabilistic model which estimates the probability of relevance
of the document to the query (Ponte, Croft, 1998). Ponte and Croft acknowledge
that these two measures are correlated but distinct. Equation 4.3 represents a
more formal definition of a language model, in this case a unigram11 language
model:
n
P( q1, q 2,....., qn | d ) = ∏ P( qi | d )
(4.3)
i =1
The most natural method of estimating P( qi | d ) , the probability of observing
query word qi in document d, is to use the maximum likelihood of observing q
as a sample of d:
P( qi | d ) = freq ( qi, d ) / length ( d )
(4.4)
While this estimate may be unbiased it suffers from one fatal problem: if the
document contains no instances of one particular query word then P( qi | d ) = 0 ,
which implies that P( q1, q 2,....., qn | d ) is also zero. However, we cannot assume
that since q fails to appear in this document that it could never occur in another
document on the same topic. Hence, we come to a core area of research in the
language modelling community called discounting methods, which address this
zero frequency or data sparseness problem. All these discounting methods work
by decreasing the probability of previously seen events, so that there is a little
bit of probability mass left over for previously unseen events, in this case query
terms, while still preserving the requirement that the total sum of the probability
masses is 1. This process of discounting is often referred to as smoothing, since
a probability distribution with no zeros is smoother than one with zeros.
11
An n-gram model attempts to model sequences of words in a text i.e. which words tend to follow
other words in a text. If n is 1, as in the case of the unigram model, then probabilities are calculated
based on single words and no apriori word information. In contrast, the bigram model uses the
previous two words in the text to predict the next word, and the tri-gram model uses the previous
three etc.
79
Research has shown that retrieval effectiveness is sensitive to smoothing
parameters, and that unleashing the true potential of language modelling
depends greatly on the understanding and selection of these parameters, thus
providing the motivation for more research in this area (Chen and Goodman,
1996).
We will now look at the vector space model in more detail and some important
preprocessing steps that are commonly used by IR systems.
4.1.1 Vector Space Model
The vector space model (VSM), as already stated, ranks a document with respect to
its similarity to a given query. According to the VSM this similarity can be
estimated by calculating the cosine of the angle between the document vector and a
query vector. More formally then:
sim( dj, q) =
dj • q
| dj | × | q |
(4.5)
t
sim(dj, q) =
i =1
t
i =1
wi , j × wi , q
( wi , j ) ×
2
t
i =1
( wi , q )
2
(4.6)
where | dj | and | q | are the norms of the document and query vectors. Both the
document and query vectors are weighted, that is an index term in the document
wi , j and the query wi , q , represented by the same position i in their respective
vectors, will be assigned a weight. The value of this weight will depend on the
weighting scheme used, of which there are many variations in IR research. The
simplest possible weight for a term is the frequency of the term in the document, or
the normalised frequency when it is divided by the length of the document, tf.
However, an even better measure is one that also considers the inverse document
frequency or idf value, which represents the frequency of occurrence of a term in
the entire collection. Incorporating an idf count into a term weighting scheme is
useful because although a term may appear to be important in a document due to its
high frequency, this term may not be very useful at distinguishing this document
from others in the collection, since it also occurs frequently in other documents in
the corpus. Therefore, a good index term can be defined as one that has a low idf
count, but a high tf count. The most common weighting scheme combining both
these measures is defined as follows (Baeza-Yates, Ribeiro-Neto, 1999):
80
tf.idf = wi , j = fi, j × log
N
ni
(4.7)
where fi , j is the normalised tf value (i.e. the frequency of term ki in document dj
divided by the maximum word frequency in dj), and the log expression is the idf
value with N the total number of documents in the collection and ni the number of
documents in which the term appears. In the VSM two important preprocessing
steps are generally undertaken before term weighting occurs:
Stopword Removal is the process of identifying and eliminating frequently
occurring (often closed class words) that add very little information to a
document representation. Before processing begins a list is usually compiled
that contains such words as modal verbs ‘be’ ‘have’ ‘should’, determiners ‘the’,
conjunctions ‘because’, and vague concepts ‘nobody’ ‘anyone’. Since the tf.idf
would have assigned these types of words low scores anyway, one may ask is
there any need then to remove them? The main advantage of stopword removal
is that it improves the speed of execution of subsequent processing steps (e.g.
document-query similarity calculations) because the number of vocabulary
items in the collection is reduced, and hence all vector lengths are shortened.
Stemming is the process of reducing terms to their root form. A stemmer uses
morphological and derivational transformation rules to accomplish this. So for
example, noun plurals such as ‘chocolates’ are transformed into ‘chocolate’, and
derivational endings such as ‘ing’, ‘es’ ‘s’ and ‘ed’ are removed from verbs.
However, some words do not conform to these transformations rules so many
stemmers use a table of exceptions to identify and correctly reduce these words
to their correct root, e.g. ‘children’ to ‘child’, and ‘ate’ to ‘eat’ and not ‘at’. The
major benefit of stemming is that it increases the accuracy of the term weighting
process, thus improving the recall of the system, i.e. more documents will be
retrieved in response to a query. However, since words are reduced to a root
form the original semantics of the word may be lost. For example, consider the
case of the words ‘(chemical) plant’ and ‘(tobacco) plantation’ which will both
be represented as ‘plant’ in a term index after stemming, even though one
occurrence refers to a ‘factory’ and the other an ‘area of foliage’. In spite of this,
however, researchers have found that a marginal improvement in retrieval
81
performance is possible using a stemming algorithm on index terms (Frakes,
Baeza-Yates, 1992).
4.1.2 IR Evaluation
In IR research, two aspects of system performance are explored when determining
the effectiveness of a given retrieval system. Performance evaluation looks at the
real-world practicality of the system by examining the trade-off between the time
and space complexity of the system. Such issues were, at one time, of great concern
to the IR community. However, with the advent of high-speed processors and the
relative inexpensiveness of memory, this facet of system performance is not of as
much interest in current evaluation methodologies.
System effectiveness (or retrieval performance) on the other hand, is what drives
most IR research. This type of evaluation looks at how well a system can match a
set of ground-truth retrieval results (i.e. what a human expert believes the system
should return in response to a query given a particular document collection).
Creating such a test collection and employing experts to generate these relevancy
judgements is a very time-consuming and expensive endeavour. However, it is a
small price to pay for an evaluation procedure that can stand up to scientific
scrutiny, allow different systems to be compared on an even platform, and
consequently ensure that real and transparent progress is being made in a particular
area of IR research. The most influential large-scale12 evaluation of IR strategies
that follows this line of thinking is the TREC (Text REtrieval Conference)
initiative, which began in the early 1990’s and is now in its 11th year. This annual
conference holds separate forums on different English text retrieval tasks and on
other diverse areas such as multi-lingual and digital video retrieval. However, in
this thesis we focus on another well-known large-scale evaluation initiative called
the Topic Detection and Tracking (TDT) initiative. A more detailed discussion of
the TDT evaluation methodology is given in Section 4.2. For the remainder of this
sub-section we will focus on a more general description of IR evaluation.
Given a large test collection (~10 Gigabytes), and a ground-truth or goldstandard set of judgements for a given task, the first step in an IR evaluation is to
measure the degree of overlap between the list of relevant documents the system
12
For a discussion of other important IR test collections, e.g. ISI and CACM refer to (Baeza-Yates,
Ribeiro-Neto 1999).
82
returns in response to a query and the gold-standard output for that query produced
by a set of human judges. There are two standard metrics for measuring this degree
of overlap: recall and precision. Section 3.3 defines these metrics with respect to
disambiguation accuracy, however, these metrics are more commonly defined in the
context of an IR evaluation. Figure 4.1 taken from (Baeza-Yates, Ribeiro-Neto
1999) represents a set theory approach to defining these metrics.
Figure 4.1: IR metrics precision and recall.
Recall is the number of relevant documents returned by the system divided by
the total number of relevant documents in the corpus.
Recall =
|S |
|A|
(4.8)
Precision is the number of relevant documents returned by the system divided
by the total number of documents retrieved by the system.
Precision =
|S |
|R |
(4.9)
F measure is an attempt to combine recall and precision into a single score. It is
calculated by finding the harmonic mean of the two numbers, p precision and r
recall.
F1 =
2 pr
p+r
(4.10)
83
Typically in IR experiments one finds that a trade-off exists between precision and
recall, i.e. as the precision of the system improves the recall deteriorates and vice
versa. Hence, reports of systems achieving 100% precision and recall are very rare.
Although recall, precision and F1 are the most commonly used metrics for
evaluating system performance, the TDT evaluation and the work described in this
thesis measures performance using two alternative system error metrics: misses and
false alarms. The decision by the TDT community to choose a signal detection
model of evaluation is based on the fact that most information filtering (see Section
4.1.3) tasks can be viewed as a detection process. For example, consider a security
alarm (that detects intruders) or a retina-scanning device (that detects unauthorised
users), where the goal of the system is to minimise both the number of false alarms
and the number of misses. However, some errors are more critical than others
depending on the fault-tolerance bias of the system. For example, consider a smoke
alarm that is highly sensitive and produces many false alarms; the annoyance
caused by the smoke alarm falsely detecting a fire is less critical than the loss of
human life if the system were to miss the occurrence of a real fire. This tolerance of
false alarms over misses can be integrated into a cost function that combines misses
and false alarms (like the F1 measure) and penalises the system more heavily when
it commits a critical error (in this case a miss). Like recall and precision, a trade-off
also exists between misses and false alarms, where a reduction in system misses
will often lead to an increase in false alarms. A fuller explanation of these error
metrics in terms of the TDT evaluation methodology can be found in Section 5.4.2.
4.1.3 Information Filtering
Information Filtering tasks are often referred to as IR sub-tasks. This is mainly
because IR strategies like the models described in Section 4.1 have also been
successfully applied to information filtering problems. Although, an IR system
compares the similarity of each document in the collection to a query, the same
similarity measure can be applied to an information filtering task, where the
query/document
comparison
process
is
replaced
with
a
series
of
document/document similarity comparisons. However, in spite of their resemblance
in this regard there are still some striking differences between ad hoc/query-based
retrieval and information filtering.
84
Firstly, as just stated, a filtering system does not deal with an explicit query like
an IR system does. A ‘filtering query’ has been described as a long-term query that
anticipates a future user’s need by organising a dynamic document collection into a
structure representative of its content. In this way filtering is often seen as a
classification task where a document is either relevant or not relevant. An example
of a filtering task would be a clustering task that partitions a collection of
documents into a set of clusters, where each cluster contains a series of documents
discussing a particular topic.
Secondly, although filtering tasks can handle static document collections, unlike
IR systems, they may also have to deal with dynamic collections of documents, e.g.
streams of newswire documents arriving every few minutes. This collection
characteristic means that a filtering system may have to make an online decision
regarding the relevance or non-relevance of a document. In contrast, an IR system
has the entire collection at its disposal and so has the luxury of a ‘retrospective’
decision making process. TDT tasks such as New Event Detection and Topic
Tracking are examples of such filtering tasks.
So far in our discussion of filtering, we have assumed that no interaction occurs
between a user and the filtering system, however, this is not strictly the case. Figure
4.2 shows Arampatzis’s (2001) model of document filtering, which clearly
illustrates that interaction between the user and the collection, and the user and the
filtering or selection process is possible.
WWW
Newswire
Collection
Selection
Display
Feedback
???
Figure 4.2: Typical Information Filtering System.
In the case of the collection, the user may have control over what sources of
information are to be filtered, e.g. only European newswire. While the filtering
process may be aided or controlled by personal information regarding the user’s
85
User
interests, level of expertise or age. However, any subsequent information displayed
to the user is not ranked and it is basically up to the user to browse through this subset of information to find what interests them. At this stage if the filtered
information is ranked in order of relevancy to a user’s preferences then this special
case of filtering is called routing.
4.2
Topic Detection and Tracking
‘TDT is a body of research and an evaluation paradigm that addresses the eventbased organisation of broadcast news’ (Allan, 2002b). More specifically, TDT
research is concerned with the detection of new events in a stream of news stories
taken from multiple sources, and the tracking of these known events. The initiative
hopes to provide an alternative to traditional query-based retrieval, by providing the
user with a set of documents stemming from a generic question like: “What has
happened in the news today/ this week/ in the past month?” In this way, TDT tasks
are seen as classification or filtering tasks as they deal with static queries which
present the user with an organised event structure and then allow the user to decide
what is of interest to them. Although TDT tasks focus on the organisation of large
volumes of news stories, the techniques and methods being developed can be used
in a variety of other scenarios that match this type of static information need. A few
examples of other possible applications include stock market analysis, email
alerting, junk mail filtering, and incident and accident analysis.
The TDT initiative began in 1997 and is still an active area of research after
having completed one pilot study and six ‘open and competitive evaluations’ on
four distinct test collections (Allan, 2002b). TDT was originally funded and
supported by DARPA (Defence Advanced Research Projects Agency), but is now
under the control of the TIDES (Translingual Information Detection, Extraction and
Summarization) program. Most years the evaluation attracts roughly 11
participants, which include its founding members University of Massachusetts,
Carnegie Mellon University, and Dragon Systems, and other important participants
such as IBM Watson and University of Maryland. In the TDT pilot study and TDT
1998 evaluation, only English language news sources were provided. However,
TDT is now a multilingual forum that also focuses on Chinese (TDT2 corpus),
86
Arabic, Spanish, Korean, and Farsi13 news stories (soon to be released in the TDT4
corpus). Another important feature of the TDT collections is that they contain not
only newswire (text), but also audio transcriptions of radio and television news
broadcasts. The contribution of the audio sources is an important motivation for the
TDT researchers, since previous large-scale filtering and classification work has
focussed on organising ‘clean’ newspaper sources. Therefore, TDT systems must be
robust enough to be able to deal with error-prone manually and automatically
transcripted, and translated broadcasts. Another important feature of the TDT
paradigm, described in the next section, relates to its definition of news topics and
events, and how this has affected the creation of the TDT corpora by LDC
(Linguistic Data Consortium) annotators and the evaluation methodology set out by
the TDT community.
4.2.1 Distinguishing between TDT Events and TREC Topics
During the 1997-1998 pilot study evaluation, the TDT participants at the time
settled on the following definition of an event: ‘something that happens at some
point in time’ (Allan et al. 1998a). Later it was admitted that this definition was
somewhat vague; however, it was still considered important as the realisation of this
definition, and the discussion around its formulation, marks the first large scale
attempt in IR research to move away from a broad notion of ‘aboutness’ to a finergrained definition of how news topics grow and expand on a day-to-day basis. As
Allan (2002b) explains, much of the TREC filtering and retrieval work done before
the TDT initiative centred around the classification and retrieval of documents that
discuss broad subject areas such as stories on ‘earthquakes’. TDT topics, on the
other hand, differentiate between different instances of ‘earthquakes’ in that general
topic. For example:
o 17th of January 1995, Kobe earthquake
o 30th of May 1998, northern Afghanistan earthquake
o 17th of August 2000, northwest Turkey earthquake
o 25th of March 2002, northeast Afghanistan earthquake
13
Farsi is the most widely spoken Persian language with over 30 million speakers which include
50% of Iranians and 25% Afghanis in their respective countries, (Source:
http://www.farsinet.com/farsi/, July 2003).
87
The TDT definitions of a topic and an event that are still used in current evaluations
were agreed upon during the second TDT evaluation in 1998, and are defined as
follows (Papka, 1999):
A topic is a seminal event or activity along with all directly related events and
activities.
An event is something that happens in a specific time and place. (Specific
elections, accidents, crimes and natural disasters are examples of events.)
An activity is a connected set of actions that have a common focus or purpose.
(Specific campaigns, investigations, and disaster relief efforts are examples of
activities).
Two important areas of TDT research have emerged from these definitions. Firstly,
notice the emphasis on ‘time’. The temporal nature of news topics is a common
phenomenon in news streams and incorporating this into TDT technology presented
an exciting challenge for TDT researchers. Yang et al. (1998) outlined three
important observations on this characteristic of news which helped to focus the
development of their TDT systems:
News stories discussing the same event tend to be temporally proximate, i.e.
occur in news bursts.
A time gap between bursts of topically similar stories is often an indication of
different TDT topics.
New events are characterised by large shifts in vocabulary in the data stream,
especially where proper nouns are concerned.
Secondly, the event, topic and activity definitions contain a notion of event
evolution. To illustrate this, consider a seminal event such as a storm warning that
triggers the start of a topic on the devastation caused by a ferocious tropical
hurricane. As the story progresses over the course of the topic, different events or
activities will arise that are not clearly related to other events, but are associated by
the seminal event that triggered them. So some events that might originate from the
hurricane event include: the resulting rescue attempt; the rebuilding of damaged
houses and fundraising activities; the subsequent rise in home insurance in the area.
When an event evolves in this way, clearly, there will be a gradual shift in
vocabulary and focus as the story develops. Allan (2002b) points out that this is
another difference between subject-based topics and event-based topics, where the
88
relevance of a news story to a topic is dependent on time, while the relevance of a
news story to a general subject is time independent.
The general effect of an event-based definition of a topic on corpus creation is
that a large amount of annotation time is spent on ensuring that annotators
understand what seminal events are, and what constitutes a set of related events for
a particular topic. In the development of the TDT2 corpus a set of ‘rules of
interpretation’ were formulated to help annotators with this task. Once annotators
have assigned labels to documents relating to specific topics, a number of quality
assurance tests are performed in order to ensure that different annotators agree on
these labels. Cieri et al. (2002) at the LDC note that when they measured human
annotation consistency using the kappa statistic they found that ‘kappa scores on
TDT2 were routinely in the range of 0.59 to 0.89 … scores for TDT3 ranged from
0.72 to 0.86’, where 0.6 indicates marginal consistency and 0.7 measures good
consistency. These scores show that annotating a corpus using an event-based topic
definition is a challenging task. More recently, the TDT organisers have discovered
‘an unrecognised (but always present) problem with topic annotation’. The LDC
defined a topic by first selecting a random story from the corpus, then identifying
the seminal event that triggered the story, and finally building a topic for the
seminal event by finding all related documents using the aforementioned rules of
interpretation. The problem with this procedure is that ‘if the LDC were to
randomly sample a story from that topic and then re-apply the process, it might not
get the original topic back. The issue is in how the seminal event is chosen from
the sample story, and that depends on which story is selected.’14 Currently, the
organisers are planning to determine the impact of this finding on their evaluation
methodology by comparing cluster detection results (see 5.1.2 for definition of this
task) from different sites on the new TDT4 corpus.
4.2.2 The TDT Tasks
The goal of a TDT system is to monitor a stream of broadcast news stories, and to
find the relationships between these stories based on the real-world proceedings or
events that they describe. Five technical tasks have been outlined within the TDT
study (Allan, 2002b):
14
Source: http://ciir.cs.UMass.edu/research/tdt2003/guidelines.html last accessed 16th of July, 2003.
89
Segmentation is the task of breaking a broadcast news stream into its
constituent news stories. This task opens up a whole new area of discussion that
has not been explored so far in the thesis. The necessity of this task relates to the
added difficulty of working with broadcast radio and television transmissions.
Since unlike written sources of news, which contain title, paragraph and story
boundary information, a broadcast news transcript or closed caption material
will not contain any mark-up indicating where stories begin and end in the data
stream. This has prompted an entirely new avenue of research (discussed at
length in Chapters 6 and 7) that requires systems to automate the story
segmentation process. As a consequence of this, TDT systems must incorporate
more robust filtering technologies that can tackle noisy input due to
segmentation errors (e.g. a missed story) and additional errors contained in ASR
(Automatic Speech Recognition) system output such as a lack of capitalisation,
and errors due to pronunciation similarity between different word forms, e.g.
‘ice cream’ and ‘I scream’. Much of the LDC’s work in creating the TDT
corpora was taken up adding boundary information to automatic ASR output
and closed-caption transcripts. Segments in TDT text must also be classified as
one of the following: a news story, a miscellaneous news item such as reporter
chit-chat or commercials and untranscribed text containing incomplete stories
where there is not enough information present in the text to identify its topic
(Cieri et al., 2002). These human identified topic boundaries are then used to
evaluate the performance of TDT segmentation systems. The TDT community
has also investigated the impact of automatic segmentation errors on other TDT
tasks, where it has found that ‘segmentation has little-effect on tracking tasks,
but does dramatically effect the impact of various detection tasks’ (Allan,
2002b).
Detection is the task of identifying similar (on-topic) and dissimilar (off-topic)
news stories in the news stream. Detection can be further subdivided into new
event detection, cluster detection, and link detection tasks.
o (Online) New Event Detection (NED) is the task of recognising
seminal events as they arrive on the data stream. In TDT 1999 – 2002
this task was referred to as First Story Detection, however, in its
current incarnation the TDT community have reverted back to calling it
90
by its original task name used in the 1997 pilot study. In all evaluations,
the task definition remains the same: to find the document that is first to
discuss a breaking news story for each event in the collection. This is an
online filtering task so the system can only make this decision (first
story or not a first story) for the current document by considering only
those documents that it has seen so far on the input stream.
o Cluster Detection has been referred to as either Event Detection or
Retrospective Event Detection in previous TDT evaluations. The task
definition for an event detection system is: to retrospectively divide the
data stream into clusters of related events by considering all the
documents in the TDT collection rather than just those that occur before
the current document in the input stream, as in the case of online new
event detection. This task has proved to be considerably more popular
than its new event detection counterpart due to the similarity of this
technology with previous research efforts such as clustering-based
TREC tasks.
o Story Link Detection is the task of classifying a pair of news stories as
on-topic (they belong to the same topic) or off-topic (they belong to
different topics). The TDT initiative has emphasised the importance of
this task as it is a ‘core technology for all other tasks’ (Allan, 2002b).
This claim is easily understood since all IR and filtering systems are
concerned with the determination of document similarity. It is hoped that
by refining this aspect of TDT research that a break-through in other
tasks may be possible.
Tracking is the task of finding all subsequent stories in the news stream
pertaining to a certain known event represented by the first n sample stories on
that event. It is analogous to the TREC information filtering task. In essence, the
tracking problem involves classifying each successive story on the input stream
as either on-topic (it describes the target event) or off-topic (it is not related to
the target event). Quantifying an optimal value for n is a major part of tracking
research where it is of paramount importance to a real-time system that tracking
can begin as soon as possible (i.e. small n) after the seminal event has been
identified.
91
Figure 4.3 shows a possible TDT architecture that integrates the tasks defined
above. The system inputs are a broadcast news and newswire stream. An ASR
system converts a television or radio broadcast speech signal to a text transcript
which is then segmented into its constituent news stories.
Newswire and
Broadcast News
Stories
Broadcast News Preprocessing
Automatic Speech Recognition
System
Story Segmentation System
News
Stream
Retrospective
Event
Detection
System
Specific Time Span
Event Clusters
New Event
Breaking
Detection
News Stories
System
User
Event Cluster
Event Tracking
System
Related News
Stories
Figure 4.3: TDT system architecture.
92
The data stream is then fed to each of the TDT components. Given a specific time
span (e.g. summer 2003), the retrospective detection component will provide the
user with a list of news topic clusters in that time frame. The user can then specify a
news event of interest that can then be tracked in the remainder of the data stream
using the event tracking component. The final component, the new event detector,
alerts the user to all breaking news stories as they arrive on the data stream. An ad
hoc retrieval component could also be included in this system architecture, where
given a user query (e.g. ‘Kofi Annan’) the system will return a ranked lists of all
relevant event clusters from the set of event clusters detected by the retrospective
event detection step.
4.2.3 TDT Progress To Date
Since the beginning of the TDT initiative a significant amount of progress has been
made in the development of systems based on the tasks defined in the previous
section. Allan (2002b) states that tracking technology is now at an acceptable level
of accuracy for integration into a real-time system. However, all of the detection
tasks have experienced less exciting levels of progress. In particular, this is true of
the NED task. In a paper by Allan et al. (2000b), it was shown that NED, or First
Story Detection (FSD) as they refer to it, is a special instance of event tracking, and
that current TDT tracking approaches used to solve the FSD problem are ‘unlikely
to succeed’. This logic follows from the observation that during the detection of
first stories multiple tracking of documents is actually taking place. More
specifically, each first story identified is a potential event that must be tracked in
order to discover further ‘first’ stories that digress from all previously identified
events. Allan et al. (2000) show that TDT filtering results are comparable with
TREC results on a similar filtering task. Hence, they conclude that a huge effort is
needed to get current FSD effectiveness to the level of current tracking
effectiveness. In fact, they believe that this improvement in FSD performance will
require a 20-fold improvement in current tracking effectiveness.
It is generally agreed that the first phase of TDT research has largely been taken
up with investigating how well traditional IR filtering solutions would perform in a
TDT evaluation environment. Allan (2002b) concludes that parameter tweaking
existing techniques has got TDT research this far, but that if further improvements
93
are to be made, future TDT investigations must focus more on modelling the
essential entities involved in the event definition – time, location and people and
how these entities relate to each other with respect to event evolution. In the next
section we examine some of the principal approaches to NED used by the TDT
participants. In Chapter 5 we document our novel lexical chaining approach to this
task.
4.3
New Event Detection Approaches
Most approaches to TDT tracking and detection tasks have used elements of the
traditional IR models described in Section 4.1. This section describes the techniques
of four different TDT participants who have submitted results for the NED/FSD
task. Typically these techniques differ in their implementation of the following
three NED system components:
Feature Extraction and Weighting: This component reduces documents to a
set of index terms or features, and then weights these features with respect to
their discriminating power as an element of the resultant document classifier (or
document representation).
Similarity Function: A similarity function is used to determine the strength of
association between document representations or classifiers. This component
coupled with a similarity threshold is used to determine if two classifiers are ontopic (an old event has been detected) or off-topic (a first story has been
detected).
Detection Algorithm: All NED algorithms are based around some sort of
cluster algorithm. A document clustering algorithm groups documents into sets
or clusters that contain a high overlap of highly weighted features. Unlike text
categorisation, another type of text classification task, document clustering is an
unsupervised task with no apriori knowledge about the types of categories (in
our case events) present in the collection. Another added difficulty associated
specifically with NED clustering algorithms is that retrospective clustering is
prohibited. That means that documents must be processed sequentially, and that
the classification of a document as a new or old event must be based only on the
documents that have occurred before this point on the data stream. This implies
implementing the detection algorithm using a single-pass clustering algorithm
94
(van Rijsbergen, 1979), which uses the similarity function and a feature
weighting strategy in its decision to assign a document to a cluster. The
technical details of this process will be described in more detail with respect to
each of the four NED techniques covered in this section.
Another important contribution to NED research which will also be looked at
are the conclusions drawn from the Workshop on Topic-based Novelty Detection
held at John Hopkins University in Summer 1999 (Allan et al., 1999). Although
TDT is now a multi-lingual evaluation environment, we will only focus on aspects
of TDT research that deal with English multi-source news stories as this is the main
focus of our research. For more information on multi-lingual TDT, we refer the
reader to various site contributions found in (Allan, 2002a).
4.3.1 UMass Approach
The initial UMass NED system that took part in the pilot study, and the TDT 1998
and 1999 evaluations is based on work by Papka (1999) and Allan et al. (1998a,
1998b, 1998c, 1998d, 1999, 2000a, 2000b). In their NED vector space model
implementation, they use a single-pass clustering algorithm, and avail of the
InQuery system framework (Callan et al., 1992) for representing documents, and
measuring similarity between an incoming document and a document in a cluster.
In the UMass system an initial classifier for the current document on the input
stream using the n most frequently occurring stopped and stemmed terms is created.
The InQuery weighting function, a variation on the tf.idf measure, is then used to
assign ‘belief values’ to each of the terms in a document using the following
formulae taken from (Papka, 1999):
dj , k = 0.4 + 0.6 ∗ tfk ∗ idfk
tfk = tk /(tk + 0.5 + 1.5 ∗
(4.11)
dlj
)
avg _ dl
(4.12)
C + 0. 5
cfk
idfk =
log(C + 1)
log
(4.13)
where dj,k is a document feature in document dj, tk is its frequency in that document,
dlj is the document’s length and avg_dl is the average document length in the test
collection. As already stated, the TDT task requirement for NED states that
classification decisions are to be made online rather than retrospectively. This not
95
only affects the type of clustering algorithm the system can use, but also the
calculation of the idf part of any tf.idf measure. More specifically, the problem
arises from the fact that it is difficult to generate meaningful idf measures from the
small amount of previously seen documents that are permitted for this calculation
during the detection process. UMass use an auxiliary corpus of TREC documents
(all on general news topics) to estimate idf values for each term in the test
collection. Therefore, in the above idfk formula C is the number of documents in the
auxiliary collection, cfk is either the number of documents containing the feature tk
in this auxiliary collection, or the default value of 1 if the term is not present in the
auxiliary collection. A document classifier is represented as a weighted vector of its
features. The similarity between any two classifiers is calculated using the InQuery
scoring metric #WSUM:
N
sim( qi, dj ) =
k =1
qi , k ⋅ dj , k
(4.14)
N
qi , k
k =1
where qi,k is the relative weight of a feature in an existing document classifier, and
dj,k is the weight of a feature in the incoming document’s classifier, calculated using
Equation 4.11. The threshold used to determine if these two classifiers are on-topic
is dynamically determined using the following formula:
threshold ( qi, dj ) = 0.4 + ∗ ( sim( qi, di ) − 0.4) + ∗ ( datej − datei )
where
= 0.2 and
(4.15)
= 0.0005 control the effect of sim( qi, di ) and the time
parameter datej − datei , and 0.4 is an InQuery constant. This time parameter is used
as a means of modelling the temporal nature of news in the detection task, and as a
means of controlling the similarity of the classifiers so that documents that are far
apart on the input stream seem less similar than they actually are. This is based on
the following observation that ‘an event is less and less likely to be reported as time
passes [because].. it slowly becomes news that is no longer worth reporting’ (Allan
et al. 1998c). Another important element of Equation 4.15 is the calculating of
sim( qi, di ) , i.e. the similarity value between the (existing) classifier and the
document from which it was originally formulated. In Papka’s NED
implementation a document classifier is continually reformulated as new documents
arrive on the input stream. This classifier reformulation step is based on the notion
that as a news topic grows a variation in vocabulary will also occur, and so in order
96
to ensure that discriminating features in a cluster are weighted correctly, they must
be re-weighted each time a new document is added to that cluster. More
specifically, classifiers are re-weighted with respect to how often their features
occur in the relevant document set (the cluster in which the classifier resides) tfrel
and in the non-relevant document set tfnonrel (all other document classifiers):
qi , k = c1 ∗ tfrel − c2 ∗ tfnonrel
(4.17)
where c1 and c2 are equal to 0.5.
Using the above weighting and thresholding strategies, and a single-pass
clustering algorithm, an incoming document is first compared to each previously
seen classifier in each cluster. In each case the system assigns a decision score to
each document comparison using the following formula:
decision( qi, dj ) = sim( qi, dj ) − threshold ( qi, dj )
(4.16)
where the occurrence of a positive value indicates that an old event has been found,
while a negative value indicates that an the incoming document discusses a seminal
event or first story. Papka’s clustering algorithm employs an online single-link
strategy when comparing a cluster to a target document. More specifically rather
than maintaining and updating a cluster centroid classifier, this comparison strategy
maintains a set of individual classifiers representing all the documents that have
been added to a particular cluster during the clustering process. The single-link
comparison strategy takes ‘the maximum positive decision score for the classifiers
contained in a cluster’ as the similarity value between a cluster and an incoming
document (Papka, 1999). Once an incoming document classifier has been added to
an existing cluster (i.e. an old event has been detected), or forms the seed of a new
cluster (i.e. a new event has been detected) then the next document in the input
stream is read in and the detection process begins again.
4.3.2 CMU Approach
The Carnegie Mellon University approach to NED uses a vector space model to
represent documents and clusters during first story detection (Yang et al., 1998;
2002). However, unlike the original UMass NED approach (Papka, 1999), they use
a variation on the single-pass clustering algorithm that compares incoming
documents to other recently occurring documents on the input stream rather than to
documents in existing clusters. Like UMass, the CMU researchers also addressed
the difficulties of calculating idf statistics online. To tackle this problem they
97
devised an incremental vector space model (Incr.VSM) which calculates the idf
statistic in two ways:
Retrospective idf statistics can be generated from a same domain corpus in
order to approximate the real idf values in the current corpus.
Incremental idf statistics generated from the current corpus are captured by
recomputing the statistics as new information arrives on the data stream. In
other words, the feature frequency counts in the documents seen so far on the
input stream are used to update and augment the retrospective idf values
obtained apriori.
The incremental idf measure is defined as follows in (Yang et al., 1998):
idf ( t , p ) = log 2( N ( p ) / n ( t , p ))
(4.18)
where p is the current time, t is the term, N ( p ) is the number of accumulated
documents up to the current point (including the retrospective corpus if used), and
n ( t , p ) is the number of documents which contain term t up to the current point on the
input stream. Terms are then weighted using the following version of the tf.idf
measure also taken from (Yang et al., 1998):
w(t , d ) =
(1 + log 2 tf ( t , d )) × idf ( t , p )
|| d ||
(4.19)
where the denominator || d || is the 2-norm of vector d , i.e. the square root of the
sum of the squares of all the elements in that vector.
This equation and all
document preprocessing (stopword removal and stemming) are provided by the
SMART 11.0 system developed at Cornell University (Salton, 1989).
CMU’s detection algorithm, as already stated, compares incoming documents to
other previously seen documents rather than cluster representations of topics.
Furthermore, their approach calculates document-document similarity using the
standard cosine measure; however, using document-document comparison is more
computationally expensive than comparing documents to clusters. Hence, they
implement a time-window component which improves their algorithm’s efficiency
by limiting the number of ‘target document to existing document’ comparisons to
those that exist within a window of m previously seen stories up to the current point
on the input stream. Using a window size of 2000 documents (about 1.5 months of
news time), Yang et al. (1998) found that this windowing technique actually
improved performance rather than compromised it. This result is in part due to the
98
temporal nature of news where documents becomes less related in the input stream
as time goes on, and so it therefore becomes unnecessary to compare all old
documents to the target document. To further model this idea of news temporality
the CMU team also incorporate a decay function into the similarity measure as
follows:
score( x ) = 1 − max { i
di ∈ window
m
sim( x , di )}
(4.20)
where x is the current document, di is the i-th document in the window,
and i = 1, 2, …, m. The value i m is the decay factor which ensures that documents
that exist far apart on the input stream will be assigned a reduced similarity score.
With regard to feature extraction, all features (excluding stopwords) are retained for
each document representation, and the optimal similarity threshold was found to be
0.16. This means that an incoming document must not exceed the 0.16 similarity
limit with another document in the time-window in order to be classified as a ‘first
story’.
This CMU system took part in the TDT-pilot study, TDT 1998 and TDT 1999
evaluation initiatives. A lot of CMU’s work has also focused on using multipleclassifiers and multiple-method approaches (the BORG approach) to improve
tracking and cluster detection performance; however, these techniques were not
suitable for the NED task just described. More information on this work and their
multi-lingual TDT findings can be found in (Yang, et al., 2002).
4.3.3 Dragon Systems Approach
Dragon systems use a language modelling approach to NED (Allan et al., 1998a;
Yamron et al., 2002). This means that documents are represented using n-gram
frequencies (in this case unigrams). Like CMU, Dragon use an auxiliary corpus to
generate word statistics, however, their approach does not use this information to
collect idf values to weight terms. Instead discriminator topic models are built from
an available same domain corpus using an iterative k-means clustering algorithm.
The Dragon approach then uses a single-pass clustering algorithm to determine
which stories discuss seminal events, where a new event is detected if it is closer to
a discriminator topic model than an existing story cluster. Dragon use a variation
99
on the Kullback-Leibler divergence metric to measure the distance15 between the tf
distributions of a document and a cluster which is defined more formally as:
d=
( sn / S ) log
n
un / U
+ decay term
c ′n / C
(4.21)
In this equation c ′n is the smoothed cluster count16 for word wn, and un /U, sn / S and
c ′n / C are the relative frequencies of wn in the background unigram model (or
discriminator model), the story unigram model and the cluster unigram model
respectively. The decay term is used to make clusters containing old news stories
have greater d, i.e. appear less similar and is defined in (Allan et al., 1998) as ‘the
product of a decay parameter and the difference between the number of the story
representing the distribution sn and the number midway between the first and last
story in the cluster’. Dragon systems only participated in the pilot study evaluation
for NED, preferring to concentrate like CMU on the segmentation, tracking and
cluster detection tasks for further TDT evaluations. In later tracking and detection
approaches Dragon combine two complementary statistical models, a betabinominal model and a unigram model, to improve the performance of these tasks.
However, no attempt was made to observe the effects of this combination technique
on the NED performance. Yamron et al. (2002) give additional details on these
tracking and retrospective detection approaches.
4.3.4 Topic-based Novelty Detection Workshop Results
The Topic-based Novelty Detection Workshop took place in Summer 1999 at the
John Hopkins University Center for Language and Speech Processing (CLSP)
(Allan et al., 1999). Its aim was to investigate two novelty detection-based tasks:
New Event Detection which looks at measuring novelty at a document level, and
New Information Detection which looks at detecting novelty at a sentence level.
The findings in this workshop resulted in a number of publications and a new
avenue of research for the CIIR at UMass on temporal news summarisation using
New Information Detection techniques (Allan et al., 2001). One of the major
conclusions of the workshop, already mentioned in Section 4.2.3, was that further
15
Kullback-Leibler divergence (or relative entropy) is an information-theoretic measure of how
different two probability distributions (over the same event space) are. This metric is commonly
used as a distance measure, although it is asymmetrical.
16
Smoothing and its importance to language modelling-based retrieval is discussed briefly in
Section 4.1.
100
improvements in NED performance using current IR technologies had reached an
upper bound in expected effectiveness (Allan et al., 2000b).
The Novelty Detection Workshop also gave the UMass team an opportunity to
review elements of their initial approach to NED described in Section 4.3.1. This
work also lead to the implementation of a ‘new-look’ TDT system for the TDT
2000 evaluation. Details on the clustering, similarity and weighting strategies
supported by the UMass system can be found in Section 5.6.1. They found that
optimal NED performance could be achieved using a VSM, KNN clustering
strategy, the cosine similarity metric, the basic tf.idf metric and a feature vector
containing all terms in the document, i.e. full dimensionality (Allan et al., 2000c,
2002c). Stopping and stemming also appeared to be useful preprocessing steps as
they reduced the number of tokens in the detection process without hurting the
effectiveness of the system (Allan, et al. 1999).
Research into the effect of named-entity recognition on NED performance was a
major focus of the Detection Workshop. Using the BBN named-entity detector a
variety of attempts were made to integrate this information into their vector space
model. The basic intuition of this work was that named entities such as people or
companies play a significant role in the description of a topic as they help to
distinguish one topic from another. The simplest method of including named
entities in a VSM is to assign these phrases higher weights in a term index.
However, this was shown to have little effect on system performance as these
phrases tend to have high tf and low idf values, and so are assigned higher weights
anyway.
Other integration approaches worked on the notion that a first story will have a
greater percentage of previously unseen named entities since they are describing an
entirely new event. However, this assumption was found to be false in a lot of cases
because there are many high profile named entities that are mentioned continuously
in different topics such as the phrase ‘President Clinton’ which turns up in topics as
diverse as his state of the union address, his affair with Monica Lewinsky and his
attempts at brokering peace in Northern Ireland. Named entities of this nature are
partially responsible for missed events. It was also found that there were a number
of topics that did not contain a significant number of original named entities in their
first stories, particularly if details of the incident were ‘shady’ at that point in the
101
story, and so the names of the individuals involved were not known. This ‘first
story’ characteristic was also responsible for missed new events. In addition, this
characteristic was also the cause of many false alarms, where new named entities
were introduced as a topic developed, thus leading the system to believe that a new
event had been found.
The most promising technique involving named entities explored by the
workshop used a two phase nearest neighbour approach that first found the n closest
documents to the target document based on the similarity of their feature vectors
disregarding all their named entities. This initial search was a comparison based
purely on ‘content words’, and the second phase then involved counting the number
of novel named entities in the target document that were not present in any of its
nearest neighbours. Any document that contained a high percentage of new entities
relative to its number of old events was classified as a ‘first story’. However, the
performance of this technique still did not exceed that of their baseline (tf.idf, vector
space model) NED strategy.
4.3.5 Other notable NED Approaches
In 1999 and 2000 only 2 groups, UMass and National University of Taiwan (NUT),
participated in the NED track at the TDT workshops. In both cases the UMass
system outperformed NUT’s contribution. The NUT approach to the TDT problem
is still an interesting one as they attempt to extend the simple VSM model with
some linguistic knowledge present in the text. In particular, they use a part-ofspeech tagger to identify noun phrases and verbs in the text since according to Chen
and Ku (2002) these parts-of-speech are the main contributors to the event
description in a news story. They also use a centroid-based clustering algorithm to
detect first stories. However, when a new document is added to a cluster and the
centroid is updated with its terms, their algorithm assigns time labels to all terms in
the centroid, and deletes older candidate terms that have not appeared for a while in
the input stream. This ensures that the important terms and the latest terms are
retained in each topic centroid. Chen and Ku also acknowledge that short
documents in the input stream tend to have very low similarities. Hence, they adopt
a form of query expansion to help identify the occurrence of synonymous terms in
the story and centroid representations. For Chinese documents their algorithm uses
a Chinese thesaurus, while for English documents it uses the WordNet thesaurus
102
where a term is expanded using synonyms, and the weight of an expanded term is
half that of the original term. Chen and Ku use a standard tf.idf weighting scheme to
weight terms in each document and cluster vector. As already stated, none of these
augmentations to the basic VSM model could improve upon the performance of the
UMass system. However, some improvements have been shown in the link
detection task, details of which can be found in (Chen, Chen, 2002)
In 2001, IBM’s NED system17 outperformed attempts by CMU, UMass and
University of Iowa (UIowa). IBM’s NED system like many others uses an
unsupervised single-pass clustering algorithm with a document/centroid comparison
strategy and the Okapi weighting function. After a document has been assigned
part-of-speech tags, morphological analysis is performed and all noun bigrams and
the remainder of the terms are used to represent the story. IBM’s method differs,
however, in the way in which a document/cluster similarity is calculated. They use
a combined approach where the similarity between a document and a cluster is a
weighted linear combination of the ‘traditional’ similarity of their vectors and a
similarity measure based on the novelty of the terms in the centroid. This term
novelty weighting scheme aims to capture the temporal nature of news by
decreasing the weight of terms that occurred much earlier in the news stream.
At the same TDT workshop, the UIowa system performed worst of all four
submitted NED system results. This system was based on a named-entity
recognition approach where noun phrases are identified and sorted in the following
entity categories: persons, organisations, locations and events (all other parts-ofspeech are ignored). Each document/cluster similarity is calculated as the weighted
sum of similarities of the vectors for each of these entity types. Eichmann and
Srinivasan (2002) comment only on the effect of this approach on tracking
performance, but their observation that sparse or empty entity vectors had a
detrimental effect on similarity calculations also holds true for their NED system.
17
Since no formal publication of IBM’s NED results exists, we were only able to gather details on
their technique from a presentation given at the TDT 2001 workshop which can be found at
http://www.nist.gov/speech/tests/tdt/tdt2001/paperpres.htm (as of March, 2004). NED system
descriptions are also sketchy for the UIowa and NUT systems. However, personal correspondences
with the researchers at NUT has confirmed that their system is based on the techniques described in
(Chen, Ku, 2002).
103
4.4
Discussion
The aim of this chapter was to provide some background on the New Event
Detection task as a general Information Retrieval problem, and as a TDT task. This
chapter also provided background on the TDT initiative, and a detailed review of
the previous solutions to New Event Detection proposed by various research groups
participating in this project. An important conclusion drawn by the TDT
community, and from this chapter, is that although much progress has been made in
the area of automatic organisation of news streams into a manageable information
source for newsreaders, there is much room for improvement. In particular, New
Event Detection has gained a reputation as the most challenging of the information
filtering tasks defined by the TDT researchers (Allan et al., 2000b).
Consequently, we have focused our energies on improving the performance of
this filtering task over a typical approach that uses a general baseline vector space
model. The next chapter details the results of the evaluation of our lexical chainbased approach to NED on the TDT pilot study and TDT2 evaluation corpora. In
particular, we describe a novel document representation strategy which attempts to
improve upon the traditional view of a document as a mere set of term frequencies,
and instead aims to capture the essence of an incoming story with respect to its
lexical cohesive structure.
104
Chapter 5
Lexical Chain-based New Event Detection
In the previous chapter, we looked in detail at the Topic Detection and Tracking
initiative. In particular, we focussed on the New Event Detection (NED) task and
how various participants had addressed this problem using traditional IR methods.
As previously stated, NED is a classification task that identifies all breaking news
stories discussed in a news stream. In Chapter 2, we reviewed work by Morris and
Hirst (1991) which concluded that lexical chains can be used to identify prominent
sub-topics and themes in texts that correspond well with the discourse units
described in Grosz and Sidner’s Theory of Discourse Structure (1986). Based on
this observation, we investigate whether lexical chains can be used as a means of
differentiating between informative and unimportant terms in a text.
In this chapter, we attempt to improve NED performance by using chain words
in conjunction with a standard keyword indexing strategy to represent document
content. The first set of experiments, described in Section 5.5, represent a
preliminary investigation into the suitability of our hybrid model of textual content
for identifying new events in the TDT1-pilot study corpus; while the experiments
described in Section 5.6 attempt to replicate these results on the TDT2 corpus by
integrating our linguistic indexing strategy with the UMass NED system.
Before we report on the results of these TDT experiments, we first look at the
relationship between disambiguation accuracy and previous attempts at integrating
lexical chains into an IR model, and how these systems influenced the design of our
NED system, LexDetect. In Section 5.2, we also examine how lexical cohesive
relationships in text can be used to identify pertinent themes in news stories, and
how this differs from a traditional word frequency-based approach. We conclude
this chapter with a review of other lexical chain-based NED research also
conducted at University College Dublin.
105
5.1
Sense Disambiguation and IR
In Chapter 2, we explored various methods of creating lexical chains using different
lexical knowledge sources. The resultant chains were then used to address a variety
of NLP and IR problems. In particular research efforts by Stairmand, Green and
Ellman looked at using information derived from the generation of the chains to
improve hypertext generation (Green, 1997a; 1997b) and query-based retrieval
(Stairmand, 1996; Ellman, 2000). In the case of Ellman, he attempted to model
document content in terms of Roget’s categories, while Stairmand and Green
represented documents in terms of concepts indexed using WordNet synsets.
Assigning WordNet synsets or Roget’s categories to words in a text requires a
method of sense disambiguation which, as we saw in Chapter 3, is one of the side
effects of lexical chain generation, since noun phrase clustering based on semantic
similarity requires a decision on which context a word is being used in. Using
disambiguated words to index documents has been the focus of much interest in the
IR community as it was hoped that a deeper understanding of document content
might improve retrieval performance. There are two linguistic phenomena that
motivate this intuition. They are defined as follows with respect to their effect on
retrieval performance in terms of recall and precision:
Synonymy is the phenomenon that occurs when two distinct syntactic phrases
are used which share the same meaning (e.g. ‘domestic animal’ and ‘pet’). Since
the core operation of any traditional IR model is the measurement of similarity
in terms of syntactic word matching, synonymy will cause documents and
queries to appear less similar than they actually are, thus reducing the recall of
the IR system18.
Polysemy is the phenomenon that occurs when a word has more than one word
sense depending on the context in which it is used (e.g. the ‘financial’ and
‘river’ senses of ‘bank’ is a typical example). Polysemy has the effect of
causing documents and queries to appear more similar than they actually are,
thus reducing the precision of the IR system.
However, despite the fact that addressing synonymy and polysemy to improve IR
performance makes sense, disambiguation-based IR experiments have by and large
18
Furnas et al. (1987) refer to this phenomenon as the vocabulary problem, which states that ‘people
tend to use a surprisingly great variety of words to describe the same thing’.
106
been unsuccessful. This is also true for Stairmand’s, Ellman’s and Green’s attempts
at replacing keyword indexing with WordNet synsets and Roget’s categories. For
the remainder of this section, we will briefly touch on some of the reasons stated in
the literature for these disappointing results. In Section 5.2, we discuss how these
results prompted us to use lexical chaining as a feature selection method rather than
as a disambiguating and indexing strategy.
5.1.1 Two IR applications of Word Sense Disambiguation
Throughout the nineties, interest in sense disambiguation for IR escalated due to the
release of online dictionaries and thesauri like Roget’s Thesaurus, Longmans
Dictionary of Contemporary English (LDOCE) and WordNet (see Section 2.4 for
details), which made it possible to automate the sense disambiguation process using
the sense definitions defined in these lexical resources. According to Sanderson
(2000), the first large scale IR experiments investigating the usefulness of
disambiguation were carried out by Voorhees (1993; 1994; 1998) and Sussna
(1993) using WordNet, and Wallis (1993) using the LDOCE. Since we are
primarily interested in WordNet-based indexing approaches, we will focus on the
conclusions drawn by Voorhees which concur with the results of subsequent
experiments by Sussna (1993), and Richardson and Smeaton (1995).
Voorhees (1998) looked at two applications of word sense disambiguation for
query-based retrieval: conceptual indexing and query expansion. Since a WordNet
synset represents a single concept or set of synonymous words, building a vector of
synsets as a representation of a document in the vector space model (VSM) is
referred to as conceptual indexing. Like most WordNet-based sense resolution
approaches, Voorhees’s disambiguator assigns a synset number to a word if that
sense is the most active node in the WordNet network in the context of a specific
document, i.e. the word sense that is most related to other words in the document.
Once the conceptual index for the document collection has been built for each
query, Voorhees’s system disambiguates each incoming query resulting in a synset
query vector. The traditional VSM is then used to retrieve and rank documents
relevant to this query. Voorhees ran experiments on five popular IR collections19:
CASM, CISI, CRAN, MED, TIME. However, in each case she found that the
19
Descriptions and statistics on these test collections can be found in Chapter 3 of the IR textbook
by Baeza-Yates et al. (1999).
107
effectiveness of the sense-based vectors was worse than the traditional stem-based
vectors. She suggests two main causes for this degradation in retrieval performance:
disambiguation errors and the inability of the disambiguator to resolve word senses
in short queries due to a lack of context.
In a second set of experiments, Voorhees uses WordNet as a source of words for
expanding queries, in order to widen the breath of the search for relevant documents
in a TREC collection. More specifically, she looked at the effect of WordNet-based
query expansion on retrieval performance when query terms were manually
disambiguated and then automatically expanded with related terms from the
WordNet taxonomy during the retrieval process. For example, Voorhees adds the
following words to the query containing the word furniture: table, dining, board,
refectory. These words are all specialisations of the word furniture. Unfortunately,
this query expansion technique did not outperform the traditional VSM approach to
query-based retrieval. However, when words were manually disambiguated and
manually expanded (not all related WordNet terms where chosen) then a significant
improvement in retrieval performance was observed. Only certain lexicographically
related words are useful in the expansion process, because a hypernym path
between words in the WordNet taxonomy does not always indicate a useful query
expansion term. The reason for this relates to the fact that not every edge in the
WordNet taxonomy is of equal length and not all branches in the taxonomy are
equally dense (see Section 2.4.4 for further discussion). Hence, Voorhees’s query
expansion experiments indicate that semantic distance in WordNet cannot be used
to approximate semantic relatedness with sufficient accuracy for use in this
application.
5.1.2 Further Analysis of Disambiguation for IR
A major weakness of Voorhees’s experiments was that no evaluation of
disambiguation performance was conducted. Hence, it is impossible to know to
what extent disambiguation error reduces IR performance. However, measuring
disambiguation accuracy is a very time consuming process as a gold standard set of
documents must be manually assigned senses before such an evaluation can take
place. This prompted Sanderson (1994; 1997; 2000) to investigate the impact of
disambiguation errors on IR effectiveness using a technique that artificially adds
ambiguity to a test collection. This technique, which was first proposed by
108
Yarowsky (1993), is based on the addition of pseudo-words to a test collection. A
pseudo-word is an artificially created ambiguous word, generated by randomly
selecting a sequence of n words and concatenating them together. An example of a
size 2 pseudo-word would be ‘cat/spade’ were every instance of ‘cat’ and ‘spade’ in
the corpus would be replaced by this pseudo-word. Sanderson showed, using this
technique, that adding ambiguity to queries and collections has little effect on IR
performance compared to the effect of adding disambiguation errors to the
collection (e.g. replacing ‘cat/spade’ with ‘cat’ in a particular document where this
instance of the pseudo-word should have been disambiguated as ‘spade’).
Consequently, Sanderson concluded that only low levels of disambiguation error
(less than 10%) would result in improvements over a basic word stem-based IR
model. This result is in agreement with earlier work by Krovetz and Croft (1992)
who undertook a manual investigation of thousands of query/document word sense
matches after retrieval, and concluded that sense ambiguity (caused by polysemy)
did not downgrade retrieval performance as much as was originally expected. They
pinpointed two reasons for this, as highlighted by Sanderson (2000):
The query word collocation effect where query words implicitly disambiguate
each other by the fact that in a ranked list the highest ranking documents will
have occurrences of all or most of the query words. Therefore, one can presume
that these documents are using these words in the context intended by the query.
Hence, the effect of polysemy on retrieval performance is less than expected.
75% of words in a corpus are either unambiguous or have skewed sense
distributions and so are used in the majority sense in most queries. A term is
said to exhibit a skewed sense distribution if one of its senses is used more
frequently in a particular domain. For example, it would be unusual to find the
‘friendship’ or ‘chemical’ sense of ‘bonds’ used in the Financial Times
newspaper. Again this contributes to the fact that the effect of polysemy on
retrieval performance is less than expected.
Furthermore, in his thesis Buitelaar (1998) states that only 5% of word stems in
WordNet are truly unrelated. This means that stemming words and conflating
particular instances to a common stem, as is carried out in most traditional IR
models, is not as harmful as researchers might have expected since 95% of words
originate from a core related sense. Hence, stemming ‘computation’ and ‘computer’
109
to ‘comput’ is actually good for retrieval in contrast to say using WordNet-based
conceptual indexing where ‘computer’ and ‘computation’ will be assigned two
distinct synset numbers, hence contributing to the dissimilarity between a document
and a query.
All of these points in some way account for why resolving polysemy has not
proved to be an effective means of improving IR performance. However, there is
evidence to suggest that improvements are possible by resolving synonymy.
Gonzalo et al. (1998; 1999) report on the results of a series of experiments on the
SemCor collection. SemCor (Miller et al., 1993) is a publicly available subset of the
Brown Corpus that has been hand-tagged with WordNet synsets. Gonzalo et al.
adapted SemCor by splitting up large documents into coherent self-contained
fragments and writing synset tagged summaries of each fragment. These summaries
were then used as queries in an IR-style experiment. More specifically, for each
query the system was only required to retrieve one known item, in this case the
document from which the query/summary was generated. Gonzalo et al. found that
synset indexing ranked the correct document in first place 62% of the time
compared with 53.2% for word sense indexing, and 48% for basic word indexing.
The first two results show that resolving synonym relationships between words is
responsible for a much greater improvement over the traditional keyword indexing
performance than resolving and differentiating between polysemous words in the
collection, i.e. word sense indexing versus basic word indexing.
Gonzalo et al. also suggest that Sanderson’s original estimate of 90%
disambiguation accuracy as the minimum cut-off point for observing any
improvement in retrieval performance is too high, and that this cut-off point is
nearer 60%. They believe that the reason for the difference between these two
accuracy estimates is due to the fact that pseudo-words do not always behave like
real ambiguous words in text, and that their experiment is much closer to how real
ambiguity works. The results of Stokoe et al.’s (2003) (Web) TREC retrieval
experiments provide some evidence to support Gonzalo et al.’s claim, where their
high-recall disambiguator performed with an accuracy of only 62.1%, but still
managed to outperform the traditional VSM by 1.76% with regard to average recall.
However, this experiment differed from previous WordNet-based concept indexing
experiments where an initial ranked list of documents is retrieved for a given query
110
using the VSM, and then this ranking is refined with respect to the similarity of the
disambiguated query to each disambiguated document in the ranked list. Stokoe et
al. also state that this improvement may be limited to certain types of retrieval, and
may only be useful to ad hoc retrieval systems that deal with very short queries (one
to two words) where the query collocation effect tends not to apply. However,
regardless of where this cut-off point lies current state-of-the-art WordNet-based
disambiguation has only reached a maximum of 69% accuracy20, which is on the
lower end of the scale of acceptable disambiguation accuracy for IR applications.
Consequently, it will be some time before traditional approaches will be
significantly outperformed by conceptual indexing approaches to IR.
5.2
Lexical Chaining as a Feature Selection Method
In the previous section we looked at reasons why integrating sense disambiguation
strategies into IR system indexing has not been as successful as researchers had
anticipated. This discussion also gives us an insight into why in the past lexical
chain-based IR tasks have not been as successful as expected. For example, Green
(1997b) found that users experienced no significant advantage when answering
questions using lexical chain-based hypertext links over links generated by a simple
vector space model of document similarity (see Section 2.5.4). Both Stairmand
(1996) and Ellman (2000) used lexical chains as a means of improving query-based
retrieval. In both cases, even though Ellman used Roget’s categories rather than
WordNet synsets, they reported mixed results where improvements were observed
in certain cases but not in others (see Sections 2.5.5 and 2.5.3). Kazman et al.
(1995, 1996) also looked at chain-based dialogue indexing but no formal IR-based
evaluation was performed (see Section 2.5.4). These researchers, apart from
Ellman, were also elusive on the effect of the disambiguation accuracy of their
algorithms on the performance of their particular lexical chaining application.
For example:
Green (1997b) partially evaluated his intra-document hypertext linking strategy
by clustering documents from six topics (from the 50 available) in a TREC
collection. His clustering technique was based on a conceptual indexing strategy
20
This result was taken from the Senseval-2 evaluation – a WordNet-based disambiguation
workshop. For more information see http://www.sle.sharp.co.uk/senseval2/ (February, 2004).
111
using WordNet synsets derived from the lexical chains generated for each
document as features. However, Green’s evaluation only comments on the fact
that ‘the similarity function for synset weight vectors works as expected, that is,
higher thresholds result in less connections, i.e. additions to clusters’. More
specifically, no comparison with a clustering strategy that uses traditional
keyword weighted vectors was performed.
Stairmand’s (1996) evaluation on the other hand is more comprehensive, since
he compares his conceptual indexing technique using chain synsets with a
traditional VSM of query-based retrieval. However, his experiments would still
be considered small on an IR scale, as he only evaluates system performance on
12 carefully chosen queries. Stairmand only selected queries that contained
nouns in the WordNet index so as to ensure that degradations in system
performance could be attributed to the indexing strategy rather than limitations
in WordNet’s coverage. Stairmand compared his system with the SMART
retrieval system and found that his system exhibited a higher rate of precision.
However, the systems low recall levels and limited ability to deal with all types
of queries made it unsuitable as a real replacement for a traditional VSM.
Stairmand suggests that ‘a hybrid approach is required to scale up to real-world
IR scenarios’. Like Green, Stairmand did not directly evaluate the
disambiguation accuracy of his algorithm.
In Chapter 3, details were given of the disambiguation accuracy of three lexical
chaining algorithms on the SemCor corpus. It was found that both greedy and nongreedy lexical chaining approaches can only hope to attain recall and precision
values, representing disambiguation accuracy, that lie between 55% and 60%. This
means that these algorithms are capable of disambiguating just over half of the
nouns in the SemCor collection correctly.
Both Sanderson’s and Gonzalo’s
analysis of IR system tolerance to disambiguation errors suggests that between 55%
and 60% accuracy is on the low side of this tolerance level. Also it must be
remembered that their estimates were based on full-text disambiguation, in contrast
to lexical chain-based disambiguation which only looks at one part-of-speech, i.e.
nouns. This will undoubtedly have caused a further degradation in performance as
valuable content information would have been missing from the document
representations, in particular parts-of-speech such as verbs and adjectives.
112
As a result of these inadequacies in previous chaining attempts, we proposed a
more suitable method of incorporating lexical chaining into an IR model based on
the following hypotheses:
1. Lexical chaining can be viewed as a method of feature selection, where our
feature selection hypothesis states that nouns in the text that form clusters of
cohesive words are considered to be pertinent in describing the overall topic of
the text.
2. A document representation strategy consisting of chain terms is more
appropriate than one based on chain synsets. This hypothesis is put forward
based on the fact that our lexical chaining algorithm achieves a relatively low
level of disambiguation accuracy (see Chapter 3) with respect to Gonzalo’s
suggested IR performance breakeven point of 55%-60%.
3. A data fusion document representation strategy that combines a lexical chain
term representation with a free-text representation of text will perform better
than a document representation based solely on lexical chain words.
As mentioned previously the focus of this chapter is the use of lexical chains as a
means of improving New Event Detection (NED) in the TDT domain. So in this
case we will be testing these hypotheses on a text classification task rather than an
ad hoc retrieval task. However, the techniques discussed in Section 5.3 should also
be applicable to any VSM-based system. Our evaluation, and the other work on
TDT at University College Dublin (Hatch, 2000; Carthy, 2002), represents the first
large-scale evaluation of lexical chaining as an indexing strategy in a text
classification or IR task.
According to Yang and Pedersen (1997), automatic feature selection methods
‘include the removal of non-informative terms according to corpus statistics and the
construction of new features which combine lower level features (i.e. terms) into
higher level orthogonal dimensions’21. In contrast to this definition our feature
selection method is based solely on a linguistic analysis of a text rather than a
statistical one. In addition, most of the interest in feature selection research has
21
For more information on statistical feature selection we refer the reader to Yang and Pedersen
(1997) who compare and contrast a number of statistical-based techniques like mutual information,
document frequency, information gain, and X 2-test. Latent Semantic Indexing (Deerwester et al.
1990), Support Vector Machines (Joachims, 2002) and more recently statistical word clustering are
also popular means of extracting features from text. For more details on statistical word clustering
see Baker and McCallum (1998), Slonim (2002) and Dhillon et al. (2003).
113
grown out of a need for smaller feature spaces when using computationally
intensive machine learning-based text classification techniques like neural networks
and Bayes’ belief networks. In the context of this thesis, we are using lexical chains
as a means of augmenting a basic VSM with additional information regarding the
main themes of a news story. Hence, a lexical chain representation is not meant to
replace a free-text representation but to improve it, so we combine two distinct
document representations using an extended vector space model proposed by Fox
(1983). In this model an extended vector is actually a collection of sub-vectors
where the overall similarity between two extended vectors is the weighted sum of
the similarity of their corresponding sub-vectors. A more detailed explanation of
this composite document representation is discussed in the following section.
5.3
LexDetect: Lexical Chain-based Event Detection
In the following sub-sections, we will describe how we have integrated our
approach to lexical chain-based New Event Detection (NED) with a traditional
vector space model approach. Figure 5.1 gives an overview of the system
architecture of the LexDetect system.
Newswire
Radio
Television
News Stream
LexNews
Tokeniser
Free Text Document
Representation
Lexical Chainer
Lexical Chain
Word Document
Representation
New Event Detector
Breaking News Stories
Figure 5.1: System architecture of the LexDetect system.
114
We evaluate this hybrid system with respect to a traditional keyword-based NED
system using the TDT1 pilot study evaluation in Section 5.5 and with respect to the
UMass NED system on the TDT2 collection in Section 5.6.
5.3.1 The ‘Simplistic’ Tokeniser
In Section 3.2.2, we described a complex lexical chaining candidate selection step
(the tokeniser) that used a part-of-speech tagger and a parser to find useful proper
noun phrases and noun compounds, and a set of morphological rules that changed
adjectives to nouns. Our initial TDT1 pilot study experiments undertaken at an
earlier stage in our research used a much simpler tokenisation process that did not
avail of the part-of-speech tagging and proper noun/noun phrase identification
steps. Instead, like many other lexical chaining approaches (St-Onge, 1995;
Stairmand, 1997), a term was considered a candidate term for chaining if it was
listed in the WordNet noun database. However, we found this selection process to
be unsatisfactory in many cases as it led to additional ambiguity in the chaining
process which was responsible for spurious lexical chains, e.g. the verb ‘to drive’
was incorrectly identified as the noun ‘a drive’ which has 12 defined senses in
WordNet. The ‘simplistic’ version of the Tokeniser used in the TDT1 experiments
described in this chapter, changed all terms in a text that occurred in the WordNet
noun database to their singular form (if necessary). However, adjectives pertaining
to nouns, noun compounds long than two words, and proper noun phrases were not
identified, and so did not take part in the chaining process.
5.3.2 The Composite Document Representation Strategy
As explained in Section 5.1, we use lexical chains as a means of filtering noisy
terms from a document representation, where only those terms that are cohesively
linked with many other terms in the text are retained, since we hypothesise that they
capture the essence of the news story. However, as already stated we consider this
chain word representation as partial evidence in a composite document
representation that also includes a free text representation. In practical terms this
means that determining the similarity between two documents involves calculating
the cosine similarity between their respective chain word vectors and free text
vectors (where both sub-vectors weight their tokens with respect to their frequency
within the document), and then combining these two scores into a single measure of
115
similarity. This process of combining evidence is also referred to in the literature as
data fusion.
In Croft’s (2000) review of data fusion techniques used in IR, he states that
combining different text representations or search strategies has become a standard
technique for improving IR effectiveness. In the case of combining search systems
the class of ‘meta-search’ engines such as MetaCrawler have been very successful.
Similar improvements have been seen when multiple representations of document
content are used within a single IR search strategy (McGill et al. 1979; Katzer,
1982; Fox, 1983; Fox et al., 1988). As Croft explains, there are many different
classes of representation that have been used in these experiments, such as single
words from the text of a document (used in the vector space model), representations
based on controlled index terms (a list of key words composed by an indexer to
describe a set of documents), citations (references to other texts within a document),
passages (where documents are seen as a set of self contained parts rather than a
monolithic block of text), phrases and proper nouns (documents are described in
terms of people, companies, locations and phrases such as ‘budget deficit’), and
multimedia (where documents are seen as complex multimedia objects represented
by references to other media such as sound bites and video images). In general,
researchers have found that when combining these different text representations the
best results are obtained when a free text representation (i.e. traditional
representation containing all parts of speech) is used as stronger evidence than any
other class of representation. In practice this means that higher weights are given to
free text representations, with alternative representations seen as additional rather
than conclusive evidence of similarity.
5.3.3 The New Event Detector
As stated in Chapter 4, New Event Detection or First Story Detection is in essence a
classification problem where documents arriving in chronological order on the input
stream are tagged with a ‘YES’ flag if they discuss a previously unseen news event,
or a ‘NO’ flag when they discuss an old news topic. However, unlike detection in a
retrospective environment a story must be identified as novel before subsequent
stories can be considered. A single-pass clustering algorithm bases its clustering
methodology on the same assumption and has been used successfully by UMass,
CMU and DRAGON systems to solve the problem of NED.
116
In general, this type of clustering algorithm takes as input a set of S objects, and
outputs a partition of S into non-overlapping subsets S1, S2, S3,…Sn where n is a
positive integer. In our implementation of a single-pass algorithm no limit is
imposed on n (the number of clusters). Instead, this number is indirectly controlled
by a thresholding methodology which determines the minimum similarity between
a document and a cluster that will result in the addition of that document to the
cluster. Determining the similarity between an incoming document and a cluster
and controlling which clusters are compared to that document is managed by a
cluster comparison strategy and a thresholding strategy. The following explanation
encapsulates how these strategies are integrated into the single-pass clustering
algorithm:
1.
Convert the current document on the input stream into a lexical chain word
vector and a ‘free text’ vector.
2.
The first document on the input stream will become the first cluster and its
chain word vector and ‘free text’ vector will form two distinct cluster
centroids.
3.
All subsequent incoming documents are compared with all previously created
clusters up to the current point in time. A comparison strategy is used here
to determine the extent of the similarity between an incoming document and
the existing cluster centroid vectors.
4.
When the most similar cluster to the current document is found, the
thresholding strategy is used to discover if this similarity measure is high
enough to warrant the addition of that document to the cluster, i.e. the event
has been previously detected so the current document is classified as ‘an old
event’. If this document does not satisfy the minimum similarity condition for
the cluster determined by the thresholding methodology, then that document
is classified as discussing a new, previously unseen, event, i.e. a first story.
This document will then form the seed of a new cluster representing this new
event.
5.
The clustering process will continue until all documents in the input stream
have been classified.
117
Cluster Comparison Strategy
There are two facets to the cluster comparison strategy: a similarity measure and a
Time_Window. In some clustering implementations the addition of a document to a
cluster involves maintaining the documents original representation for subsequent
comparisons. So for example, in the single link cluster comparison strategy the
similarity between the incoming document and the cluster is taken as the maximum
similarity score between the incoming document and the document representations
in the cluster. However in the LexDetect implementation, to improve the overall
efficiency of the algorithm a cluster representative or centroid is used which is an
average representation of all the documents in the cluster. This process involves
merging (updating) the centroid representation every time a new member is added
to the cluster. However, before this merging can occur the most similar cluster to
the current document must be found. In accordance with the VSM, each
document/cluster representation is characterised in the detection process as a vector
of length t, which can be expressed as a unique point in t-dimensional space. The
importance of this is that document/cluster vectors that lie close together in this tspace will contain many of the same terms. This closeness or similarity is calculated
using the cosine similarity measure (Section 4.1.1, Equations 4.5 and 4.6).
So far we have described a cluster comparison strategy based on a traditional
VSM approach using keyword-based document classifiers. The data fusion element
of our research, as described in Section 5.3.2, involves the use of two distinct
representations of document content to identify first stories in a single cluster run.
In our alternative IR model, we use sub-vectors to describe our two document
representations, where the overall similarity between a document/cluster pair is
computed as the linear combination of the similarities for each sub-vector. So the
similarity function for our LexDetect system when comparing document D to
cluster C is for free-text vectors Cword and Dword, and chain word vectors Cchain word
and Dchain word :
Sim(C , D ) = ( Kword ∗ Sim(Cword , Dword )) + ( Kchain word ∗ Sim(Cchain word , Dchain word ))
where Kword , Kchain
word
(5.1)
are coefficients that influence the weight of evidence
each document representation contributes to the similarity measure. As in the case
118
of the traditional NED system, vector similarity is determined using the cosine
similarity function.
The final component of the comparison strategy is a Time_Window, which aims
to exploit the temporal nature of broadcast news. In general when a significant news
story breaks many stories discussing the same event will occur over a certain time
span. This means that stories closer together on the input stream are more likely to
discuss related topics than stories further apart on the stream. Hence, we impose a
Time_Window within which documents can be clustered, by only allowing the n
most recently updated clusters to be compared with the current document.
Thresholding Strategy
Working in tandem with the cluster comparison strategy is a thresholding
methodology, which influences the decision for generating a new cluster. When the
system has established the most similar cluster to a particular document, that
document may only become a member of that cluster if it exceeds the cluster
similarity threshold (CST). The CST is calculated by finding the similarity of the
updated cluster centroid (after the document representation has been merged with
it), and the newest document member of the cluster:
CST (Cupdate, D ) = Sim(Cupdate, D ) ∗ R
(5.2)
At this point, this new similarity threshold is too high, and is reduced by
multiplying in a reduction coefficient, R. This reduction coefficient plays an
important role in the resulting cluster formation, controlling the size of these
clusters and consequently the classification of new and old events. R is one of three
system parameters (the other two being Dimensionality (the length of the document
classifiers) and the Time_Window parameter) that have varying effects on the
detection process. In particular, increasing R will decrease system misses as it
makes it easier for documents to be classified as new events, increasing the
Dimensionality will reduce precision but increase recall, and decreasing the
Time_Window parameter will increase the efficiency of the system. If a sensible
Time_Window value is used then this should help decrease the number of system
false alarms by eliminating irrelevant (old news) from the cluster/document
comparisons step of the detection algorithm.
119
The LexDetect implementation of the thresholding strategy deviates slightly
from the traditional NED system described above. In particular, when a document
representation is merged with a cluster representation, two separate merging
processes are required where vector Dword is merged with Cword and Cchain
merged with Dchain
word
word
is
which results in the following updated cluster vectors
(Cword)update and (C chain word)update . Equation 5.3 is then used to calculate an overall
updated cluster similarity threshold by multiplying the similarity value obtained by
Equation 5.2 with the reduction factor R:
CST (Cupdate, D ) =
(( Kword ∗ Sim((Cword )update, Dword )) + ( Kchain word ∗ Sim((Cchain word )update, Dchain word ))) ∗ R
(5.3)
The cluster comparison and thresholding strategies are the only difference
between the implementations of the traditional NED and our lexical chain-based
NED system LexDetect.
5.4
The TDT Evaluation Methodology
Currently the TDT community are about to embark on their 8th evaluation to date.
Since the beginning of the TDT initiative in 1997 four distinct corpora have been
created (the latest TDT-4 was released for the TDT 2003 evaluation). Initially TDT
research focussed on mono-lingual English language newswire and broadcast news
stories; however, since the advent of the TDT2 corpus multi-lingual data has been
made available, and most of the TDT participants have built systems that can filter
both types of news data. In this thesis we focus only on English news sources,
because unlike most other TDT approaches our technique relies on an additional
knowledge source (i.e. WordNet) for text understanding, in contrast to the mainly
statistical approaches of other TDT participants. In what follows, we describe the
TDT1 pilot study corpus, the TDT2 English corpus, and the evaluation
methodologies used in the pilot study and subsequent TDT workshops.
5.4.1 TDT Corpora
New Event Detection is concerned with detecting the occurrence of a new event
such as a plane crash, a murder, a jury trial result or a political scandal in a stream
of news stories from multiple sources. To assist in the research of this task the TDT
120
initiative has developed a number of event-based corpora two of which have been
used during the course of our work.
The TDT1 pilot study corpus is comprised of 15,863 news stories spanning
from the 1st of July 1994 through to the 30th of June 1995. These stories were
randomly chosen from Reuters news articles and CNN news transcripts from this
period, and were assigned an ordering that represents the order in which they were
published or broadcast. This corpus is accompanied by a file of relevance
judgements created for a set of 25 events, some of which include ‘the Kobe
earthquake’, ‘DNA evidence in the OJ Simpson trial’ and ‘the arrest of Carlos the
Jackal’. These recognised events are only a subset of the total number of distinct
events in the corpus and were chosen for their ‘interestingness’, their uniqueness
and the fact that there were an acceptable number of stories on each of these events
in the corpus. In total 1132 stories were judged relevant, 250 stories were judged to
contain brief mentions, and 10 stories overlapped between the set of relevant and
the set of brief mentions. However these ‘brief mentions’ and overlaps are removed
from the evaluation process, so classification is measured on relevant and nonrelevant stories only.
The TDT2 corpus consists of 64,000 stories spanning the first six months of
1998 taken from six different news sources:
TV Broadcast News Transcripts: Cable News Network (CNN) Headline
News, American Broadcasting Company (ABC) World News Tonight,
Radio Broadcast News Transcripts: Public Radio International (PRI) The
World, Voice of America (VOA) English news programs,
Newswire: Associated Press Worldstream (APW) News Service, New York
Times (NYT) News Services.
For the TDT 1998 evaluation the corpus was split into training, development and
evaluation test sets. Both training and development corpora are always provided for
any initial dry-run experiments conducted by the participants. The evaluation test
set on the other hand is used for the final ‘blind’ evaluation and is only sent to the
participants a few weeks in advance of the TDT workshop. Since its release in 1998
the TDT2 corpus has existed in three distinct versions:
Version 1 was used in the TDT2 evaluation and was annotated against 100
target topics (only 96 first stories could be identified for these 100 topics).
121
Version 2 was augmented with three Mandarin news sources which were
annotated against 20 target topics from the original 100 topics identified in
version 1. This version was released in June 1999, and used as development and
training data in the TDT 1999 evaluation, while the TDT3 corpus provided the
evaluation test data.
Version 3.2 is the current version of the corpus on offer from the LDC which
was released on the 6th of December 1999. It consists of the same Mandarin and
English news sources as version 2 (a number of bugs were also fixed). This
version was annotated against an additional 97 topics that in total provides 193
first stories on which NED performance is based. These new topics are,
however, only partially annotated (not all stories belonging to the event have
been added to the relevance files), and were initially created to facilitate NED
research at the John Hopkins University Novelty Detection Workshop (Allan et
al., 1999) (see Section 4.3.4). This version was used as development and
training data in the TDT 2000 evaluation runs with the TDT3 corpus being used
as the evaluation test set. All broadcast news in the TDT2 collection is available
in audio or transcripted format. These transcripts were generated by the Dragon
and BNN automatic speech recognisers, and boundaries between adjacent
stories in the audio streams were determined by the LCD annotators. Some
manual text transcriptions are also available which include closed caption
material taken from television news streams and some Federal Document
Clearing House (FDCH) formats.
In the experiments that follow, we use the TDT1 pilot study corpus and version 3.2
of the English TDT2 corpus. Unfortunately, since the TDT 1998 evaluation did not
include the NED task and all subsequent evaluations used TDT2 as a training and
development resource we are unable to directly compare our system results in
Sections 5.6 with other TDT participants. However, during a visit to the Center for
Intelligent Information Retrieval at UMass in 2001 a number of experiments were
conducted and evaluated with respect to the UMass NED system. The UMass
system was the best performing system at the 1999 TDT evaluation and was
marginally outperformed by the IBM NED system in the 2000 evaluation. Before
we report on these results we will first look at the evaluation metrics used to
determine system performance by the TDT community.
122
5.4.2 Evaluation Metrics
As stated in Section 4.1.2, IR performance is generally measured in terms of three
metrics recall, precision and the F1 measure. However, in the TDT evaluation
methodology two system errors (misses and false alarm probabilities) are used to
assess the effectiveness of the classification task. Misses occur when the system
fails to detect the first story discussing a new event, and false alarms occur when a
document discussing a previously detected event is classified as a new event. These
definitions are now described more formally with respect to Table 5.1. For
completeness, definitions of the traditional IR evaluation metrics are also included.
# Retrieved by the system
# Not Retrieved by the system
# Relevant Stories in Corpus
A
C
# Non-relevant Stories in Corpus
B
D
Table 5.1. Values used to calculated TDT system performance. A, B, C, D are document counts.
recall = r = A /( A + C ) if A + C > 0, otherwise undefined
precision = p = A /( A + B ) if A + B > 0, otherwise undefined
Pmiss = 1 − recall = C /( A + C ) if A + C > 0, otherwise undefined
Pfa = B /( B + D ) if B + D > 0, otherwise undefined
F1 = 2 pr /( p + r ) = 2 A /(2 A + B + C ) if ( 2 A + B + C ) > 0, otherwise undefined
Since the TDT1 evaluation (Allan et al., 1998a) was only based on a relevance
file of 25 identified first stories, an evaluation methodology was developed which
expanded the number of trials and effectively increased the number of decisions
that the system could be judged on. This was achieved by calculating miss and false
alarms rates based on 11 system passes through the input data, where the goal of the
first pass is to detect the first story to discuss one of the 25 events on the input
stream, and the goal of the second pass after all first stories have been removed is to
identify all the ‘second stories’. This process is then iterated until the 10th document
on the event has been skipped. If an event has less than the required number of
documents in order to participate in the iteration then it is ignored. Final
performance metrics are obtained by calculating the macro average of the respective
miss and false alarm rates for each of the 11 passes.
As well as requiring that each NED system tag each document with a declaration
that it either discusses a new event (a ‘YES’ tag) or discusses an old event (a ‘No’
tag), the TDT evaluation also requires that the system produce a confidence score to
123
accompany these tags, i.e. a score that indicates how sure the system was about its
declaration. Often this score is based on the maximum similarity between the target
document and its most similar document in the corpus. An example of the UMass
confidence score, Equation 4.16 referred to as the decision score decision( qi, dj ) ,
was given in Section 4.3.1. The confidence scores returned by the NED system are
then used to generate a Detection Error Tradeoff (DET) graph which represents
the trade-off between miss and false alarm rates for a system. The TDT evaluation
software22 constructs a DET graph from the confidence score space by calculating
the Pmiss and Pfa for a large range of decision thresholds, i.e. how well the system
perform if only documents with confidence scores exceeding X are tagged as ‘first
stories’, where X is incremented in small steps from 0 to 1. During this ‘threshold
sweep’ average Pmiss and Pfa values are computed across topics. Once this process is
completed a topic weighted DET curve can be generated by plotting each point
(Pmiss, Pfa) for each threshold. These points are plotted on a Gaussian scale rather
than a linear one as this helps to ‘expand the high performance region’ of the graph
making it easier to differentiate between similarly performing systems, where the
curve closest to the origin represents the best performing system.
A welcome enhancement to the TDT pilot study evaluation was the formulation
of a cost function for the TDT2 evaluation. The purpose of the cost function was to
provide participants with a single measure that could define TDT performance in
terms of miss and false alarm probabilities. Like the F1 measure, used to combine
recall and precision, the TDT cost function does not perfectly characterise detection
effectiveness. However, it has been shown to be useful for parameter tuning
purposes. The general form of the TDT cost function is:
CDet = Cfa ∗ Pfa ∗ (1 − Pevent ) + Cmiss ∗ Pmiss ∗ Pevent
(5.4)
For the TDT2 evaluation, cost was defined with constants Pevent = 0.02 and Cfa =
Cmiss= 1.0, where Pevent is the apriori probability of finding a target (in this case a
first story). Fiscus and Doddington (2002) point out that although this measure
(Equation 5.4) is useful, it is difficult to determine what exactly constitutes a well-
22
Our experiments on the TDT2 collection used the TDT3eval_v2.1 evaluation software available at
http://www.nist.gov/
124
performing system with respect to a specific task. To address this, they suggest a
normalised CDet which is calculated by dividing CDet by the ‘minimum expected cost
achieved by either answering YES to all decisions or answering NO to all
decisions’ (Fiscus, Doddington, 2002), i.e.
(CDet ) Norm = CDet / MIN (Cmiss ∗ Pevent , Cfa ∗ (1 − Pevent ))
(5.5)
The TDT evaluation software also calculated the value of a topic weighted
minimum normalised cost function for the systems DET curve which represents
the point on the curve where the optimal Pmiss and Pfa was achieved.
5.5
TDT1 Pilot Study Experiments
A number of experiments were conducted on the TDT1 collection in order to
explore the effect on NED performance when lexical chains are used in conjunction
with a free text representation to represent a document. This involved comparing
the effectiveness of a number of different chain-based approaches to NED with a
traditional VSM approach to the problem. Details of these systems are described in
the following section. The results described in this section were published in
(Stokes et al., 2001a; 2001b; 2001c).
5.5.1 System Descriptions
Four distinct detection systems TRAD, CHAIN, SYN and LexDetect took part in
the following set of experiments. The main difference between these systems is that
TRAD, SYN and CHAIN use a single text representation of a document, while
LexDetect uses two distinct representations of document content.
The TRAD system, our benchmark system in these experiments, is a basic NED
system that expresses document content in terms of a free text representation and
computes detection on the syntactic similarity between documents and clusters
within the vector space model framework described in Section 4.1. Classification of
a new event occurs in a similar manner to that described in Section 5.3.3. Three
TRAD schemes23 are used, TRAD_30, TRAD_50, and TRAD_80, which differ
23
An IR ‘system’ and an IR ‘scheme’ are used in this context to describe two different concepts. An
IR system refers to the physical implementation of an IR algorithm, which can have various
operational modes or various parameter settings. The same IR system may be used to execute
different IR schemes by adjusting these parameters (Lee, 1997).
125
only in the length of their document representations (i.e. varying the Dimensionality
parameter by selecting the n most frequently occurring terms). Fixing the
Dimensionality parameter is a common strategy in IR and filtering systems based
on the VSM, since the cosine measure can become distorted when calculating the
similarity between vectors of uneven length. Hence, we experimented with a
number of different dimensionalities ranging from 30 to full dimensionality (all the
words in the text) in order to determine the optimal value for the TRAD system on
the TDT1 corpus. We found that dimensionality 50 produced optimal TRAD
performance which corresponded with other NED results reported by (Allan et al.,
1998d). Another important parameter of the TRAD system is the Time_Window
parameter which exploits the news stream characteristic that stories closer together
on the input stream are more likely to discuss related topics than stories further
apart on the stream. Thus the Time_Window parameter ensures that only the t most
recently updated (or active) clusters are compared to the current document on the
input stream. A time window of 30 clusters was chosen as a suitable value for t and
is employed in the TRAD, CHAIN, SYN and LexDetect systems experiments
described below.
The design of our second system LexDetect has been described in detail in
Section 5.3. Unlike TRAD the dimensionality of LexDetect (80 words) remains
static throughout these experiments. Using our basic lexical chaining method, just
under 72% of documents contained greater than or equal to 30 chain words. We
therefore normalised the length of chain word representations by imposing a
dimensionality value of 30 on all LexDetect schemes. In theory, it is possible to
vary the length of the free text representation in our combined representation;
however, in these experiments all schemes contain free text representations of
length 50, with a combined document representation length of 80. The final system
parameters to be varied in these experiments are the weighting coefficients Kword
and Kchain word used in Equations 5.1 and 5.2 to control the importance of the
similarity evidence derived from the chain word and free text sub-vectors.
The design of our third and fourth systems, CHAIN and SYN, are similar to
TRAD in that they use a singular document representation during NED. However,
both these systems incorporate chain words into the document representation
strategy during the detection process. In the case of CHAIN a syntactic chain word
126
representation is used, in contrast to SYN which uses the WordNet synset numbers
to represent a document. The use of synsets instead of terms as a document
representation has been discussed in Section 5.1 and 5.2, and is referred to in the
literature as conceptual indexing. By including this representation in our experiment
we can explore the effect of disambiguation performance on our NED task, the
results of which are discussed in more detail in the next section. As in the case of
LexDetect the dimensionality of the chain representations used in SYN and CHAIN
are also limited to 30 features.
5.5.2 New Event Detection Results
The objective of this experiment is to determine if LexDetect’s combined
representation approach can exceed the NED performance of our single
representation systems TRAD, SYN and CHAIN. Figure 5.2 is a Detection Error
Tradeoff (DET) graph showing the impact of our combined representation on
detection performance. As explained in Section 5.4.2, a DET graph illustrates the
trade-off between misses and false alarms, where points closer to the origin indicate
better overall performance. The points on a DET graph are plotted from the false
alarm and miss rates of each of the four systems using a range of reduction
coefficients R (from 0.1 – 0.9) for each of their 11 iterations. The average miss and
false alarm values of these iterations are then plotted on the DET graph.
As can be seen the graph with the closest point to the origin is the LexDetect
system. The error bars (at 5% statistical significance) lead us to concluded that a
composite document representation using chain words and free text words
marginally outperforms a system containing either one of these representations, i.e.
(CHAIN and TRAD). This result is in agreement with two of the three hypotheses
set out in Section 5.2, namely that lexical chaining works well as a feature selection
method and that a data fusion experiment involving a combination of chain word
and free text representations outperforms a traditional keyword-based approach.
The third hypothesis, that chain words are better index terms than chain synsets,
also holds as the SYN system performs significantly worse than the CHAIN system.
This is also an important result as it clearly shows that there is no advantage in
using WordNet synsets in a conceptual indexing strategy for an NED classification
task. Since we have now established that SYN is an inferior representation, no
further experiments with this system are performed.
127
100
LexDetect (No Weighting)
CHAIN
90
TRAD
SYN
80
LexDetect
% False Alarms
70
60
50
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
% Misses
Figure 5.2: The effect on TDT1 NED performance when a combined document representation
is used.
System
Optimal R
% Misses
% False Alarms
LexDetect
0.30
19.0
30.66
LexDetect (No Weighting)
0.30
19.0
33.67
TRAD_50
0.40
28.0
30.86
TRAD_80
0.40
30.0
31.19
CHAIN
0.15
43.0
27.44
SYN
0.15
40.0
38.1
Table 5.2: Miss and False Alarms Rates of NED systems for optimal value of the Reduction
Coefficient R on the TDT1 corpus.
The graph in Figure 5.2 also makes reference to a version of the LexDetect
system called LexDetect (No Weighting). In this version of the algorithm the
weight coefficients defined in Equation 5.2 are assigned equal weight, i.e. Kword =
Kchain
word
= 1. A range of other values incremented in steps of 0.1
from 0.1 ≤ K ≤ 0.9 for both coefficients were also tried, where Kchain
word
= 1 and
Kword = 0.5, used in the LexDetect schema, were found to be optimal. This is an
128
100
interesting result as similar experiments using composite document representations
to improve search system performance based on ranking, only experienced optimal
effectiveness when they allowed free text evidence to bias the retrieval process
(McGill et al., 1979; Fox, 1983; Fox et al., 1988; Katzer et al., 1982). Table 5.2
summarises the optimal miss and false alarms achieved by each of the systems in
Figure 5.2.
5.5.3 Related New Event Detection Experiments at UCD
The initial phase of the NED research (1999-2000) carried out in this thesis was
worked on in conjunction with Hatch (2000). Separate implementations of the NED
system and the lexical chaining algorithm described in Section 5.3 were used to
pursue two different avenues of NED research24:
Using lexical chain words as distinct features in a VSM, i.e. the work in this
thesis.
Using lexical chains as features in a VSM (Hatch, 2000).
More specifically, Hatch’s work looked at determining document similarity at a
chain rather than a chain word level, where a document is a set of chain word
vectors (one vector for each chain) and the pairwise comparison of these chain
vectors is used to calculate the similarity between two document representations.
Figure 5.3 illustrates the pairwise comparison of chains in documents A and B,
where each chain is represented as a weighted vector of its chain words, and the
similarity value between chains can be measured using the cosine similarity metric.
Once these pairwise chain comparisons have been calculated only the maximum
similarity value for each document chain vector (with a cluster chain vector) is
retained and used to determine the overall similarity between a target document and
a document cluster, defined more formally as follows in Hatch (2000):
sim max( dcj, cluster ) = (max j : 1 ≤ j ≤ n : sim(dcj, cck ))
(5.8)
where sim max( dcj, cluster ) is the maximum similarity between a document chain dcj
and each of the cluster chains cck , and
24
The C and Perl programming languages in a Unix environment were used to implement the work
covered in this thesis, while Hatch’s implementations were developed and run on a Java/Unix
platform.
129
n
sim(document , cluster ) =
sim max(dc j, cluster )
j =1
n
(5.9)
where sim(document,cluster) is the average maximum similarity assigned to each
document chain dcj in the previous step.
Figure 5.3: Example of cross chain comparison strategy. Red arrows indicate an overlap
between chains.
An interesting question arises from considering whole chains in the
document/cluster similarity metric, where once a document is to be added to a
cluster how should its chain representation be merged with the cluster centroid
representation? Hatch identifies two possibilities and comments on how these might
affect recall and precision values:
Taking the UNION of the document and cluster chains as the centroid
representation will improve recall.
Taking the INTERSECTION of these two chain sets will improve precision.
Hatch chose the union of the two chain representations as a method of merging new
documents with the centroid representations, as it ‘broadens the event definition and
captures the evolution of the event’. The validity of this statement is obvious when
one considers the following example where the union of the following two chains
clearly increase the number of terms in the centroid chain representation:
f
C1 = {a, c, d } C 2 = {a , e}
(5.10)
then C1 C 2 = {a , c, d , e} C1 C 2 = {a}
Hatch also evaluated this lexical chain-based NED prototype using the TDT1
corpus and evaluation methodology. Figure 5.4 is a DET graph comparing the
130
performance of both SYN systems (WordNet concept indexing): one based on
comparison of chain terms between document and cluster representations (SYN
(Stokes, 2001a)), the other based on a comparison of lexical chains (SYN (Hatch,
2000)). It is evident from this graph that the former technique works best for
comparing chain representations in a simple VSM. Hence, as confirmed by the error
bars (at 5% statistical significance), we can conclude that no gain in performance
occurs when Hatch’s computationally more expensive prototype is used.
Consequently, in subsequent experiments involving the TDT2 corpus and
evaluation, described in Section 5.6, only the approach documented earlier in this
chapter is considered.
100
90
SYN (Stokes, 2001)
SYN (Hatch, 2000)
80
% False Alarms
70
60
50
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
100
% Misses
Figure 5.4: DET Graph showing performance of the SYN system using two alternative lexical
chain-based NED architecture.
Hatch also performed a number of data fusion experiments using a similar
LexDetect architecture to that described in Section 5.3.2 (Equations (5.2) and (5.3))
where the weighted sum of the term and chain representations was used to calculate
the similarity between a document and a cluster. However, Hatch only
experimented with a conceptual indexing based chain representation in her data
fusion experiment. Figure 5.5 shows a DET graph of the NED results for both the
chain-based VSM (Hatch, 2000), which is a combination of SYN and TRAD, and
the chain word-based VSM (Stokes, 2001a), which is a combination of CHAIN and
131
TRAD. It is evident from this graph that Stokes’s LexDetect schema marginally
outperforms Hatch’s, which is in agreement with the SYN results in Figure 5.4.
This result is also statistically significant to 5% for lower miss and false alarm rates
indicated by the error bars on the LexDetect (Stokes, 2001a) curve.
100
90
TRAD
LexDetect (Stokes, 2001)
80
LexDetect (Hatch, 2000)
SYN (Hatch, 2000)
70
False Alarms
SYN (Stokes, 2001)
60
50
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
100
Misses
Figure 5.5: DET Graph showing performance of two alternative lexical chain-based NED
architectures.
5.6
TDT2 Experiments
In this section, we repeat our NED experiments described in the previous section.
However, in this case the underlying retrieval model, used by the LexDetect system
to detect new events, is the UMass approach to NED. These experiments also differ
from those described in Section 5.5, since the TDT2 corpus and evaluation
methodology are used to evaluate system accuracy. In particular, the TDT2 cost
function, (CDet ) Norm , (not available in the TDT1 evaluation) provides a single
measure of system performance based on a combination of the miss and false alarm
probabilities calculated for a give system run. In addition, the TDT2 corpus is over
three times the size of the TDT1 corpus, it contains 96 identified new events
compared with only 25 events in the TDT pilot study evaluation, and hence is a
more comprehensive evaluation strategy. The TDT2 corpus also provides a more
realistic evaluation environment where NED systems are required to process error-
132
prone ASR text, with limited capitalisation, some spelling errors and segmentation
errors. The effect of this facet of the evaluation on NED performance will be
discussed in due course.
5.6.1 System Descriptions
In Chapter 4, we reviewed the preliminary UMass system proposed by Papka
(1999). His NED implementation used a single-pass clustering algorithm based on
the vector space model using the cosine similarity metric and an InQuery term
weighting scheme (see Section 4.3.1). This system participated in the pilot study
evaluation in 1997, and the TDT 1998 and TDT 1999 evaluation workshops.
However, the 1999 workshop at the John Hopkins University prompted a re-design
of Papka’s system. The participants found that optimal NED performance could be
achieved using a vector space model, k-NN clustering strategy, the cosine similarity
metric, an InQuery tf.idf metric and a feature vector containing all terms in the
document, i.e. full dimensionality (Allan et al., 2000c). Stopping and stemming
words was also found to have a positive effect on performance. Their system also
supports a number of language modelling approaches to the TDT tasks; however,
optimal performance was achieved using a vector space modelling approach. This
system participated in the TDT 2000 evaluation and was found to be the best
performing NED system in this evaluation run. The basic UMass system is made
up of a number of command-line switches which provide a flexible means of testing
various combinations of term weighting, clustering, and similarity strategies. Here
is a list of the strategies supported by the UMass system taken from (Allan et al.,
2000c)
Topic Models:
o k-NN clustering
o Agglomerative centroid based clustering
Similarity Functions:
o InQuery weighted sum (Equation 4.14)
o Vector cosine similarity (Equations 4.5, 4.6)
o Language modeling approach (Equation 4.3, 4.4)
o Kullbach-Leiblar Divergence (or relative entropy): This is an
information theoretic measure which calculates the divergence between
133
two distributions, in this case the document distribution D and a topic
model M. More formally: KL( D, M ) = −
di log(mi di )
i
where di and mi are the relative frequencies of word i in D and M
respectively (both smoothed appropriately) .
Term Weighting Schemes:
o Basic tf weighting
o Basic tf.idf weighting
o InQuery tf.idf weighting (Equations 4.11, 4.12, 4.13)
As described in Section 5.3.3, the LexDetect system combines the similarity of two
distinct sub-vectors: one a free text document representation and the other a lexical
chain based-document representation. For the experiments described in this section
we combined these forms of evidence within the UMass NED framework, with the
hope of outperforming the basic UMass system as was achieved in our preliminary
TDT1 exploration using our own NED retrieval model.
5.6.2 New Event Detection Results
As explained in Section 5.3, the basic LexNews lexical chaining algorithm (Section
3.1) was used to generate and facilitate the feature selection method discussed in
Section 5.2.
Figures 5.6 and 5.7 contrast the performance of both the basic
LexNews and the enhanced LexNews chaining algorithm. As described in Chapter
3, both these algorithms look at repetition-based and lexicographical relationships
between words. However, the enhanced version also looks at compound nouns,
proper nouns and statistical word associations. In both these graphs the LexDetect
systems are marginally outperformed by the basic UMass system. Unfortunately,
there is no significant difference between the performance of the LexDetect system
using the basic chaining algorithm and the system using the enhanced algorithm.
134
Figure 5.6: DET graph showing performance of the LexDetect and CHAIN systems (using the
basic LexNews chaining algorithm), and the UMass system for the TDT2 New Event Detection
task.
Figure 5.7: DET graph showing performance of the LexDetect and CHAIN system (using the
enhanced LexNews chaining algorithm), and the UMass system for the TDT2 New Event
Detection task.
135
However, the CHAIN system in Figure 5.7 (minimum TDT cost 0.7869) does
exhibit a marked improvement over the CHAIN system in Figure 5.6 (minimum
TDT cost 0.9174) implying the features detected by the enhanced LexNews
chaining algorithm better encapsulate the essence of the news story25. This outcome
was largely expected since the exclusion of proper nouns in the original chain word
representation would have greatly affected its performance as a document classifier.
Also the inclusion of statistical associated words is bound to bolster the
dimensionality of the resultant chain word document representation.
Table 5.3 shows a breakdown of the TDT2 results in terms of the broadcast news
(ABC, CNN, PRI, VOA) and newswire (NYT, APW) NED performance of each
system in Figures 5.6 and 5.7. As previously mentioned, the broadcast news portion
of the TDT2 corpus is a ‘noisier’ information source than its newswire equivalent
due to the presence of segmentation errors, spelling errors and most notably (for an
NLP system) a lack of capitalisation. The effectiveness of a number of
preprocessing steps in the LexNews chaining algorithm, in particular the part-ofspeech tagging and noun phrase identification steps, are greatly compromised which
has a knock-on effect on chaining accuracy. For example, consider the following
spelling mistake made by an ASR system:
‘earnest gold wrote the score for numerous productions including the film the
exorcist’
In this sentence the proper name ‘Ernest’ has been incorrectly identified as the
adjective ‘earnest’ by the ASR system. Due to a lack of capitalisation the phrase
‘earnest gold’ will be incorrectly identified as an adjective-noun phrase by the partof-speech tagger and then incorrectly interpreted by the lexical chainer as a
reference to a ‘sincere precious metal’. This compounding of errors during the chain
formation process is evident when one compares the degradation in performance
between LexDetect and CHAIN systems on the newswire and broadcast news
portions of the TDT2 corpus. In the case of the basic UMass system, performance
degradation is 8.74% compared with 20.61% for the CHAIN (enhanced) system.
25
Unlike the TDT1 experiments full dimensionality was used in both the free text and lexical chain
word document representations, since this resulted in optimal NED performance on the TDT2
corpus.
136
More moderate degradations were experienced by the other systems, the most
important conclusion being that in each case the UMass system outperforms all
other systems on both the broadcast and newswire portions of the corpus and on the
entire TDT corpus. Hence, unlike the TDT1 experiments the combined document
representation in the LexDetect system could not outperform the baseline NED
system, in this case the UMass system.
Another difference between the TDT2 and TDT1 corpora that greatly affected
the performance of the LexDetect system in these experiments was the occurrence
of very short broadcast news stories. From Table 5.4, we can see that 15.5% of
documents in the TDT2 collection have less than 50 words (roughly 2 sentences)
which means that in many instances the chainer fails to select any defining features
of the document resulting in a very low dimensionality document representation.
More specifically, the LexNews chaining algorithm has great difficulty analysing
the lexical cohesive structure of very short documents, as they tend to lack even
weakly cohesive ties like statistical word associations between words. Comparing
this to the TDT1 collection only 0.9% fall into the ‘very short’ document category
and only 3.5% into the ‘short’ category compared to 19.8% in the TDT2 collection.
An interesting extension to this work would be to investigate the impact of
document length on the effectiveness of both the LexNews chaining algorithm and
on TDT performance in general.
System
Broadcast
Newswire
%∆
All News
UMass
0.6508
0.5634
-8.74
0.6302
CHAIN (enhanced)
0.8569
0.6508
-20.61
0.7869
CHAIN (basic)
0.9546
0.8558
-9.88
0.9174
LexDetect (enhanced)
0.6688
0.5981
-7.07
0.6444
LexDetect (basic)
0.6918
0.6845
-0.73
0.6498
Table 5.3: Breakdown of TDT2 results into broadcast and newswire system performance.
LexDetect (basic) and Chain results were shown in Figure 5.7 and enhanced versions of these
algorithms were shown in Figure 5.8. % ∆ is the percentage degradation in NED performance
when broadcast news performance is compared with newswire performance.
137
Document Type
No. of Words
TDT1 % of Corpus
TDT2 % of Corpus
Very Short
<= 50
0.9
15.5
Short
51 – 100
3.5
19.8
Short-Medium
101 – 250
17.1
20.1
Medium-Long
251 – 500
41.5
18.6
Long
501-1000
30.4
17.3
Very Long
> 1000
6.6
8.7
Table 5.4: Breakdown of document lengths in the TDT1 and TDT2 corpora.
5.7
Discussion
In this chapter we introduced our New Event Detection system LexDetect which
uses a hybrid model of document content where document to cluster similarity is
based on the overlap between the free text and chain word representations of the
document/cluster pair. Inadequacies of previous chaining attempts were discussed
in Section 5.2 and a novel approach to integrating lexical chains into a vector space
IR model was described in Section 5.3. The architecture of the LexDetect system is
unique in three respects compared to previous research efforts by Stairmand, Green,
Kazman et al. and Ellman. In what follows, we justify these design decisions based
on the results of our TDT1 pilot study evaluation described in Section 5.4.
Firstly, unlike previous chain-based indexing strategies the LexDetect system
uses the syntactic form of chain words as features rather than WordNet synsets
identified during the chaining process. These two indexing strategies were
implemented as the CHAIN and SYN systems, where the results of our
experiments showed that the CHAIN system using chain words significantly
outperform the SYN system using WordNet synsets.
Secondly, the LexDetect system uses lexical chains as a feature selection
method where terms that form strong cohesive relationships with other terms in
the text are used to represent crucial threads of information in a news story.
Thirdly, we recognise that lexical chain words are only partial evidence of
document similarity, so we combined this evidence with a traditional keyword-
138
based representation of a document when seeking first stories in a news stream.
Justification of these last two design decisions was presented in the evaluation
of the TRAD, CHAIN and LexDetect systems where the LexDetect, a
combination of the TRAD and CHAIN systems, outperform either one. These
initial experiments indicated that lexical chains provide some useful additional
information regarding the topic of a document that can be used to enhance a
traditional ‘bag-of-words’ approach to the problem.
Another important justification of LexDetect’s design was discussed in Section
5.5.3, where we reviewed work by Hatch (2000) who also looked at improving
NED using lexical chain information. In her work, Hatch proposed an alternative
lexical chain document representation scheme. In this implementation lexical chains
rather than chain words are used as features in the detection process. However, a
comparison with the implementation of the LexDetect system described in this
thesis showed that Hatch’s more computationally expensive implementation is
marginally outperformed by our chain word-based approach.
Since the work at University College Dublin is the first large scale attempt at
ascertaining the suitability of lexical chaining as a solution to a real-world IR
problem like new event detection, the results discussed in Section 5.5 were a
preliminary investigation into this question. The results discussed in Section 5.6
provided a more thorough evaluation of our hypothesis, where lexical chains were
integrated into the UMass NED system architecture. However, when the
experiments were repeated on the TDT2 corpus similar improvements in system
performance where not observed. We identified three reasons for this disparity
between the pilot study and the TDT2 evaluation results:
1. The UMass system is a superior baseline system to the one described in Section
5.3, due to its continual refinement as a result of its yearly participation in the
TDT workshop. Hence, it was difficult to improve upon its baseline TDT2 NED
performance.
2. The effectiveness of a number of key preprocessing steps in the LexNews
chaining algorithm was greatly reduced due to their dependence, like many
other NLP techniques, on capitalised text. This characteristic of the ASR
broadcast news transcripts affected both the accuracy of the resultant lexical
chains and their ability to extract pertinent features in the news stories.
139
3. Since 35.3% of documents in the TDT2 corpus consist of less than 100 words,
this made it difficult for the LexNews chaining algorithm to explore the lexical
cohesive structure of these texts due to their brevity, which in turn reduced the
algorithm’s effectiveness as a feature selection method.
These final two points are primarily responsible for LexDetect’s and the CHAIN
system’s inconsistent performance on the broadcast and newswire portions of the
TDT2 corpus, where NED performance dropped by 7.07% and 20.61% respectively
for broadcast news document classification. In Section 4.3.5, we referred briefly to
the work of Eichmann and Srinivasan (2002) who also experimented with a
document/cluster similarity measure based on the weighted sum of a number of
sub-vectors for each of the following named entities: persons, organisations, places
and events. From their participation in the TDT2000 workshop, Eichmann and
Srinivasan concluded that tracking performance was severely downgraded when
sparse or empty entity vectors occurred in similarity calculations. This is analogous
to our observation that LexDetect performance is poorer on the TDT2 corpus due to
the presence of short, low cohesive documents resulting in sparse or empty chain
word vectors.
Furthermore, these results are also in agreement with similar lexical chain-based
Event Tracking research also conducted at University College Dublin. Carthy
(2002) used a similar lexical chaining implementation and document cluster
comparison strategy to Hatch (see Section 5.5.3) in his LexTrack system, where
document-cluster similarity is determined based on the cross-comparison of chains
between the a centroid and a document representation. However, Carthy calculates
the Overlap Coefficient (van Rijsbergen, 1979) between two chains rather than their
cosine similarity as Hatch did. More specifically the similarity between two chains
c1 and c2 is calculated as follows:
scorei =
| c1 c 2 |
min(| c1 |, | c 2 |)
(5.6)
Using this measure ensures that if the elements of a short chain are all subsumed in
a longer chain a high matching value is still achieved. Once all the pair-wise
comparisons between the document and event descriptor chains have been
calculated then the overall similarity between the document and cluster is the sum
of all these comparisons. Hence, the target document is on-topic (i.e. sufficiently
140
similar to the event being tracked) if it exceeds a certain threshold. Like the
LexDetect system, the LexTrack system uses a composite document representation
consisting of keyword and lexical chain representations. However, overall the
LexTrack system could not outperform the three baseline systems that Carthy
developed for comparison, and he concluded that considering the temporal nature of
news stories26 during the event tracking process had a more positive impact on
tracking performance than the inclusion of lexical cohesion information.
26
Carthy extended a baseline tracking system with a time penalty factor that gradually increased the
similarity threshold as the distance between the target story and the event descriptor on the data
stream increased. This similarity threshold is the similarity value (between a document and a cluster)
that must be exceeded in order for the target document to be deemed on-topic.
141
Chapter 6
News Story Segmentation
Text segmentation can be defined as the automatic identification of boundaries
between distinct textual units (segments) in a textual document. The aim of early
segmentation research was to model the discourse structure of a text. Consequently,
segmentation research has focussed on the detection of fine-grained topic shifts at a
clausal, sentence or passage level (Hearst, 1997). More recently, with the
introduction of the TDT initiative (Allan et al., 1998a), segmentation research has
concentrated on the detection of coarse-grained topic shifts, in particular, the
identification of story boundaries in news feeds.
The aim of this chapter is to set the scene for Chapter 7 which explores another
TDT application of our lexical chaining algorithm: news story segmentation.
Segmentation literature spans nearly four decades of published research. In Section
6.1, we review some of these segmentation approaches which we have categorised
into two levels of segmentation granularity: fine-grained segmentation and coursegrained segmentation. Since we are primarily interested in techniques that deal with
news story segmentation, Section 6.2 covers three of the most common approaches
to sub-topic or topic segmentation including methods based on: information
extraction, lexical cohesion analysis, and multi-source and statistical modelling
approaches.
142
6.1
Segmentation Granularity
In general the objective of a text segmentation algorithm is to divide a piece of text
into a distinct set of segments. Segmentation literature abounds with definitions of
what unit of text a segment should represent. These definitions have varied in form
and size from a shift in speaker focus (a span of speaker utterances) (Passonneau,
Litman, 1997) to a distinct topical unit like a news story (a set of multiple
paragraphs) (Allan et al., 1998a). The realisation of these diverse segment types
requires different levels of textual analysis or segmentation granularity, which must
be reflected in the design of the particular segmentation algorithm. In this section
we look at the relationship between text segmentation, and linear and hierarchical
discourse analysis. This is followed by some examples of fine and coarse-grained
segmentation, and a description of how segmentation analysis can be applied to a
variety of IR and NLP tasks.
6.1.1 Discourse Structure and Text Segmentation
Early text segmentation work stemmed from the desire to model the discourse
structure of a text. Discourse analysis, as explained in Section 2.1, examines the
interdependencies between utterances (words, phrases, clauses) and how these
dependencies contribute to the overall coherence of a text. Segmentation is one
method of exploring discourse structure. Many prominent theories of discourse state
that relationships between utterances are hierarchically structured (Grosz and Sider,
1986; Mann and Thompson, 1987), where a hierarchical structure, according to
Grosz and Sider (1986), is a tree-like formation that represents multi-utterance
segments and the relationships between them. In spite of this the majority of
approaches, including our own, partition text in a linear fashion. In fact, even in
cases where researchers have based their segmentation approach on a hierarchical
theory of discourse, they still resort to evaluating their technique with respect to a
linear structure (Passonneau, Litman 1997; Yaari 1997). For example, Passonneau
and Litman (1997) explained that in order to have a comprehensive evaluation
methodology, they had to use a linear strategy since asking human judges to
consider hierarchical relationships between segments is an inordinately larger task
than asking them to segment text linearly.
143
6.1.2 Fine-grained Text Segmentation
Not only do segmentation techniques differ with regard to the type of discourse
analysis (linear or hierarchical) that they perform, but they also differ with respect
to the level of segmentation granularity they produce. Passonneau and Litman’s
(1997) work represents an example of fine-grained segmentation. Their technique
identifies discourse structure in speech transcripts by extracting linguistic features
such as referential noun phrases, cue phrases and speaker pauses from a training set.
The C4.5 algorithm (Quinlan, 1993) is then used to induce a suitable decision tree
that combines the segmentation evidence derived from these features. The decision
tree is then used to fragment dialogue into segments or spans of utterances, which
according to Passonneau and Litman form coherent units in the dialogue.
Segment Description
Transcript
Speaker recommends
movie.
Well it’s a really great movie, really beautiful scenery.
You should see it, I recommend it, I really do.
Introduces main
character and where the
movie takes place.
Describes the fort and
the countryside.
The first part of it just sets up how Kevin Kostner’s [sic]
character goes out West.
He’s a soldier in the Civil War, and he was a hero so he
can be posted where he wants, and he asks to go to the
frontier, out to North Dakota ’cause he’s kind of romantic
about the West.
Well, he gets this army post and it’s abandoned.
He’s all alone in the wilderness, with lots of supplies but no
people.
It’s beautiful country, really wide open, hardly any trees,
golden grass waving in the breeze, all like that.
Figure 6.1:Example of fine-grained segments detected by Passonneau and Litman’s
segmentation technique. Note that the noun phrases ‘frontier’ and ‘North Dakota’ refer back
to ‘the West’, and ‘wilderness’ refers to ‘country’ in the third segment.
Figure 6.1 is an extract of a speech transcript taken from Passonneau and
Litman’s evaluation corpus (1997), which illustrates the fine-grained nature of their
segments. Cue phrases (e.g. well, uh, finally, because, also) are highlighted in the
transcript in bold and referential nouns phrases (including pronouns) are highlighted
in italics. This mark-up was hand-coded by the authors in order to provide the C4.5
machine learning algorithm with labelled training examples. Passonneau and
Litman define a segment as a unit of text that represents a speaker intention, i.e. a
specific idea or point that the speaker is trying to articulate to the listener.
144
One NLP task that can benefit from fine-grained segments like these is anaphor
resolution, i.e. the identification and resolution of referential relationships between
pronouns and noun phrases in a text. An experiment by Reynar (1998) showed that
by restricting the number of candidate antecedents (possible referents) to those
words that exist in the same segment as the pronoun, the efficiency of the resolution
algorithm can be greatly improved without compromising its effectiveness.
Passonneau and Litman (1993) also present a motivating example of how segmental
structure is essential for pronoun resolution. However, in this case no formal
evaluation of the effectiveness of the claim was ever reported. Other NLP
techniques that have benefited from fine-grained segmentation analysis include
speaker turn identification and dialogue generation.
6.1.3 Coarse-grained Text Segmentation
Coarser-grained segmentation breaks text into multi-sentence or multi-paragraph
sized chunks. These types of segments have been used to improve IR,
summarisation and text displaying tasks. In the case of IR applications, Hearst and
Plaunt (1993), Reynar (1998), and Mochizuki et al. (2000) have tried to determine
the usefulness of linguistically motivated segments in passage-level retrieval. In the
early nineties, there was a surge in research relating to passage-level retrieval: the
retrieval of smaller units of text rather than full documents in response to user
queries (Hearst and Plaunt, 1993; Salton et al., 1993; Callan, 1994; Moffat et al.,
1994; Mittendorf, Schauble, 1994; Wilkinson, 1994, Salton et al., 1996). This work
was motivated by the idea that long documents (e.g. expository texts) contain a
number of heterogeneous sub-topics that make their word frequency statistics
unrepresentative of any particular sub-topic in the document. Consequently, a long
document that contains a passage relevant to a particular user query will quite likely
not be retrieved, since the passage is hidden in a myriad of other textual information
included in the document.
The units used to represent blocks of text in passage-level retrieval have varied
in size from sentences and paragraphs to fixed windows of text and sub-topic
segments. Hearst and Plaunt (1993) represent passages as sub-topic segments which
consist of multi-paragraph blocks of text. They found that retrieval based on subtopics improved retrieval performance. However, comparable performance gains
were also achieved when arbitrarily chosen fixed size segments were used as
145
passages. Reynar (1998) reported similar results which showed that his topic
segments were slightly outperformed by Kaszkiel and Zobel’s (1997; 2001)
overlapping passages. Kaskziel and Zobel’s technique divides documents into fixed
length passages beginning at every word in the document. However, as Reynar
points out, although this method is effective, the size of the index greatly increases
with even a small increase in collection size, thus magnifying the space and time
requirements of the algorithm.
Mochizuki et al. (2000) reported some confusing results using lexical chainbased segments. In contrast to all other passage retrieval work, they found that
optimal retrieval performance occurred when the original documents rather than
fixed block passages were retrieved in response to a set of queries on a Japanese
text collection. In this experiment Mochizuki compared a number of segmentation
techniques: fix length segments, paragraph-based segments, lexical chain-based
segments. However, their results did show that a method that combined the
passage-level retrieval results of keyword retrieval with lexical chain-derived
segment retrieval could outperform a method that used either one of these
techniques.
Hearst (1997) suggests text display as a more motivating example of how
segmentation information can be used to help users ‘hone in’ on the relevant
passages in a document without actually having to read the document in its entirety.
Her system TileBars offers users a means of viewing the distribution of their query
terms in each passage (or segment) that was deemed relevant to their specific query.
The interface also provides users with links leading directly to the positions in the
document that are most relevant to their query. Other uses of Hearst’s segmentation
algorithm TextTiling include restricting the context surrounding words in order to
improve the generation of lexical chains (Barzilay, 1997) 27 and the gathering of cooccurrence statistics for a thesaurus (Mandala et al., 1999). Sub-topic segments
have also been used to improve text summarisation tasks (Mittal et al., 1999).
In recent times coarse-grained segmentation has found what could be described
as its niche IR application. As heterogeneous multimedia data, such as television
27
However, Barzilay’s work also showed that the effectiveness of lexical chain disambiguation
actually improved when documents were divided into paragraphs rather than Hearst’s sub–topic
segments.
146
and radio broadcast news streams, becomes more readily available a new challenge
for IR research also arrives, where systems that were traditionally used for
processing demarcated text (containing title, section, paragraph, and story boundary
information) must now work on un-segmented streams of error-prone ASR
transcripts. Since the TDT initiative started in 1997, news story segmentation has
become the main focus of segmentation research (Allan et al., 1998a; Stolcke et al.,
1999; van Mulbregt et al., 1999; Beeferman et al., 1999; Eichmann et al., 1999;
Eichmann, Srinivasan, 2002; Dharanipragada et al., 1999, 2002; Greiff et al., 2000;
Mani et al., 1997; Stokes et al., 2002, 2004; Stokes, 2003; Yamron et al., 2002). In
this form of coarse-grained segmentation, segments are defined as coherent units of
text that pertain to distinct news stories in a news stream.
In Section 4.2.2, we defined the TDT tracking and detections tasks. One of the
prerequisites of these systems is good structural organisation of the incoming data
stream, where news story boundaries must be correctly identified in order to
maximise system performance. Developing robust segmentation strategies is also
important since manual segmentation of news transcripts is a very time consuming
process. Cieri et al. (2002) state that manual segmentation of the TDT collections
represented the largest portion of LDC annotation effort (compared with the time
spent on topic annotation and transcription). In the TDT pilot study, Allan et al.
(1998a) concluded that segmentation error rates between 10% and 20% were
adequate for TDT applications. However, this conclusion was only drawn from
event tracking experiments. In subsequent TDT research, Allan conceded that
although segmentation errors have little effect on the tracking task these errors do
have a more dramatic impact on the various detection tasks, i.e. New Event
Detection, Link Detection and Retrospective Detection (Allan, 2002b).
For the remainder of this chapter we will look at a variety of approaches that
have been used to tackle the problem of coarse-grained segmentation, in particular,
sub-topic identification and news story segmentation.
6.2
Sub-topic/News Story Segmentation Approaches
According to Manning (1998), text segmentation techniques can be roughly
separated into two distinct approaches, those that rely on lexical cohesion and those
that rely on statistical Information Extraction (IE) techniques such as cue phrase
147
extraction. In this section, we also look at a two newer story segmentation methods:
one based on a Hidden Markov Modeling approach and the other on a combined
approach that uses multiple sources of segmentation evidence. However, we do not
limit this literature review to coverage of news story segmentation systems, since
many sub-topic approaches have also been successfully adapted to tackle story
segmentation. One of the key elements in the following section is our description of
lexical cohesion-based approaches (Section 6.2.2) since our own segmentation
approaches falls into this category. In addition in Section 7.3, our system is
evaluated with respect to two other notable lexical cohesion approaches: the C99
(Choi, 2000) and TextTiling (Hearst, 1997) algorithms.
6.2.1 Information Extraction Approaches
IE approaches to segmentation are based on the existence of cue phrases or words
that contribute little to the overall message of a text, but are still important as they
help to indicate thematic shifts of focus in a text. There are two types of cues
identified in the literature: domain independent cues and domain specific cues
(Reynar 1998).
Domain Independent Cues
Domain independent cues are generally cues phrases that are applicable to many
genres, which include certain conjunctions, adverbs, and pronouns. Depending on
the level of segmentation granularity required, it is either the presence or absence of
these cues at potential boundary points that indicates a coarse or fine-grained topic
shift. For example, in Figure 6.1 taken from Passonneau and Litman (1993), the cue
word well is a good indicator that the intentional focus of the speaker has changed.
Other examples of this type of cue include also, therefore, yes, so, basically, finally,
and actually (further discussion can be found in Hirschberg and Litman (1993)).
Passonneau and Litman also use the occurrence of pronouns such as it, that, and he
to determine segment boundaries. However, in contrast to adverbial/conjunctive
cues, pronoun usage can imply both the presence or absence of a new segment
depending on a number of different factors: the location of the clause that the
pronoun occurs in, and whether or not the pronoun provides a referential link to
another word in the current segment or to a proper noun in the immediately
proceeding clause.
148
Domain independent cues are also valuable indicators of cohesion in text when
coarse-grained topic shifts such as news story boundaries are required. More
specifically, if cue words such as conjunctions or pronouns can be used to indicate
sub-topic shifts then this is also evidence of the continuation of the current coarsegrained topic. Figure 6.2 is an extract taken from a CNN broadcast illustrating this
point. The cue words and, so, finally and again link in each case the sentence in
which they occur to the previous sentence, thus giving the text a continuous rather
than a disjoint quality. Similarly the referential pronoun in Sentence 6 links it to
Sentence 5. Identifying these cues helps segmentation performance as it reduces the
number of possible segment boundary points between sentences 1 and 7. This type
of segmentation evidence is used to enhance our segmentation algorithm resulting
in a notable improvement in system performance (Section 7.4.2).
1. The French forces appeared reluctant to help.
2. So the Rwandan soldier jumped out of the jeep and into the second one.
3. The scene was repeated.
4. Again there was a struggle, with the French providing no help.
5. Finally, the Rwandan soldier realized the French troops would not intervene.
6. He jumped off the jeep and started running.
7. And an RPF soldier shouted “maliza yeye” (finish him).
Figure 6.2: Extract taken from CNN transcript which illustrates the role of domain
independent cue phrases in providing cohesion to text.
So far we have illustrated how effective domain independent cues are in the
segmentation process. However, although they can in general be applied across
genres, the list of cues defined for each test set must be fine-tuned a little in order to
either eliminate misleading cues or include missing ones. Of course one method of
achieving this is to manually hand code these lists. However, there is sufficient
evidence to suggest that the extraction and weighting of these cues is better left to a
machine learning IE technique, e.g. multiple regression analysis (Mochizuki et al.
1998), the C4.5 algorithm (Passonneau, Litman, 1997), exponential language
modeling (Beeferman et al., 1999) and maximum entropy modeling (Reynar, 1998).
These techniques have also been used to identify domain specific cues in text,
described in more detail in the following section.
149
Domain Specific Cues
For domain specific cues to work some explicit structure must be present in the text.
For example, Manning’s segmenter (1998) was required to identify boundaries
between real estate classified advertisements which, in general, contain the same
types of cue information, for example house price, location, acreage, and number
of bedrooms. Similarly, in news transcripts an inherent structure exists: an
introduction, followed by a series of news stories interspersed with commercial
breaks, and finally a summation of the main news stories covered. Some researchers
involved in the TDT initiative (Reynar, 1998; Beeferman et al., 1999;
Dharanipragada et al., 1999) have put this structure to good use by extracting cue
phrases in news transcripts that are reliable indicators of topic shifts in the dialogue
such as ‘Good Morning’, ‘stay with us’, ‘welcome back’ or ‘reporting from
PLACE’. Reynar (1998), who identified these phrases by hand from the HUB-4
broadcast news transcripts, divides these domain cues into a number of different
categories: ‘Greeting’, ‘Introductory’, ‘Pointer’, ‘Return from commercial’, and
‘Sign-off’ cues. Figure 6.3 illustrates how these domain cues reflect news
programme structure.
Broadcast News Cue Phrases
Greeting
Commercial
‘welcome’
‘we’ll be right back’
‘good evening’
‘welcome back’
‘top stories this hour’
‘and we’re back’
News Summary
News Stories
Commercial Break
Introductory
Sign off
‘this just in’
‘i’m PERSON’
‘let’s begin’
‘reporting from PLACE’
‘and finally’
‘live from PLACE’
0 Minutes
30 Minutes
Figure 6.3: A timeline diagram of a news programme and some domain cues.
150
One of the main problems, however, with these domain cues is that they are
genre specific conventions used only in news transcripts. Furthermore, often these
cues are news programme specific as well. For example, in European news
broadcasts, in contrast to their American counterparts, news programmes are never
‘brought to you by a PRODUCT NAME’. Newscaster styles also change across
news stations because some presenters favour certain catch phrases more than
others. Consequently, new lists of cues must be generated either manually or
automatically for each news sample. Hence, segmenters that rely heavily on these
types of cues tend to be highly sensitive to small changes in news programme
structure which can have a detrimental effect on segmentation performance. A more
measured approach to segmentation would be to use cue phrase information as
secondary evidence of a topic shift, and consider a domain independent technique
like lexical cohesion analysis as primary evidence of the existence of a story
boundary. In the following section we examine how lexical cohesion, as a textual
characteristic, can be successfully used to segment text into distinct topical units.
6.2.2 Lexical Cohesion Approaches
Research has shown that lexical cohesion is a useful device for detecting sub-topic
and topic shifts in texts. The central hypothesis of segmenters based on lexical
cohesion analysis is that portions of text that contain high numbers of semantically
related words (cohesively strong links) generally constitute a single unit or segment.
Consequently, areas of the text that exhibit very low levels of cohesion are said to
be representative of a topic/sub-topic shift in the discourse. Most approaches to
segmentation using lexical cohesion only examine patterns of lexical repetition in
the text and ignore the four other types of lexical cohesion (as discussed in Section
2.3). There are two methods of capturing these other forms of lexical cohesion; one
is to examine semantic association between words using a thesaurus, while the other
method finds associations based on co-occurrence statistics generated from an
auxiliary (same-domain) corpus.
Lexical Repetition
Segmenters that examine lexical repetition work on the notion that the repetition of
lexical items occurs more frequently in areas of text that are about the same topic.
151
In the case of a news programme, sharp bursts in proper noun and noun phrase
repetition will often mark the beginning of the next news report. In general, lexical
cohesion-based approaches to segmentation analyse these repetition bursts, where
graphical representations of the peaks and troughs in similarity between textual
units are used to determine segment boundaries. We will now look in detail at two
such repetition-based systems, since they participate in our evaluation methodology
described in Section 7.2.
The first of these segmentation systems, called TextTiling, was developed by
Hearst (Hearst, 1997). Hearst’s algorithm begins by artificially fragmenting text
into fixed blocks of pseudo-sentences (also of fixed length). The algorithm uses the
cosine similarity metric to measure cohesive strength between adjacent blocks in
the text, where words are weighted with respect to their frequency within the block.
Depth scores are then calculated for each block gap (segment boundary) based on
the similarity between a block and its neighbouring blocks in the text as follows:
1. Find the similarity at gap n, i.e. similarity between block n and block n+1.
2. Find the similarity between n and every block to the left of it until the
similarity decreases. Record the difference between the similarity at gap n
and the highest encountered similarity.
3. Repeat this procedure for block n+1 comparing it to every block on its right.
4. The depth score for this gap is the sum of the two differences calculated in
steps 2 and 3.
Figure 6.4 is a graphical representation of the similarity scores calculated for each
block boundary point. A depth score, as Reynar (1998) points out, is basically the
sum of the differences between the top of the ‘peak’ immediately to the left and
right of a ‘valley’. High values of these depth scores indicate topic boundary points
as they represent areas in the text that exhibit major drops in similarity. The jagged
horizontal line in Figure 6.4 represents the cut-off point above which all depth
scores are definite segment boundaries. This cut-off is a function of the average and
standard deviations of the depth scores for the text under analysis. Vertical lines are
the multi-paragraph boundaries chosen by the TextTiling algorithm, which are
slightly off line with the block gap numbers on the x-axis since block positions are
mapped back on to the original paragraphs in which they occurred in the text. Also
note that Hearst prevents the occurrence of very close adjacent boundaries, by
152
checking that there are at least 3 pseudo-sentences between boundaries. Hearst’s
technique is similar to work done by Youman (1991) on the generation of
Vocabulary Management Profiles (VMP). A VMP is effectively a plot of the
number of first-time uses of words in a fixed window as a function of word position
within the text. Similarly peaks and valleys in these plots are indications of
vocabulary shifts. However, Nomoto and Nitta (1994), who implemented and tested
Youman’s technique, concluded that it failed to consistently detect patterns of
vocabulary shift in text. Hearst (1997) implemented an improved version of the
algorithm, renaming it the Vocabulary Introduction Method. However, she showed
that this method could not outperform her TextTiling algorithm.
14
Neighbouring Block Similarity Scores
12
10
8
6
4
2
0
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
Block Gap Numbers
Figure 6.4: Graph representing the similarity of neighbouring blocks determined by the
TextTiling algorithm for each possible boundary or block gap in the text.
The second system to take part in our evaluation is Choi’s segmenter C99 (Choi,
2000). This is a three-step algorithm that uses image-processing techniques to
interpret a graphical representation of the pair-wise similarity of each sentence in
the text as follows:
153
1. Generate a sentence pair similarity matrix using the cosine similarity
measure.
2. Replace each value in the similarity matrix Mi,j by its rank Ri,j, where Ri,j is
the proportion of neighbouring elements that have lower similarity values
than Mi,j. Choi explains that the purpose of this step is to limit the effect of
the sensitivity of the cosine metric when short text units are being
compared, i.e. the occurrence of a common word between two short
sentences in the text could cause a disproportionate increase in relative
similarity. By replacing each Mi,j by its Ri,j the similarity values between
individual units becomes irrelevant and only their relative ranking with
respect to their neighbours is considered during further processing.
3. Use a divisive clustering algorithm to determine the final topic boundaries.
This algorithm iteratively sub-divides the original document (one large
segment) into smaller segments that maximise the inter-sentence similarity
values within each segment until a sharp drop in similarity occurs
indicating that an optimal set of segments has been found.
Choi’s algorithm is based on another divisive clustering algorithm developed by
Reynar (1994). The main differences between these techniques is that Choi’s
algorithm uses the cosine similarity metric and a ranking strategy, while Reynar’s
algorithm examines the repetition of words with respect to their position in the text
(rather than sentence similarities) illustrated using a dotplot diagram. Creating the
dotplot involves plotting points on a graph which correspond to word repetitions in
the document. As more and more word occurrences are added to the dotplot, square
regions (representing text areas containing a high number of word repetitions) begin
to emerge along the diagonal axis of the graph. A maximisation algorithm is then
used to maximise the density of these square regions, where sub-topic shifts are
located by determining the points at which the outside densities are minimised. This
segmentation strategy is similar to Hearst’s TextTiling algorithm in that both
methods determine boundaries based on a comparison of neighbouring blocks.
However, Reynar points out that his approach involves a global rather than a local
comparison strategy since each region is compared with all other regions.
Coupled with these divisive clustering (top-down) methods are a number of
lexical repetition approaches that detect boundaries using agglomerative clustering
154
techniques (bottom-up) (Yaari, 1997; Eichmann et al., 1999). However, regardless
of the clustering algorithm used, most repetition-based segmenters calculate
similarity between units in a text using the cosine similarity measure. The few
exceptions to this trend are Reynar (1994), Youman (1991) and Richmond et al.
(1997). In particular, Richmond et al. weight word significance using Katz’s (1996)
notion of the ‘burstiness’ of content words in text. Burstiness, according to Katz, is
an observable characteristic of important topic words, where multiple occurrences
of topic words tend to occur in close proximity to each other in a text. Richmond et
al. define a significance weight for each word in a text where words which observe
a ‘bursty’ distribution will be weighted higher than other words. The similarity
between textual units is then calculated with respect to the significance of words
using the following overlap metric:
| A′ | − | A′′ | | B ′ | − | B ′′ |
+
| A|
|B|
Correspond ence =
2
(6.1)
where | A | is the sum of all the significance weights assigned to each of the words
in textual unit A, | A′ | are the sum of the weight of the words that A has in common
with B, and | A′′ | is the sum of the weight of the words unique to A with respect to
B. Similar definitions apply to | B | , | B′ | and | B ′′ | . Richmond et al. claim that
incorporating this measure of word significance into the segmentation process leads
to improved accuracy over Hearst’s TextTiling algorithm without sacrificing
language independency.
One of the main disadvantages of a lexical cohesion-based approach to
segmentation that only looks at repetition relationships between words is that
synonymous and semantic related word repetitions are not considered, making some
areas of text appear less cohesive than they actually are. Figure 6.5 illustrates this
point using two sentences taken from a report on the SARS epidemic in Toronto. To
a repetition-based segmenter these sentences will appear unrelated, since they have
no terms in common. However, looking a little closer we see that warning is a
synonym of caution and Toronto is a city in Canada (hyponym relationship). In
sub-topic or intention-based segmentation (see Section 6.1.2), the segmenter would
be correct in classifying these two sentences as separate segments. However, if the
required level of granularity is the identification of distinct news stories then
155
considering the synonym and hyponym relationships between the two sentences is
essential. In the following two sub-sections, we examine in more detail how these
types of relationships have been identified using statistical word associations and
thesaural relations.
The World Health Organization today issued a SARS-related travel caution for
Toronto, saying that they believe the virus had not been effectively contained
there yet. This latest warning is expected to hurt an already struggling economy in
Canada's largest city, which accounts for about 20 percent of national gross
domestic product.
Figure 6.5: Extract of CNN report illustrating the role of lexical cohesion in determining
related pieces of text.
Statistical Word Association
In Section 2.3, we described statistical word associations as ‘intuitive’ word
relationships that are not represented in a standard thesaurus, since they cannot be
defined in terms of generalisation, specialisation or part-whole relationships.
However, as their name suggests, these types of lexical cohesive word relationships
can be automatically identified by gathering word co-occurrence statistics from a
large domain specific corpus. Since lexicographically related words (like the
relationship between ‘vehicle’ and ‘car’) are also commonly found in similar
contexts, it can be said that statistical associations implicitly considers these types
of lexical cohesive relationships as well.
Ponte and Croft’s segmenter (Ponte, Croft, 1998) uses a word co-occurrence
technique called Local Context Analysis (LCA) to determine the similarity between
adjacent sentences. These similarities are then used in the boundary detection phase
to find segments using a dynamic programming technique. LCA works by
expanding the context surrounding each sentence by finding other words and
phrases that occur frequently with these sentence words in an auxiliary corpus. LCA
was originally created as a query expansion method (Xu and Croft 1996), where
related words are added to a query in order to improve retrieval performance. The
advantage of this technique over other statistical approaches is that it calculates
sentence similarity based on the co-occurrence of multiple words, thus ensuring that
156
two terms in different sentences are related based on their intended sense usage in
the context of the sentence. The author’s show that segmentation based on LCA is
particularly suited to texts containing extremely short segments (similar to the
example in Figure 6.5) which share very few terms due to their brevity. Their
evaluation compared the LCA-based segmenter to a similar segmenter that
examines word frequencies only, and found that the LCA method outperformed the
other. Further improvements were possible when co-occurrences statistics were
generated from a more recent auxiliary corpus that covered more up-to-date topics,
and hence shared more vocabulary with the evaluation corpus.
Another technique closely related to Ponte and Croft’s word expansion
technique is Kaufmann’s VecTiling system (Kaufmann, 2000), which augments the
basic TextTiling algorithm with a more sophisticated approach to determining block
similarity. However, instead of LCA VecTile uses Schutze’s WordSpace model
(Schutze, 1997; 1998) to replace words by vectors containing information about the
types of contexts that they are most commonly found in. The WordSpace technique
calculates a co-occurrence vector containing the co-occurrence frequencies of 1000
content words with each term in 20,500-word dictionary. Since these matrices tend
to be quite sparse (a number of co-occurrences will be zero), Kaufmann reduced the
matrix from 20,500 × 1000 to 20,500 × 100 using Singular Value Decomposition
(Golub and van Loan, 1989): an operation, Kaufmann explains, that relocates the
vectors into a lower-dimensional space to summarise the most important (i.e.
defining) parts whilst simultaneously filtering out any noise. So when finding the
similarity between two units of text, the VecTile algorithm produces one vector for
each unit by adding together the vectors of each of the words in the unit. This
calculation maps the unit and its contents onto a single position and direction in the
100 dimensional space. The remainder of the VecTile algorithm stays true to
Hearst’s original TextTiling algorithm, where the cosine similarity metric
determines similarity between vectors representing blocks of text which are in turn
used to find boundary points between sub-topic units in the text. Kaufmann used
Pearson’s correlation coefficient to evaluate the results of the two algorithms and
found that the VecTile algorithm outperformed its TextTiling counterpart. Two
approaches similar to this technique are Choi (2001) and Slaney and Ponceleon
(2001), who both used Latent Semantic Analysis (LSA) to determine the similarity
157
between textual units. LSA uses a truncated form of single value decomposition
(Deerwester et al., 1990). Slaney and Ponceleon found that their technique was an
effective method of segmenting a broadcast news programme; however, no formal
evaluation was conducted. Choi’s paper (2001) on the other hand showed that an
LSA approach could outperform his C99 algorithm.
Thesaural-based Word Association
There are relatively few segmentation techniques that consider thesaural-based
word association in the boundary detection phase, and most of those that do are
lexical chain-based approaches (Okumura and Honda, 1993; Stairmand, 1997; MinYen 1998; Mochizuki et al. 1998, 2000). In the majority of cases thesaural word
relationships, such as statistical co-occurrence information, are combined with
lexical repetition information rather than used as conclusive evidence of a topic
shift. There are both advantages and disadvantages to analysing lexicographical
rather than statistical relationships between words. Lexicographical relationships
are hand-coded, domain independent relationships so they cover a much wider
range of word relationships than lists of statistically derived relationships. For
example, a very obvious holonym relationship exists between ‘fish’ and ‘fin’ in a
story on the ‘Asian shark-fin trade’. However, finding a significant number of cooccurrence of these words in a news corpus in order to establish a statistical
relationship between them is difficult as these terms occur infrequently compared to
terms such as ‘violence’ or ‘death’ in this domain. On the other hand, there are
many domain specific relationships between words that are not captured by
thesaural relationships including many compound nouns such as ‘cocaine addiction’
or ‘dirty bomb’ and related nouns such as ‘corruption’ and ‘money’ or ‘NASA’ and
‘Mars’.
One example of published work that uses thesaural word relationships outside
the realms of lexical chain-based segmentation is a technique by Jobbins and Evett
(1998). Their algorithm is similar to TextTiling in that it looks for areas of low
cohesion in text by comparing fixed-size windows of neighbouring text. However,
they tested a number of combinations of semantic similarity using lexical repetition,
collocations, and thesaural relationships from Roget’s Thesaurus. On a very small
test set of 42 pairs of concatenated topical articles, they found that a combination of
158
word repetitions and collocation information performed best. In a similar
experiment where sub-topic shifts were identified by a number of judges, they
found that a combination of word repetition and thesaural relationships worked
best.
Lexical chain-based approaches to text segmentation determine segment
boundaries by analysing repetition as well as other forms of lexical cohesion
derived from a thesaurus such as WordNet. Like other lexical cohesion approaches
to segmentation, a boundary is defined as an area of low similarity between
neighbouring blocks in a text. More specifically, in the case of lexical chaining
based-segmentation, a boundary is defined as a point in the text where a number of
lexical chains end and a number of new chains begin. This corresponds with the
idea that each chain represents a sub-topical element of the text, and if a number of
these chains terminate at the same point in the text where a number of new subtopics or chains begin then we can say that we have detected a topic shift. This
boundary detection process is described in more detail in Section 7.1 which
documents our own approach to the problem.
There have been three previous attempts to tackle text segmentation using lexical
chains. The first by Okumura and Honda (1994) involved an evaluation based on
five Japanese texts, the second by Stairmand (1997) used twelve general interest
magazine articles and the third by Min-Yen et al. (1998) used fifteen Wall Street
Journal and five Economist articles. The evaluation methodology described in
Section 7.2 uses a substantially larger data set consisting of CNN broadcast news
and Reuters newswire documents. Broadcast news story segmentation represents a
previously unexplored evaluation domain for lexical chain-based segmentation, as
Okumura and Honda’s approach looked to identify sub-topics in newspaper articles,
while Stairmand and Min-Yin et al.’s experiments centred around the identification
of concatenated newspaper articles. In each of these experiments lexical chains
were found to be an adequate means of segmenting text into (sub)-topical units.
6.2.3 Multi-Source Statistical Modelling Approaches
So far in this chapter, we have looked at different pieces of textual evidence that
can be used to break text up into coherent segments. Although some segmenters use
either IE or lexical cohesion techniques, many researchers have found that a
combination of evidence works best. For example, in Reynar’s (1998) approach to
159
news story segmentation he showed that significant gains could be achieved by
combining cue information with other feature information such as named entities,
character n-grams (sequences of word forms of length n), and lexical cohesion
analysis. On the other hand, Eichmann et al. (1999) combined a tf.idf measure of
similarity with pausal information in the news audio stream (i.e. speaker pause
duration between textual units where a longer pause is evidence of a story
boundary). Combination approaches such as these work by learning the best
indicators of segment boundaries from an annotated corpus and then combining
these diverse sources of evidence in a theoretical sound framework, such as a
feature-based language modeling approach (Beeferman et al., 1999), a cue based
maximum entropy model (Reynar, 1998) or a decision tree-based probabilistic
model (Dharanipragada et al., 1999).
Another important statistical approach that has been successfully used for news
story segmentation is Hidden Markov Modelling (HMM): a method more
commonly used in speech recognition applications (Yamron et al., 1998; van
Mulbregt et al., 1999; Blei and Moreno 2001; Greiff et al., 2000). Each state in the
HMM is representative of a topic, so that given a word sequence the HMM assigns
each word a topic, thus producing the maximum-probability topic sequence.
Finding topic boundaries is then equivalent to finding topic transitions; in other
words, finding where adjacent word topic-labels differ. One of the main
disadvantages of building statistical models to solve segmentation is that they have
to be trained and fine-tuned on domain specific data. For example, in experiments
by van Mulbregt et al. (1998) on the TDT2 collection 48,000 stories (15 million
words) were used to train their HMM. Similarly, Beeferman et al. (1999) trained
their language modelling approach on a 2 million word subset of the TDT1
broadcast news collection. However, Utiyama and Isahara (2001) have proposed a
domain-independent statistical model for text segmentation, where no training data
is needed since word statistics are estimated from the given text. Although their
approach was not compared to a trained version of their model, their results did
compare favourable with Choi’s C99 (2000) word repetition-based approach.
160
6.3
Discussion
In this chapter we have examined a number of text segmentation approaches that
have been used for a variety of purposes depending on the granularity of the
segments required. For example, applications of fine-grained text segments include
intention-based discourse analysis, anaphoric resolution and language generation.
Coarse-grained text segments have been used in systems performing tasks such as
passage-level retrieval, text summarisation, and news story segmentation. In the
context of this thesis the goal of a news story segmentation system is to
automatically detect the boundaries between news stories in a television news
broadcast. Techniques that combine lexical cohesion analysis with domain specific
cues have proven to be an effective means of segmenting broadcast news streams.
In the Chapter 7, we compare the performance of a number of lexical cohesion
based-approaches to this problem. Two of these techniques, TextTiling (Hearst,
1994; 1997) and C99 (Choi, 2000), analyse news transcripts by examining patterns
of lexical repetition in the text. Our approach, the SeLeCT system, also analyses
other forms of cohesion, namely statistical and lexicographical word associations
using the LexNews chaining algorithm presented in Chapter 3. Hence, one of the
goals of the experiments described in the next chapter is to determine if an analysis
of these additional lexical cohesion relationships can enhance segmentation
performance.
161
Chapter 7
Lexical Chain-based News Story Segmentation
In the Chapter 6, we explained that un-segmented streams of broadcast news
present a challenging real-world application for text segmentation approaches, since
the success of other tasks such as Topic Tracking or New Event Detection depends
heavily on the correct identification of boundaries between news stories. In this
chapter we evaluate the performance of our segmentation system SeLeCT with
respect to two well-known lexical cohesion-based segmenters: TextTiling and C99.
Using the Pk and WindowDiff evaluation metrics we show that SeLeCT outperforms
both systems on spoken news transcripts (CNN), while the C99 algorithm performs
best on the written newswire collection (Reuters). We also examine the differences
between spoken and written news styles and how these differences affect
segmentation accuracy. The work described in this chapter was published in (Stokes
et al., 2002; 2004a) and (Stokes, 2003).
162
7.1
SeLeCT: Segmentation using Lexical Chaining
In this section we present our topic segmenter, SeLeCT (Segmentation using
Lexical Chaining on Text). This system takes a concatenated stream of text and
returns segments consisting of single news reports. The system consists of two
components the LexNews component made up of a tokeniser for text preprocessing,
the lexical chainer, and the boundary detector component that uses these chains to
determine news story boundaries. The LexNews component was described in detail
in Section 3.2. Figure 7.1 illustrates the general architecture of the system, where a
broadcast news programme is input and a segmented stream of news stories is
output.
Broadcast News Programme
LexNews
Lexical Chains
Tokeniser
Lexical Chainer
Boundary Detector
Boundary Strength Scorer
Error Reduction Filter
News Story Segments
Figure 7.1: SeLeCT news story segmentation system architecture.
163
Unlike other lexical chain-based approaches to segmentation, the SeLeCT
system uses a broader notion of lexical cohesion to analyse the text for topic shifts.
More specifically, the LexNews chaining component examines repetition,
synonymy, antonymy, generalisation/specialisation relationships, part-whole/wholepart relationships (provided by WordNet), and statistical word associations. The
tokenisation process gathers candidate terms (proper noun and noun phrases) for the
chain generation process. In Section 7.3, we examine the effect of these LexNews
enhancements with respect to segmentation performance using the evaluation
methodology described in Section 7.2, but first we will look at how chains are used
to detect news story boundaries in the SeLeCT system architecture.
7.1.1 The Boundary Detector
As already stated, the LexNews chaining algorithm is used to generate lexical
chains; however, due to the temporal nature of news streams, stories related to
important breaking-news topics will tend to occur in close proximity in time. If
unlimited distance were allowed between word repetitions then some chains would
span the entire text if two stories discussing the same topic were situated at the
beginning and end of a news programme. Consequently, we impose a maximum
distance of m words between candidate terms that exhibit an extra strong
relationship (i.e. a repetition-based relationship) in the chaining process. However,
the distance restrictions set out in Section 3.3 (i.e. a maximum of 130 words for
strong relationships and 60 words for medium-strength relationships) are still
adhered to in the experiments described in this chapter.
Once lexical chains have been generated, the final step in the segmentation
process is to partition the text into its individual news stories based on the patterns
of lexical cohesion identified by the chains. Our boundary detection algorithm is a
variation on one devised by Okumara and Honda (1994), and is based on the
following observation paraphrased from Morris and Hirst’s seminal paper on lexical
chaining (1991):
Since lexical chain spans (i.e. start and end points) represent semantically related
units in a text, a high concentration of chain-begin and end points between two
adjacent textual units is a good indication of a boundary point between two distinct
news stories.
164
We define boundary strength, w( n, n + 1) , between each pair of adjacent textual
units in our test set, as the sum of the number of lexical chains whose span ends at
paragraph n and begins at paragraph n + 1 . When all boundary strengths between
adjacent paragraphs have been calculated we then get the mean of all the non-zero
cohesive strength scores. This mean value plus a constant x then acts as the minimum
allowable boundary strength score (or threshold) that must be exceeded if the end of
textual unit n is to be classified as the boundary point between two news stories28. To
illustrate how boundary strengths based on lexical cohesion are calculated consider
the following piece of text in Figure 7.2 containing one topic shift (all nouns are
highlighted), accompanied by the lexical chains derived from this text fragment
where the chain format is:
{word1(frequency)….wordn(frequency) | Sentence no. chain start, Sentence no. chain end}
CHAINS
TEXT
{hearing(1), testimony(1) | 1, 1}
[1] Coming up tomorrow when the
{tomorrow(1), night(1), holiday(1),
weekend(1), time(1) | 1, 3}
{O.J. Simpson(2) | 1, 1}
{airport(2) | 1, 1}
{president(1), organisation(1)| 2, 2}
{checkpoints(2) | 2, 3}
{murders(1), fatalities(1) | 1, 3}
hearing resumes, we hear testimony
from the limousine driver that brought
O.J. Simpson to the airport- who
brought O.J. Simpson to the airport
June 12th, the night of the murders. [2]
The president of Mothers Against
Drunk
Driving
discusses
organisation's support
checkpoints
over
of
the
her
sobriety
holiday
weekend. [3] She hopes checkpoints
will be used all the time to limit the
number of fatalities on the road.
Figure 7.2: Sample lexical chains generated from concatenated news stories.
28
Optimal values for the constant x (used in the calculation of the boundary strength threshold) were
found to be x = 1 in the case of the Reuters collection and x = 2 for the CNN collection. The results
of these experiments are discussed in Section 7.3.
165
CHAINS
Sentence 1
Sentence 2
Sentence 3
{hearing, testimony | 1, 1}
{tomorrow, night, holiday,
weekend, time | 1, 3}
{O.J. Simpson | 1, 1}
{airport | 1, 1}
{president, organisation| 2, 2}
{checkpoints | 2, 3}
{murders, fatalities | 1, 3}
Boundary Point
Figure 7.3: Chain span schema with boundary point detected at end of sentence 1. w(n, n+1)
values for each of these points are w(1, 2) = (3+2) = 5 and w(2, 3) = (1+0) = 1.
Figure 7.3 illustrates the start and end points of the above chain spans and the
position of the two distinct boundary points in the text between sentences 1 and 2
and sentences 2 and 3. No boundary strength score is calculated for the boundary at
the end of sentence 3 since this is the last sentence in the text, and so by default
must be the end of a particular story. As previously stated, a boundary strength
score is the sum of the number of chain-end points and the number of chain-begin
points at a particular boundary. Therefore, w(1,2) has a higher score than w(2,3) (5
versus 1 respectively) and the algorithm correctly labels the boundary between
sentence 1 and 2 as the end of the O.J. Simpson story in the broadcast since this
score exceeds the boundary strength threshold for this piece of text, i.e.
w(1,2) > (((5 + 1) /2) + 1) where the threshold is the mean of these two scores plus
the constant x = 1.
This is a very simple example of segmentation using lexical chains, however, in
reality news stories tend to be much longer than this (around 500 words) and news
broadcasts consists of many more than two stories concatenated together.
Consequently, the segmentation decision process gets increasingly more difficult,
for the following reasons:
166
1. Boundary scores tend to increase within a narrow region of text surrounding a
true boundary position rather than directly at that point in the text. This is a
natural consequence of the fact that not all sub-topics in a story will end in the
last line or paragraph of the story. Hence, lexical chain spans will tend to end
and begin in the general vicinity of a topic shift. This results in a cluster of high
scoring adjacent boundary points labelled in Figure 7.4 as regions A and C.
2. In news broadcasts it is common to follow a story with a related news report, so
quite often this means that lexical chain spans will stretch across story
boundaries since they share some common or related vocabulary. For example,
in Figure 7.3 we saw that the relationship between the words ‘murders’
(sentence 1) and ‘fatalities’ (sentence 3) was captured by a specific lexical chain
that spanned across the ‘O.J. Simpson’ and ‘Drink Driving’ story lines. This
results in a solitary boundary point that is very close to a cluster of adjacent
boundary points. This is labelled in Figure 7.4 as region B.
6
4
0
0
5
5
5
3
0
0
0
0
0
0
0
0
0
0
0
0
A
B
C
Figure 7.4: Diagram showing characteristics of chain-based segmentation. All numbers greater
than zero are possible boundary positions, while zero scores represent no story boundary point
between these two textual units. Only the boundaries ringed in red are retained after the
results are run through the error-reduction filter.
Both of these characteristics of chain-based segmentation add noise to the detection
process. However, their effect on segmentation performance can be lessened by
using an error reduction filter. Our error reduction filter, the final element of the
boundary detection process, examines all boundary detection scores that exceed the
required threshold and searches for system detected boundary points that are
separated by less than d number of textual units from a higher scoring boundary,
where d is too small to be a ‘reasonable’ story length. This filter has the effect of
smoothing out local maxima in the boundary score distribution, thus increasing
segmentation precision. This means that for regions A and C, which represent
167
clusters of adjacent boundary points, only the boundary with the highest score in the
cluster is retained as the true story boundary. Therefore, when d = 5, the boundary
which scores 6 is retained in region A while in region C both points have the same
score so in this case we consider the last point in region C to be the correct boundary
position. Finally, the story boundary in region B, a solitary boundary point, is also
eliminated because it is situated too close to the boundary points in region C and it has
a lower score than either of those boundaries.
At the beginning of this section we defined the boundary strength w( n, n + 1)
between each pair of adjacent textual unit in our test set, as the sum of the number
of lexical chains whose span ends at paragraph n and the number of chains that
begin their span at paragraph n + 1 . Our decision to choose the summation of chainbegin and end points over the product of these two numbers is based on the
observation that a multiplication driven score would eliminate all potential, highscoring boundary points that have either zero chain-ends or zero chain-begins. For
example, if a boundary point has five chain-ends and no chain-begins then its
strength is zero and so is eliminated from the set of potential boundaries given to
the error filter. However, high-scoring end or begin scores are still good indicators
of a topic shift, so we chose a summation driven boundary strength score.
Other scoring approaches were also considered such as the weighted sum of the
number of chain-end and begin points. However, these combinations reduced
system performance leading us to conclude that the position of chain-end points as
evidence of topic shifts in a text is equally as important as the position of the chainbegin points in a text.
7.2
Evaluation Methodology
In this section we give details of the evaluation metrics used to determine news
story segmentation performance in Sections 7.3 and 7.4. In order to determine the
effect of different language modes on segmentation performance, these experiments
were run on two test collections: one containing CNN broadcast news transcripts
and another containing Reuter newswire articles. Both types of news document
were taken from the TDT1 pilot study corpus (Allan et al. 1998a) details of which
can be found in Section 5.4.1.
168
7.2.1 News Segmentation Test Collections
For most test collections used as input to segmentation algorithms, a lot of time and
effort is spent gathering human annotations, i.e. human-judged sub-topic shifts. The
difficulty with these annotations lies in determining their reliability since human
judges are notoriously inconsistent in their agreement on the beginning and end
points of these fine-grained boundaries (Passonneau, Litman, 1993). A different
approach to segmentation evaluation is available to us due to the nature of the
segments that we wish to detect. By concatenating distinct stories from a specific
news source and using this as our test set, we eliminate subjectivity from our
boundary judgments. Therefore a boundary can now be explicitly defined as the
joining point between two news stories, in contrast with other test collections, there
is no need for a set of judges to make any subjective decisions on what constitutes a
segment in the collection. In Sections 7.3 and 7.4 we report segmentation results
gathered from two test collections each consisting of 1000 news stories randomly
select from the TDT1 corpus.
The first test set contains 1000 news stories extracted from CNN news
programme transcripts. These stories were reorganised into 40 files each containing
25 stories. This procedure was repeated on the Reuters test set, which also consists
of 1000 written articles. Consequently, all experimental results in Sections 7.3 and
7.4 are averaged scores calculated over each of the 40 samples. In a previously
reported set of segmentation results (Stokes, Carthy, Smeaton 2002), experiments
involving SeLeCT and TextTiling were run on a single file of 1000 CNN stories.
However, splitting the corpus was necessary for the experiments reported here and
in (Stokes, 2003) and (Stokes et al., 2004a), because Choi’s C99 program was
implemented to handle only small amounts of input data. These earlier results
(Stokes et al., 2002) will also be discussed in the following section.
7.2.2 Evaluation Metrics
There has been much debate in the segmentation literature regarding appropriate
evaluation metrics for estimating segmentation accuracy. Earlier experiments
favoured an IR style evaluation that measures performance in terms of recall and
precision, which we define as follows:
Recall: The number of correctly detected story boundaries divided by the
number of actual news story boundaries in the test set.
169
Precision: The number of correctly detected story boundaries divided by the
total number of boundaries returned by the system.
However, unlike retrieval tasks where documents are classified as either relevant or
non-relevant, the notion of segmentation accuracy is a fuzzier concept.
For
example, if a system suggests a boundary point that is one sentence away from the
true story-end point it is unfair to penalise this system as heavily as a system that
has missed the same boundary by 10 sentences. In other words recall, precision and
their harmonic mean the F1 measure all fail to take into account near-boundary
misses. Consequently, these metrics are insufficiently sensitive when trying to find
system parameters that yield optimal system performance (Beeferman et al., 1999).
Other researchers (Reynar, 1998; Ponte, Croft 1998; Stokes et al. 2002) have tried
to remedy this problem by measuring recall and precision values at varying margins
of error. More specifically, a system boundary is considered correct if it exists
within a certain window of allowable error. So a margin of error of +/- n means that
if the system identifies a boundary n paragraphs before or n paragraphs after the
correct boundary point then this end point is still counted as correct. This evaluation
strategy is illustrated in Figure 7.5, where vertical lines represent boundaries and a
red line represents the correct boundary while grey lines represent the range of
system boundaries surrounding this point that are considered correct within a
margin of error of +/-3 sentences.
Outside Range
Allowable Margin of Error +/- 3
Outside Range
Figure 7.5: Diagram illustrating allowable margin of error, where the red vertical lines
represents the correct boundary between 2 stories, the grey lines represent boundaries that lie
within an allowable margin of error and so would still be consider correct if a segmentation
system returned them, and finally black lines are incorrect boundary positions.
The only stipulation when calculating recall and precision with respect to this
margin of error is that each boundary may only be counted once as a correct
boundary. This problem occurs when the value of n is high and has the effect of
170
exaggerating improvements in system performance as n increases. This is the first
of four metrics used in our evaluation which we define more formally as follows:
ferror =
1
0
if
ref − hyp ≤ n
otherwise
(7.1)
ferror is an error function where n is the allowable distance in units between the
actual boundary ref and the system or hypothesised boundary hyp.
Since the arrival of the TDT initiative, Beeferman et al.’s metric, which tried to
address the inadequacies of recall and precision, has become the standard for
segmentation evaluations. They proposed a probabilistic evaluation metric Pk that
aims to incorporate gradations of segmentation accuracy in terms of false positives
(falsely detected segments), false negatives (missed segments) and near-misses
(very close but not exact boundaries). More specifically, Pk is defined as ‘the
probability that a randomly chosen pair of words a distance k words apart is
inconsistently classified’ (Beeferman et al., 1999):
D (i, j )(δ ref (i, j ) ⊕ δ hyp(i, j ))
Pk (ref, hyp) =
(7.2)
1≤i ≤ j ≤n
where δ ref (i , j ) and δ hyp(i , j ) are binary valued functions which are 1 when
sentences i and j are in the same topic segment. The symbol ⊕ represents the
XNOR function29, which is 1 when its arguments are equal and 0 otherwise.
The function D is a distance probability distribution which is estimated based on the
average segment size (i.e. story length) in the collection.
However, in a recent publication Pevzner and Hearst (2002) highlight several
faults with the Pk metric. Most notable they criticise Pk firstly for its inability to
deal with different types of error in an even-handed manner and secondly they
criticise its over-sensitivity to large variances of segment size in the test set. In the
latter case, Pk becomes more lenient as the variance increases and in the former it
unfairly penalises false negatives more than false positives while over-penalising
near-misses. The authors show though empirical evidence and different
segmentation scenarios that their proposed alternative metric called WindowDiff
alleviates these problems and provides a fairer and more accurate measure of
segmentation performance. WindowDiff, like Pk, works by moving a window of
29
XNOR (exclusive OR) is
(A ∧ B) ∨ (¬A ∧ ¬B) .
171
fixed size across the test set and penalising the algorithm whenever a missed or
erroneous boundary occurs. However, unlike Pk it calculates this error by counting
‘how many discrepancies occur between the reference and the system results’ rather
than ‘determining how often two units of text are incorrectly labelled as being in
different segments’ (Pevzner, Hearst 2002). WindowDiff is defined more formally
as follows:
WindowDiff ( ref , hyp ) =
1
N −k
N −k
(| b( refi, refi + k ) − b(hypi, hypi + k ) | > 0)
(7.3)
i =1
where k is the size of the window (based on the average segment size in the text),
b(i, j) represents the number of boundaries between positions i and j in the text and
N represents the number of textual units (e.g. sentences) in the text.
The fourth and final metric to be used in our evaluation also comes from
Pevzner and Hearst. It is referred to as the Pk′ metric and is an exact implementation
of Pk except that it doubles the false positive penalty to compensate for the overpenalisation of false negatives. However, as the authors explain this metric is still
inferior to WindowDiff as it solves only one of many identified problems with Pk.
We include this metric in our evaluation as it helps to shed some light on the style
of segmentation returned by the segmenter being evaluated. By segmentation style
we mean the types of mistakes that a segmenter is prone to making in terms of nearmisses, false positives, and false negatives. Appendix C contains a more detailed
explanation of the differences between the WindowDiff and Pk metrics.
7.3
News Story Segmentation Results
In this section we present performance results for each segmenter on both the CNN
and Reuters test sets, with respect to the aforementioned evaluation metrics. As
explained in Section 7.1, we determine the effectiveness of our SeLeCT system
with respect to two other lexical cohesion-based approaches to segmentation,
namely the TextTiling (Hearst, 1997) and C99 algorithms (Choi, 1999)30.
A
detailed description of these algorithms was given in Section 5.2.3. Like the
SeLeCT system, they are both lexical cohesion-based segmenters; however they
only examine one form of lexical cohesion, namely lexical repetition. These types
30
We use Choi’s java implementations of TextTiling and C99 available for free download at
www.cs.man.ac.uk/~choif.
172
of systems work on the notion that areas of text where lexical repetition is at a
minimum represent transitions between topics in the news stream. The SeLeCT
system augments this hypothesis with an extended notion of relatedness so that
topic transitions are represented as areas of text where there are low numbers of
repetition, lexicographical and statistical relationships between tokens, in our case
noun and proper noun phrases.
In addition to the C99 and TextTiling performance results, we also include
results from a random segmenter that returns 25 random boundary positions for
each of the 40 files in both test sets. These results were averaged over 50 random
trials and represent a lower bound on segmentation performance. Furthermore, all
results in this section are calculated using paragraphs as the basic unit of text. Since
both our test sets are in SGML format, each of the segmentation systems make
boundary decisions based on SGML paragraph boundaries31 where the beginning of
a paragraph is indicated by a speaker change tag in a CNN transcript or a paragraph
tag in the case of a Reuters newswire story.
7.3.1 CNN Broadcast News Segmentation
The graph shown in Figure 7.6 summarises the results of each segmentation system
on the CNN data set, evaluated with respect to the four metrics. All values for these
metrics range from 0 to 1 inclusively. However, F1 results are expressed as 1-F1
since a score of 0, in line with the other metrics, will then represent the highest
measure of system performance. Consequently, the system with the lowest score in
each metric is the best performing algorithm. From Figure 7.6, a visualisation of the
results in Table 7.1, we can see that the accuracy of our SeLeCT segmentation
algorithm is greater than the accuracy of either C99, TextTiling or the Random
segmenter for all four evaluation metrics.
Although
many
combinations
of
lexical cohesive
relationships
were
experimented with, optimal performance of the SeLeCT system was achieved when
only patterns of proper noun and noun repetition were examined during the
boundary detection stage. For the remainder of this subsection we will comment on
31
In (Choi, 2000) boundaries are hypothesised using sentences as the basic unit of text. However
both C99 and TextTiling can take advantage of paragraph information when the input is formatted so
that carriage returns indicate breaks between paragraphs.
173
the segmentation style of each of the algorithms and some interesting characteristics
of each of the evaluation metrics when determining segmentation accuracy.
The 1-F1 value for TextTiling gives us a prime example of how traditional IR
metrics, precision and recall, fail as informative measures of segmentation
performance. In their all-or-nothing approach to measuring segmentation
performance, TextTiling rates as the worst performing system with highest overall
1-F1 score. A break down of this score shows that TextTiling’s recall and precision
values are very low, 27.2% and 22.8% respectively. However, these values take no
account of the fact that TextTiling is producing near-misses rather than ‘pure’ false
negatives, i.e. ‘just’ missing boundaries rather than failing to detect them at all. To
verify this we can observe from Figure 7.7 that recall and precision percentages
significantly improve as the margin of error is incremented in units of +/-1
paragraph. In the case of TextTiling, this graph strongly indicates that the system is
more prone to near-misses than false negatives, as recall and precision values
increase to 68.2 and 53.9 respectively at +/-1 paragraphs.
1
0.9
0.8
0.7
SeLeCT
0.6
TextTiling
0.5
C99
0.4
Random
0.3
0.2
0.1
0
1- F1
Pk
Pk'
WinDiff
Figure 7.6. Accuracy of segmentation algorithms on CNN test set.
System
Recall%
Precision %
1 – F1
Pk
SeLeCT
53.4
55.8
0.446
0.25
0.365
0.253
TextTiling
27.9
22.4
0.752
0.259
0.425
0.299
C99
64.1
44.0
0.475
0.294
0.524
0.351
Random
7.5
7.5
0.925
0.421
0.604
0.480
Pk′
WindowDiff
Table 7.1: Precision and Recall values from segmentation on concatenated CNN news stories.
174
1
0.9
0.8
F1 Measure
0.7
0.6
0.5
0.4
C99
0.3
TextTiling
SeLeCT
0.2
Random
0.1
0
0
1
2
3
4
5
6
7
8
9
Margin of Error +/-n Paragraphs
Figure 7.7: Graph illustrating effects on F1 measure as margin of allowable error is
increased for CNN segmentation results.
Section 7.2.2 explained how the WindowDiff metric corrects Pk’s overpenalisation of false negatives and near-misses. With this in mind we would expect
TextTiling to perform better under WindowDiff than Pk. However, it is evident that
TextTiling also suffers considerably from false positive errors, as the difference
between its Pk and Pk′ (doubles false positive penalty) scores in Table 7.1 is
relatively large. Therefore we observe, from an analysis of all four evaluation
metrics, that the TextTiling segmentation style is a combination of false positives
(over-segmentation) and near-misses rather than false negatives (undersegmentation). Similarly if we look at values for the other two systems we see that
C99 is also more prone to false positives than false negatives, and SeLeCT shows
no particular bias towards producing false positives since its Pk and Pk′ remain
relatively stable.
Another interesting observation from these results is that although C99 has a
much lower 1-F1 measure than TextTiling in Table 7.1, both Pk and WindowDiff
rank it as the worst performing system. Taking a closer look at the results explains
why this is the case. C99 returns nearly 3 times more ‘true’ false positives than
TextTiling, since more of TextTiling’s false positives are in fact near-misses. This
again is not reflected in the recall and precision values. However, Figure 7.7
175
somewhat illustrates this point by the fact that C99’s performance shows the least
improvement as the margin of error increases. Overall we observe that in spite of
the fact that WindowDiff penalised each system more than Pk does, the overall
ranking of the systems with respect to these two measures is the same. Although, in
the case of C99 and TextTiling, WindowDiff distinguishes between their levels of
accuracy with more certainty than Pk does.
7.3.2 Reuters Newswire Segmentation
Table 7.2 and Figure 7.8 summarise the performance of each system on our Reuters
newswire test collection. In this experiment we observe that the C99 algorithm
outperforms the SeLeCT, TextTiling and Random segmenter with respect to all four
evaluation metrics. Optimal performance for the SeLeCT system was once again
achieved by analysing only patterns of proper noun and noun phrase repetition.
Overall the results show an improvement in performance for each of the systems
when segmenting concatenated Reuters news stories rather than CNN transcripts.
The difference between WindowDiff scores (improvement in performance) for C99
and SeLeCT is, however, less than was observed for experiments on CNN
transcripts, i.e. 0.12 versus 0.059 respectively. In Figure 7.9, we notice that
TextTiling performance improves dramatically as the margin of error is
incremented from 0 to +/-1 paragraph, which is reflected in its WindowDiff and Pk
scores, ranking it a close third to the SeLeCT system. We see in Table 7.2, as in
Table 7.1, that although WindowDiff penalises systems more heavily than Pk, the
ranking of system accuracy remains the same. Pevzner and Hearst also comment on
Pk’s sensitivity to variation in segment size in the test set. In our experiment CNN
stories vary in length more than Reuters articles do. Consequently, we observe a
smaller deviation between WindowDiff and Pk scores on the Reuters collection in
comparison to the CNN collection.
In conclusion then, both WindowDiff and Pk attempt to represent each type of
segmentation error in a single value of system accuracy. Combining different error
information in a unified manner is a difficult problem that has drawn as much
attention from the IR community with the formulation of combination metrics such
as the F1 and E-measures (van Rijsbergen, 1979), as it has from segmentation
researchers. The main reason that recall and precision measures are useful in
segmentation evaluation is that they reflect how a user who expects 100% accuracy
176
might perceive the segmentation results. However, judging system performance
rather than user satisfaction is what metrics like WindowDiff and Pk are good at and
so they also play an important role in measuring system effectiveness.
1
0.8
C99
0.6
SeLeCT
TextTiling
0.4
Random
0.2
0
1- F1
Pk
Pk'
WinDiff
Figure 7.8: Accuracy of segmentation algorithms on Reuters test set.
System
Recall%
Precision %
1 – F1
Pk
Pk′
WindowDiff
C99
70.0
74.9
0.276
0.128
0.189
0.148
SeLeCT
60.6
79.1
0.314
0.191
0.246
0.207
TextTiling
32.1
41.0
0.640
0.221
0.291
0.244
Random
9.3
9.3
0.907
0.490
0.731
0.514
Table 7.2: Precision and Recall values from segmentation on concatenated Reuters news
stories.
1
0.9
0.8
F1 Measure
0.7
0.6
0.5
C99
0.4
TextTiling
0.3
SeLeCT
0.2
Random
0.1
0
0
1
2
3
4
5
6
7
8
9
Margin of Error +/-n Paragraphs
Figure 7.9: Graph illustrating effects on F1 measure as margin of allowable error is increased
for Reuters segmentation results.
177
7.3.3 The Error Reduction Filter and Segmentation Performance
In Section 7.1.1, we described the error reduction filter used to improve the
performance of SeLeCT system in the boundary detection phrase. This filter works
by seeking out clusters of adjacent high scoring boundaries that are separated by
less than d textual units (in our case paragraphs), and then deciding which one of
these boundary is the correct one using the heuristics discussed in Section 7.1.1.
During the course of our segmentation experiments on the CNN and Reuters news
corpora, we found that the optimal value for d is 7 sentences for both collections.
The problem with a parameter of this nature is that the value of d could be accused
of over-fitting the data set in question. More specifically, if d was the average
distance between stories in the data set, then it could be argued that this element of
the boundary detection process was more important than the information provided
by the lexical chains. However, this is not the case for two reasons. Firstly, roughly
85% of CNN and Reuters news articles are longer than 7 sentences (see Table 5.4
for a breakdown of document lengths in the TDT1 corpus). Secondly, the error filter
is only responsible for a modest increase in SeLeCT’s performance.
Figure 7.10 illustrates the degree to which the error filter improves SeLeCT
performance on the CNN data set. In particular, we can see from this graph that the
F1 scores at a margin of error of +/-0 units are similar (0.52 without the filter versus
0.55 with the filter). However, after this point these F1 scores tend to plateau out as
the margin of error increases for the SeLeCT schema without the filter. In contrast,
the F1 score steadily increases when the filter is employed meaning that it improves
performance by hypothesising more near-misses than the other schema. In Figure
7.11, we can visualise this segmentation characteristic in terms of recall and
precision. From this graph we can see that the ‘without-filter’ schema is capable of
achieving higher recall values at the expense of precision, where as the ‘with-filter’
schema can achieve much greater precision without causing a significant
deterioration in recall.
178
0.9
F1 Measure
0.8
0.7
0.6
SeLeCT with Filter
SeLeCT without Filter
0.5
1
2
3
4
5
6
7
8
9
10
Margin of Error +/-n paragraphs
Figure 7.10: Graph illustrating the effect of the error reduction filter on SeLeCT’s F1 measure
for the CNN collection as the margin of allowable error increases.
100
SeLeCT without Filter
95
SeLeCT with Filter
90
85
Recall
80
75
70
65
60
55
50
35
45
55
65
75
85
95
Precision
Figure 7.11: Graph illustrating the effect of the error reduction filter on SeLeCT’s recall and
precision for the CNN collection as the margin of allowable error increases.
179
7.3.4 Word Associations and Segmentation Performance
As stated in Sections 7.3.1 and 7.3.2, optimal SeLeCT performance was achieved
on the CNN and Reuters test sets when only repetition relationships were used to
determine topic shifts during the boundary detection phase32. Figure 7.12 illustrates
how segmentation performance deteriorates with the use of additional lexical
cohesive relationships. The graph also shows that this trend is consistent across
spoken and written forms of news stories.
0.32
Reuters
CNN
0.3
WindowDiff
0.28
0.26
0.24
0.22
0.2
R
R,S
R,S,C
R,C
R,S,S/G, P/W
R,S,S/G,
P/W,C
Relationship Types
Figure 7.12: Graph showing effect of word relationships on segmentation accuracy.
R=Repetition, S=Synonymy, C=Co-occurrences or statistical word associations, S/G=
Specialisation/Generalisation, P/W=Part/Whole.
Although this is a disappointing and somewhat counterintuitive result a closer
examination of the effect of using weaker semantic relationships to segment text
into topics reveals why the use of these relationships, identified by the lexical
chains, is inappropriate in a text segmentation application. Firstly, cohesion and
32
Similar SeLeCT performance is reported in (Stokes, Carthy, Smeaton 2002). However, this
preliminary system did not include ‘fuzzy’ syntactic matching of noun phrases (Section 3.2). In that
paper optimal performance was achieved when story boundaries were determined by examining
patterns of repetition and WordNet-based relationships in the text. Furthermore although the same
CNN stories are involved in both papers, (Stokes, Carthy, Smeaton 2002) evaluation was run on
single file of 1000 CNN stories rather than 40 files each containing 25 stories. Splitting the corpus
was necessary for the experiments reported here and in (Stokes, 2003) as Choi’s C99 program was
implemented to handle only small data sets.
180
coherence are independent which means that cohesion can exist in sentences that
are not related coherently (Morris, Hirst, 1991).
[1] Dwi Sumadji, who was released yesterday after a judge decided there was
insufficient evidence against him, said he was willing to testify in any hearings on the
case relating to his imprisonment. [2] Dr Ian Wilmut, the head of the Roslin Institute in
Edinburgh, released results earlier today proving that Dolly the sheep and her donor’s
DNA were identical, using the same DNA typing technique that is now accepted as
standard in most courts,
Repetition-based Chains:
{DNA, DNA typing technique}
Repetition + WordNet-based Chains:
{DNA, DNA typing technique}
{judge, case}
Repetition + WordNet + Statistical Word Association-based Chains:
{judge, evidence, hearing, case, DNA, DNA typing technique, court}
Figure 7.13: Example of the effect of weak semantic relationships on the segmentation
process.
For example, consider the text extract in Figure 7.13, which contains two
unrelated sentences one taken from a story on ‘Dwi Sumadji’s release from prison’
and the other on ‘Dolly the sheep’. Below these sentences are three sets of lexical
chains generated from this piece of text using a variety of word relationships:
repetitions only, repetitions and WordNet relationships, and repetitions, WordNet
relationships and collocations. Looking only for repetition relationships between
these sentences we see that they have no nouns in common. Looking for WordNet
relationships between these sentences we find that although ‘judge’ and ‘case’ can
be related by following paths in the taxonomy the related word ‘court’ in the second
sentence is not identified. However, when statistical word associations are included
181
in the chaining process we find that ‘evidence’ in the first sentence is related to
‘DNA’ in the second, and ‘court’, ‘hearing’ and ‘case’ are also added into the same
chain. It is evident from this example that weaker semantic relationships can add
noise to the detection process by blurring topic shifts between news stories.
Although this example only highlights the negative effect of collocations on
segmentation accuracy there are also many instances where WordNet relationships
are responsible for finding similar weak associations between words.
The second reason why weaker lexical cohesive relationships identified by the
chains fail to improve segmentation accuracy is that word interpretations must
occur in context. More specifically, when WordNet or statistical word relationships
are used to analyse cohesion in concatenated news stories, spurious chains are more
likely to be generated in a text that is disjoint and incoherent. For example, consider
the following two seemingly unrelated sentences:
‘There are approximately 336 dimples on a golf ball.’
‘They finally had all the wrinkles in the plan pretty much ironed out’.
In this example, the lexical chaining algorithm incorrectly associates the words
‘dimples’ and ‘wrinkles’ by a specialisation relationship with the concept –
‘depression, impression or imprint’ in WordNet. However, it is obvious that
‘wrinkle’ in the second sentence is being used in the ‘minor difficulty’ sense of the
word. Therefore, we can say that if our chaining algorithm finds a lexicographical
or statistical association in a fixed context (i.e. a single news story) then we can
assume that this relationship is reliable, otherwise we cannot.
7.4
Written versus Spoken News Story Segmentation
It is evident from the results of our segmentation experiments on the CNN and
Reuters test collections that system performance is dependent on the type of news
source being segmented, i.e. spoken texts are more difficult to segment. This
disagreement between result sets is a largely unsurprising outcome as it is well
documented by the linguistic community that written and spoken language modes
differ greatly in the way in which they convey information. At a first glance, it is
obvious that written texts tend to use more formal and verbose language than their
spoken equivalents. However, although CNN transcripts share certain spoken text
182
characteristics, they lie somewhere nearer written documents on a spectrum of
linguistic forms of expression, since they contain a mixture of speech styles ranging
from formal prepared speeches from anchor people, politicians, and correspondents,
to informal interviews/comments from ordinary members of the public.
Furthermore, spoken language is also characterised by false starts, hesitations, backtrackings, and interjections; however, information regarding prosodic features and
these characteristics are not represented in CNN transcripts. In this section we look
at some grammatical differences between spoken and written text that are actually
evident in CNN transcripts. In particular, we look at the effect that these differences
have on part-of-speech distributions, and how these impact segmentation
performance.
7.4.1 Lexical Density
One method of measuring the grammatical intricacy of speech compared to written
text, is to calculate the lexical density of the language being used. The simplest
measure of lexical density, as defined by Halliday (1985), is the ‘the number of
lexical items (content words) as a portion of the number of running words
(grammatical words)’. Halliday states that written texts are more lexically dense
while spoken texts are more lexically sparse. In accordance with this we observe,
based on part-of-speech tag information, that the CNN test set contains 8.58% less
lexical items than the Reuters news collection.33 Halliday explains that this
difference in lexical density between the two modes of expression can be attributed
to the following observation:
Written language represents phenomena as products, while spoken language
represents phenomena as processes.
In real terms this means that written text tends to convey most of its meaning
through nouns (NN) and adjectives (ADJ), while spoken text conveys it through
adverbs (ADV) and verbs (VB). To illustrate this point consider the following
written and spoken paraphrase of the same information:
33
Lexical items included all nouns, adjectives and verbs, except for function verbs like modals and
auxiliary verbs. Instead these verbs form part of the grammatical item lexicon with all remaining
parts of speech. Our CNN and Reuters data sets consist of 43.68% and 52.26% lexical items
respectively.
183
Written: Improvements/NN in American zoos have resulted in better living/ADJ
conditions for their animal residents/NN.
Spoken: Since/RB American zoos have been improved/VB the animals
residing/VB in them are now/RB living/VB in better conditions.
Although this example is a little contrived, it shows that in spite of changes to
the grammar, by and large the vocabulary has remained the same. More
specifically, these paraphrases illustrate how the products in the written version,
improvements, resident, and living, are conveyed as processes in spoken language
though the use of verbs. The spoken variant also contains more function words, in
particular two adverbs ‘now’ and ‘since’, where adverbs are a grammatical
necessity that provides cohesion to text when processes are being described in verb
clauses. So looking at the ratio of function words in the written and spoken forms
we find that for every one function word in the written text there are 1.8 function
words in the spoken form, i.e. 1: 1.8. On the other hand the ratio of content words is
almost one for one.
Lexical Density and SeLeCT
As explained in Section 3.2, the LexNews chaining algorithm, used by the SeLeCT
segmenter, only looks at cohesive relationships between nouns, proper nouns and
nominalised adjectives in a text. This accounts partly for SeLeCT’s lower
performance on the CNN test set, since the extra information conveyed though
verbs in spoken texts is ignored by the lexical chainer. The simplest solution to this
problem is to repeat the SeLeCT experiments, this time including all verbs (except
function verbs such as modals) in the chaining process. The best method of dealing
with morphological variations between same-stem verbs is to reduce all verbs to
their root form during tokenisation. This ensures that irregular verbs such as ‘to
ring’ (ring – rang - rung) or ‘to swim’ (swim – swam – swum) will appear
syntactically equivalent during the chaining process. So excluding these irregular
verbs (WordNet lists these exceptions), all verbs are reduced to their root form by
the tokeniser using standard inflection derived rules. In Section 2.3, we briefly
discussed the difficulty of finding semantic relationships between verbs using the
184
WordNet taxonomy; consequently, we only examine repetition relationships
between these parts of speech.
Table 7.3 shows the negative effect on segmentation performance when verb
stems (excluding function verbs) are included in the chaining processes. More
specifically, SeLeCT’s performance deteriorates by 3.1% on the CNN collection
and 4.3% on the Reuters collection. From the results of this experiment we observe
that the standard set of textual function verbs is not enough for speech text
processing tasks and that their lists should be extended to include other common
‘low information’ verbs. These types of verbs are not necessarily characterised by
large frequency counts in the spoken news collection such as the domain specific
phrases ‘to report’ or ‘to comment’. Instead these verbs tend to have no equivalent
nominal form, such as the verbs ‘to let’ ‘to hear’ ‘to look’ or ‘to try’. With this in
mind, we repeated this experiment including this time only nominalised verbs and
the usual proper noun and nouns phrases, and nominalised adjectives in the
chaining process. As expected these experimental results, presented in the last row
of Table 7.3, show a 1.2% decrease in system error on the CNN collection over the
initial SeLeCT system. A similar decrease in error on the Reuters test collection
was not observed since written text conveys most of its meaning though the use of
nouns, so verbs can be ignored with little or no effect on segmentation performance
in the context of this experiment.
System
SeLeCT
CNN WindowDiff
Before
After
0.253
0.284
0.253
0.241
∆ Error
Reuters WindowDiff
∆ Error
Before
After
+ 3.1%
0.207
0.250
+ 4.3%
-1.2%
0.207
0.209
+ 0.2%
(stopped and
stemmed verbs)
SeLeCT
(nominalised
verbs)
Table 7.3: Results of SeLeCT segmentation experiments when verbs are adding into the
chaining process.
185
Lexical Density, C99 and TextTiling
Comparing Tables 7.1 and 7.2, as in the case of the SeLeCT results, we also notice
a decrease in C99 and TextTiling segmentation performance on the CNN collection
compared with the Reuters collection results. For the SeLeCT system, we
concluded that this performance difference was caused by the loss of valuable topic
information when ‘nominalisable’ verbs are excluded from the chain creation
phrase. However, since C99 and TextTiling use all parts of speech in their analysis
of the text, the replacement of products (nouns) with processes (verbs) is not the
reason for a similar deterioration in their performance. More specifically, both C99
and TextTiling rely on stopword lists to identify spurious inter-segment links
between function words that by their nature do not indicate common topicality. For
the purpose of their original implementation their stopword lists contained mostly
pronouns, determiners, adverbs, and function verbs such as auxiliary and modal
verbs. However, we observed from the SeLeCT results in Table 7.3 that the
standard set of textual function verbs is not enough for speech text processing tasks.
In order to observe if a similar improvement in results could be achieved, we re-ran
C99 and TextTiling experiments on the Reuters and CNN collections, using only
nouns, adjectives, nominalised verbs (provided by the NOMLEX (Meyers et al.,
1998)), and nominalised adjectives as input. Alternatively, we could have provided
these systems with an extended stopword list that included general stopwords and
‘low information’ verbs; however, it is more desirable and effective in this case to
limit the input of the system to content words only.
Our results in Table 7.4 show that there is a decrease in the WindowDiff error
for the C99 system on both the CNN collection (an 8.3% reduction in error) and the
Reuters collection (a 2.7% reduction in error). Similarly, we observe an
improvement in the WindowDiff based performance of the TextTiling system on the
CNN data set (a 2.5% reduction in error). However, we observe a marginal fall in
performance on the Reuters data set (a 0.3% increase in error). These results again
illustrate the increased dominance of verbs in spoken text and the importance of
function verb removal by our verb nominalisation process for CNN segmentation
performance.
186
System
C99 (nominalised
CNN WindowDiff
Before
After
0.351
0.268
0.299
0.274
∆ Error
Reuters WindowDiff
∆ Error
Before
After
-8.3%
0.148
0.121
-2.7%
-2.5%
0.244
0.247
+ 0.3%
verbs)
TextTiling
(nominalised verbs)
Table 7.4: Results of C99 and TextTiling segmentation experiments when nominalised verbs
are adding into the segmentation process.
7.4.2 Reference and Conjunction in Spoken Text
‘A picture paints a thousand words’ and since news programme transcripts are
accompanied by visual and audio cues in the news stream, there will always be a
loss in communicative value when transcripts are interpreted independently. It is
well known that conversational speech is accompanied by prosodic and
paralinguistic contributions, facial expressions, gestures, intonation etc., which are
rarely conveyed in spoken transcripts. However, there are also explicit (exophoric)
references in the transcript to events occurring outside the lexical system itself.
These exophoric references in CNN transcripts relate specifically to audio
references such as speaker change, musical interludes, background noise; and visual
references such as event, location and people shots in the video stream. All of these
exophoric cues help to give context to the dialogue in a news report and commonly
contain information which is not always repeated in the accompanying transcript. In
particular this ‘spoken’ news story characteristic is repeatedly seen in humaninterest stories and entertainment reports which are less structured than other news
reports in the collection. Although, some of these exophoric cues are explicitly
tagged in TDT news transcripts (e.g. speaker change), TDT segmenters have only
made use of this information to identify boundaries between sentences. For
example, Figure 7.14 contains an extract from a CNN story on the movie “The
Shadow”. One of the first things one notices is that much of the text produced by
the film clip is largely irrelevant. In the context of document indexing the addition
of this text would by in large have little effect on the term frequencies and idf
scores. However, in story segmentation these types of interludes often appear to
automatic segmenters as areas of dissimilarity in the text which can in turn lead to
187
incorrect story boundary assignments. Consequently, identifying and ignoring these
text units during boundary detection may improve segmentation performance and
be a fruitful area for future speech-based news story segmentation research.
However, with respect to the deterioration in segmentation performance on the
CNN test collection in our experiments, we believe that this property of transcribed
news is a contributory factor.
[Film clip from ‘The Shadow’]
LONE, stars as Shiwan Khan: How did you know what was happening to me?
How did you know who I am?
ALEC BALDWIN, stars as the Shadow: The Shadow knows.
s the whole idea.
CHARLIE COATS, News Correspondent: Of course he knows. That'
The Shadow is the super hero who has some loosely-defined mind-reading powers and a
dark side- something about evil that lurks in the hearts of men. But he'
s a good guy and a
snappy dresser.
[Film clip from ‘The Shadow’]
LONE: That is a lovely tie, by the way. May I ask where you acquire it?
BALDWIN: Brooks Brothers.
LONE: Is that midtown?
BALDWIN: Forty-fifth and Madison. You are a barbarian.
LONE: Thank you.
Figure 7.14: CNN transcript of movie review with speaker identification information.
In addition to the occurrence of exophoric references, speech transcripts also
contain examples of endophoric (anaphora and cataphora) reference. Solving
endophoric reference has long been recognised as a very difficult problem, which
requires pragmatic, semantic and syntactic knowledge in order to be solved.
However, there are simple heuristics commonly employed by text segmentation
algorithms that we use to take advantage of the increased presence of this form of
reference in spoken text. One such heuristic is based on the observation that when
common referents such as personal and possessive pronouns, and possessive
determiners appear at the beginning of a sentence, this indicates that these referents
are linked in some way to the previous textual unit (in our case the previous
188
paragraph). The resolution of these references is not of interest to our algorithm but
the fact that two textual units are linked in this way gives the boundary detection
process an added advantage when determining story segments in the text. In
addition, an analysis of conjunction (another form of textual cohesion) can also be
used to provide the detection process with useful evidence of related paragraphs,
since paragraphs that begin with conjunctions (because, and, or, however,
nevertheless), and conjunctive phrases (in the mean time, in addition, on the other
hand) are particularly useful in identifying cohesive links between units in
conversational/interview sequences in the transcript.
7.4.3 Refining SeLeCT Boundary Detection
In Section 7.1.1, we described in detail how the boundary detection phrase uses
lexical chaining information to determine story segments in a text. One approach to
integrating referential and conjunctive information with the lexical cohesion
analysis provided by the chains is to remove all paragraphs from the system output
that contain a reference or conjunctive relationship with the paragraph immediately
following it in the text. The problem with this approach is that Pk and WindowDiff
errors will increase if ‘incorrect’ segment end points are removed that represented
near system misses rather than ‘pure’ false positives. Hence, we take a more
measured approach to integration that uses conjunctive and referential evidence in
the final filtering step of the detection phrase, to eliminate boundaries in boundary
clusters (Section 7.1.1) that cannot be story end points in the news stream. Figure
7.15 illustrates how this technique can be used to refine the filtering step.
Originally, the boundary with score six in region A would have been chosen as the
correct boundary point. However, since a conjunctive phrase links the adjacent
paragraphs at this boundary position in the text, the boundary which scores five is
deemed the correct boundary point by the algorithm.
189
6
5
4
0
5
5
3
0
0
0
0
0
0
0
A
0
0
B
0
0
0
0
C
Figure 7.15: Diagram illustrating how cohesion information can help SeLeCT’s boundary
detector resolve clusters of possible story boundaries.
Using this technique and the verb nominalisation process described in Section 7.4.1
on both news media collections, we observed an improvement in SeLeCT system
performance on the CNN data set (a decrease in error from 0.241 to 0.225), but no
such improvement on the Reuters collection. Again the ineffectiveness of this
technique on the Reuters result can be attributed to differences between the two
modes of language expression, where conjunctive and referential relationships
resolve 51.66% of the total possible set of boundary points between stories in the
CNN collection and only 22.04% in the Reuters collection. In addition, these
references in the Reuters articles mostly occur between sentences in a paragraph
rather than between paragraphs in the text, thus providing no additional cohesive
information.
A summary of the improved results discussed in this section and in Section 7.4.1
is shown in Table 7.534. This table is followed by two tables (Tables 7.6. and 7.7)
containing the results of a paired samples t-test for each pair of systems in our
evaluation. In these tables, the symbol * indicates that the difference in WindowDiff
scores between the two systems is statistically significant at the 99.9% level, while
** indicates statistical significance at 95% level. These results are based on a two-
34
The C99, TextTiling and SeLeCT implementations yield optimal results using the following
parameters. On the both the CNN and Reuters data sets the C99 was run with an 11 x 11 ranking
mask as suggested in (Choi, 2000), while TextTiling runs best with window size = 300 and a step
size = 20 on the CNN test set, and a window size = 300 and a step size = 30 on the Reuters test set.
For a more detailed explanation of these parameters see (Choi, 2000) and the README file which
accompanies
Choi’s
version
of
TextTiling
which
can
be
downloaded
at
http://www.cs.man.ac.uk/~mary/choif/frame.html (checked March 2004). SeLeCT yielded optimal
performance on the CNN test set using x = 2, distance = 7, and a maximum extra strong or
repetition-based relationship distance of 750 words. While on the Reuter test the following
parameter settings were used x = 1, distance = 7 and a maximum extra strong relationship distance
of 400 words which was referred to as parameter m in Section 7.1.1.
190
sided t-test of the null hypothesis of equal means, where all tests are performed on
39 degrees of freedom, i.e. sample size of 40.
Table 7.6 shows that our original SeLeCT system is more accurate than both the
original TextTiling and C99 systems. However, Table 7.7 shows that after
refinements were made to the systems (see Section 7.4.1), only the difference in
means between the SeLeCT and TextTiling systems are deemed to be statistically
significant (to a level of 95% confidence). Hence, SeLeCT, C99 and TextTiling
performance on the CNN collection is equivalent. Further experimentation on a
larger CNN test collection might help to distinguish between the performance of
these systems.
With regard to the Reuters results, the system refinements discussed in Section
7.4.1 were shown to have made no impact on segmentation performance. Hence
Table 7.6 contains the original systems results where a paired sample t-test shows
that all results are statistically significant to a level of 99.9% confidence.
Consequently, C99 performs best, followed by SeLeCT and then TextTiling on the
Reuters collection.
System
CNN WindowDiff
Before
After
SeLeCT
0.253
0.225
C99
0.351
TextTiling
0.299
∆ Error
Reuters WindowDiff
∆ Error
Before
After
− 2.8%
0.207
0.209
+ 0.2%
0.268
− 8.3%
0.148
0.121
− 2.7%
0.274
− 2.5%
0.244
0.247
+ 0.3 %
Table 7.5: Improvements in system performance as a result of system modifications discussed
in Sections 7.4.1 and 7.4.3.
Initial Results: Paired
CNN
Reuters
Samples T-Test
t-statistic
p-value
t-statistic
p-value
SeLeCT – C99
-6.802
0.00**
7.406
0.00**
SeLeCT – TextTiling
-6.464
0.00**
-4.911
0.00**
C99 – TextTiling
3.051
0.04*
-11.916
0.00**
Table 7.6: Paired Samples T-Test on initial results from Table 7.1 and 7.2. All results marked
with * are statistical significant to 95% confidence and marked with ** are statistical
significant to 99.9% confidence.
191
Refined Results: Pair Samples
T-Test on from Table 6.4
CNN
t-statistic
p-value
SeLeCT – C99
-1.537
0.13
SeLeCT – TextTiling
-2.99
0.05*
C99 – TextTiling
-0.66
0.51
Table 7.7: Pair Samples T-Test p-values on refined results taken from Table 7.4. All results
marked with a * are statistical significant to 95% confidence.
7.5
Discussion
In this chapter we described our lexical chain-based approach to news story
segmentation, the SeLeCT system. This system uses the LexNews chaining
algorithm to build a set of lexical chains from a concatenated set of news stories.
The start and end points of these chains in the text are then used to discern where
topic transitions or story boundaries occur in the text. On a CNN news story
collection of spoken news transcripts using the WindowDiff metric as a means of
determining segmentation accuracy, we found that the SeLeCT system
outperformed the C99 and TextTiling algorithms. However, on a similar Reuters
news collection of concatenated written newswire documents the best performing
system was the C99 algorithm followed by the SeLeCT system and then the
TextTiling algorithm. In both experiments the SeLeCT system performed best when
only patterns of repetitions where analysed. The lack of success of the weaker
semantic relationships (WordNet relationships and statistical word associations) in
determining boundaries between news stories was attributed to the fact that a text in
this application is a non-coherent collection of paragraphs and lexical chains can
only be reliably built from a coherent text. Otherwise spurious chains are created
which add noise to the boundary detection step. Also, we noted that cohesion and
coherence are independent textual properties so even unrelated sentences can have
correctly identified cohesive ties between them, making text segmentation an
unsuitable application for a full analysis of lexical cohesive patterns in text.
The deterioration in segmentation accuracy of all three systems on the spoken
news collection was explained in terms of the propensity of written text to express
phenomena as products (i.e. nouns) in contrast to speech where phenomena are
192
more commonly expressed as processes (i.e. verbs). With respect to the SeLeCT
system, we observed that by including nominalised verbs in the segmentation
process the accuracy of the algorithm improved. Further improvements were also
observed when all non-nominalisable verbs were eliminated from the TextTiling
and C99 input. These verbs tend to be less ‘informative’ and more commonly
occurring, therefore they are considered as additional noise in the segmentation
process. The final refinement to the SeLeCT boundary detection step involved the
inclusion of reference and conjunction information, which helped to improve
segmentation performance. On the other hand, none of these refinements improved
the performance of any of the systems on the Reuters test collection. Also the
improvement in performance of the TextTiling and C99 systems on the CNN
collection resulted in the SeLeCT system only marginal outperforming these
systems in the end.
In this chapter we described the results of our own segmentation evaluation
methodology. However, as Section 4.2.2 stated, News Story Segmentation is an
offical TDT task. As a result, the offical TDT1 pilot study evaluation provides a
means of determining segmentation accuracy on the TDT1 corpus. Like our
evaluation they use an error metric (an earlier version of the Pk metric) to directly
evaluate the ability of each system to determine boundaries between stories.
However, the TDT1 segmentation evaluation is also based on an indirect
measurement of segmentation quality with respect to the effect on event tracking
performance of automatically segmented news stories. An interesting future
experiment would be to re-evaluate the SeLeCT system in this more comprehensive
evaluation methodology. However, this evaluation could not involve Choi’s
implementation of the C99 algorithm because it only supports segmentation of short
documents, and the TDT1 evaluation requires the segmentation of three streams:
the Reuters news stream, the CNN news stream and the entire TDT1 collection. As
already stated, in order to include the C99 system in our own segmentation
evaluation we had to calculate performance as the average performance of each
system on 40 files each containing 25 news stories.
Another advantage of using our own evaluation format was that it allowed us to
determine to what extent repetition-based segmenters are useful in News Story
Segmentation. In comparison, the systems that have been involved in the offical
193
TDT segmentation task have been primarily focussed on domain-specific
techniques, like those described in Section 6.2.3, that are trained on news data, and
are sensitive to the occurrence of domain-specific cues in the news text, e.g. ‘news
just in’. By comparing our method with other domain independent repetition-based
segmenters, we were able to directly establish how well lexical chains perform with
respect to these approaches, and to explore the effect of broader lexical cohesive
relations (i.e. WordNet and statistical relationships) on the segmentation process.
Another interesting area for investigation would be the combination of SeLeCT’s
segmentation evidence with domain specific information using one of the multisource statistical models described in Section 6.2.3. In addition, it is not clear from
our experiments how well the SeLeCT system would perform on error-prone ASR
news transcripts. This would also be an interesting area for further research, since
ASR transcripts were shown to have significantly affected the performance of our
lexical chain-based New Event Detection system, LexDetect, in Chapter 5.
194
Chapter 8
News Story Gisting
In this chapter we discuss some promising initial results obtained from our final
application of lexical cohesion analysis: News Story Gisting. A gist is a very short
summary, ranging in length from a single phrase to a sentence, that captures the
essence of a piece of text in much the same way as a title or section heading in a
document helps to convey the texts central message to a reader. Like News Story
Segmentation and New Event Detection, News Story Gisting is a prerequisite for
the successful organisation and presentation of news streams to users. News Story
Gisting also represents another interesting and novel application of the LexNews
algorithm in the broadcast news domain.
In this chapter we describe the results of some on-going collaborative work with
the Dublin City University (DCU) Centre for Digital Video Processing. More
specifically, this part of our research focuses on the creation of news story gists for
streams of news programmes used in the DCU Físchlár-News-Stories system
(Smeaton et al., 2003): a multi-media system that allows users to search, browse
and play individual news stories from Irish television news programmes. In its
current incarnation the Físchlár-News-Stories system segments video news streams
using audio and visual analysis techniques.
Like all real-world applications, these techniques will at times place erroneous
story boundaries in the resultant segmented video stream. In addition, since the
closed caption material accompanying the video is generated live during the
broadcast, a time lag exists between the discussion of an item of news in the audio
stream and its appearance in the teletext on the video stream. Consequently,
segmentation errors will be present in the closed caption stream, where for example
the end of one story might be merged with the beginning of the next story. Previous
work in this area undertaken at the DUC summarisation workshops35 and by other
research groups has predominantly focussed on generating gists from clean data
35
Document Understanding Conferences (DUC): www-nlpir.nist.gov/projects/duc/intro.html
195
sources such as newswire (Witbrock, Mittal, 1999) thus avoiding the real issue of
developing techniques that can deal with the erroneous data that underlies this
problem. The work described in this chapter was published in (Stokes et al., 2004b).
8.1
Related Work
Automatic text summarisation is the task of generating a more concise version of a
source text while trying to retain the essence of its original information content.
Summaries range in sophistication from simple extractions to more complex
abstractions. In the case of extractions, a text summariser simply returns the set of
sentences (verbatim) that it believes represents the central theme of a document.
Abstractions on the other hand involve deep-level textual analysis and subsequent
paraphrasing of an extract into a more coherent whole.
Our approach to gisting is extractive. More specifically, the LexGister system
determines a representative sentence for a text based on the strength of the lexical
cohesive relationships between that sentence and the rest of the text. In our
experimental methodology, Section 8.3, we determine the performance of the
LexGister system with respect to a random extractor, a lead sentence extractor and
a tf.idf approach to the problem. Other notable extractive gisting approaches
discussed in the literature include Kraaij et al.’s (2002) probabilistic approach,
Alfonseca et al.’s (2003) genetic algorithmic approach, and Copeck et al.’s (2003)
approach based on the occurrence of features that denote appropriate summary
sentences. These lexical, syntactic and semantic features include the occurrence of
discourse cues, the position of the sentence in the text, and the occurrence of
content phrases and proper nouns. Biasing the extraction process with additional
textual information such as these features is a standard approach to headline
generation that has proved to be highly effective in most cases (Kraaij, 2002;
Alfonseca et al., 2003; Copeck et al., 2003; Zhou, Hovy, 2003).
An alternative to extractive gisting approaches is to view the title generation
process as being analogous to statistical machine translation. Wittbrock and Mittal’s
(1999) paper on ‘ultra-summarisation’, was one of the first attempts to generate
headlines based on statistical learning methods that make use of large amounts of
training data. More specifically, during title generation a news story is ‘translated’
into a more concise version using the Noisy Channel model. The Viterbi algorithm
196
is then used to search for the most likely sequence of tokens in the text that would
make a readable and informative headline. This is the approach adopted by Banko
et al. (2000), Jin and Hauptmann (2001), Berger and Mittal (2000) and most
recently by Zajic and Dorr (2002). In the following section we describe our lexical
chain-based gisting approach.
8.2
The LexGister System
Like the LexDetect and SeLeCT systems our news gister, the LexGister system,
uses the same candidate selection and LexNews algorithm for generating a news
story summary. Once lexical chains have been generated the next step is to identify
the most important or highest scoring proper noun and noun chains for a story. This
step is necessary as it helps to identify the central themes in the text by discarding
cohesively weak chains. The overall cohesive strength of a chain is measured with
respect to the strength of the relationships between the words in the chain. Table 3.6
in Section 3.5 showed the strength of the scores assigned to each cohesive
relationship type participating in the chaining process, i.e. repetition = 1; synonymy
= 0.9; antonymy, hyponymy, meronymy, holonymy, and hypernymy = 0.7; path
lengths greater than 1 in WordNet = 0.4; statistical word associations = 0.4. The
chain weight, score(chain), then becomes the sum of these relationship scores,
which is defined more formally as follows:
score(chain ) =
n
reps(i ) + rel (i, j )
(8.1)
i =1
where i is the current chain word in a chain of length n, reps(i) is the number of
repetitions of term i in the chain and rel(i,j) is the strength of the relationship
between term i and the term j where j was deemed related to i during the chaining
process. For example, the chain {hospital, infirmary, hospital, hospital} would be
assigned a score of ( (reps(hospital) + rel(hospital, infirmary)) + (reps(infirmary) +
rel(infirmary, hospital)) ) = 5.8, since ‘infirmary’ and ‘hospital’ are synonyms.
Chain scores are not normalised, in order to preserve the importance of the length of
the chain in the score(chain) calculation. Once all chains have been assigned a
score, the highest scoring proper noun chain and noun chain are retained for the
next step in the extraction process. If the highest score is shared by more than one
chain (of either chain type) then these chains are also retained.
197
Once the key noun and proper noun phrases have been identified, the next step is
to score each sentence in the text based on the number of key chain words it
contains as follows:
score( sentence) =
n
score(chain )i
(8.2)
i =1
where score(chain)i is zero if word i in the current sentence of length n does not
occur in one of the key chains, otherwise score(chain)i is the score assigned to the
chain where i occurred.
Once all sentences have been scored and ranked, the highest ranking sentence is
then extracted and used as the gist for the news article36. This final step in the
extraction process is based on the hypothesis that the key sentence in the text will
contain the most key chain words. This is analogous to saying that the key sentence
will be the sentence that is most cohesively strong with respect to the rest of the
text. If it happens that more than one sentence has been assigned the maximum
sentence score then the sentence nearest the start of the story is chosen, since lead
sentences in a news story tend to be better summaries of its content.
Another consideration in the extraction phase is the occurrence of dangling
anaphors in the extracted sentence, e.g. references to pronouns such as ‘he’ or ‘it’
that cannot be resolved within the context of the sentence. In order to address this
problem we use a commonly used heuristic that states that if the gist begins with a
pronoun then the previous sentence in the text is chosen as the gist. However, when
we tested the effect of this heuristic on the performance of our algorithm we found
that the improvement was insignificant. We have since established that this is the
case because the extraction process is biased towards choosing sentences with
important proper nouns due to the inclusion of proper noun chains in the gisting
process. The effect of this is an overall reduction in the occurrence of dangling
anaphors in the resultant gist.
8.3
Experimental Methodology
Our evaluation methodology establishes gisting performance using an automatic
evaluation based on the same framework proposed by Witbrock and Mittal (1999),
36
At this point in the algorithm it would also be possible to generate longer-style summaries by
selecting the top n ranked sentences.
198
where recall, precision and the F1 measure are used to determine the similarity
between a gold standard or reference title and a system generated title. In the
context of this experiment these IR evaluation metrics are defined as follows:
Recall is the number of words that the reference and system titles have in
common divided by the number of words in the reference title.
Precision is the number of words that the reference and system titles have in
common divided by the number of words in the system title.
F1 measure is the harmonic mean of the recall and precision metrics (defined in
Section 4.1.2, Equation 4.10).
In order to determine how well our lexical chain-based gister performs, the
automatic part of our evaluation compares the recall, precision and F1 metrics of
four baseline extractive gisting systems with the LexGister system. A brief
description of the techniques employed in each of these systems is provided below:
A baseline lexical chaining extraction approach (LexGister(b)) that works in
the same manner as the LexGister system except that uses a basic version of the
LexNews chaining algorithm, i.e. it ignores statistical associations between
words in the news story and proper nouns that do not occur in the WordNet
thesaurus.
A tf.idf-based approach (TFIDF) that ranks sentences in the news story with
respect to the sum of their tf.idf weights for each word in a sentence. The idf
statistics were generated from the TDT1 corpus.
A lead sentence-based approach (LEAD) that in each case chooses the first
sentence in the news story as its gist. In theory this simple method should
perform well due to the pyramidal nature of news stories, i.e. the most important
information occurs at the start of the text followed by more detailed and less
crucial information. In practice, however, due to the presence of segmentation
errors in our data set, it will be shown in Section 5 that a more sophisticated
approach is needed.
A random approach (RANDOM) that randomly selects a sentence as an
appropriate gist for each news story. This approach represents a lower bound on
gisting performance for our data set.
In Chapters 5 and 7 our New Event Detection and News Story Segmentation
evaluations focussed on newswire and broadcast news transcripts taken from the
199
TDT1 and TDT2 corpora. However, for our gisting evaluation we collected a
corpus of 246 error-prone closed caption news stories captured from RTÉ Irish
broadcast news programmes. We manually annotated this corpus with a set of gold
standard human generated titles taken from the www.rte.ie/news website. However,
there is a marked difference between what is meant by ‘error-prone closed caption
material’ and ‘error-prone ASR broadcast news transcripts’, where TDT ASR
transcripts are primarily affected by limited capitalisation, and some segmentation
and spelling errors. On the other hand, the RTÉ closed caption material is
capitalised, but suffers from breaks in transmission (in some cases missing
words/sentences) and story segmentation errors are more prevalent than in the TDT
transcripts due to an LDC ‘clean-up’ attempt on the TDT2 corpus37. The extrinsic
evaluation results discussed in the following section are generated from all 246
stories.
8.4
Gisting Results
As previously explained recall, precision and F1 measures are calculated based on a
comparison of the 246 generated news titles against a set of reference titles taken
from the RTÉ news website. However, before the overlap between a system and
reference headline for a news story is calculated both titles are stopped and
stemmed using the standard InQuery stopword list (Callan et al., 1992) and the
Porter stemming algorithm (Porter, 1997). The decision to stop reference and
system titles before comparing them is based on the observation that some title
words are more important than others. For example if the reference title is
‘Government still planning to introduce the proposed anti-smoking law’ and the
system title is ‘The Vintners Association are still looking to secure a compromise’
then they share the words ‘the’, ‘still’, and ‘to’, and the system title will have
successfully identified 3 out of the 9 words in the reference title, resulting in
misleadingly high recall (0.33) and precision (0.3) values. Another problem with
automatically comparing reference and system titles is that there may be instances
of morphological variants in each title, such as ‘introducing’ and ‘introduction’, that
37
Appendix D contains sample documents from each of the news collections used in the thesis: TDT
newswire, TDT1 broadcast transcripts, TDT2 broadcast transcripts, and RTÉ closed caption
material.
200
without the use of stemming will make titles appear less similar than they actually
are.
0.6
Recall
Precision
0.5
F1
0.4
0.3
0.2
0.1
0
Human
LexGister
LexGister(b)
TFIDF
LEAD
RANDOM
Figure 8.1: Recall, Precision and F1 values measuring gisting performance for 5 distinct
extractive gisting systems and a set of human extractive gists.
Figure 8.1 shows the automatic evaluation results, using the stopping and stemming
method, for each of our four extractive gisting methods discussed in Section 8.2.2.
For this experiment we also asked a human judge to extract the sentence that best
represented the essence of each story in the test set. Hence, the F1 value 0.25
achieved by these human extracted gists represents an upper bound on gisting
performance. As expected our lower bound on performance, the RANDOM system,
is the worst performing system with an F1 measure of 0.07. The LEAD sentence
system also performs poorly (F1 0.08), which helps to illustrate that a system that
simply chooses the first sentence in this instance is not an adequate solution to the
problem. A closer inspection of the collection shows that 69% of stories have
segmentation errors which accounts for the low performance of the LEAD and
RANDOM gisters. On the other hand, the LexGister outperforms all other systems
with an F1 value of 0.20. A breakdown of this value shows a recall of 0.42, which
means that on average 42% of words in a reference title are captured in the
corresponding system gist generated for a news story. In contrast, the precision
value for the LexGister is much lower where only 13% of words in a gist are
reference title words. The precision values for the other systems show that this is a
201
characteristic of extractive gisters since extracted sentences are on average two
thirds longer than reference titles. This point is illustrated in the follow example
where the recall is 100% but the precision is 50% (in both cases stopwords are
ignored).
Gist: “The world premier of the Veronica Guerin movie took place in Dublin'
s
Savoy Cinema, with Cate Blanchett in the title role.”
Reference Title: “Premier of Veronica Guerin movie takes place in Dublin”.
This example also shows that some form of sentence compression is needed if the
LexGister were required to produce titles as opposed to gists, which would in turn
help to increase the precision of the system. However, the higher recall of the
LexGister system verifies that lexical cohesion analysis is better at capturing the
focus of a news story than a statistical-based approach using a tf.idf weighting
scheme.
Another important result from this experiment is the justification of our
enhanced LexNews algorithm (which incorporated statistical word associations and
non-WordNet proper nouns in the chaining process). Figure 8.1 illustrates how the
LexGister system (F1 0.20) outperforms the baseline version, LexGister(b) (F1
0.17). Although our data set for this part of the experiment may be considered small
in IR terms, a two-sided t-test of the null hypothesis of equal means shows that all
system results are statistically significant at the 1% level, except for the difference
between the RANDOM and LEAD results, and the TFIDF and LexGister(b) results
which are not significant.
One of the main criticisms of an automatic evaluation experiment, such as the
one just described, is that it ignores important summary attributes such as
readability and grammatical correctness. It also fails to recognise cases where
synonymous or semantically similar words are used in a system and reference title
for a news story, e.g. ‘killed’ or ‘murdered’ or ‘9-11’ and ‘September 11’.This is a
side effect of our experimental methodology where the set of gold standard human
generated titles contain many instances of words that do not occur in the original
text of the news story. Examples like these account for a reduction in gisting
performance, and illustrate the importance of an intrinsic or user-oriented
evaluation when determining the ‘true’ quality of a gist.
202
Consequently, we conducted an evaluation experiment, based on one proposed
by Jin and Hauptmann (1999), involving human judges that addresses these
concerns. However, due to the overhead of relying on human judges to rate gists for
all of these news stories we randomly selected 100 LexGister gists for the manual
part of our evaluation. We then asked six judges to rate LexGister’s titles using five
different quality categories ranging from 5 to 1 where ‘very good = 5’, ‘good = 4’,
‘ok = 3’, ‘bad = 2’ and ‘very bad = 1’. Judges were asked to read the closed caption
text for a story, and then rate the LexGister headline based on its ability to capture
the focus of the news story. The average score for all judges over each of the 100
randomly selected titles was an average score of 3.56 (i.e. gists were ‘ok’ to ‘good’)
with a standard deviation of 0.32 indicating strong agreement among the judges.
Since judges were asked to rate gist quality based on readability and content
there were a number of situations where the gist may have captured the crux of the
story but its rating was low due to problems with its fluency or readability. These
problems are a side effect of dealing with error-prone closed caption data that
contains both segmentation errors and breaks in transmission. To estimate the
impact of this problem on the rating of the titles we also asked judges to indicate if
they believed that the headline encapsulated the essence of the story disregarding
grammatical errors. This score was a binary decision (1 or 0), where the average
judgement was that 81.33% of titles captured the central message of the story with a
standard deviation of 10.52 %. This ‘story essence’ score suggests that LexGister
headlines are in fact better than the results of the automatic evaluation suggest,
since the problems resulting from the use of semantically equivalent yet
syntactically different words in the system and reference titles (e.g. Jerusalem,
Israel) do not apply in this case. However, reducing the number of grammatical
errors in the gists is still a problem as 36% of headlines contain these sorts of errors
due to ‘noisy’ closed caption data. An example of such an error is illustrated below
where the text in italics at the beginning of the sentence has been incorrectly
concatenated to the gist due to a transmission error:
“on tax rates relating from Tens of thousands of commuters travelled free of
charge on trains today.”
203
It is hoped that the sentence compression strategy, briefly discussed in Section
8.5, will be able to remove unwanted elements of text like this from the gists. One
final comment on the quality of the gists relates to the occurrence of ambiguous
expressions, which occurred in 23% of system generated headlines. For example,
consider the following gist which leaves the identity of ‘the mountain’ to the
reader’s imagination:
“A 34-year-old South African hotel worker collapsed and died while coming down
the mountain”.
To solve this problem a ‘post-gisting’ component would have to be developed that
could replace a named entity with the longest sub-string that co-refers to it in the
text [22], thus solving the ambiguous location of ‘the mountain’.
8.5
Discussion
In this chapter we briefly discussed the area of text summarisation, in particular
News Story Gisting in the broadcast news domain. In Section 8.2, we presented our
novel lexical chaining-based approach to news story gisting, the LexGister system,
and in Section 8.4 we explored the robustness of this technique with respect to
‘noisy’ closed caption material from news programmes. The results of our intrinsic
and extrinsic evaluation methodologies indicate that this technique is a more
effective means of generating a compact and readable headline from a news story
text than a ‘bag-of-words’ technique. Another important outcome of our gisting
experiment was the notable improvement in performance when the enhanced
LexNews algorithm was used, indicating that there are benefits from generating a
more comprehensive representation of the lexical cohesive structure of a text when
generating an extractive summary.
The next stage in our research is to explore current trends in title generation that
use linguistically motivated heuristics to reduce a gist to a skeletal form that is
grammatically and semantically correct (Fuentes et al., 2003; Dorr, Zaijc, 2003;
McKeown et al., 2002; Daume et al., 2002). We have already begun working on a
technique that draws on parse tree information for distinguishing important clauses
in sentences using the original lexical chains generated for the news story to weight
each clause. This will allow the LexGister to hone in on the grammatical unit of the
204
sentence that is most cohesive with the rest of the news story, resulting in a more
compact news story title.
Re-evaluating the performance of the LexGister using the new ROUGE
evaluation metric is also a future goal of our research. ROUGE is a recall oriented
metric that calculates the n-gram overlap between a set of reference (i.e. human
generated) summaries and a single system generated summary. Experiments have
shown that this metric corresponds well with human summary quality judgements
(Lin, Hovy, 2003), which represents an exciting development for summarisation
research because of the large effort involved in manually determining summary
quality. The large scale usability of ROUGE is currently being investigated in the
context of DUC (Document Understanding Conference) 2004: a research initiative
similar to TDT that invites participants to generate summaries38 for a previously
unseen corpus of documents, which are evaluated and then discussed at the annual
DUC workshop.
38
The DUC 2004 initiative has defined five tasks: (task1) very short single document English
summary generation (~10 words); (task2) short English multi-document summary generation (~100
words); (task3 and task 4) generate English summaries from manually and automatically translated
Arabic news documents replicating task 1 and task 2; (task 5) generate answers to ‘Who is X’ style
questions where X is a person or group of people.
205
Chapter 9
Future Work and Conclusions
In this chapter we discuss some future directions for this work including, some
suggestions for improving our LexNews algorithm and a description of how multidocument summarisation can benefit from lexical cohesion analysis. The remainder
of the chapter highlights the research contributions and conclusions of this thesis.
9.1
Further Lexical Chaining Enhancements
In Chapter 3 we proposed our own novel approach to lexical chaining based on an
algorithm proposed by Hirst and St-Onge (1998). One of the fundamental
operations of any WordNet-based chaining algorithm is the estimation of semantic
relatedness between nouns with respect to their semantic distance in the WordNet
taxonomy. As explained in Section 3.1, St-Onge and Hirst’s measure of semantic
association is calculated with respect to the number of edges between two nouns in
the taxonomy and the number of direction changes on this path (i.e. semantically
opposite relations), where a number of ‘rules-of-thumb’ are used to ensure that
spurious links are minimised. However, as explained in Section 2.4.4, an edgecounting measure like St-Onge’s assumes that all edges in the taxonomy are of
equal length and that all branches are equally dense. However, these assumptions
are false in the case of WordNet, and so edge counting is at best a very rough
estimate of semantic relatedness.
We have found that this measure of association used during the generation of
lexical chains is, on occasion, less than satisfactory. Consequently, in the next phase
of our research we intend to experiment with a number of different measures of
semantic distance in WordNet, such as those recently compared and contrasted by
Budanitsky and Hirst with respect to their effect on the performance of
malapropism correction and detection (Budanitsky, 1999; Budanitsky and Hirst,
2001). Their overall finding was that the weakest measure of semantic distance was
the aforementioned St-Onge and Hirst approach, which Budanitsky describes as
‘being far too promiscuous in its judgement of relatedness’. In addition, Budanitsky
206
found that Jiang and Conrath’s information theoretic-based measure yielded the
best malapropism detection results (Jiang, Conrath, 1997). This approach attempts
to improve a basic edge counting metric by verifying the correctness of a WordNet
association with respect to a set of corpus statistics, i.e. it considers both the number
of edges between the two nodes, and the conditional probability of finding an
instance of a child node given the occurrence of a parent node. However, like the
other approaches that Budanitsky examined, the Jiang-Conrath measure only looks
at specialisation/generalisation relationships between nouns. Hence, Budanitsky
concedes that St-Onge’s measure might work better if a more constrained version
were used, e.g. if paths that traverse from the specialisation/generalisation
taxonomy to the whole/part taxonomy were ignored.
Another interesting avenue for future research arises from the recent release of
two expanded versions of the WordNet thesaurus. Firstly, the latest offical release
of the taxonomy, WordNet 2.0, by the Cognitive Science Laboratory at Princeton
University is the first attempt at improving the connectivity of the taxonomy with
respect to three areas:
Topical Clustering: WordNet 2.0 has organised related nouns into topical
categories such as ‘criminal law’ or ‘the military’ although much of the work in
this area has focussed on vocabulary related to terrorism.
Derivational Morphology: Currently, WordNet 2.0 links derivationally related
and semantically related noun/verb pairs such as ‘summarise/summary’ and
‘examine/examination’, which results in 42,000 new connections between these
two syntactic categories. However, links between other parts-of-speech such as
adverbs-adjectives (quickly-quick) are planned for future releases.
Gloss Term Disambiguation: In future releases, gloss definitions will be
tagged with synset numbers. This will provide a broader context for each sense
of a particular word form and will also help to increase the connectivity between
different syntactic categories in the taxonomy.
In addition to this research effort at Princeton, the University of Texas at Dallas
has released an enhanced version of WordNet called eXtended WordNet. The
objectives of both these research projects are closely related since they both aim to
increase the connectivity of the semantic network by exploiting information
contained in the gloss for each word sense. In this regard, the research effort at the
207
University of Texas leads the way, since they have already developed and
implemented automatic methods that syntactically parse and semantically tag each
gloss word (noun, verb, adjective and adverb) with a synset number39.
In the course of our research we also investigated a method of increasing noun
connectivity by incorporating statistical word associations into the LexNews
chaining process. However, in Section 3.2.1 we highlighted a number of
inadequacies with this approach:
Words linked with respect to this form of lexical cohesion are not mapped to
WordNet synsets. Hence, statistical word associations in the chaining process
fail to consider instances of polysemy. For example, the noun ‘noise’ would be
added to the following chain {racket, sports implement, hockey stick} through a
statistical relationship with the word ‘racket’ in spite of the fact that ‘racket’ is
being used in the ‘sport’ sense in this context.
Biases in the training corpus can create less than intuitive associations, e.g. in
the TDT1 corpus there is a strong statistical association between ‘glove’ and
‘blood’ due to the large portion of documents discussing the OJ Simpson case.
In some instances statistical associations also capture thesaural relationships
between words, because (not surprisingly) these types of word pairs also tend to
co-occur frequently in text. This leads to additional and unnecessary word
comparisons during the search for statistical associations between related
candidate terms in the chaining process since thesaural relationship searches
(i.e. Extra Strong, Strong and Medium Strength relationship searches) precede
the statistical association search.
In Section 3.2.1, we explained that in order to provide a full lexical cohesive
analysis of a text, as defined by Halliday and Hasan (1976), statistically derived
relationships would also have to be considered in the chaining process. However,
using eXtended WordNet in the chaining process would provide a ‘cleaner’ method
of considering these associations than our method of integration using corpus
statistics, since the chaining algorithm would not be affected by the problems
outlined above. Consequently, this represents a viable and promising direction for
39
WordNet 2.0 is available for download from http://www.cogsci.princeton.edu/~wn/ and the latest
version of eXtended WordNet can be downloaded from http://xwn.hlt.utdallas.edu/index.html (last
checked March 2004).
208
future lexical chaining research. However, the inclusion of statistically derived
associations will more than likely still play a role in the chaining process for two
reasons. Firstly, word associations established though gloss definitions will have a
binary score (i.e. the word either exists in the gloss of another word or it does not).
Secondly, there will be many instances of loosely related associations based on
these gloss definitions. Hence, a more sensitive measure of relatedness will be
needed, which could be measured using corpus statistics gathered from an
appropriate domain. In the following section we propose another lexical chaining
application that represents a promising future direction for our research.
9.2
Multi-document Summarisation
Many of the lexical chaining approaches reviewed in Chapter 2 were applied to text
summarisation tasks, in particular single document summarisation (Barzilay,
Elhadad, 1997; Silber, McCoy, 2000; Brunn, Chali, Pinchak, 2001; Bo-Yeong,
2002; Alemany, Fuentes, 2003). Like many of these researchers we found that
lexical chain-based gisting performance could outperform a standard ‘bag-ofwords’ technique (Section 8.4), thus fortifying the consensus that summarisation is
an area that can benefit greatly from lexical cohesion information.
Another important outcome of our gisting experiment was the notable
improvement in performance when the enhanced LexNews algorithm was used.
This indicates that there are benefits from generating a more comprehensive
representation of the lexical cohesive structure of a text when generating an
extractive summary. Hence, another promising area for future work would be the
application of this technique to other summarisation tasks. In particular, there is a
lot of scope for experimentation in the area of multi-document summarisation
where the task is to provide an overview of content, given a cluster of related
articles. Multi-document summarisation has proven to be a more challenging area
of summarisation research for a number of reasons, including:
The need for summary adaptation with respect to the strength of association
between the documents in the cluster. For example, if the documents are
strongly related then the summary should be based on their commonality.
However, if the documents are loosely related (i.e. contain a lot of non-
209
overlapping novel information) it is more difficult to decipher which novel
elements should be included.
Fluency also plays a more dominant role in multi-document summarisation,
since unlike single document summaries sentences cannot be ordered with
respect to their original position in their source document. Fluency is also a
major concern within a summary sentence if the summarisation technique
attempts to fuse different pieces of information scattered across related
sentences into a single sentence40.
With regard to summary adaptation, our composite document representation
described in Chapter 5 could be a promising approach to this problem. While
determining similarity in the New Event Detection task involves on-topic and offtopic (or unrelated) document comparisons, similarity in the multi-document
summarisation sense only requires a measurement of how similar related documents
are to each other, which might prove to be a more appropriate application for a
measure of similarity based on lexical cohesion analysis. With regard to fluency,
lexical chains could also play an important role in enforcing thematic continuity
when ordering sentences in a multi-document scenario, since one of the
characteristics of fluent, coherent text is the presence of lexical cohesion. To date
there have been very few attempts to develop a lexical chain-based multi-document
summarisation strategy (Chali et al., 2003). Hence, this represents an exciting new
application for lexical cohesion analysis.
9.3
Thesis Contributions
In Sections 9.1 and 9.2 we highlighted areas of our lexical chaining research that
would benefit from future work. In this section we summarise the main
contributions of the work presented in this thesis, which are organised under the
following headings: Lexical Chaining, New Event Detection, News Story
Segmentation and News Story Gisting.
Lexical Chaining
The design and development of a novel news-oriented lexical chaining
algorithm, the LexNews system, that considers, in addition to the standard
40
A more detailed discussion on information fusion, fluency and sentence ordering for multidocument summarisation can be found in (Barzilay, 2003).
210
lexical cohesive relationships found in WordNet, news-specific statistical word
associations and an extended definition of lexical repetition based on the fuzzy
matching of noun phrases referring to named entities such as people, places and
organisations.
An exploration of greedy and non-greedy approaches to lexical chaining,
evaluated with respect to disambiguation accuracy on a portion of the SemCor
corpus.
An investigation into the suitability of lexical cohesion analysis to Topic
Detection and Tracking tasks.
New Event Detection
The design, implementation and evaluation of a novel method of integrating
lexical cohesion analysis into an IR model used to detect breaking news stories
on a news stream, motivated by the TDT initiative request for techniques that
went beyond the traditional ‘bag of words’ approach.
An investigation into the extent to which our lexical chain-based approach, the
LexDetect system, is affected by ‘noisy’ broadcast news data.
News Story Segmentation
The design, implementation and evaluation of our lexical chain based-approach
to text segmentation, the SeLeCT system, in a novel application of text
segmentation.
An examination of segmentation performance in terms of a variety of evaluation
metrics.
An investigation into the effects of lexical cohesion relationships on
segmentation accuracy.
An exploration of the effects of written and spoken news sources on
segmentation performance.
News Story Gisting
The design, implementation and evaluation of our lexical chain-based News
Gister, the LexGister system.
An analysis of LexGister performance on error-prone closed caption material
using intrinsic and extrinsic evaluations.
211
9.4
Thesis Conclusions
In this, the final section of the thesis, we clarify and summarise the principal
findings of our work. These conclusions are split into two main areas: those
pertaining to our lexical chaining algorithm LexNews, and those relating to the
success of its application to the TDT tasks explored in the thesis.
Greedy and Non-Greedy Lexical Chaining: Despite the fact that greedy
approaches to lexical chain generation have been largely disregarded of late by
researchers, the results of the evaluation described in Chapter 3 indicate that no
significant gain in performance is achieved when a non-greedy, more
computationally expensive approach is used. Lexical chaining performance, in this
instance, was measured with respect to WordNet-based disambiguation accuracy on
the SemCor corpus. This result validated the use of our semi-greedy chaining
approach, LexNews, to address the various IR and NLP tasks examined in this
thesis.
News Story Document Representation using Lexical Chains: Our composite
document representation strategy, presented in Chapter 5, attempted to amalgamate
information regarding the lexical cohesive structure of a news story with a
traditional ‘bag-of-words’ representation of its content. Our experiments showed
that:
-
A lexical chain representation of a document is best used as additional rather
than conclusive evidence of news story similarity in the New Event Detection
task (Section 5.5.2).
-
Considering chain words as features in a vector space model marginally
outperforms a New Event Detection system that considers the chains as features
in a representation of the document (Section 5.5.3).
-
Using word stems instead of WordNet concepts (i.e. synset numbers) improved
the performance of the chain word document representation with respect to the
New Event Detection task (Section 5.5.2).
-
Results from experiments on the TDT2 corpus showed that New Event
Detection (NED) performance deteriorated when broadcast rather than
newswire stories were processed. This was evident from both the performance
of the UMass NED system and our lexical chain based-NED system, LexDetect.
212
Two factors were identified as being responsible for this decline in LexDetect’s
performance: a lack of capitalisation which added errors to lexical chaining
preprocessing steps (e.g. noun phrase identification based on part-of-speech tag
information), and the presence of very short news reports (35.3% of documents
in the TDT2 corpus consist of less than 100 words) making it difficult for our
lexical chaining algorithm, LexNews, to perform a lexical cohesive analysis of
these short texts (Section 5.6.2).
-
Although initial experiments on the TDT1 corpus indicated that the LexDetect
system could outperform a simple ‘bag-of-words’ approach to the NED
problem, these results could not be replicated on the ‘noisier’ TDT2 corpus
when compared with the performance of the UMass NED system.
News Story Segmentation using Lexical Chains: In Chapter 7, our News Story
Segmentation system, SeLeCT, was evaluated with respect to two well-known
lexical cohesion-based approaches to segmentation: the C99 and TextTiling
algorithms on a collection of CNN broadcast news stories and Reuters news
articles. Our experiments showed that:
-
The SeLeCT algorithm outperformed the C99 and TextTiling algorithms on the
CNN news collection. However, on the Reuters newswire collection the C99
algorithm performed best, followed by the SeLeCT system and finally the
TextTiling algorithm (Section 7.3).
-
Only pure lexical repetition proved to be a useful lexical cohesive relationship
for detecting boundaries between adjacent stories (Section 7.3.4).
-
Like the New Event Detection results, significant degradations in performance
were observed for results on the CNN (spoken) collection compared with the
Reuters (written) news collection, even though in this case the CNN broadcast
news documents were manually transcribed and so were correctly punctuated
and capitalised. This led us to investigate the differences between written and
spoken language modes, and how they convey information in a news story.
Using Halliday’s observation that ‘written language represents phenomena as
products (nouns), while spoken language represents phenomena as processes
(verbs)’, we adapted our SeLeCT system and showed that gains in broadcast
(spoken) News Story Segmentation performance could be achieved by including
213
nominalisable verbs in the chaining process. In addition, the C99 and TextTiling
performance improved when all non-nominalisable verbs were eliminated from
their input, since these verbs tend to be less ‘informative’ and more commonly
occurring, hence adding additional noise to the segmentation process (Section
7.4).
News Story Gisting using Lexical Chains: Our work on News Story Gisting,
described in Section 8.2, provides further evidence that lexical cohesion analysis
can make a positive contribution to text summarisation tasks. Another interesting
conclusion of this initial investigation is that lexical chaining is robust enough to be
able to deal with broadcast news segmentation errors in closed caption material,
leading us to conclude that although ASR transcripts pose a significant challenge
for any NLP-based approach, there is still use for these techniques in the broadcast
news environment on ‘cleaner’ data samples like closed caption material.
The performance of the LexNews chaining algorithm: In Chapter 3 we
introduced the enhanced LexNews algorithm, which focuses on the exploration of
lexical cohesive relationships in a broadcast news environment. Throughout the
thesis the impact of these enhancements on the various TDT applications was
considered. More specifically:
-
In Section 5.6, a version of our NED system LexDetect, using the basic
LexNews algorithm, was outperformed by a version using the enhanced
LexNews algorithm.
-
However, as explained in Section 7.3.4, News Story Segmentation performance
deteriorates when additional lexical cohesive relationships outside of exact
repetition are explored, prompting the conclusion that the scope of the lexical
cohesion analysis provided by the basic and enhanced version of the LexNews
algorithm is too broad.
-
On the other hand, the News Story Gisting application of our chaining algorithm
produced some definite evidence to support the inclusion of statistical word
associations and extended noun phrase matching as presented in this thesis, in
future implementations of lexical chain-based summarisation. However, it
remains to be seen whether the additional lexical cohesive links provided by the
214
recently released of the WordNet 2.0 and eXtended WordNet taxonomies can
lead to further gisting improvements.
215
Appendix A
The LexNews Algorithm
The purpose of this appendix is to provide the reader with a more formal
description of the LexNews algorithm described in Section 3.2.3 of the thesis. The
LexNews chaining approach was designed and developed with the intention of
building lexical chains for broadcast news and newswire text. The underlying
algorithm is based on a technique by Hirst and St-Onge (1998), and Section A.1 is
based on St-Onge’s formal description (1995) of his own algorithm. Section A.2
contains a list of stopwords, more specifically WordNet nouns that are excluded
from the chaining process due to their tendency to cause spurious chains, i.e. chains
containing incorrectly disambiguated or weakly cohesive chain members. This list
of ‘problematic’ nouns is a combination of manually (domain-specific) and
automatically identified words that tend to subordinate a higher than average
number of nouns in the WordNet taxonomy, and consequently are responsible for
the creation of weakly cohesive lexical chains.
A.1 Basic Lexical Chaining Algorithm
LexNews Chaining algorithm
XS_R = Extra Strong Relation
FXS_R = Fuzzy (proper noun) Extra Strong Relation
S_R = Strong Relation
MS_R = Medium-Strength Relation
SW_R = Statistical Word Relation
1.
2.
3.
4.
5.
6.
7.
8.
if (new_word_sentence_number = current_sentence_number) then
if ((current_word.type == `NN`)
and (current_word.type != stopword)) then
NN_queue.push(new_word);
end if
else /* current_word.type == `PN`*/ then
PN_queue.push(new_word);
end if
end if
216
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
else
current_sentence_number = new_word.sentence_number;
/* Begin NN chaining */
for current_word from NN_queue.first to NN_queue.last do
if (NN_chain_stack.try_to_chain(current_word, XS_R)) then
NN_queue.remove(current_word);
end if
end do
for current_word from NN_queue.first to NN_queue.last do
if (NN_chain_stack.try_to_chain(current_word, XS_R) or
NN_chain_stack.try_to_chain(current_word, S_R)) then
NN_queue.remove(current_word);
end if
end do
for current_word from NN_queue.first to NN_queue.last do
if (NN_chain_stack.try_to_chain(current_word, XS_R) or
NN_chain_stack.try_to_chain(current_word, S_R) or
NN_chain_stack.try_to_chain(current_word, MS_R)) then
NN_queue.remove(current_word);
end if
end do
for current_word from NN_queue.first to NN_queue.last do
if (NN_chain_stack.try_to_chain(current_word, XS_R) or
NN_chain_stack.try_to_chain(current_word, S_R) or
NN_chain_stack.try_to_chain(current_word, MS_R)) or
NN_chain_stack.try_to_chain(current_word, SW_R)) then
NN_queue.remove(current_word);
end if
end do
for current_word from queue.first to queue.last do
NN_chain_stack.create_chain(current_word);
NN_queue.remove(current_word);
end do
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
/* Begin PN chaining */
for current_word from PN_queue.first to PN_queue.last do
if (PN_chain_stack.try_to_chain(current_word, FXS_R)) then
PN_queue.remove(current_word);
end if
end do
for current_word from PN_queue.first to PN_queue.last do
PN_chain_stack.create_chain(current_word);
PN_queue.remove(current_word);
end do
end if
217
As explained in Section 3.2 the tokeniser identifies candidate terms that should
be included in the chaining process. There are two types of candidate terms:
WordNet noun phrases (NN) and non-WordNet proper noun phrases (PN). During
the chaining process separate sets of non-overlapping NN and PN chains are
generated. These two sets of chains are non-overlapping because the algorithm has
no means of associating non-WordNet proper nouns with WordNet nouns as they
are not linked in the noun database.
The algorithm begins by reading in candidate terms as they occur in the original
source text. The algorithm then pushes each term from the current sentence onto
either the NN_sentence_queue or the PN_sentence_queue depending on the type
assigned to the phrase by the tokeniser. In the case of the WordNet noun phrases,
only non-stopword terms are pushed on the NN_sentence_queue (lines 1-8).
Once all words for a particular sentence have been read in, the chain formation
process can begin. The algorithm begins by generating NN chains. So for each word
in the NN_sentence_queue, an extra strong relationship is sought in the
NN_chain_stack, which stores all WordNet noun phrase chains in the order in
which they were last updated. If a match is found then the candidate term in the
sentence queue is added to the related chain, the chain is then moved to the head of
the chain stack and the noun phrase is removed from the NN_sentence_queue.
Otherwise, if a match is not found, then the current phrase remains in the sentence
queue and awaits further processing (lines 12-16).
The algorithm then iterates though the sentence queue again, this time searching
for strong relationships between queue words and chain words. Again, if a sentence
word is related to a chain word then it is added to the chain, the chain becomes the
head of the chain stack, and the word is removed from the sentence queue (lines 1722). This process is repeated for the medium-strength (lines 23-29) and statistical
association (lines 30-37) searches. However, in each of these loops the algorithm
checks again for the possibility of the relationships that preceded it. The reason for
these additional searches is that if, for example, a word is added to a chain based on
a medium-strength relationship (lines 23-29) this might create the possibility of a
strong relationship between the recently added word and a member of the sentence
queue that didn’t exist during the first strong relationship search (lines 17-22).
218
Figure A.1 helps to illustrates this point, where during the first strong search
iteration (lines 17-22) a relationship is sought between ‘resident’ in the sentence
queue and ‘town’ in the chain stack. However, the chaining algorithm could not
establish a relationship since their path length in WordNet exceeds 4 edges. In fact
these terms can only be related though a transitive relationship, once the words
‘state’ and ‘town’ had been established at line 19, and a second strong relationship
search has been completed at line 25, whereupon the algorithm finds that a
relationship between ‘state’ and ‘resident’ exists.
resident
state
Belfast
Sentence Queue
1
0
{town, city_limits}
{dissident, radical}
Chain Stack
Figure A.1: Chaining example illustrating the need for multiple searches.
When all searches have been completed and there are no remaining
relationships between the members of the noun phrase sentence queue and the
members of the chain stack, then each remaining candidates term in the sentence
queue becomes the head of a new chain in the NN_chain_stack (lines 38-41). This
point in the algorithm marks the end of the noun phrase chaining process and the
beginning of the chaining of non-WordNet proper noun (PN) phrases. Proper noun
chaining is similar to noun chaining in that each phrase in the PN_sentence_queue
is compared to each member of each chain stored in the PN_chain_stack (lines 4247). However, this comparison process does not require any WordNet lookup.
Instead, the fuzzy matching function, described in Section 3.2.3, is used to seek out
links between proper noun phrases. Like the medium-strength relationship search,
not all relationships of this type are assigned the same strength, in which case all
relationships are sought between the sentence queue phrase and each chain. The
phrase is then added to the chain with which it holds the strongest relationship.
Once all possible proper noun phrase additions to the PN_chain_stack have been
219
made, the algorithm then makes each remaining sentence queue proper noun phrase
the head of a new proper noun chain (lines 48-51).
This process is repeated until each sentence in the source text has been
processed. Once all lexical chains have been generated, only proper noun and noun
phrase chains that have more than one member take part in any further processing,
i.e. New Event Detection, News Story Segmentation or News Story Gisting.
220
A.2 Lexical Chaining Stopword List
This stopword list is available on request from the author.
abstraction
act
action
activity
afternoon
amount
artefact
artifact
attribute
being
bit
blank
cause
content
course
day
daybreak
dimension
distance
edge
effect
end
entity
existence
extent
form
front
function
group
grouping
hour
human_action
human_activity
instance
instrumentality
instrumentation
kind
large
length
level
life_form
light
line
little
living_thing
location
look
lot
lots
manner
matter
mean
minute
morning
mortal
natural
night
noon
nothing
now
number
object
old
order
organism
part
past
people
person
phenomenon
physical_object
piece
point
portion
position
process
quality
region
relation
right
second
series
side
size
somebody
someone
something
soul
standard
state
221
status
stuff
subject
thing
time
try
type
unit
use
way
while
year
Appendix B
LexNews Lexical Chaining Example
The following is a sample news story (‘clean’ closed-caption material) taken from
an Irish broadcast news programme. In Section 3.5 we explored the generation of
lexical chains using the LexNews algorithm on this piece of text. In the following
sections we provide the original version of the text (Section B.1), a tagged version
(Section B.2) used as input to the Tokeniser, and the lexical chains generated from
this set of candidate terms using the enhanced version of the LexNews algorithm
(Section B.3).
222
B.1 News Story Text Version
All noun phrase and adjectives pertaining to nouns are marked in bold in the
following piece of text.
As Gardai launch an investigation into gangland murders in Dublin and
Limerick a film opened in Dublin tonight which recalls the killing of another
victim of organised crime in 1996.
The world premiere of the Veronica Guerin movie took place in the Dublin's
Savoy Cinema, with Cate Blanchett in the title role.
The film charts the events leading up to the murder of the Irish journalist.
Crowds gathered outside the Savoy Cinema as some of Ireland's biggest names
gathered for the premiere of Veronica Guerin, the movie.
It recounts the journalists attempts to exposed Dublin drug gangs.
But for many the premiere was mixed with sadness.
“It'
s odd. It can'
t be celebratory because of the subject matter.”
Actress Cate Blanchett takes on the title role in the movie.
It was a part she says she felt honoured to play.
“I got this complete picture of this person full of life and energy. And so that'
s
when it became clear the true nature of the tragedy of the loss of this
extraordinary human being, and great journalist.”
Apart from Blanchett every other part is played by Irish actors.
Her murderer was later jailed for 28 years for drug trafficking.
The film-makers say it'
s a story of personal courage, but for the director, there
was only one person'
s approval that mattered.
“A couple of months ago I brought the film to show to her mother. It was the
most pressure I'
ve ever felt.”
But he needn'
t have worried.
“I see it as a tribute to Veronica, a worldwide tribute.”
223
B.2 Part-of-Speech Tagged Text
This tagged text was generated using the JTAG tagger (Xu, Broglio, Croft, 1994)
algorithm.
***000001 284
As/CS Gardai/NP launch/VB an/AT investigation/NN into/TOIN
gangland/NN murders/NNS in/IN Dublin/NP and/CC Limerick/NP a/AT
film/NN opened/VBD in/IN Dublin/NP tonight/NN which/WDT recalls/VBZ
the/AT killing/NN of/IN another/DT victim/NN of/IN organised/VBD
crime/NN in/IN 1996/CD ./. The/AT world/NN premiere/NN of/IN the/AT
Veronica/NP Guerin/NP movie/NN took/VBD place/NN in/IN Dublin/NP
's/$ Savoy/NP Cinema/NP ,/, with/IN Cate/NP Blanchett/NP in/IN
the/AT title/NN role/NN ./. The/AT film/NN charts/VBZ the/AT
events/NNS leading/VBG up/RP to/TOIN the/AT murder/NN of/IN the/AT
Irish/JJ journalist/NN ./. Crowds/NNS gathered/VBN outside/IN
the/AT Savoy/NP Cinema/NP as/CS some/DTI of/IN Ireland/NP 's/$
biggest/JJ names/NNS gathered/VBN for/IN the/AT premiere/NN of/IN
Veronica/NP Guerin/NP ,/, the/AT movie/NN ./. It/PPS recounts/VBZ
the/AT journalists/NNS attempts/NNS to/TOIN exposed/VBN Dublin/NP
drug/NN gangs/NNS ./. But/CC for/IN many/AP the/AT premiere/NN
was/BEDZ mixed/VBN with/IN sadness/NN ./. It/PPS 's/BEZ odd/JJ ./.
It/PPS can't/MD be/BE celebratory/JJ because/CS of/IN the/AT
subject/JJ matter/NN ./. Actress/NN Cate/NP Blanchett/NP takes/VBZ
on/IN the/AT title/NN role/NN in/IN the/AT movie/NN ./. It/PPS
was/BEDZ a/AT part/NN she/PPS says/VBZ she/PPS felt/VBD
honoured/VBN to/TO play/VB ./. I/PPSS got/VBD this/DT complete/JJ
picture/NN of/IN this/DT person/NN full/JJ of/IN life/NN and/CC
energy/NN ./. And/CC so/CS that/DT 's/BEZ when/WRB it/PPS
became/VBD clear/RB the/AT true/JJ nature/NN of/IN the/AT
tragedy/NN of/IN the/AT loss/NN of/IN this/DT extraordinary/JJ
human/JJ being/NN ,/, and/CC great/JJ journalist/NN ./. Apart/RB
from/IN Blanchett/NP every/AT other/AP part/NN is/BEZ played/VBN
by/IN Irish/JJ actors/NNS ./. Her/PP$ murderer/NN was/BEDZ later/RB
jailed/VBN for/IN 28/CD years/NNS for/IN drug/NN trafficking/NN ./.
The/AT film-makers/NNS say/VB it/PPS 's/BEZ a/AT story/NN of/IN
personal/JJ courage/NN ,/, but/CC for/IN the/AT director/NN ,/,
there/EX was/BEDZ only/RB one/CD person/NN 's/$ approval/NN
that/WPS mattered/VBD ./. A/AT couple/NN of/IN months/NNS ago/RB
I/PPSS brought/VBD the/AT film/NN to/TO show/VB to/TOIN her/PP$
mother/NN ./. It/PPS was/BEDZ the/AT most/AP pressure/NN I/PPSS
've/HV ever/RB felt/VBD ./. But/CC he/PPS needn't/NP have/HV
worried/VBN ./. I/PPSS see/VB it/PPO as/CS a/AT tribute/NN to/TOIN
Veronica/NP ,/, a/AT worldwide/JJ tribute/NN ./.
224
B.3 Candidate Terms
Below is a list of candidate terms (proper noun and noun phrases) identified by the
tokeniser for chaining. All terms highlighted in bold are stopwords (as defined by
Section A.2), and do not take part in the chaining process. The candidate term
information is in the following format where nn refers to a WordNet noun and pn a
non-WordNet proper noun:
Document identifier; Word Number; Sentence Number; Term Tag; Term
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2 1 pn gardai
3 1 nn launch
5 1 nn investigation
7 1 nn gangland
8 1 nn murder
10 1 nn dublin
12 1 pn limerick
14 1 nn film
17 1 nn dublin
18 1 nn tonight
22 1 nn killing
25 1 nn victim
28 1 nn crime
32 2 nn world
33 2 nn premiere
36 2 pn veronica_guerin
38 2 nn movie
43 2 pn dublin’s_savoy_cinema
47 2 pn cate_blanchett
51 2 nn title_role
54 3 nn film
57 3 nn event
62 3 nn murder
65 3 nn ireland
66 3 nn journalist
67 4 nn crowd
71 4 pn savoy_cinema
76 4 nn ireland
79 4 nn names
83 4 nn premiere
85 4 pn veronica_guerin
89 4 nn movie
93 5 nn journalist
94 5 nn attempt
96 5 nn exposition
97 5 nn dublin
98 5 nn drug
99 5 nn gang
104 6 nn premiere
108 6 nn sadness
115 8 nn celebration
119 8 nn subject_matter
121 9 nn actress
122 9 pn cate_blanchett
127 9 nn title_role
225
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
131
135
142
147
150
153
155
166
169
172
176
181
184
187
189
191
192
194
200
202
205
210
213
218
224
226
230
232
237
242
247
262
264
268
9 nn movie
10 nn part
10 nn player
11 nn picture
11 nn person
11 nn life
11 nn energy
12 nn nature
12 nn tragedy
12 nn loss
12 nn human_being
12 nn journalist
13 pn blanchett
13 nn part
13 nn player
13 nn ireland
13 nn actor
14 nn murderer
14 nn years
14 nn drug
15 nn film_maker
15 nn story
15 nn courage
15 nn director
15 nn person
15 nn approval
16 nn couple
16 nn month
16 nn film
16 nn mother
17 nn pressure
19 nn tribute
19 pn veronica
19 nn tribute
B.4 Weighted Lexical Chains
The mark-up and weighting scheme adopted in the following sets of lexical chains
is explained in Section 3.5.
WordNet Noun Phrase Chains
CHAIN 4; No. Words 14; Word Span 11-264; Sent Span 1-19;
[film (SEED) Freq 3 WGT 0.9 STRONG]
[movie (film) Freq 3 WGT 0.9 STRONG]
[premiere (film) Freq 3 WGT 0.4 MEDIUM]
[subject_matter (film) Freq 1 WGT 0.7 STRONG]
[actress (movie) Freq 1 WGT 0.7 STRONG]
[picture (film) Freq 1 WGT 0.9 STRONG]
[actor (actress) Freq 1 WGT 0.7 STRONG]
[film_maker (film) Freq 1 WGT 0.4 MEDIUM]
[approval (subject_matter) Freq 1 WGT 0.7 STRONG]
[story (subject_matter) Freq 1 WGT 0.4 MEDIUM]
[director (actor) Freq 1 WGT 0.4 STATISTICAL]
[tribute (approval) Freq 2 WGT 0.7 STRONG]
CHAIN 14; No. Words 3; Word Span 151-243; Sent Span 11-17;
[energy (SEED) Freq 1 WGT 0.4 MEDIUM]
[nature (energy) Freq 1 WGT 0.4 MEDIUM]
[pressure (energy) Freq 1 WGT 0.4 MEDIUM]
CHAIN 1; No. Words 6; Word Span 4-198; Sent Span 1-14;
[gangland (SEED) Freq 1 WGT 0.4 MEDIUM]
[world (gangland) Freq 1 WGT 0.4 MEDIUM]
[crowd (gangland) Freq 1 WGT 0.4 MEDIUM]
[gang (crowd) Freq 1 WGT 0.9 STRONG]
[drug (gang) Freq 2 WGT 0.4 STATISTICAL]
CHAIN 3; No. Words 7; Word Span 7-187; Sent Span 1-13;
[Dublin (SEED) Freq 3 WGT 0.7 STRONG]
[Ireland (Dublin) Freq 3 WGT 0.7 STRONG]
CHAIN 2; No. Words 9; Word Span 3-168; Sent Span 1-12;
[investigation (SEED) Freq 1 WGT 0.4 STATISTICAL]
[murder (investigation) Freq 2 WGT 0.7 STRONG]
[killing (murder) Freq 1 WGT 0.7 STRONG]
[victim (killing) Freq 1 WGT 0.4 STATISTICAL]
[crime (victim) Freq 1 WGT 0.4 STATISTICAL]
[life (murder) Freq 1 WGT 0.4 MEDIUM]
[loss (life) Freq 1 WGT 0.4 STATISTICAL]
[murderer (victim) Freq 1 WGT 0.4 MEDIUM]
CHAIN 10; No. Words 3; Word Span 63-177; Sent Span 3-12;
[journalist (SEED) Freq 3 WGT 0]
CHAIN 7; No. Words 2; Word Span 48-123; Sent Span 2-9;
[title_role (SEED) Freq 2 WGT 0]
CHAIN 9; No. Words 2; Word Span 54-112; Sent Span 3-8;
[event (SEED) Freq 1 WGT 0.4 MEDIUM]
[celebration (event) Freq 1 WGT 0.4 MEDIUM]
226
Non-WordNet Proper Noun Phrase Chains
CHAIN 3; No. Words 3; Word Span 33-260; Sent Span 2-19;
[Veronica_Guerin (SEED) Freq 2 WGT 0.8]
[Veronica (Veronica_Guerin) Freq 1 WGT 0.8]
CHAIN 5; No. Words 3; Word Span 44-180; Sent Span 2-13;
[Cate_Blanchett (SEED) Freq 2 WGT 0.8]
[Blanchett (Cate_Blanchett) Freq 1 WGT 0.8]
CHAIN 4; No. Words 2; Word Span 40-68; Sent Span 2-4;
[Dublin’s_Savoy_cinema (SEED) Freq 1 WGT 0.8]
[Savoy_cinema (Dublin’s_Savoy_cinema) Freq 1 WGT 0.8]
227
Appendix C
Segmentation Metrics: WindowDiff and Pk
In this appendix we provide a more detailed explanation of the difference between
the WindowDiff and Pk metrics which were used to evaluate segmentation
performance in Section 7.2.2 (Equations 7.2 and 7.3). In that section, we referred
briefly to Pevzner and Hearst’s (2002) paper on the shorting-comings of Beeferman
et al.’s Pk metric (1999), and their proposed alternative, the WindowDiff metric,
which they state is a more intuitive and accurate means of determining
segmentation performance.
In this paper, Pevzner and Hearst informally define these two error metrics as
follows:
Pk uses a sliding window method for calculating error, where if the two ends of
the window are in different segments in the reference segmentation and in the
same segment in the system’s segmentation (or vice versa), then an error has
been detected and the error counter is incremented by 1.
WindowDiff also uses a sliding window; however, this metric compares the
number of boundaries in the window in the reference segmentation (r) with the
number of boundaries in the system’s segmentation (s) and if the number of
boundaries is not equal, then errors have been detected and the error counter is
incremented by the absolute difference between these two numbers, i.e. |r – s|.
In both these metrics the window size is half the average size of the segments in the
reference segmentation. Figure C.1 shows the window incrementing in units of 1,
where each block numbered 0-20 represents a unit of text, at the beginning and end
of which exists a possible boundary point. We also notice that the system has made
two segmentation errors: it has placed a false boundary point between blocks 7 and
8 (a false positive), and it has missed a boundary between blocks 12 and 13 (a false
negative). Table C.1 shows how the Pk and WindowDiff metrics calculate errors for
each of the shifting windows in Figure C.1 numbered 1 to 10. The error scores for
the Pk metric for each window show that although it detects the false negative error,
it fails to identify the false positive error, because in each window from 1 to 5 the
228
start and end of the window lie in different segments in the reference and system
segmentations. In comparison, the WindowDiff metric correctly identifies both
errors, because its error metric is based on the difference between the number of
boundaries in each window in the system and reference segmentations. This
example illustrates one of Pk’s flaws outlined by Pevzner and Hearst, i.e. the Pk
metric has the potential to penalise false negatives more than false positives. It also
helps to illustrate how the Pk and WindowDiff metrics are calculated. For an
explanation of other Pk flaws and an empirical justification of the WindowDiff
metric see (Pevzner, Hearst, 2002).
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
REF
SYS
1
2
3
4
5
6
7
8
9
10
Figure C.1: Diagram showing system segmentation results and the correct boundaries defined
in the reference segmentation. Blocks in this diagram represent units of text.
229
Window Iteration
Pk
WindowDiff
1
0
1
2
0
1
3
0
1
4
0
1
5
0
0
6
1
1
7
1
1
8
1
1
9
1
1
10
0
0
Table C.1: Error calculations for each metric for each window shift in Figure C.1.
230
Appendix D
Sample News Documents from Evaluation
Corpora
In this appendix we provide sample text from each of the ‘clean’ and ‘noisy’ news
sources used in the experiments described in this thesis:
New Event Detection: TDT newswire articles (clean), TDT1 broadcast
transcripts (clean), TDT2 broadcast transcripts (noisy).
News Story Segmentation: TDT newswire articles (clean), TDT1 broadcast
transcripts (clean).
News Story Gisting: RTÉ closed caption material (noisy).
As stated in Chapter 8, TDT ASR transcripts are affected primarily by limited
capitalisation, and some segmentation and spelling errors. The RTÉ closed caption
material, on the other hand, is capitalised, but suffers from breaks in transmission
(missing words/sentences). In addition, story segmentation errors are more
prevalent in this data source than in TDT transcripts, due to the manual ‘clean-up’
conducted on these transcripts by the LDC before they were released.
231
D.1 TDT1 Broadcast News Transcript
<DOC>
<DOCID> CNN786-5.940701 </DOCID>
<TDTID> TDT000010 </TDTID>
<SOURCE> CNN Daybreak </SOURCE>
<DATE> 07/01/94 </DATE>
<TITLE> Arafat Returns to Gaza Without Promised Money </TITLE>
<SUBJECT> Live Report </SUBJECT>
<SUBJECT> News </SUBJECT>
<SUBJECT> International </SUBJECT>
<TOPIC> Arafat, Yasser </TOPIC>
<TOPIC> Middle East--Politics and government </TOPIC>
<TOPIC> Palestinian Arabs </TOPIC>
<TOPIC> Palestinian self-rule areas </TOPIC>
<SUMMARY> Yasir Arafat is expected to return in about an hour-anda-half for a weekend stay in the Gaza Strip. Spirits and security
are both high, as Arafat's many supporters and enemies try to make
their points. </SUMMARY>
<TEXT>
<SP> BOB CAIN, Anchor </SP> <P>
As we reported to you earlier, Palestine Liberation Organization
leder Yasir Arafat is returning to Palestine today for the first
time in 27 years - that we know of. He will be crossing the border
from Egypt into Gaza shortly. CNN Correspondent Bill Delaney is in
Gaza City with the latest developments. Bill? </P>
<SP> BILL DELANEY, Correspondent </SP> <P>
Bob, the relative nonchalance with which many Gazans greeted the
news of PLO Chairman Yasir Arafat's arrival here is finally giving
way to excitement, expectation and there is a lot of security out
on the streets. The hotel behind me is covered with security on the
roof - has been for the last 24 hours or so. Yasir Arafat's
entourage is expected to stay there and everywhere else in Gaza,
there's evidence of how concerned Palestinians are about keeping
their living legend alive. </P>
<P> By the truckload, Yasir Arafat's loyal legion, the Palestine
Liberation Army Brigade, fanned out in streets still unfamiliar to
many who've only themselves so recently returned, awaiting the man
these soldiers see as above all other - the keeper of the flame.
</P>
<SP> 1st RESIDENT </SP> <P>
We are here, all of us, to protect Yasir Arafat and to say to him
`hello.' </P>
<SP> DELANEY </SP> <P>
A brigadier general said he was not worried about security because
everyone loves Arafat. An exaggeration evidenced by the dragnet of
security everywhere in Gaza - at the hotel where Arafat's entourage
will stay and in the plaza where Arafat's expected to address tens
of thousands as crowds slowly gathered in the Mediterranean heat,
Arafat's long journey home to Gaza, where his mother's family once
lived, changes everything forever for Palestinians. </P>
<SP> 2nd RESIDENT </SP>
<P>
232
All the people very happy. </P>
<SP> 3rd RESIDENT </SP> <P>
My father's happy today. </P>
<SP> DELANEY </SP> <P>
Dissenters were heard from - after an attack Thursday on Israeli
soldiers in Gaza, an Islamic group claimed responsibility for an
attack on Jewish settlers on the West Bank. In Gaza, though,
whether the Middle East's old violent cycles continue or not,
nothing will ever look quite the same once Yasir Arafat's come to
town. For young children, the 27-year Israeli occupation won't be
much of a memory - Yasir Arafat's arrival surely will be. This boy
said `We can't live without him. He is our leader. He is our love.'
Still, Yasir Arafat returns as he vowed repeatedly he would not,
still only barely solvent, with relatively few of the hundreds of
millions of dollars he's been pledged actually in his pocket. </P>
<P> Arafat's latest gamble seems to be to so irrevocable stake
his claim here that everything else he so desperately needs will
follow. We expect him here in the Gaza Strip in about an hour and a
half, crossing over from Egypt. Bob? </P>
<SP> CAIN </SP> <P>
Bill, Arafat apparently poses a security problem not only to the
Palestinians, but to the Israelis as well. Do you know how long
he's going to be there in Gaza City? </P>
<SP> DELANEY
We expect him
plans are, of
understand at
</SP> <P>
to spend the weekend and leave Monday. Now, those
course, always subject to change, but that's what we
the moment. </P>
<SP> CAIN </SP> <P>
Bill Delaney, in Gaza City. </P>
<COPYRIGHT>
The preceding text has been professionally transcribed. However,
although the text has been checked for errors, in order to meet
rigid distribution and transmission deadlines, it may not have been
proofread against tape.
(c) Copyright 1994 Cable News Network. All rights reserved.
</COPYRIGHT>
</TEXT>
</DOC>
<DOC>
233
D.2 TDT2 Broadcast News Transcript
<DOC>
<DOCNO> CNN19980106.1600.0984 </DOCNO>
<DOCTYPE> NEWS STORY </DOCTYPE>
<DATE_TIME> 01/06/1998 16:16:24.99 </DATE_TIME>
<BODY>
<TEXT>
shares of apple computer were among those in the plus column today.
apple says cost-cutting and strong demand for its new computers
helped return to company to profitability at the end of last year.
that good news helped the company stock soar nearly 20%.
apple closed up just over $3, at $19 a share.
the acting chief of the bruised computer maker credits his workers.
<TURN>
every group at apple has been burning the midnight oil over the
last six months.
the product groups hardware and software have been doing great.
our sales and marketing groups all around the world are
manufacturing, distribution , and we're starting to really see the
results.
<TURN>
apple also says its eyeing the sub $1,000 personal computer market.
other major computer makers including hewlett packard and compaq
are already churning out cheap pcs they've become one of the
hottest products in the computer industry.
</TEXT>
</BODY>
<END_TIME> 01/06/1998 16:17:20.00 </END_TIME>
</DOC>
234
D.3 TDT Newswire Article
<DOC>
<DOCNO> NYT19980109.0937 </DOCNO>
<DOCTYPE> NEWS STORY </DOCTYPE>
<DATE_TIME> 01/09/1998 21:13:00 </DATE_TIME>
<HEADER>
A4332 &Cx1f; taf-z
u i &Cx13; &Cx11; BC-IRAN-U.S.-POLICY-270&AMP;
01-09 0830
</HEADER>
<BODY>
<SLUG> BC-IRAN-U.S.-POLICY-270&AMP;ADD-NYT </SLUG>
<HEADLINE>
U.S. OFFICIALS WARMING UP TO INFORMAL LINKS WITH IRAN
</HEADLINE>
(Eds., see related IRAN-U.S) &QL;
(bl) &QL;
By STEVEN ERLANGER &LR; &QL;
&UR; c.1998 N.Y. Times News Service &QC; &LR;
&QL;
<TEXT>
WASHINGTON _ After a few days of reflection on Iranian President
Mohammed Khatami's address to the American people, senior U.S.
officials are changing their tone and embracing the idea of
cultural exchanges that fall short of a formal, government-togovernment dialogue.
But that formal dialogue is vital to any real improvement in
relations with Iran, the officials repeated, adding that atmosphere
also matters.
Noting that the interview with Ayatollah Khatami on Wednesday
was also broadcast in Iran and received a mixed reception, U.S.
officials say they have a fuller understanding both of the courage
of his address and what is possible within the divided politics of
theocratic Iran.
``When the president of Iran, a country with whom we've had a
very bad relationship for a long time, gets on CNN and addresses
the American people and starts praising our values and our
civilization and talks about a dialogue, then it behooves us to
respond,'' a senior U.S. official said.
``When he says he regrets the hostage-taking and talks about
America as a great civilization and these things get criticized in
Iran,'' the official continued, ``it is an indication to us that
he's interested in breaking down this distrust and finding a way to
engage with us.''
All that ``is important on a rhetorical level,'' the official
said/ But he cautioned that ``we have some real problems with
Iranian behavior'' that can only be resolved in ``authorized,
government-to-government talks'' of the kind Washington has been
seeking _ publicly and privately, through various diplomatic
channels _ for many months.
<ANNOTATION>
(STORY CAN END HERE. OPTIONAL MATERIAL FOLLOWS)
</ANNOTATION>
U.S. diplomatic overtures for new talks on the substantive
problems of the relationship were passed to Iranian officials in
Tehran by Saudi intermediaries in June and early July, The Los
Angeles Times reported in July, before Khatami took office in
August.
235
Another overture, sometime after Khatami's inauguration, was
made in a letter delivered by the Swiss, who represent U.S.
interests in Tehran, where there is no U.S. diplomatic
representation, The Washington Post reported.
But these overtures _ and less formal efforts made through
Washington-based research groups _ produced little at the time,
officials said.
``A real improvement in Iran's behavior and relations with the
United States will depend more on domestic political change in
Tehran than anything we do or say,'' a senior official said. ``And
what we do or say will have an exaggerated impact over there. There
is a real risk in saying too much and doing in the guy who's trying
to make things better.''
While wanting to be receptive to the overture from Khatami, U.S.
officials do not want to be ``bounced,'' one said, into aimless
talks that harm U.S. efforts to isolate Iran and produce no
discernible change in Tehran's behavior.
So State Department spokesman James P. Rubin says the United
States will ``take a serious, hard look'' at Khatami's vague
proposal for a more formalized expansion of cultural and
educational exchanges.
But limited informal exchanges already exist, Rubin said, and
what matters to Washington remains now what it was last week: a
halt in Iranian support for terrorism; a halt to Iran's pursuit of
weapons of mass destruction and ballistic missiles to deliver them,
and a halt in Iran's active support for radicals opposed to the
Middle East peace effort.
The U.S. response to Khatami is ``designed to make clear to him
that we listened and we heard, both the good things, the things we
appreciate, and the things we do not appreciate,'' a senior
official said.
It is also designed not to cause any inadvertent damage to
Khatami's standing in Iran _ but without appearing to take sides in
the struggle between conservative adherents of Iran's spiritual
leader, Ayatollah Ali Khamenei, and those who look to Khatami to
soften Iran's religious fervor and encourage the trend toward the
more moderate brand of Islam he appears to represent.
The United States remains a metaphor for the more fundamental
battle inside Iran, just as it was during the 1979 revolution
against Shah Mohammed Riza Pahlevi that brought the ayatollahs to
power.
The administration applauded Khatami's call for relations built
on ``mutual respect'' and his suggestion that terrorist violence
aimed at Israeli citizens is useless and counterproductive. His
comments about the United States, an official said, ``were a breath
of fresh air, quite contrary to the paranoid, vitriolic view of
Western values and culture put around in Iran for many years.''
Among the less attractive comments, to American ears, was
Khatami's description of Israel as ``a racist, terrorist regime.''
</TEXT>
</BODY>
<TRAILER>
NYT-01-09-98 2113EST
</TRAILER>
</DOC>
236
D.4 RTÉ Closed Caption Material
<DOC id="story16942_3">
<TEXT>
the result of which has caused fierce disagreement among
Palestinian militants.
In the North, a five-year-old boy suffered a serious eye injury
when the taxi he was in came under attack by youths throwing
stones.
The incident was one of a series of minor skirmishes in Belfast,
after what has been the quietest 12th of July weekend for many
years. But Jeffrey Donaldson, the Ulster Unionist MP, confirmed he
will increase efforts to overturn the policies of David Trimble.
One week after a calm Drumcree, a quiet 12th.
In North Belfast, after a day's marching and in some cases a day's
drinking, Orangemen paraded past nationalists with no serious
confrontations.
it has been the calmest July since the Troubles began.
But for both communities, there are different realities still.
They will have to be grasped Last night in North Belfast, the
prominent Republicans helping to keep the peace included Bobby
Storey, the man at the centre of unproven Unionist allegations
about the break-in at Castlereagh pol??
4?tion.
At times it was an extremely tense situation.
I saw Sinn Fein's Gerry Kelly intervene to stop a number of youths
from getting involved in confrontations with the police.
Policing remains a crucial issue for Sinn Fein. Until Nationalist
communities support the new policing structures, security issues
will be contentious.
On the Unionist side, Jeffrey Donaldson may now believe But in his
campaign to oust David Trimble, he is looking for support from the
likes of Sir Reg Empey, who supports the Good Friday Agreement.
Donaldson's strategy seems to be - sort out our own camp and then
try to negotiate with Nationalists.
I say that I and the people I represent can bring a concensus on
the Unionist side to this process, provided our concerns are
addressed. Normally in the North, politicians take summer holidays
and the streets become tense.
The current situation is different –
</TEXT>
</DOC>
237
References
(Agirre et al., 2000) E. Agirre, O. Ansa, E. Hovy, D. Martinez. Enriching very large
ontologies using the WWW. In the Proceedings of the Workshop on Ontology
Learning, 14th European Conference on Artificial Intelligence (ECAI-00), 2000.
(Alemany, Fuentes, 2003) L. Alemany, M. Fuentes. Integrating cohesion and
coherence for automatic summarization. In the Proceedings of the 11th Meeting of
the European Chapter of the Association for Computational Linguistics (EACL-03),
2003.
(Alfonseca et al., 2003) E. Alfonseca, P. Rodriguez. Description of the UAM system
for generating very short summaries at DUC 2003. In the Proceedings of the
HLT/NAACL Workshop on Automatic Summarization/Document Understanding
Conference (DUC 2003), 2003.
(Al-Halimi, Kazman, 1998) R. Al-Halimi, R. Kazman, Temporal Indexing through
Lexical Chaining. In WordNet: an Electronic Lexical Database. Chapter 14, pp. 33352, C. Fellbaum (editor), The MIT Press, Cambridge, M.A., 1998.
(Allan et al., 1998a) J. Allan, J. Carbonell, G. Doddington, J. Yamron, Y. Yang.
Topic Detection and Tracking Pilot Study Final Report. In the Proceedings of the
DARPA Broadcasting News Transcript and Understanding Workshop 1998, pp.
194-218, 1998.
(Allan et al., 1998b) J. Allan, V. Lavrenko, R. Papka. Event Tracking. Computer
Science Department, University of Massachusetts, Amherst, CIIR Technical Report
IR-128, 1998.
238
(Allan et al., 1998c) J. Allan, R. Papka, V. Lavrenko. On-line New Event Detection
and Tracking. In the Proceedings of the 21st Annual ACM SIGIR Conference of
Research and Development in Information Retrieval (SIGIR-98), pp. 37-45, 1998.
(Allan et al., 1998d) J. Allan, R. Papka, V. Lavrenko, On-line New Event Detection
using Single Pass Clustering. University of Massachusetts, Amherst, Technical
Report 98-21, 1998.
(Allan et al., 1999) J. Allan, H. Jin, M. Rajman, C. Wayne, D. Gildea, V. Lavrenko,
R. Hoberman, D. Caputo. Topic-Based Novelty Detection. Summer Workshop Final
Report, Center for Language and Speech Processing, Johns Hopkins University,
1999.
(Allan et al., 2000a) J. Allan, V. Lavrenko, D. Malin, R. Swan. Detections, Bounds,
and Timelines: UMass and TDT-3. In the Proceedings of Topic Detection and
Tracking Workshop (TDT-3), 2000.
(Allan et al., 2000b) J. Allan, V. Lavrenko, H. Jin. First Story Detection in TDT Is
Hard. In the Proceedings of the 9th International Conference on Information and
Knowledge Management (CIKM-00), 2000.
(Allan et al., 2000c) J. Allan, V. Lavrenko, D. Frey, V. Khandelval. UMass at TDT
2000. In the Proceedings of Topic Detection and Tracking Workshop (TDT-2000),
Gaithesburg, MD, 2000.
(Allan et al., 2001) J. Allan, V. Khandelwal, R. Gupta. Temporal Summaries of
News Topics. In the Proceedings of the 24th Annual ACM SIGIR Conference of
Research and Development in Information Retrieval (SIGIR-01), pp. 10-18, 2001.
(Allan, 2002a) J. Allan (editor). Topic Detection and Tracking: Event-based
Information Organization. Kluwer Academic Publishers, 2002.
239
(Allan, 2002b) J. Allan. Introduction to Topic Detection and Tracking. Chapter 1,
In Topic Detection and Tracking: Event-based Information Organization. J. Allan
(editor), Kluwer Academic Publishers, 2002.
(Allan et al., 2002c) J. Allan, V. Lavrenko, R. Swan. Explorations within Topic
Detection and Tracking. Chapter 10, In Topic Detection and Tracking: Event-based
Information Organization. J. Allan (editor), Kluwer Academic Publishers, 2002.
(Arampatzis, 2001) A. Arampatzis. Adaptive and Temporally-dependent Document
Filtering. Ph.D. thesis, University of Nijmegen, The Netherlands, 2001.
(Baker, McCallum, 1998) L. D. Baker, A. K. McCallum. Distributional clustering
of words for text classification. In the Proceedings of the 21st Annual ACM SIGIR
Conference of Research and Development in Information Retrieval (SIGIR-98), pp.
96-103, 1998.
(Baeza-Yates, Ribero-Neto, 1999) R. Baeza-Yates, B. Ribeiro-Neto. Modern
Information Retrieval. ACM Press, Addisson-Wesley, 1999.
(Banko et al., 2000) M. Banko V. Mittal, M. Witbrock. Generating Headline-Style
Summaries. In the Proceedings of the Association for Computational Linguistics
(ACL-00), 2000.
(Barzilay, Elhadad, 1997) R. Barzilay, M. Elhadad. Using Lexical Chains for Text
Summarization. In the Proceedings of the Association for Computational
Linguistics and the European Chapter of the Association for Computational
Linguistics
(ACL-97/EACL-97),
Workshop
on
Intelligent
Scalable
Text
Summarization, pp. 10-17, 1997.
(Barzilay, 1997) R. Barzilay. Lexical chains for summarisation. Master’s Thesis,
Ben-Gurion University, Beer-Sheva, Israel, 1997.
240
(Barzilay,
2003)
R.
Barzilay.
Information
Fusion
for
Mutlidocument
Summarization: Paraphrasing and Generation. PhD Thesis, Columbia University,
2003.
(Beeferman et al., 1999) D. Beeferman, A. Berger, J. Lafferty. Statistical Models
for Text Segmentation. In Machine Learning, Vol. 34, pp. 1-34, 1999.
(Berger, Mittal, 2000) A. Berger, V. Mittal. OCELOT: a system for summarizing
Web. In the Proceedings of the 23rd Annual ACM SIGIR Conference of Research
and Development in Information Retrieval, (SIGIR-00), pp.144-151, 2000.
(Blei, Moreno, 2001) D. M. Blei, P. J. Moreno. Topic segmentation with an aspect
hidden Markov model. In the Proceedings of the 24th Annual ACM SIGIR
Conference of Research and Development in Information Retrieval, (SIGIR-01),
pp. 343-348, 2001.
(Bo-Yeong, 2002) Bo-Yeong Kang. Text Summarization through Important Noun
Detection Using Lexical Chains. M.S. Thesis, Kyungpook National University,
2002.
(Bo-Yeong, 2003) Bo-Yeong Kang. A novel approach to semantic indexing based
on concept. In the Proceedings of the Association for Computational Linguistics
Student Session (ACL-03), 2003.
(Brunn, Chali, Pinchak, 2001) M. Brunn, Y. Chali, C.J. Pinchak. Text
Summarization Using Lexical Chains. In the Proceedings of the Document
Understanding Conference (DUC-2001), pp. 135 - 140, 2001.
(Brunn, Chali, Dufour, 2002) M. Brunn, Y. Chali, D. Dufour. The University of
Lethbridge Text Summarizer at DUC 2002. In the Proceedings of the Document
Understanding Conference (DUC-2002), pp. 39-44, 2002.
241
(Budanitsky, 1999) A. Budanitsky. Lexical Semantic Relatedness and its
Application in Natural Language Processing. PhD Thesis, Technical Report CSRG390, Computer Systems Research Group, University of Toronto, August 1999.
(Budanitsky, Hirst, 2001) A. Budanitsky, G. Hirst. Semantic Distance in WordNet:
An experimental, application oriented-evaluation of five measures. In the
Proceedings of the Workshop on WordNet and Other Lexical Resources, in the
North American Chapter of the Association for Computational Linguistics
(NAACL-2001), Pittsburgh, PA, June 2001.
(Buitelaar,
1998)
P.
Buitelaar.
CORELEX:
Systematic
Polysemy
and
Underspecification. Ph.D. thesis, Brandeis University, 1998.
(Callan et al., 1992) J. P. Callan, W. B. Croft, S. M. Harding. The INQUERY
Retrieval System. In the Proceedings of the 3rd International Conference on
Database and Expert System Applications, pp. 78-83, 1992.
(Callan. 1994) J. P. Callan. Passage level evidence in document retrieval. In the
Proceedings of the 17th Annual ACM SIGIR Conference of Research and
Development in Information Retrieval, (SIGIR-94), pp. 302-310, 1994.
(Carthy, Smeaton, 2000) J. Carthy, A. F. Smeaton. The Design of a Topic Tracking
System. In the Proceedings of the 22nd Annual Colloquium on IR Research, (BCSIRSG-00), 2000.
(Carthy, Sherwood-Smith, 2002) J. Carthy, M. Sherwood-Smith. Lexical Chains for
Topic Tracking. In the Proceedings of the IEEE International Conference on
Systems Management and Cybernetics, 2002.
(Carthy, 2002) J. Carthy. Lexical Chains for Topic Tracking. PhD thesis,
Department of Computer Science, University College Dublin, 2002.
242
(Cieri et al., 2002) C. Cieri, S. Strassel, D. Graff, N. Martey, K. Rennert, M.
Liberman. Corpora for Topic Detection and Tracking. Chapter 10, In Topic
Detection and Tracking: Event-based Information Organization. J. Allan (editor),
Kluwer Academic Publishers, 2002.
(Chali et al., 2003) Y. Chali, M. Kolla, N. Singh, Z. Zhang. The University of
Lethbridge Text Summarizer at DUC 2003. In Proceedings of the Document
Understanding Conference (DUC-2003), pp. 148-152, 2003.
(Chen, Chen, 2002) Y. Chen, H. Chen. NLP and IR Approaches to Monolingual
and Multilingual Link Detection. In the Proceedings of the 19th International
Conference on Computational Linguistics (ACL-02), pp. 176-182, 2002.
(Chen, Ku, 2002) H. Chen, L. Ku. An NLP and IR approach to topic detection.
Chapter 12, In Topic Detection and Tracking: Event-based Information
Organization. J. Allan (editor), Kluwer Academic Publishers, 2002.
(Choi, 2000) F. Y. Y. Choi. Advances in domain independent linear text
segmentation. In the Proceedings of the 1st Meeting of the North American Chapter
of the Association for Computational Linguistics (NAACL-00), pp. 26-33, 2000.
(Choi, 2001) F. Y. Y. Choi, P. Wiemer-Hastings, J. Moore. Latent semantic
analysis for text segmentation. In the Proceedings of the 6th Conference on
Empirical Methods in Natural Language Processing (EMNLP-01), pp. 109-117,
2001.
(Copeck et al., 2003) T. Copeck, S. Szpakowicz. Picking phrases, picking
sentences. In the Proceedings of the HLT/NAACL workshop on Automatic
Summarization/Document Understanding Conference (DUC 2003), 2003.
(Croft, 2000) W. B. Croft. Combining approaches to information retrieval. In
Chapter 1, Advances in Information Retrieval, W. B. Croft (editor), pp. 1-36.
Kluwer Academic Publishers, 2000.
243
(Daume et al., 2002) H. Daume, D. Echihabi, D. Marcu, D. S. Munteanu, R.
Soricut. GLEANS: A generator of logical extracts and abstracts for nice summaries.
In the Proceedings of the ACL Workshop on Automatic Summarization/Document
Understanding Conference (DUC 2002), 2002.
(Deerwester et al., 1990) S. Deerwester, S. Dumais, G. Furnas, T. Landauer, R.
Harshman. Indexing by latent semantic analysis. Journal of the American Society
for Information Science, Vol. 41, pp. 391-407, 1990.
(Dharanipragada et al., 1999) S. Dharanipragada, M. Franz, J. S. McCarley, S.
Roukos, T. Ward. Story Segmentation and Topic Detection for Recognised Speech.
In the Proceedings of Eurospeech, 1999.
(Dharanipragada et al., 2002) S. Dharanipragada, M. Franz, J. S. McCarley, T.
Ward, W. -J. Zhu. Segmentation and Detection at IBM. Chapter 7, In Topic
Detection and Tracking: Event-based Information Organization. J. Allan (editor),
Kluwer Academic Publishers, 2002.
(Dhillon, et al., 2003) I. Dhillon, S. Mallelaa, R. Kumar. A Divisive InformationTheoretic Feature Clustering Algorithm for Text Classification. To appear in the
Journal of Machine Learning Research,: Special Issue on Variable and Feature
Selection, Vol. 3 pp. 1265-1287, 2003.
(Dorr, Zajic, 2003) B. Dorr, D. Zajic. Hedge Trimmer: A parse-and-trim approach
to headline generation. In the Proceedings of the HLT/NAACL Workshop on
Automatic
Summarization/Document
Understanding
Conference
(DUC
2003), 2003.
(Dunning, 1994) T. Dunning. Accurate Methods for the Statistics of Surprise and
Coincidence. Computational Linguistics, Vol. 19, No. 1 pp. 61-74, 1994.
244
(Eichmann et al., 1999) D. Eichmann, M. Ruiz, P. Srinivasan, N. Street, C. Culy, F.
Menczer. A cluster-based approach to tracking, detection and segmentation of
broadcast news. In the Proceedings of the DARPA Broadcast News Workshop,
1999.
(Eichmann, Srinivasan, 2002) D. Eichmann, P. Srinivasan. A cluster-based
approach to broadcast news. Chapter 8, In Topic Detection and Tracking: Eventbased Information Organization. J. Allan (editor), Kluwer Academic Publishers,
2002.
(Ellman, 2000) J. Ellman. Using Roget’s Thesaurus to Determine the Similarity of
Texts. PhD thesis, Department of Computer Science, University of Sunderland,
2000.
(Fellbaum, 1998a) C. Fellbaum (editor). In WordNet: An Electronic Lexical
Database and some of its Applications. MIT Press, Cambridge, MA, 1998.
(Fellbaum, 1998b) C. Fellbaum. A semantic network of English verbs. Chapter 3, In
WordNet: An Electronic Lexical Database and some of its Applications. C.
Fellbaum (editor), MIT Press, Cambridge, MA, 1998.
(Fiscus, Doddington, 2002). J. Fiscus, G. Doddington, Topic Detection and
Tracking Overview. Chapter 2, In Topic Detection and Tracking: Event-based
Information Organization. J. Allan (editor), Kluwer Academic Publishers, 2002.
(Fox, 1983) E. Fox.
Extending the Boolean and Vector Space Models of
Information Retrieval with P-norm Queries and Multiple Concept Types. PhD
thesis, Cornell University, 1983.
(Fox et al., 1988) E. Fox, G. Nunn, W. Lee. Coefficients for combining concept
classes in a collection. In the proceedings of the 11th Annual ACM SIGIR
Conference of Research and Development in Information Retrieval, (SIGIR-88),
pp. 291-308, 1988.
245
(Frakes, Baeza-Yates, 1992) W. B. Frakes, R. Baeza-Yates. Information Retrieval:
Data structures and algorithms. Prentice Hall, 1992.
(Fuentes et al., 2003) M. Fuentes, H. Rodriguez, L. Alonso, Mixed Approach to
Headline Extraction for DUC 2003. In the Proceedings of the HLT/NAACL
Workshop on Automatic Summarization/Document Understanding Conference
(DUC 2003), 2003.
(Furnas et al., 1987) G. W. Furnas, T. K. Landauer, L. M. Gomez, S. Dumais. The
Vocabulary Problem in Human-System Communication. CACM, Vol. 30, No. 11,
pp. 964-971, 1987.
(Gale et al., 1992) W. Gale, K. Church, D. Yarowsky. One Sense Per Discourse. In
the Proceedings of the 4th DARPA Speech and Natural Language Workshop, pp.
233-237, 1992.
(Galley, McKeown, 2003) M. Galley, K. McKeown. Improving Word Sense
Disambiguation in Lexical Chaining. In the Proceedings of the 18th International
Joint Conference on Artificial Intelligence (IJCAI-03), 2003.
(Gloub, Van Loan, 1996) G. H. Golub, C. Van Loan. Matrix Computations. Johns
Hopkins University Press, 3rd Edition, 1996.
(Gonzalo et al., 1998) J. Gonzalo, F. Verdejo, I. Chugur, J. Cigarran. Indexing with
WordNet Synsets can Improve Text Retrieval. In the Proceedings of the Workshop
on the Usage of WordNet in Natural Language Processing Systems, (COLINGACL-98), S. Harabagiu (editors), pp. 38-44, 1998.
(Gonzalo et al., 1999) J. Gonzalo, A. Penas, F. Verdejo. Lexical ambiguity and
Information Retrieval revisited. In the Proceedings of the Joint SIGDAT
Conference on Empirical Methods in NLP and Very Large Corpora (EMNLP/VLC99), 1999.
246
(Green, 1997a) S. J. Green. Building hypertext links in Newspaper Articles using
Semantic Similarity. In the Proceedings of the 3rd Workshop on Applications of
Natural Language to Information Systems (NLDB’97), pp. 178-190, 1997.
(Green, 1997b) S. J. Green. Automatically Generating Hypertext by Computing
Semantic Similarity. PhD Thesis, University of Toronto, 1997.
(Greiff et al., 2000) W. Greiff, A. Morgan, R. Fish, M. Richards, A. Kundu. MITRE
TDT-2000 segmentation system. In the Proceedings of the TDT 2000 Workshop,
2000.
(Grosz, Sidner, 1986) B. J. Grosz, C. L. Sidner. Attention, intentions, and the
structure of discourse. Computational Linguistics, Vol. 12, No. 3, pp. 175-204,
1986.
(Halliday, Hasan, 1976) M. A. K. Halliday, R. Hasan. Cohesion in English.
Longman, 1976.
(Halliday, 1995) M. A. K. Halliday. Spoken and Written Language. Oxford
University Press, 1985.
(Hasan, 1984) R. Hasan. Coherence and Cohesive Harmony. Understanding
Reading Comprehension: Cognition, Language and the Structure of Prose. James
Flood (ed.), Newwark, Delaware: International Reading Association, pp.184-219,
1984.
(Hatch, 2000) P. Hatch. Lexical Chaining for the Online Detection of New Events.
MSc thesis, Department of Computer Science, University College Dublin, 2000.
(Harabagiu, 1999) S. Harabagiu. From Lexical Cohesion to Textual Coherence: A
Data Driven Perspective. Journal of Pattern Recognition and Artificial Intelligence,
Vol. 13, No. 2, pp. 247-265, 1999.
247
(Hearst, Plaunt, 1993) M. Hearst, C. Plaunt. Subtopic structuring for full-length
document access. In the Proceedings of the 16th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, (SIGIR-93),
pp. 59-68, 1993.
(Hearst, 1997) M. Hearst. TextTiling: Segmenting Text into Multi-Paragraph
Subtopic Passages. Computational Linguistics, Vol. 23 No. 1, pp. 33-64, 1997.
(Hirst, 1995) G. Hirst. Near-synonymy and the structure of lexical knowledge. In
the Working Notes of the AAAI Spring Symposium on Representation and
Acquisition of Lexical Knowledge: Polysemy, Ambiguity, and Generativity, pp. 5156, 1995.
(Hirst, St-Onge, 1998) G. Hirst, D. St-Onge. Lexical chains as Representations of
Context for the Detection and Correction of Malapropisms. in Chapter 13
WordNet: An Electronic Lexical Database, C. Fellbaum (editors), pp. 305-332, The
MIT Press, Cambridge, MA, 1998.
(Hirschberg, Litman, 1993) J. Hirschberg, D. Litman. Empirical studies on the
disambiguation of cue phrases. Computational Linguistics, Vol. 19, No. 3, pp. 501530, 1993.
(Jarmasz, Szpakowicz, 2003) M. Jarmasz, S. Szpakowicz. Not as Easy as It Seems:
Automating the Construction of Lexical Chains Using Roget'
s Thesaurus. In the
Proceedings of the Canadian Conference on Artificial Intelligence, pp. 544-549,
2003.
(Jiang, Conrath, 1997) J. J. Jiang, D. W. Conrath. Semantic Similarity Based on
Corpus Statistics and Lexical Taxonomy. In the Proceedings of the International
Conference on Research in Computational Linguistics (COLING-97), 1997.
248
(Jin, Hauptmann, 2002) R. Jin, A. G. Hauptmann. A new probabilistic model for
title generation. In the Proceedings of the International Conference on
Computational Linguistics (ACL-02), 2002.
(Joachims, 2002) T. Joachims. Learning to Classify Text using Support Vector
Machines, PhD Dissertation, Kluwer, 2002.
(Jobbins, Evett, 1998) A. C. Jobbins, L. J. Evett. Text Segmentation Using
Reiteration and Collocation. In the Proceedings of the Joint International
Conference on Computational Linguistics with the Association for Computational
Linguistics (COLING-ACL 1998), pp. 614-618, 1998.
(Kaszkiel, Zobel, 1997) M. Kaszkiel, J. Zobel. Term-ordered query evaluation
versus document-ordered query evaluation for large document databases. In the
Proceedings of the 20th International ACM-SIGIR Conference on Research and
Development in Information Retrieval, (SIGIR-97), pp. 343-344, 1997.
(Kaszkiel, Zobel, 2001) M. Kaszkiel, J. Zobel. Effective ranking with arbitrary
passages. Journal of the American Society of Information Science and Technology,
Vol. 54, No. 4, pp. 344-364, 2001.
(Kaufmann, 2000) S. Kaufmann.
Second-order Cohesion. Computational
Intelligence. Vol. 16, No. 4, pp. 511-524, 2000.
(Kazman et al., 1995) R. Kazman, W. Hunt, M. Mantei. Dynamic Meeting
Annotation and Indexing. In the Proceedings of the 1995 Pacific Workshop on
Distributed Meetings, pp. 11-18, 1995.
(Kazman et al., 1996) R. Kazman, R. Al-Halimi, W. Hunt, M. Mantei. Four
Paradigms for Indexing Video Conferences. In IEEE Multimedia, pp. 63-73, 1996.
249
(Justeson, Katz, 1995) J. Justeson, S. M. Katz. Technical terminology: some
linguistic properties and an algorithm for identification in text. Natural Language
Engineering Vol. 1, No.1, pp. 9-27, 1995.
(Katz, 1996) S. M. Katz. Distribution of context words and phrases in text and
language modelling. Natural Language Engineering, Vol 2, No. 1, pp. 15-59, 1996.
(Katzer, et al., 1982) J. Katzer, M. McGill, J. Tessier, W. Frakes, P. DasGupta. A
study of the overlap among document representations. Information Technology:
Research and Development, Vol. 1, No. 4, pp.261-274, 1982.
(Kazman et al., 1997) R. Kazman, J. Kominek. Accessing Multimedia through
Concept Clustering. In the Proceedings of Computer-Human Interaction (CHI-97),
pp. 19-26, 1997.
(Kilgarriff, Yallop, 2000) A. Kilgarriff C. Yallop. What’s in a thesaurus? In the
Proceedings of the 2nd Conference on Language Resources and Evaluation (LREC00), pp. 1317-1379, 2000.
(Klavans, Min-Yen, 1998) J. Klavans, Min-Yen Kan. Role of Verbs in Document
Analysis. In the Proceedings of the Joint International Conference on Computational
Linguistics with the Association for Computational Linguistics (COLING-ACL
1998), pp. 680-686, 1998.
(Kozima, Furugori, 1993a) H. Kozima, T. Furugori. Similarity between words
computed by spreading activation on an English dictionary. In the Proceedings of
the 6th conference of the European Chapter of Computational Linguistics (EACL93), pp 232-239, 1993.
(Kozima, 1993b) H. Kozima. Text Segmentation based on Similarity between
Words. In the Proceedings of the 31st Meeting of the Association of Computational
Linguistics (ACL-93), pp. 286-288, 1993.
250
(Kozima, Ito, 1997) H. Kozima, A. Ito. Context Sensitive Word Distance by
Adaptive Scaling of a Semantic Space. In R. Mitkov and N. Nicolov (editors.),
Recent Advances in Natural Language Processing: Selected Papers from RANLP
1995, Volume 136 of Amsterdam Studies in the Theory and History of Linguistic
Science: Current Issues in Linguistic Theory, Chapter 2, pp. 111-124, John
Benjamins Publishing Company, Amsterdam/Philadelphia, 1997.
(Krovetz, Croft, 1992), R. Krovetz, W. B. Croft. Lexical Ambiguity and Information
Retrieval. In the ACM Transactions on Information Systems, Vol. 10, No. 2,
pp.115-141, 1992.
(Lee, 1997) J. Lee. Analysis of multiple evidence combination. In the Proceedings
of the 20th Annual ACM SIGIR Conference on Research and Development in IR,
pp. 267-276, (SIGIR-97), 1997.
(Lin, Hovy, 2003) C. Lin, E. Hovy. Automatic Evaluation of Summaries Using Ngram Co-occurrence Statistics. In the Proceedings of the Joint Human Language
Technology, North Atlantic Association of Computational Linguistics Conference,
(HLT-NAACL 2003), 2003.
(Mandala et al., 1999) R. Mandala, T. Tokunaga, H. Tanaka. Combining Multiple
Evidence from Different Types of Thesaurus for Query Expansion. In the
Proceedings of the 22nd Annual ACM SIGIR Conference on Research and
Development in IR, (SIGIR-99), pp 191-197, 1999.
(Mani et al., 1997) I. Mani, D. House, M.T. Maybury, M. Green. Towards ContentBased Browsing of Broadcast News Video. In Intelligent Multimedia Information
Retrieval, M. T. Maybury (ed.), AAAI/MIT Press, pp 241-258, 1997.
(Mann,Thompson, 1987) B. Mann, S. Thompson. Rhetorical Structure Theory: A
Theory of Text Organization. In The Structure of Discourse, L. Polanyi (editor),
Norwood, N.J.: Ablex Publishing Corporation, 1987.
251
(Manning, 1998) C. D. Manning. Rethinking text segmentation models: An
information extraction case study. Technical report SULTRY-98-07-01, University
of Sydney, 1998.
(Marcu, 1997) D. Marcu. From Discourse Structures to Text Summaries. The
Proceedings of the ACL'
97/EACL'
97 Workshop on Intelligent Scalable Text
Summarization, pp. 82-88, 1997.
(McGill et al., 1979) M. McGill, M. Koll, T. Noreault. An evaluation of factors
affecting document ranking by information retrieval systems. Final report for grant
NSF-IST-78-10454 to the National Science Foundation, Syracuse University, 1979.
(Mc Hale, 1998) M. Mc Hale. A comparision of WordNet and Roget’s Taxonomy
for Measuring Semantic Similarity. In the proceedings of the COLING/ACL
Workshop on Usage of WordNet in Natural Language Processing Systems, pp. 115120, 1998.
(McKeown et al., 2002) K. McKeown, D. Evans, A. Nenkova, R. Barzilay, V.
Hatzivassiloglou, B. Schiffman, S. Blair-Goldensohn, J. Klavans, S. Sigelman. The
Columbia Multi-Document Summarizer for DUC 2002. In the Proceedings of the
ACL workshop on Automatic Summarization/Document Understanding Conference
(DUC 2002), 2002.
(Meyers et al., 1998) A. Meyers. Using NOMLEX to produce nominalization
patterns for information extraction. In the Proceedings of the COLING-ACL
Workshop on Computational Treatment of Nominals, 1998.
(Miller et al., 1990) G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, K. Miller.
Five papers on WordNet. In the International Journal of Lexicography. Vol. 3 No.
4, Cambridge, MA, 1990.
252
(Miller et al., 1993) G. A. Miller, C. Leacock, T. Randee, R. Bunker. A Semantic
Concordance. In the Proceedings of the 3rd DARPA Workshop on Human
Language Technology, pp. 303-308, 1993.
(Miller, 1998) G. A. Miller. Nouns in WordNet. In WordNet: An Electronic Lexical
Database, C. Fellbaum (Editor), pp.23-46, Cambridge, Massachusetts, USA: The
MIT Press, 1998.
(Min-Yen et al., 1998) Min-Yen Kan, J. Klavans, K. McKeown. Linear
Segmentation and Segment Relevance. In the Proceedings of 6th International
Workshop of Very Large Corpora (WVLC-6), pp. 197-205, 1998.
(Mihalcea, Moldovan, 2001) R. Mihalcea, D. Moldovan. eXtended WordNet:
Progress Report. In the Proceedings of NAACL Workshop on WordNet and Other
Lexical Resources, pp.95-100, 2001.
(Mittal et al., 1999) V. Mittal, M. Kantrowitz, J. Goldstein, J. Carbonell. Selecting
text spans for document summaries: Heuristics and metrics. In Proceedings of the
16th National Conference on Artificial Intelligence, pp. 467-473, 1999.
(Mittendorf, Schauble, 1994) E. Mittendorf, P. Schauble. Document and passage
retrieval based on hidden Markov models. In the Proceedings of the 17th Annual
International ACM SIGIR Conference on Research and Development in IR,
(SIGIR-94), pp. 318-327, 1994.
(Mochizuki et al., 1998) H. Mochizuki, T. Honda, M. Okumura. Text Segmentation
with Multiple Surface Linguistic Cues. In the Proceedings of the Joint International
Conference on Computational Linguistics with the Association for Computational
Linguistics, (COLING-ACL-98), pp. 881-885, 1998.
253
(Mochizuki et al., 2000) H. Mochizuki, M. Iwayama, M. Okumura. Passage Level
Document Retrieval Using Lexical Chains. RIAO 2000, Content Based Multimedia
Information Access, pp. 491-506, 2000.
(Moffat et al., 1994) A. Moffat, R. Sacks-Davis, R. Wilkinson, J. Zobel. Retrieval
of partial documents. In the Proceedings of the 2nd Text Retrieval Conference,
(TREC-2), pp. 181-190, 1994.
(Moldovan, Novischi, 2002) D. Moldovan, A. Novischi, Lexical Chains for
Question Answering. In the Proceedings of the International Conference on
Computational Linguistics (COLING –02), pp.674-680, 2002.
(Morris, Hirst, 1991) J. Morris, G. Hirst. Lexical Cohesion Computed by Thesaural
Relations as an Indicator of the Structure of Text. Computational Linguistics, Vol.
17, No. 1, pp. 21-48, 1991.
(Navarro, 2001) G. Navarro. A guided tour to approximate string matching. ACM
Computing Surveys, Vol. 33 No. 1, pp. 31-88, 2001.
(Nomoto, Nitta, 1994) T. Nomoto, Y. Nitta. A Grammatico-Statistical Approach to
Discourse Partitioning. In the Proceedings of the 15th International Conference on
Computational Linguistics (COLING-94), pp. 1145-1150, 1994.
(Okumura, Honda, 1994) M. Okumura, T. Honda. Word Sense Disambiguation and
Text Segmentation Based on Lexical Cohesion. In the Proceedings of the 15th
International Conference on Computational Linguistics (COLING-94), pp. 775-761,
1994.
(Papka, 1999) R. Papka. On-Line New Event Detection, Clustering and Tracking.
Department of Computer Science, UMASS, Amherst, PhD Dissertation, 1999.
254
(Passonneau, Litman, 1993) R. Passoneau, D. Litman. Intention based
segmentation: Human reliability and correlation with linguistic cues. In the
Proceedings of Association of Computational Linguistics, (ACL-93), pp. 148-155,
1993.
(Passonneau, Litman, 1997) R. Passoneau, D. Litman. Discourse Segmentation by
Human and Automated Means. Computational Linguistics, Vol. 23, No. 1, pp. 103139, 1997.
(Pedersen, 1996) T. Pedersen. Fishing for Exactness. In the Proceedings of the
South-Central SAS Users Group Conference (SCSUG-96), 1996.
(Pevzner, Hearst, 2002) L. Pevzner, M. Hearst. A Critique and Improvement of an
Evaluation Metric for Text Segmentation. Computational Linguistics, Vol. 28, No.
1, pp. 19-36, 2002.
(Polanyi, 1998) L. Polanyi. A formal model of discourse structure. Journal of
Pragmatics, Vol. 12, pp 601-638, 1998.
(Ponte, Croft, 1998) J. Ponte, W. B. Croft. A language modeling approach to
information retrieval. In the Proceedings of the 21st Annual ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR-98), pp.
275-281, 1998.
(Porter, 1997) M. F. Porter. An Algorithm for Suffix Stripping. In Readings in
Information Retrieval, K.Spark Jones and P. Willet, (editors), pp. 313-316, Morgan
Kaufmann Publishers, 1997.
(Quinlan, 1993) J. R. Quinlan. C4.5: Programs for machine learning. Morgan
Kaufman Publishers, 1993.
255
(Rada et al., 1989) R. Rada, H. Mili, E. Bicknell, M. Blettner. Development and
Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man and
Cybernetics, Vol. 19, No. 1, pp. 17-30, 1989.
(Resnik, 1999) P. Resnik. Semantic Similarity in a Taxonomy: An InformationBased Measure and its Application to Problems of Ambiguity in Natural Language.
Journal of Artificial Intelligence Research, (JAIR), Vol. 11, pp. 95-130, 1999.
(Richardson, Smeaton, 1995) R. Richardson, A. F. Smeaton. Using WordNet in an
Knowledge-Based Approach to Information Retrieval. Working Paper CA-0395,
School of Computer Applications, Dublin City University, 1995.
(Richmond et al., 1998) K. Richmond, A. Smith, E. Amitay. Detecting subject
boundaries within text: A language independent statistical approach. In the
Proceedings of the 2nd Conference on Empirical Methods in Natural Language
Processing (EMNLP-97), pp. 47-54, 1997.
(Reynar, 1994) J. Reynar. An automatic method of finding topic boundaries. In the
Proceedings of the Association of Computational Linguistics (ACL-94), 1994.
(Reynar, 1998) J. Reynar. Topic Segmentation: Algorithms and Applications. Ph.D.
thesis, Computer and Information Science, University of Pennsylvania, 1998.
(Salton, McGill, 1983) G. Salton, M. J. McGill. Introduction to Modern
Information Retrieval. McGraw-Hill Book Co., New York, 1983.
(Salton, 1989) G. Salton. Automatic Text processing: The transformation, Analysis
and Retrieval of Information by Computer. Addison Wesley, Reading, Pennsylvania
1989.
(Salton et al., 1993) G. Salton, J. Allan, C. Buckley. Approaches to passage
retrieval in full text information systems. In the Proceedings of the Annual 16th
ACM SIGIR Conference on Research and Development in Information Retrieval,
(SIGIR-93), pp. 49-58, 1993.
256
(Salton et al., 1996) G. Salton, J. Allan, A. Singhall. Automatic Text Decomposition
and Structuring. Information Processing and Management, Vol. 32, No. 2, pp. 127138, 1996.
(Sanderson, 1994) M. Sanderson. Word Sense Disambiguation and Information
Retrieval, in the Proceedings of the 17th International Conference on Research and
Development in Information Retrieval, (SIGIR-94), pp. 142-151, 1994.
(Sanderson, 1997) M. Sanderson. Word Sense Disambiguation and Information
Retrieval. PhD Thesis, Technical Report (TR-1997-7) of the Department of
Computing Science at the University of Glasgow, 1997.
(Sanderson, 2000) M. Sanderson. Retrieving with good sense. Information
Retrieval, Vol. 2, No. 1, pp. 49-69, 2000.
(Schutze, 1997) H. Schutze. Ambiguity resolution in language learning. CSLI
Publications, Stanford CA, 1997.
(Schutze, 1998) H. Schutze. Automatic word sense disambiguation. Computational
Linguistics, Vol. 24, No. 1, pp. 97-123, 1998.
(SENSEVAL-2, 2001) SENSEVAL-2: Sense Disambiguation Workshop 2001,
www.sle.sharp.co.uk/senseval2/, 2001.
(Silber, McCoy, 2000) H. G. Silber, K. F. McCoy. Efficient text summarization
using lexical chains. In the Proceedings of Intelligent User Interfaces 2000, pp. 252255, 2000.
(Silber, McCoy, 2002) H. G. Silber, K. F. McCoy. Efficiently Computed Lexical
Chains as an Intermediate Representation for Automatic Text Summarization.
Computational Linguistics, Vol. 28, No. 4, pp. 487-496, 2002.
257
(Slaney, Ponceleon, 2001) M. Slaney, D. Ponceleon. Hierarchical segmentation
using latent semantic indexing in scale space. In the Proceedings of the IEEE
International Conference on Acoustics, Speech, & Signal Processing, 2001.
(Slonim, 2002) N. Slonim. The Information Bottleneck: Theory and Applications.
Ph.D. thesis, The Hebrew University, Jerusalem, 2002.
(Smeaton et al., 2003) A. F. Smeaton, H. Lee, N. O'
Connor, S Marlow, N. Murphy.
TV News Story Segmentation, Personalisation and Recommendation. In the
Proceedings of Advances in Artificial Intelligence (AAAI-03) Spring Symposium
on Intelligent Multimedia Knowledge Management, 2003.
(Stairmand, 1996) M. A. Stairmand. A Computational Analysis of Lexical Cohesion
with Applications in Information Retrieval. PhD Thesis, Department of Language
Engineering, University of Manchester Institute of Science and Technology, 1996
(Stairmand, Black, 1997) M. A. Stairmand, W. J. Black. Conceptual and Contextual
Indexing using WordNet-derived Lexical Chains. In the Proceedings of BCS IRSG
Colloquium on Information Retrieval, pp. 47-65, 1997.
(Stairmand, 1997) M. A. Stairmand. Textual context analysis for information
retrieval. In the Proceedings of the 20th Annual ACM SIGIR Conference on
Research and Development in IR, (SIGIR-97), pp. 140-147, 1997.
(Stevenson, 2002) M. Stevenson. Combining Disambiguation Techniques to Enrich
an Ontology. In the Proceedings of the 5th European Conference on Artificial
Intelligence (ECAI-02), Workshop on Machine Learning and Natural Language
Processing for Ontology Engineering, 2002.
(Stokes et al., 2000a) N. Stokes, P. Hatch, J. Carthy. Topic Detection, a new
application for lexical chaining? In the Proceedings of the 22nd BCS IRSG
Colloquium on Information Retrieval. pp.94-103, 2000.
258
(Stokes et al., 2000b) N. Stokes, P. Hatch, J. Carthy. Lexical semantic relatedness
and online news event detection. In the Proceedings of the Annual 23rd ACM SIGIR
Conference on Research and Development in IR, (SIGIR-00), pp.324-325, 2000.
(Stokes et al., 2000c) N. Stokes, P. Hatch, J. Carthy. Lexical Chaining for WebBased Retrieval of Breaking News. In the Proceedings of the International
Conference on Adaptive Hypermedia and Adaptive Web-Based Systems AH2000,
pp. 327-330, 2000.
(Stokes, Carthy, 2001a) N. Stokes, J. Carthy. Using Data Fusion to Improve First
Story Detection. In the Proceedings of the 23rd BCS-IRSG European Conference on
IR Research, pp. 78-90, 2001.
(Stokes et al., 2001b) N. Stokes, J. Carthy. First Story Detection using a Composite
Document Representation. In the Proceedings of the Human Language Technology
Conference, (HLT-01), 2001.
(Stokes et al., 2001c) N. Stokes, J. Carthy. Combining Semantic and Syntactic
Document Classifiers to Improve First Story Detection. In the Proceedings of the
24th Annual ACM SIGIR Conference on Research and Development in Information
Retrieval, (SIGIR-01), pp. 424-425, 2001.
(Stokes et al., 2002) N. Stokes, J. Carthy, A.F. Smeaton. Segmenting Broadcast
News Streams using Lexical Chaining. In the Proceedings of the Starting Artificial
Researchers Symposium, (STAIRS-02), Vol.1, pp. 145-154, 2002.
(Stokes, 2003) N. Stokes. Spoken and Written News Story Segmentation using
Lexical Chaining. In the Proceedings of the Student Workshop at Joint Human
Language
Technology
Conference
and
North
Atlantic
Association
for
Computational Linguistics, (HLT/NAACL-03), Companion Volume, pp. 49-54,
2003.
259
(Stokes et al., 2004a) N. Stokes, J. Carthy, A. F. Smeaton. SeLeCT: A Lexical
Cohesion based News Story Segmentation System. To appear in the Journal of AI
Communications, 2004.
(Stokes et al., 2004b) N. Stokes, E. Newman, J. Carthy, A. F. Smeaton. Broadcast
news gisting using lexical cohesion analysis. To appear in the Proceedings of the
26th BCS-IRSG European Conference on Information Retrieval (ECIR-04),
Sunderland, U.K., 2004.
(Stokoe et al., 2003) C. Stokoe, M. Oakes, J. Tait. Word Sense Disambiguation in
Information Retrieval Revisited. In the Proceedings of the 26th Annual ACM-SIGIR
Conference on Research and Development in IR, (SIGIR-03), pp. 159-166, 2003.
(Stolcke et al., 1999) A. Stolcke, E. Shriberg, D. Hakkani-Tur, G. Tur, Z. Rivlin, K.
Sonmez. Combining words and speech prosody for automatic topic segmentation.
In the Proceedings of the DARPA Broadcast News Workshop, pp. 61-64, 1999.
(St-Onge, 1995) D. St-Onge. Detecting and Correcting Malapropisms with Lexical
Chains. Technical Report CSRI-319, Master’s thesis, University of Toronto, March
1995.
(Sussna, 1993) M. Sussna, WordSense Disambiguation for Free-Text Indexing
Using a Massive Semantic Network. In the Proceedings of the 2nd International
Conference on Information and Knowledge Management (CIKM-93), pp. 67-74,
1993.
(Utiyama, Isahara, 2001) M. Utiyama, H. Isahara. A statistical model for domain independent text segmentation. In the Proceedings of the 9th Conference of the
European Chapter of the Association for Computational Linguistics, (EACL-01),
pp.491-498, 2001.
(van Mulbregt et al., 1999) P. van Mulbregt, I. Carp, L. Gillick, S. Lowe, J.
Yamron. Segmentation of automatically transcribed broadcast news text. In
260
Proceedings of the DARPA Broadcast News Workshop, pp. 77-80. Morgan
Kaufman Publishers, 1999.
(van Rijsbergen, 1979) C. J. van Rijsbergen, Information Retrieval, Butterworths,
1979.
(Voorhees, 1993) E. M. Voorhees. Using WordNet to Disambiguate Word Senses
for Text Retrieval. In the Proceedings of the 16th Annual ACM-SIGIR Conference
on Research and Development in IR, (SIGIR-93), pp. 171-180, 1993.
(Voorhees, 1994) E. M. Voorhees, Query Expansion using Lexical-Semantic
Relations. In the Proceedings of the 17th Annual ACM SIGIR Conference on
Research and Development in IR, (SIGIR-94), pp. 61-69, 1994.
(Voorhees, 1998) E. M. Voorhees. Using WordNet for Text Retrieval. In WordNet:
An Electronic Lexical Database, C. Fellbaum (Editor), pp.285-303, Cambridge,
Massachusetts, USA: The MIT Press, 1998.
(Vossen, 1998) P. Vossen (Editor). EuroWordNet: A Multilingual Database with
Lexical Semantic Networks. Kluwer Academic Publishers, Dordrecht, 1998.
(Wallis, 1993) P. Wallis. Information retrieval based on paraphrase. In the
Proceedings of the Pacific Association for Computational Linguistics, (PACLING93), 1993.
(Wilkinson, 1994) R. Wilkinson. Effective retrieval of structured documents. In the
Proceedings of the 17th Annual ACM SIGIR Conference on Research and
Development in IR, pp.311-317, 1994.
(Witbrock, Mittal, 1999) M. Witbrock, V. Mittal. Ultra-Summarisation: A
Statistical approach to generating highly condensed non-extractive summaries. In
the Proceedings of the Annual ACM SIGIR Conference on Research and
Development in IR, (SIGIR-99), pp. 315-316, 1999.
261
(Witten et al., 1998) I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, C. G. NevillManning. KEA: Practical Automatic Keyphrase Extraction. In the Proceedings of
the 4th ACM Digital Libraries Conference, pp. 254-255, 1999.
(Xu, Broglio, Croft, 1994) J. Xu, J. Broglio, W. B. Croft. The design and
implementation of a part of speech tagger for English. Technical Report IR-52,
Center for Intelligent Information Retrieval, Department of Computer Science,
University of Massachusetts, 1994.
(Yamron et al., 1998) J. P. Yamron, I. Carp, L. Gillick, S. Lowe, P. van Mulbregt. A
hidden Markov model approach to text segmentation and event tracking. In the
Proceedings of the IEEE International Conference on Acoustics, Speech, & Signal
Processing, (ICASSP-98), Vol. 1, pp. 333-336, 1998.
(Yamron et al., 2002) J. P. Yamron, L. Gillick, P. van Mulbregt, S. Knecht.
Statistical Models of Topical Content. Chapter 6, In Topic Detection and Tracking:
Event-based Information Organization. J. Allan (editor), Kluwer Academic
Publishers, 2002.
(Yang, Pedersen, 1998) Y. Yang, J. O. Pedersen. A comparative study on feature
selection in text categorization. In Proceedings of the 14th International Conference
on Machine Learning (ICML-97), pp. 412-420, 1998.
(Yang et al., 1998) Y. Yang, T. Pierce, J. Carbonell. A Study on Retrospective and
On-line Event Detection. In the Proceedings of the Annual ACM SIGIR Conference
on Research and Development in IR, (SIGIR-98), pp. 28-36, 1998.
(Yang et al., 2002) Y. Yang, J. Carbonell, R. Brown, J. Lafferty, T. Pierce, T. Ault.
Multi-Strategy learning for TDT. Chapter 5, In Topic Detection and Tracking:
Event-based Information Organization. J. Allan (editor), Kluwer Academic
Publishers, 2002.
262
(Yaari, 1997) Y. Yaari. Segmentation of expository texts by hierarchical
agglomerative clustering. In the Proceedings of the Conference on Recent
Advances in Natural Language Processing (RANLP-97), pp. 59-65, 1997.
(Yarowsky, 1993) D. Yarowsky. One sense per collocation. In the Proceedings of
ARPA Human Language Technology Workshop, 1993.
(Youman, 1991) G. Youmans. A new tool for discourse analysis: the vocabulary
management profile. Language, Vol. 67, pp. 763-789, 1991.
(Zajic, Dorr, 2002) D. Zajic, B. Dorr. Automatic headline generation for newspaper
stories.
In
the
Proceedings
of
the
ACL
workshop
on
Automatic
Summarization/Document Understanding Conference (DUC 2002), 2002.
(Zhou, Hovy, 2003) L. Zhou, E. Hovy. Headline Summarization at ISI. In the
Proceedings of the Human Language Technology Conference/North Atlantic
Association of Computational Linguistics HLT/NAACL workshop on Automatic
Summarization/Document Understanding Conference (DUC 2003), 2003.
263