Applications of Lexical Cohesion Analysis in the Topic Detection and Tracking Domain Nicola Stokes B.Sc. A thesis submitted for the degree of Doctor of Philosophy in Computer Science Supervisor: Dr Joseph Carthy Department of Computer Science Faculty of Science National University of Ireland, Dublin Nominator: Prof. Mark Keane April 2004 ii Abstract This thesis investigates the appropriateness of using lexical cohesion analysis to improve the performance of Information Retrieval (IR) and Natural Language Processing (NLP) applications that deal with documents in the news domain. More specifically, lexical cohesion is a property of text that is responsible for the presence of semantically related vocabulary in written and spoken discourse. One method of uncovering these relationships between words is to use a linguistic technique called lexical chaining, where a lexical chain is a cluster of related words, e.g. {government, regime, administration, officials}. At their core, traditional approaches to IR and NLP tasks tend to treat a document as a ‘bag-of-words’, where document content is represented in terms of word stem frequency counts. However, in these implementations no account is taken of more complex semantic word associations such as the thesaural relationships synonymy (e.g. home, abode), specialisation/generalisation (e.g. cake, dessert) and part/whole (e.g. musician, orchestra). In this thesis we present a novel news-oriented chaining algorithm, LexNews, which provides a means of exploring the lexical cohesive structure of a news story. Unlike other chaining approaches that only explore standard thesaural relationships, the LexNews algorithm also examines domain-specific statistical word associations and proper noun phrase repetition. We also report on the performance of some challenging, real-world applications of lexical cohesion analysis with respect to the performance of ‘bag-of-words’ approaches to these problems. In particular, we attempt to enhance New Event Detection and News Story Segmentation performance: two tasks currently being investigated by the Topic Detection and Tracking (TDT) initiative, a research programme dedicated to the intelligent organisation of broadcast news and newswire data streams. Our results for the New Event Detection task are mixed, and a consistent improvement in performance was not achieved. However, in contrast, our News Story Segmentation results are very positive. In addition, we also explore the effect of lexical cohesion analysis on News Story Gisting (i.e. a type of summarisation that generates a news story title or headline), which although not defined as an official TDT task, is still an important component of any real-world TDT system. Our experiments show that News Story Gisting performance improves when a lexical chaining approach to this task is adopted. iii Acknowledgements First and foremost I would like to thank my thesis supervisor Joe Carthy for having provided unfailing support, feedback and ideas over the course of my research. Without his belief in me none of this would have been possible. I would also like to thank John Dunnion for introducing me to this topic in the final year of my degree, and for his programming expertise and English grammar lessons! Thank you also to my proofreaders, Fergus, Will, Eamonn, and Ray, and to Gerry Dunnion for filling in the ‘vast chasms’ in my technical expertise. I would also like to acknowledge the Eurokom gang for happy memories and much devilment especially my roommates Paula and Maf, my coffee buddies Doireann, Aibhin, Colm and Dave, and my long standing partner-in-crime Ray ‘Boldie’ Rafter. During the course of my research, two wonderful collaborations emerged. Firstly, in the Autumn/Winter semester of 2001 I spent an enlightening few months at the Center for Intelligent Information Retrieval, University of Massachusetts. Many thanks to James Allan and Victor Lavrenko for their guidance and support during my stay. Secondly, from 2002 to the end of 2003 I worked on the Físchlár News Stories System with the Department of Engineering and the Centre for Digital Video Retrieval at Dublin City University. Thanks to Alan Smeaton and ‘the lads’ for making this a very pleasant experience. Outside of my ‘college cocoon’ there has been a wealth of support from friends and family, including Gra, Dee, Yvette, Sinead, Lisa; Paula and Ray (again); my parents, landlords, and financial benefactors Joan and Brian1; my little brother Cu; my nana Halligan; my wonderful boyfriend Gar and his parents Betty and George. In particular, apologies and special recognition must go to Gar for (un)successfully feigning interest in the merits of lexical chaining over the past 3 years, and for his love, patience and support. Is iomaí cor sa tsaol. 1 Financial support from Enterprise Ireland is also gratefully acknowledged. iv Table of Contents Abstract iii Acknowledgements iv Chapter 1 Introduction 1 1.1 Thesis Goals 3 1.2 Thesis Outline 4 Chapter 2 Lexical Cohesion Analysis through Lexical Chaining 7 2.1 Cohesion, Coherence and Discourse Analysis 8 2.2 The Five Types of Cohesion 9 2.3 Lexical Cohesion 10 2.4 Semantic Networks, Thesauri and Lexical Cohesion 13 2.4.1 Longmans Dictionary of Contemporary English 13 2.4.2 Roget’s Thesaurus 14 2.4.3 WordNet 15 2.4.4 The Pros and Cons 17 2.5 Lexical Chaining: Techniques and Applications 20 2.5.1 Morris and Hirst: The Origins of Lexical Chain Creation 22 2.5.2 Lexical Chaining on Japanese Text 25 2.5.3 Roget’s Thesaurus-based Chaining Algorithms 26 2.5.4 Greedy WordNet-based Chaining Algorithms 28 2.5.5 Non-Greedy WordNet-based Chaining Algorithms 33 2.6 Discussion 40 Chapter 3 LexNews: Lexical Chaining for News Analysis 42 3.1 Basic Lexical Chaining Algorithm 43 3.2 Enhanced LexNews Algorithm 48 3.2.1 Generating Statistical Word Associations 50 3.2.2 Candidate Term Selection: The Tokeniser 55 3.2.3 The Lexical Chainer 3.3 58 Parameter Estimation based on Disambiguation Accuracy v 61 3.4 Statistics on LexNews Chains 67 3.5 News Topic Identification and Lexical Chaining 69 3.6 Discussion 74 Chapter 4 TDT New Event Detection 4.1 76 Information Retrieval 77 4.1.1 Vector Space Model 80 4.1.2 IR Evaluation 82 4.1.3 Information Filtering 84 Topic Detection and Tracking 86 4.2 4.2.1 Distinguishing between TDT Events and TREC Topics 87 4.2.2 The TDT Tasks 89 4.2.3 TDT Progress To Date 93 New Event Detection Approaches 94 4.3 4.3.1 UMass Approach 95 4.3.2 CMU Approach 97 4.3.3 Dragon Systems Approach 99 4.3.4 Topic-based Novelty Detection Workshop Results 100 4.3.5 Other notable NED Approaches 102 4.4 Discussion 104 Chapter 5 Lexical Chain-based New Event Detection 5.1 Sense Disambiguation and IR 105 106 5.1.1 Two IR applications of Word Sense Disambiguation 107 5.1.2 Further Analysis of Disambiguation for IR 108 5.2 Lexical Chaining as a Feature Selection Method 111 5.3 LexDetect: Lexical Chain-based Event Detection 114 5.3.1 The ‘Simplistic’ Tokeniser 115 5.3.2 The Composite Document Representation Strategy 115 5.3.3 The New Event Detector 116 The TDT Evaluation Methodology 120 5.4 5.4.1 TDT Corpora 120 5.4.2 Evaluation Metrics 123 5.5 TDT1 Pilot Study Experiments 125 vi 5.5.1 System Descriptions 125 5.5.2 New Event Detection Results 127 5.5.3 Related New Event Detection Experiments at UCD 129 5.6 TDT2 Experiments 132 5.6.1 System Descriptions 133 5.6.2 New Event Detection Results 134 5.7 Discussion 138 Chapter 6 News Story Segmentation 6.1 142 Segmentation Granularity 143 6.1.1 Discourse Structure and Text Segmentation 143 6.1.2 Fine-grained Text Segmentation 144 6.1.3 Coarse-grained Text Segmentation 145 6.2 Sub-topic/News Story Segmentation Approaches 147 6.2.1 Information Extraction Approaches 148 6.2.2 Lexical Cohesion Approaches 151 6.2.3 Multi-Source Statistical Modelling Approaches 159 6.3 Discussion 161 Chapter 7 Lexical Chain-based News Story Segmentation 7.1 SeLeCT: Segmentation using Lexical Chaining 7.1.1 7.2 The Boundary Detector 162 163 164 Evaluation Methodology 168 7.2.1 News Segmentation Test Collections 169 7.2.2 Evaluation Metrics 169 7.3 News Story Segmentation Results 172 7.3.1 CNN Broadcast News Segmentation 173 7.3.2 Reuters Newswire Segmentation 176 7.3.3 The Error Reduction Filter and Segmentation Performance 178 7.3.4 Word Associations and Segmentation Performance 180 7.4 Written versus Spoken News Story Segmentation 182 7.4.1 Lexical Density 183 7.4.2 Reference and Conjunction in Spoken Text 187 7.4.3 Refining SeLeCT Boundary Detection 189 vii 7.5 Discussion 192 Chapter 8 News Story Gisting 195 8.1 Related Work 196 8.2 The LexGister System 197 8.3 Experimental Methodology 198 8.4 Gisting Results 200 8.5 Discussion 204 Chapter 9 Future Work and Conclusions 206 9.1 Further Lexical Chaining Enhancements 206 9.2 Multi-document Summarisation 209 9.3 Thesis Contributions 210 9.4 Thesis Conclusions 212 Appendix A The LexNews Algorithm 216 A.1 Basic Lexical Chaining Algorithm 216 A.2 Lexical Chaining Stopword List 221 Appendix B LexNews Lexical Chaining Example 222 B.1 News Story Text Version 223 B.2 Part-of-Speech Tagged Text 224 B.3 Candidate Terms 225 B.4 Weighted Lexical Chains 226 Appendix C Segmentation Metrics: WindowDiff and Pk 228 Appendix D Sample News Documents from Evaluation Corpora 231 D.1 TDT1 Broadcast News Transcript 232 D.2 TDT2 Broadcast News Transcript 234 D.3 TDT Newswire Article 235 D.4 RTÉ Closed Caption Material 237 References 238 viii Table of Figures 1.1 News story extract illustrating lexical cohesion in text. 1 2.1 Sample category taken from Roget’s thesaurus. 15 2.2 Number of synsets for each of part-of-speech in WordNet. 17 2.3 A generic lexical chaining algorithm. 23 3.1 Example of a spurious relationship between two nouns in WordNet by not following St-Onge and Hirst’s rules. 45 3.2 Diagram illustrating the process of pushing chains onto the chain stack. 47 3.3 LexNews system architecture. 49 3.4 Examples of statistical word associations generated from the TDT1 corpus. 53 3.5 Example of noun phrase repetition in a news story. 56 3.6 Graph showing relationship between disambiguation error and number of senses or different contexts that a noun may be used in. 66 3.7 Graph showing the dominance of extra strong and medium-strength relationships during lexical chain generation. 68 3.8 Graph showing a breakdown of all relationship occurrences in the chaining process. 68 3.9 Sample broadcast news story on the Veronica Guerin movie. 72 3.10 WordNet noun phrase chains for sample news story in Figure 3.9. 73 3.11 Non-WordNet proper noun phrase chains for sample news story in Figure 3.9. 74 4.1 IR metrics precision and recall. 83 4.2 Typical Information Filtering System. 85 4.3 TDT system architecture. 92 5.1 System architecture of the LexDetect system. 114 5.2 The effect on TDT1 NED performance when a combined document representation is used. 128 5.3 Example of cross chain comparison strategy. 130 5.4 DET graph showing performance of the SYN system using two alternative lexical chain-based NED architecture. ix 131 5.5 DET graph showing performance of two alternative lexical chain-based NED architectures. 132 5.6 DET graph showing performance of the LexDetect and CHAIN systems (using the basic LexNews chaining algorithm), and the UMass system for the TDT2 New Event Detection task. 135 5.7 DET graph showing performance of the LexDetect and CHAIN system (using the enhanced LexNews chaining algorithm), and the UMass system for the TDT2 New Event Detection task. 135 6.1 Example of fine-grained segments detected by Passonneau and Litman’s segmentation technique. 144 6.2 Extract taken from CNN transcript which illustrates the role of domain independent cue phrases in providing cohesion to text. 6.3 A timeline diagram of a news programme and some domain cues. 149 150 6.4 Graph representing the similarity of neighbouring blocks determined by the TextTiling algorithm for each possible boundary or block gap in the text. 153 6.5 Extract of CNN report illustrating the role of lexical cohesion in determining related pieces of text. 156 7.1 SeLeCT news story segmentation system architecture. 163 7.2 Sample lexical chains generated from concatenated news stories. 165 7.3 Chain span schema with boundary point detected at end of sentence 1. 166 7.4 Diagram showing characteristics of chain-based segmentation. 167 7.5 Diagram illustrating allowable margin of error. 170 7.6 Accuracy of segmentation algorithms on CNN test set. 174 7.7 Graph illustrating effects on F1 measure as margin of allowable error is increased for CNN segmentation results. 7.8 Accuracy of segmentation algorithms on Reuters test set. 175 177 7.9 Graph illustrating effects on F1 measure as margin of allowable error is increased for Reuters segmentation results. 177 7.10 Graph illustrating the effect of the error reduction filter on SeLeCT’s F1 measure for the CNN collection as the margin of allowable error increases. 179 x 7.11 Graph illustrating the effect of the error reduction filter on SeLeCT’s recall and precision for the CNN collection as the margin of allowable error increases. 179 7.12 Graph showing effect of word relationships on segmentation accuracy. 180 7.13 Example of the effect of weak semantic relationships on the segmentation process. 181 7.14 CNN transcript of movie review with speaker identification information. 188 7.15 Diagram Illustrating how cohesion information can help SeLeCT’s boundary detector resolve clusters of possible story boundaries. 190 8.1 Recall, Precision and F1 values measuring gisting performance for 5 distinct extractive gisting systems and a set of human extractive gists. 201 A.1 Chaining example illustrating the need for multiple searches. 219 C.1 Diagram showing system segmentation results and the correct boundaries defined in the reference segmentation. xi 229 Table of Tables 2.1 Semantic relationships between nouns in WordNet. 16 3.1 Contingency table of frequency counts calculated for each bigram in the collection. 52 3.2 Disambiguation accuracy results taken from Galley and McKeown. 63 3.3 Results of Galley and McKeown’s evaluation strategy using the ‘accuracy’ metric and default sense assignments, compared with the recall, precision and F1 values when all disambiguated nouns are not assigned default senses. 64 3.4 Comparing the effect of different parameters on the disambiguation performance of the LexNews algorithm. 64 3.5 Chains statistics for chains generated on subset of the SemCor collection. 69 3.6 Weights assigned to lexical cohesive relationships between chain terms. 70 5.1 Values used to calculated TDT system performance. 123 5.2 Miss and False Alarms rates of NED systems for optimal value of the Reduction Coefficient R on the TDT1 corpus. 128 5.3 Breakdown of TDT2 results into broadcast and newswire system performance. 137 5.4 Breakdown of document lengths in the TDT1 and TDT2 corpora. 138 7.1 Precision and Recall values from segmentation on concatenated CNN news stories. 174 7.2 Precision and Recall values from segmentation on concatenated Reuters news stories. 177 7.3 Results of SeLeCT segmentation experiments when verbs are adding into the chaining process. 185 7.4 Results of C99 and TextTiling segmentation experiments when nominalised verbs are adding into the segmentation process. 187 7.5 Improvements in system performance as a result of system modifications discussed in Sections 7.4.1 and 7.4.3. 7.6 Paired Samples T-Test on initial results from Table 6.1 and 6.2. xii 191 191 7.7 Pair Samples T-Test p-values on refined results taken from Table 6.4. 192 C.1 Error calculations for each metric for each window shift in Figure C.1. 230 xiii Chapter 1 Introduction Humans instinctively know when sentences in a text are related. Cohesion and coherence are two properties of text that help us to make this judgement, where coherence refers to the fact that a text makes sense, and cohesion to the fact that there are elements in the text that are grammatically related (e.g. ‘John’ referred to as ‘he’) or semantically related (e.g. ‘BMW’ referred to as a ‘car’). Of these two properties cohesion is the easiest to compute because it is a surface relationship and coherence requires a deeper textual understanding. The main suspect in the shooting dead of a 58-year-old woman with a hunting rifle was today jailed for life following the return of a unanimous verdict by a Dublin jury. The accused originally denied the charge, but pleaded guilty to second-degree murder when a witness came forward to testify that the murder weapon belonged to the defendant. Figure 1.1: News story extract illustrating lexical cohesion in text. For example, consider the news story extract in Figure 1.1. In this text we can see that although these sentences have no vocabulary in common (apart from stopwords, e.g. a, the, to) they are unequivocally related due to the presence of lexical cohesive relationships between their words (i.e. the commonest form of cohesion found in text). In particular, looking only at the nouns in this text we find the following clusters of related noun phrases: {suspect, accused, defendant}, {murder weapon, hunting rifle}. These clusters are examples of lexical chains generated from the text using thesaural-based relationships where, according to the WordNet thesaurus (Miller et al., 1990), ‘suspect’ is a synonym of ‘defendant’, the ‘accused’ is a specialisation of both ‘suspect’ and ‘defendant’, and ‘hunting rifle’ is a specialisation of a ‘murder weapon’. In the course of this thesis, we analyse lexical cohesion in text using this technique of lexical chaining. However, unlike previous approaches we also 1 examine lexical cohesive relationships that cannot be defined in terms of thesaural relationships, but are considered ‘intuitively’ related due to their regular cooccurrence in text. There are many examples of these co-occurrence relationships between words in the text extract in Figure 1.1. More specifically, these words are related through their frequency of use in similar news story contexts relating to criminal law, e.g. {life, verdict, jury, guilty, witness, charge, murder}. However, the main focus of this thesis is the use of lexical cohesion analysis in challenging Natural Language Processing (NLP) and Information Retrieval (IR) tasks, with the intention of improving performance over standard techniques. One of the most common approaches to text analysis is based on an examination of word frequency occurrences in text, where the intuition is that high frequency words represent the essence of a particular discourse. For example, word frequency information has been used to build extractive summaries, where sentences that contain many high frequency words are included in the resultant summary. Word frequency information also forms the basis of most approaches to IR, where for example in an ad hoc retrieval situation documents are ranked in order of relevance to a query based on the frequency of occurrence of the query words in each document. However, in these ‘bag-of-words’ techniques frequency counts are calculated with respect to exact syntactic repetition, while other forms of repetition such as synonymy, specialisation/generalisation and part/whole relationships are ignored. Hence, a ‘bag-of-words’ analysis of the previously discussed news extract would have found little similarity between the two sentences except for stopwords. Thus it appears that there are many tasks that could benefit from the additional textual knowledge provided by a lexical cohesion analysis of text. In this thesis we test this hypothesis in the relatively new IR research area of Topic Detection and Tracking (TDT). The TDT initiative (Allan et al., 1998a; Allan, 2002a) is concerned with the organisation of streams of broadcast news and newswire data into a collection of structured information that satisfies a set of user needs, in particular: News Story Segmentation: The segmentation of broadcast news programmes into distinct news stories. Event Tracking: The tracking of a known event (given a set of related news stories) as documents arrive on the input stream. 2 Cluster Detection: The clustering of similar news stories into distinct nonoverlapping groups. New Event Detection: The detection of breaking news stories as they arrive on the news stream. The novelty of these tasks in comparison to previous IR research is that they are required to operate on real-time news streams from a variety of media sources (radio, television and newswire) rather than on a static, retrospective newspaper collection. This requirement makes these filtering-based tasks (excluding Cluster Detection) more difficult than standard query-based retrieval and text classification tasks as relevancy decisions must be made based on only those documents seen so far on the input stream and without any knowledge of subsequent postings. Also the TDT evaluation defines a finer-grained notion of ‘aboutness’ than is found in standard IR evaluations. For example, a TDT system is not only required to find all documents on a topic like ‘the OJ Simpson trial’, but also the system must be able to distinguish between the different events making up this topic, e.g. ‘OJ Simpson’s arrest’ and ‘the DNA evidence presented in the trial’. What makes this an interesting application domain for lexical cohesion analysis is that TDT participants have found that standard IR approaches to these tasks have reached a performanceplateau, and that new techniques are required in order to effectively tackle these complex problems. 1.1 Thesis Goals This thesis addresses five primary research goals: To develop a novel lexical chaining method that provides a full analysis of lexical cohesive relationships in text by considering not only thesaural-based links between words, as previous chaining approaches have done, but also to explore domain-specific statistical associations between these words that are not defined in the WordNet taxonomy. To establish which IR and NLP applications most benefit from the lexical cohesion analysis provided by lexical chains. In particular, to determine whether New Event Detection, Story Segmentation and News Story Gisting performance can be improved by considering a richer semantic view of a text than that which is provided by a ‘bag-of-word’-based approach to these problems. 3 To determine the performance of these applications in a previously unexplored domain for lexical cohesion analysis, i.e. TDT broadcast news. To use large-scale evaluation methodologies for ascertaining application performance. This goal is prompted by the observation that previous research efforts involving applications of lexical chaining have involved small-scale evaluations which provide little conclusive evidence of system effectiveness. To comment on the extent to which an NLP technique like lexical chaining is affected by ‘noisy’ data sources such as speech transcripts and closed caption material taken from broadcast news reports. A set of secondary goals arising from the above are also addressed in the thesis: To establish the sense disambiguation accuracy of our LexNews lexical chaining algorithm, as this facet of the algorithm has implications on the performance of the IR and NLP applications explored in this thesis. To propose a novel method of integrating lexical cohesion information into an IR model, and to investigate how this technique performs with respect to the standard conceptual indexing strategy put forward by many lexical chaining approaches to IR problems, where words are replaced with WordNet synsets, e.g. concepts such as {airplane, aeroplane, plane}. Our chosen task for this investigation is TDT New Event Detection. To discover whether a lexical chain-based segmentation strategy that was previously proposed for sub-topic segmentation analysis is a reliable method for determining the boundaries between adjacent news stories in a broadcast news programme. To implement and evaluate a lexical chain-based News Story Gisting system that verifies that text summarisation tasks are appropriate vehicles for lexical cohesion analysis. 1.2 Thesis Outline This thesis is organised into four parts: the first part is dedicated to lexical chaining (Chapters 2 and 3), the second to New Event Detection (Chapters 4 and 5), the third to News Story Segmentation (Chapters 6 and 7), and the fourth to our initial experiments concerning News Story Gisting (Chapter 8). 4 Chapter 2 is concerned with the notion of lexical cohesion as a property of text, and how this textual characteristic can be analysed using a word clustering method called lexical chaining. Since many types of lexical cohesion can be discovered by examining thesaural relationships between words in a text, the merits and weaknesses of three knowledge sources capable of providing these relationships are discussed. A detailed overview of contemporary approaches to lexical chain generation and its applications are discussed in the remainder of the chapter. Chapter 3 presents our lexical chaining algorithm, LexNews. We establish the values of a number of important parameters in the algorithm using the SemCor corpus, i.e. a collection of documents manually tagged with WordNet synsets. This collection also facilitates the comparison of the performance of our disambiguation algorithm with another approach to chain generation developed at Columbia University. The chapter ends with a motivating example of how lexical chains capture the topicality of a news document. Chapter 4 presents an overview of common approaches used to address information retrieval problems, since these techniques form the basis of many of the approaches taken in TDT implementations. The research objectives of the TDT initiative are then introduced, while the remainder of the chapter focuses on New Event Detection, TDT participant approaches to the problem, and a number of important conclusions drawn from TDT workshops. Chapter 5 describes the LexDetect system, our lexical chain-based approach to the New Event Detection task. The lexical cohesion information provided by the LexNews algorithm is used as a means of representing the essence of a news story. This linguistically-motivated document representation is then integrated into a traditional vector space modelling (VSM) IR approach. The experiments described in this chapter are split into two parts: those performed on the TDT1 corpus using our own implementation of the VSM, and those performed on the TDT2 corpus using the UMass New Event Detection (VSM-based) system where both systems incorporate a lexical chain representation of a news story in their implementations. Chapter 6 provides some background on text segmentation methods, another application of our LexNews algorithm. This chapter examines segmentation approaches with respect to the granularity of the text segments that they produce. 5 Coarse-grained approaches, such as News Story Segmentation techniques, are explored in detail as a prelude to the work described in the following chapter. Chapter 7 describes the SeLeCT system, our lexical chain-based approach to News Story Segmentation. The performance of the SeLeCT system is evaluated with respect to two other well-known lexical cohesion approaches to segmentation: the C99 and TextTiling algorithms. This chapter also investigates the effect of different news media (i.e. spoken broadcast news versus written newswire) on segmentation performance. Chapter 8 discusses the results of our initial experiments on the application of lexical cohesion analysis to News Story Gisting. The evaluation of the LexGister system is two-fold: firstly, the results of an automatic evaluation based on recall and precision values are reported, and secondly the results of a manual evaluation involving a group of human judges are discussed. Finally, in Chapter 9, our plans for future work are presented followed by a summary of the research contributions and conclusions arising from this work. 6 Chapter 2 Lexical Cohesion Analysis through Lexical Chaining In this chapter we introduce the fundamental linguistic concepts necessary for understanding one of the main focuses of this thesis: The identification of lexical cohesion in text using a linguistic technique called lexical chaining that discovers naturally occurring clusters of semantically related words in text, e.g. {jailbreak, getaway, escape} and {shotgun, firearm, weapon}. In the following sections we establish where lexical cohesion fits into the general framework of textual properties, and how lexical cohesion in text can be represented using lexical chains. Since lexical cohesion is realised in text through the use of related vocabulary, knowledge sources such as thesauri and dictionaries have been used as a means of identifying lexical cohesive ties between words. Hence, we review the pros and cons of three lexical knowledge sources: Longmans Dictionary of Contemporary English, Roget’s thesaurus and the WordNet taxonomy. This is followed by an in-depth review of the different approaches to lexical chain generation discussed in the literature and the performance of various NLP and IR applications of this technique. 7 2.1 Cohesion, Coherence and Discourse Analysis When reading any text it is obvious that it is not merely made up of a set of unrelated sentences, but that these sentences are in fact connected to each other through the use of two linguistic phenomenon, namely cohesion and coherence. As Morris and Hirst (1991) point out, cohesion relates to the fact that the elements of a text (e.g. clauses) ‘tend to hang together’; while coherence refers to the fact that ‘there is sense (or intelligibility) in a text’. Observing the interaction between textual units in terms of these properties is one way of analysing the discourse structure of a text. Most theories of discourse result in a hierarchical tree-like structure that reflects the relationships between sentences or clauses in a text. These relationships may, for example, highlight sentences in a text that elaborate, reiterate or contradict a certain theme. Meaningful discourse analysis like this requires a true understanding of textual coherence which in turn often involves looking beyond the context of the text, and drawing from real-world knowledge of events and the relationships between them. Hasan, in her paper on ‘Coherence and Cohesive Harmony’ (1984), hypothesises that the coherence of a text can be indirectly measured by analysing the degree of interaction between cohesive chains in a text. Analysing cohesive relationships in this manner is a more manageable and less computationally expensive solution to discourse analysis than coherence analysis. For example, Morris and Hirst (1991) note that, unlike research into cohesion, there has been no widespread agreement on the classification of different types of coherence relationships2. Furthermore, they note that even humans find it more difficult to identify and agree on textual coherence because, although identifying cohesion and coherence are subjective tasks, coherence requires a definite ‘interpretation of meaning’, while cohesion requires only an understanding that terms are about ‘the same thing’. 2 Some coherence relationships that have been identified between sentences or clauses are elaboration, cause, support, exemplification, contrast and result. In recent work by Harabagiu (1999), attempts have been made to map specific patterns of lexical cohesion coupled with the occurrence of certain discourse markers directly to these coherence relationships, e.g. contrast in text is indicated by the existence of both the discourse marker ‘although’ and an antonymy relationship (e.g. dead-alive or happy-sad). 8 To get a better idea of the difference between the two, consider the following example: After a night of heavy drinking the party fizzled out at around 6am. They then ate breakfast while watching the sunrise. These sentences are only weakly cohesive. Consequently, a deeper understanding of the concept ‘morning’ makes the existence of a coherence relationship between the two sentences highly plausible. However, in the more usual case where an area of text shares a set of cohesively related terms, Morris and Hirst hypothesise that cohesion is a useful indicator of coherence in text especially since the identification of coherence itself is not computationally feasible at present. Stairmand (1996) further justifies this hypothesis by emphasising that although cohesion fails to account for grammatical structure (i.e. readability) in the way that coherence does, cohesion can still account for the organisation of meaning in a text, and so, by implication, its presence corresponds to some form of structure in that text. 2.2 The Five Types of Cohesion As stated in the previous section, cohesion refers to the way in which textual units interact in a discourse. Halliday and Hasan (1976) classify cohesion into five (not always distinct) classes: Conjunction is the only class which explicitly shows the relationship between two sentences, ‘I have a cat and his name is Felix’. Reference and lexical cohesion, on the other hand, indicate sentence relationships in terms of two semantically equivalent or related words. o In the case of reference, pronouns are the most likely means of conveying referential meaning. For example, consider the following sentences: ‘“Get inside now!” shouted the teacher. When nobody moved, he was furious’. In order for the reader to understand that ‘the teacher’ is being referred to by the pronoun ‘he’ in the second sentence, they must refer back to the first sentence. o Lexical cohesion arises from the selection of vocabulary items and the semantic relationships between them. For example, ‘I parked outside the library, and then went inside the building to return my books’, where 9 cohesion is represented by the semantic relationships between the lexical items ‘library’, ‘building’ and ‘books’. Substitution and Ellipsis are grammatical relationships, as opposed to relationships based on word meaning or semantic connection. o In the case of nominal substitution, a noun phrase such as ‘a vanilla icecream cone’ can be replace by the indefinite article ‘one’ as shown in the following example, ‘As soon as John was given a vanilla ice-cream cone, Mary wanted one too’. o Ellipsis is closely related to substitution as it is often described as the special case of ‘zero substitution’, where a phrase such as ‘in my exams’ is left out as it is implied by the preceding sentence which contains the phrase ‘in your exams’. For example, ‘Did you get a first in your exams? No, I only got a third’. For automatic identification of these relationships, lexical cohesion is the easiest to resolve since less implicit information is needed to discover these types of relationship between words in a text. In the sample sentence used to define lexical cohesion we identified a generalisation relationship between ‘library’ and ‘building’ and a has-part relationship between ‘library’ and ‘books’. However, there are five further lexical cohesive relationships that are explored in the following section. 2.3 Lexical Cohesion Lexical Cohesion ‘is the cohesion that arises from semantic relationships between words’ (Morris, Hirst, 1991). Halliday and Hasan (1976) define five types of lexical cohesive ties that commonly occur in text. Here are a number of examples taken from a collection of CNN news story transcripts, since the news story domain is the focus of our analysis in this thesis: Repetition (or Reiteration) – Occurs when a word form is repeated again in a later section of the text. ‘In Gaza, though, whether the Middle East' s old violent cycles continue or not, nothing will ever look quite the same once Yasir Arafat come to town. We expect him here in the Gaza Strip in about an hour and a half, crossing over from Egypt’. 10 Repetition through synonymy – Occurs when words share the same meaning, but have two unique syntactical forms3. ‘Four years ago, it passed a domestic violence act allowing police, not just the victims, to press charges if they believe a domestic beating took place. In the past, officers were frustrated, because they' d arrive on the scene of a domestic fight, there' d be a clearly battered victim and yet, frequently, there' d be no one to file charges.’ Word association through specialisation/generalisation – Occurs when a specialised/generalised form of an earlier word is used. ‘They' ve put a possible s hands; that' s something that no one knew murder weapon in O.J. Simpson' before. And it shows that he bought that knife more than a month or two ahead of time and you might, therefore, start the theory of premeditation and deliberation.’ Word association through part-whole/whole-part relationships – Occurs when a part-whole/whole-part relationship exists between two words, e.g. ‘committee’ is made up of smaller parts called ‘members’. ‘The Senate Finance Committee has just convened. Members had been meeting behind closed doors throughout the morning and early afternoon.’ Word association through collocation - These types of relationships occur when the nature of the association between two words cannot be defined in terms of the above relationship types. These relationships are most commonly found by analysing word co-occurrence statistics, e.g. ‘Osama bin Laden’ and ‘the World Trade Centre’. Halliday and Hasan also classify antonymy in this category of word relationship. Antonyms are words that are exact semantic opposites or complementaries, e.g. male-female, boy-girl, adult-child. All of these relationships, except statistical word co-occurrences, are types of lexicographical relationships that can be extracted from a domain-independent thesaurus. Statistical associations between words, on the other hand, are generated from domain-specific corpora that reflect the most commonly used senses of words 3 More formally, synonymy refers to the relationship between semantically equivalent words which are interchangeable in all textual contexts. In reality, true synonyms are rare, where near-synonyms or plesionyms are the most common form of synonymy in text. Halliday and Hasan’s definition of synonymy also includes these near-synonyms. Hirst (1995) states that true-synonymy is mostly limited to technical terms like groundhog/woodchuck. He provides the following example of nearsynonymy: lie/misrepresentation, where a lie is a deliberate attempt to deceive, while the use of misrepresentation tends to imply an untruth told merely out of ignorance. So depending on the context, these terms may be intersubstitutable. 11 in domains such as American Broadcast News or Inorganic Chemistry. The role of the thesaurus in identifying lexical cohesive structure is discussed in the following section, while a description of how to generate these co-occurrences is given in Section 3.2.1. We stressed in the previous section that one of the advantages of analysing cohesion is that it is a surface relationship and thus, unlike coherence, it is relatively easier to model discourse structure by observing cohesive ties. With regard to lexical cohesion (one of the five identified categories of cohesion), Hasan explains that it is the most prolific form of cohesion in text, and as already stated it is the least computational expensive means of identifying cohesive ties. As Hasan and many other researchers have observed, nouns usually convey most of the information in a written text, and so identifying lexicosemantic connections between nouns is an adequate means of determining cohesive ties between textual units. Although verbs also make an undeniable contribution to the grammatical and semantic content of a text (Klavans, Min-Yen, 1998), they are more difficult to deal with than nouns for the following reasons: 1. Verbs are more polysemous than nouns. Fellbaum (1998) states that nouns in the Collins English Dictionary have an average of 1.71 senses whereas verbs have an average of 2.11 senses. For example WordNet defines 3 sense for the noun form of the word ‘close’ and 14 senses for its verb form. Fellbaum suggests the reason for this is that there are fewer verbs than nouns in the English lexicon (in spite of the fact that all sentences need a verb), and so to compensate for this, verb meanings tend to be more flexible. More specifically, the meaning of a verb (especially ambiguous verbs) tends to be dictated by the noun accompanying it in a clause. For example, consider the meaning of the verb ‘have’ in the following contexts: She had a baby. => She gave birth. He had an egg for breakfast. => He ate an egg for breakfast. 2. Fewer verbs are truly synonymous. This depends on how rigid a definition of synonymy is used, but, in general, this means the presence of lexical cohesion in text through synonymous verbs is limited. 3. Not all verb categories suit being cast into a taxonomic framework. Fellbaum states that in the design of WordNet it was possible to associate a 12 large majority of verb categories using an entailment4 relationship between verb-pairs which is not one of the cohesive relationships described by Hasan. Also, unlike the noun hierarchy, a verb sense may belong to one type (or baseconcept), but have as its superordinate another verb sense of a different type (Kilgarriff, Yallop, 2000). As we will see in the next section, taxonomic frameworks are an essential resource in the identification of lexical cohesive ties. From this point on in our discussion all lexical cohesive relationships will refer to relationships between nouns. 2.4 Semantic Networks, Thesauri and Lexical Cohesion In this section we illustrate how semantic networks generated from machinereadable thesauri and dictionaries have been used to identify cohesive links between noun pairs. We will focus much of this discussion on the WordNet thesaurus as it has become a standard resource among researchers in the NLP, AI and IR fields. It is also the knowledge source used by our lexical chaining algorithm, LexNews, described in Chapter 3. 2.4.1 Longmans Dictionary of Contemporary English The Longmans Dictionary of Contemporary English (LDOCE) was the first available machine-readable dictionary. Its popularity as a lexical resource can also be attributed to the simplicity of its design, since it was created with non-native English speakers in mind. More specifically, the dictionary was written so that all gloss definitions were described with respect to a controlled vocabulary of 2,851 words referred to as the Longmans Defining Vocabulary (LDV). In their paper, which looks at calculating the similarity between words based on spreading activation in a dictionary, Kozima and Furugori (1993a) took advantage of this design feature, and generated a semantic network from it using the gloss entries of a subset of the words in the dictionary. This subset of the dictionary is called Glossème and consists of all words included in the LDV. They then created their semantic network Paradigme by connecting all LDV words that share gloss terms resulting in a network of 2,851 nodes connected by 295,914 links. 4 A verb X entails Y if X cannot be done unless Y is, or has been, done. ‘Snoring’ entails ‘sleeping’ as a person cannot snore unless he/she is sleeping. 13 Kozima (1993b) defines lexical cohesion in terms of the semantic similarity between two words, where the similarity between two words w and w’ is measured in the following way: “Produce an activation pattern by activating a node for w and then observe the activity of the second node w’ in the activation pattern”. A similarity score or strength of association between w and w’ can then be calculated based on the significance of w and the activity of the node representing w’ in the activity pattern for w in the network. This measurement results in a score ranging from 0 to 1, where 1 indicates a significant lexical cohesive tie and 0 no relationship between the terms. Kozima (1993b) also defines a method of segmenting text by generating a Lexical Cohesion Profile, a means of representing the cohesiveness of a text in terms of the cohesiveness of windows of word sequences in the text. Kozima’s work will be returned to again in the discussion on text segmentation in Chapter 6. 2.4.2 Roget’s Thesaurus Roget’s Thesaurus is one of a category of thesauri (like the Macquarie Thesaurus) that were custom built as aids to writers who wish to replace a particular word or phrase with a synonymous or near-synonymous alternative. Unlike a dictionary, they contain no gloss definitions, instead they provide the user with a list of possible replacements for a word and leave it up to the user to decide which sense is appropriate. The structure of the thesaurus provides two ways of accessing a word: 1. By searching for it in a list of over 1,042 pre-defined categories, e.g. Killing, Organisation, Amusement, Physical Pain, Zoology, Mankind, etc. 2. By searching for the word in an alphabetical index that lists all the different categories in which the word occurs, i.e. analogous to defining sense distinctions. Lexical cohesive relationships between words can also be determined using a resource of this nature, since words that co-exist in the same category are semantically related. There is a hierarchical structure above (classes, sub-classes and sub-sub-classes) and below (sub-categories) these categories in the thesaurus, and it is this structure that facilitates the inferring of a small range of semantic strengths between words; however, unlike LDOCE, ‘no numerical value for semantic distance can be obtained’ (Budanitsky, 1999). In Section 2.5.1, we will explain how Morris and Hirst (1991) used Roget’s thesaurus to find cohesive ties 14 between words in order to build lexical chains. Figure 2.1 shows an extract from the index entry for the noun ‘ball’ in Roget’s thesaurus. Each of the categories defined after the word ‘ball’ represent its different senses in various contexts. For example, the ‘dance’ sense of ‘ball’ is found in the ‘Sociality’ category, which contains related words such as ‘party’, ‘entertainment’ and ‘reception’. Ball: #249 Rotundity; #284[Motion given to an object situated in front.] Propulsion. ……. #892 Sociality; #892. Sociality party, entertainment, reception, at home, soiree; evening party, morning party, afternoon party, bridge party, garden party, surprise party; kettle, kettle drum; housewarming; ball, festival; smoker, smoker-party; sociable [U.S.], stag party, hen party; tea-party; #840 Amusement Figure 2.1: Sample category from Roget’s thesaurus. 2.4.3 WordNet WordNet (Miller et al., 1990; Fellbaum, 1998a) is an online lexical database whose design is inspired by current psycholinguistic theories of human lexical memory. WordNet is divided into 4 distinct word categories: nouns, verbs, adverbs and adjectives. The most important relationship between words in WordNet is synonymy. The WordNet definition of synonymy also includes near-synonymy. Hence, WordNet synonyms are only interchangeable in certain contexts (Miller, 1998). A unique label called a synset number identifies each synonymous set of words (a synset) in WordNet. Each node or synset in the hierarchy represents a single lexical concept and linked to other nodes in the semantic network by a number of relationships. Different relationships are defined between synsets depending on which semantic hierarchy they belong to. For example, most verbs are organised around entailment (synonymy and a type of verb hyponymy called troponymy (Fellbaum, 1998b)), adjectives and adverbs around antonymy (opposites such as big-small and beautifully-horribly) and synonymy. 15 Nouns, on the other hand, are predominantly related though synonymy and hyponymy/hypernymy. In addition, 9 other lexicographical relationships are also defined between nodes in the noun hierarchy. Table 2.1 defines each of these relationships, where 80% of these links are attributed to the hypernymy/hyponymy relationships (Budanitsky, 1999). WordNet Noun Relationship Example Hyponymy Specialisation: apple is a hyponym of fruit since (KIND_OF) apple is a kind of fruit. Hypernymy (IS_A) Generalisation: celebration is a hypernym of birthday since birthday is a type of celebration. Holonymy (HAS_PART) HAS_PART_COMPONENT: tree is a holonym of branch. HAS_PART_MEMBER: church is a holonym of parishioners . IS_MADE_FROM_OBJECT: tyre is a holonym of rubber. Meronymy (PART_OF) OBJECT_IS_PART_OF: leg is a meronym of table. OBJECT_IS_A_MEMBER_OF: sheep is a meronym of flock. OBJECT_MAKES_UP: air is a meronym of atmosphere. Antonymy (OPPOSITE_OF) Girl is an antonym of boy. Table 2.1: Semantic relationships between nouns in WordNet. Figure 2.2 illustrates how each WordNet word category has expanded in size from version 1.5 to version 1.7.1. It is evident from this graph that the noun part of the semantic network is by far the largest. It is also the noun word category that exhibits the most connectivity between its elements, making it an ideal resource for discovering cohesive relationships between nouns in a text. The noun hierarchy is organised around a set of unique beginners, which are simply synsets that do not have any hypernyms; these include ‘entity’, ‘state’, 16 ‘phenomenon’ and ‘abstraction’. Words are then connected in a vertical fashion via hypernymy and holonymy relationships, and horizontal via meronymy and holonymy relationships. Unfortunately, there is little interconnectivity between the noun, verb, adverb and adjective files in the WordNet taxonomy: the verb file has no relations with any of the other files, the adverb file has only unidirectional relations with the adjective file, and there are only a limited number of ‘pertains to’ relationships linking adjectives to nouns. 80000 WordNet 1.5 70000 WordNet 1.6 WordNet 1.7 60000 50000 40000 30000 20000 10000 0 Nouns Adjectives Verbs Adverbs Figure 2.2: Number of synsets for each part-of-speech in WordNet (versions 1.5, 1.6 and 1.7.1). 2.4.4 The Pros and Cons We will now explore some of the advantages and disadvantages of using the knowledge sources outlined in the preceding subsections to identify lexical cohesive links between words in a text. The main reasons why WordNet has become the most commonly used knowledge resource in NLP circles is that it is both freely available and part of an on going research initiative involving Princeton University and other research communities. In contrast, only the original Roget’s Thesaurus (1911 edition) has been made available in machine-readable format (by Project Guttenberg in 1991), while the most up to date version, Roget’s International thesaurus, is still unavailable due to copyright restrictions. As we will see in Section 2.5.3, Roget’s 1911 edition has been successfully used to find relationships between words, even 17 though it has a number of limitations: a lack of up-to-date words and phrases; it contains a number of obsolete words; and it has no word index, so that looking for the various uses of a word is a laborious task that requires searching in each of the various categories of the thesaurus until the word is found. WordNet, on the other hand, is a lexical database which contains an index of words separated into different parts of speech which are explicitly linked to related words in the taxonomy. The LDOCE is also readily available upon request for academic research purposes. However, unlike WordNet, the LDOCE comes in a dictionary format that must be transformed into a semantic network, as was described in Section 2.4.1, before it can be used to establish lexicographical relationships between words. Another advantage that WordNet has over other online thesauri and dictionaries is that it is in a continuous state of transition where improvements are made based on the findings and suggestions of the research community that use it. It is currently in its 8th edition, which is particularly important to applications like ours that work on documents from the news story domain since new vocabulary, word meanings and world events are constantly being added to the general English lexicon. Consider the recent media obsession with the use of military jargon in Iraqi War reports. For example, consider phrases such as ‘blue on blue’ or events such as ‘Operation Iraqi Freedom’. Obviously, in any NLP application it is preferable to have a lexical resource that covers these terms and explicitly relates them to other words in the taxonomy. The LDOCE is also continually expanded with novel vocabulary and word usages; however, it is being developed as a commercial product rather than as a research resource. Another important knowledge source that is also available in the public domain, that has not been discussed so far, is the Macquarie Thesaurus. This thesaurus has been mapped onto a WordNet-like structure so it can be used for language engineering problems. It is an impressive body of work that covers general English terms as well as Australian English, Aboriginal English and elements of English spoken in South-East Asia. Like WordNet, it is available for academic purposes (for a small annual fee). Although this resource has not been used for the generation of 18 lexical chains, it has been used in Question Answering at TREC and SENSEVAL tasks5. So far we have looked at the advantages of using the WordNet semantic network. Nevertheless, there are also a number of well-documented problems with the taxonomy that can have a significant effect on the performance of NLP applications that avail of it. One of the most striking differences between estimating semantic distances with WordNet and the other dictionaries/thesauri mentioned so far is that relationships between words are explicitly defined in terms of a set of semantic ties, which in turn leads to certain advantages and disadvantages. On the one hand, WordNet is missing a lot of explicit links between intuitively related words. Fellbaum (1998) refers to such obvious omissions in WordNet as the ‘tennis problem’ where nouns such as ‘nets’ and ‘rackets’ and ‘umpires’ are all present in the taxonomy, but WordNet provides no links between these related ‘tennis’ concepts. A thesaurus like Roget’s or a semantic network like Kozima’s LDOCE has a much richer set of relationships between words due to the organisation of terms into categories such as the ‘sociality’ category shown in Figure 2.1. However, it has also been observed that ‘the price paid for this richness is a somewhat unwieldy tool with ambiguous links’ (Kilgarriff, 2000). This unwieldiness makes any measurement of semantic similarity into a near binary decision, i.e. we decide whether two words are related but we can’t quantify how strong this relationship is. In contrast, since the length of paths between related nodes in a taxonomy, such as WordNet, can be measured in terms of edges, one might think that semantic relatedness can be measured in terms of semantic distance with more accuracy. However, this is not the case, as such a measure is based on the following two assumptions outlined by Mc Hale (1998) that do not hold true for any known thesaurus, dictionary or semantic network: 1. Every edge in the taxonomy is of equal length. 2. All branches in the taxonomy are equally dense. Mc Hale suggests that edge length gets shorter as the depth in the hierarchy increases. In the case of WordNet it is well known that certain categories such as those relating to plants and animals are more developed than others. To address 5 TREC stands for Text Retrieval Conference (see Section 4.1.2) and SENSEVAL is a workshop dedicated to the development of sense disambiguation systems (see Section 5.1.2). 19 these discrepancies many researchers have tried to adapt the basic edge counting measure with taxonometric information such as depth and density, and measures of statistical word association such as mutual information. For a more detailed examination of these metrics we refer the reader to two excellent sources: Mc Hale (1998) and Budanitsky (1999). Another complaint commonly encountered when using WordNet is that its level of sense granularity is too fine. Polysemy is represented in WordNet as a list of different synset numbers for a particular syntactic form, while in Roget’s thesaurus polysemy is captured by assigning the same syntactic form, such as the word ‘bank’, to a number of different heads or categories. Wilks (1998) claims that this contributes to ‘the fine level of sense distinction present in WordNet’ and to the fact that ‘it lacks much of the abstract classification of Roget’s’. The aim of the lexicographer when annotating dictionary entries such as those of WordNet or the LDOCE is to define and distinguish between all the distinct meanings of a word. However, when the lexicographer is assigning words to categories in a thesaurus, senses of words that are very similar will tend to be placed (only once, to avoid repetition) in the same category and so ‘many dictionary sense distinctions will get lost in the thesaurus’ (Kilgarriff, 2000). Despite the fact that WordNet is an imperfect lexical resource the general consensus in the NLP and CL (Computational Linguistics) communities is that it is still a very valuable one. As a testimony to its indispensability a number of largescale research initiatives have germinated from the WordNet project, including the Euro WordNet project (Vossen, 1998); the Annual WordNet Conference; the SENSEVAL-2 task (sense disambiguation evaluated with respect to WordNet synsets) (SENSEVAL-2, 2001); the numerous attempts to combine WordNet with other lexical resources; the recent release in March 2003 of the eXtended WordNet (implicitly links words through their glosses) (Mihalcea, Moldovan, 2001); and the recently released of WordNet version 2.0. 2.5 Lexical Chaining: Techniques and Applications So far in this chapter we have established what lexical cohesion is, and how its various relationships are manifested in coherent text. We have also examined the merits of different knowledge sources for identifying cohesive ties between nouns 20 in a text. In the remainder of this chapter we focus on lexical chaining as a method of representing the lexical cohesive structure of a text. Lexical chains are in essence sequences of semantically related words, where lexical cohesive relationships between words are established using an auxiliary knowledge source such as a dictionary or a thesaurus. Lexical chains have many practical applications in IR, NLP and CL research such as the following: Discourse Analysis (Hirst, Morris, 1991) Text Segmentation (Okumura, Honda, 1994; Min-Yen, Klavans, McKeown, 1998; Mochizuki et al., 2000; Stokes, Carthy, Smeaton, 2002; Stokes, 2003, 2004a) Word Sense Disambiguation: (Okumura, Honda, 1994; Stairmand, 1997; Galley, McKeown, 2003) Query-based Retrieval (Stairmand, 1996) A Term Weighting Scheme (Bo-Yeong, 2003) Multimedia Indexing (Kazman et al., 1996; Al-Halimi, Kazman, 1998) Hypertext Construction (Green, 1997a; 1997b) Text Summarisation (Barzilay, Elhadad, 1997; Silber, McCoy, 2000; Brunn, Chali, Pinchak, 2001; Bo-Yeong, 2002; Alemany, Fuentes, 2003; Stokes et al., 2004) Malapropisms Detection in Text (St-Onge, 1995; Hirst, St-Onge, 1998) Web Document Classification (Ellman, 2000) Topic Detection and Tracking (Stokes et al., 2000a, 2000b, 2000c; Stokes, Carthy, 2001a, 2001b, 2001c; Carthy, Smeaton, 2000; Carthy, Sherwood-Smith, 2002; Carthy, 2002) Question Answering (Moldovan, Novischi, 2002) In 1991 Morris and Hirst published their seminal paper on lexical chains with the purpose of illustrating how these chains could be used to explore the discourse structure of a text. At the time of writing their paper no machine-readable thesaurus was available so they manually generated chains using Roget’s Thesaurus. Since then lexical chaining has developed from an idea on paper to a fully automated process that captures not only cohesive relationships, but also discourse properties such as thematic focus. In the following subsections, we review some of the 21 principal chaining approaches proposed in the literature, and how lexical chains have been used to solve some of the complex research problems listed above. 2.5.1 Morris and Hirst: The Origins of Lexical Chain Creation Morris and Hirst (1991) used lexical chains to determine the intentional structure of a discourse using Grosz and Sidner discourse theory (1986). In this theory, Grosz and Sidner propose a model of discourse based on three interacting textual elements: linguistic structure (segments indicate changes in topic), intentional structure (how segments are related to each other), and attentional structure (shifts of attention in text that are based on linguistic and intentional structure). Obviously, any attempt to automate this process will require a method of identifying linguistic segments in a text. Morris and Hirst believed these discourse segments could be captured using lexical chaining, where each segment is represented by the span of a lexical chain in the text. Morris and Hirst manually generated lexical chains using Roget’s Thesaurus which consists of an index entry for each word that lists synonyms and nearsynonyms for each of its coarse-grained senses followed by a list of category numbers that are related to these senses. A category in this context consists of a list of related words and pointers to related categories. They used the following rules to glean semantic associations from the thesaurus during the chain generation process, where two words are related if any of the following relationship rules apply: 1. They have a common category in their index entries. 2. One word has a category in its index entry that contains a pointer to a category of the other word. 3. A word is either a label in the other word’s index entry or it is listed in a category of the other word. 4. Both words have categories in their index entries that are members of the same class/group. 5. Both words have categories in their index entries that point to a common category. Morris and Hirst also introduced a general algorithm for generating chains, shown in Figure 2.3, on which most other chaining implementations are based. 22 A General Lexical Chaining Algorithm 1. Choose a set of candidate terms for chaining, t1 ……tn. These terms are usually closed-class nouns, i.e. highly informative words as opposed to stopwords. 2. Initialise: The first candidate term in the text, t1, becomes the head of the first chain, c1. 3. for each remaining term ti do 4. for each chain cm do 5. Find the chain that is most strongly related to ti with respect to the following chaining constraints: a. Chain Salience b. Thesaural Relationships c. Transitivity d. Allowable Word Distance 6. If the relationship between a chain and ti adheres to these constraints then ti becomes a member of cm, otherwise ti becomes the head of a new chain. 7. end for 8. end for Figure 2.3: A generic lexical chaining algorithm The constraints listed in statement 5 of the algorithm are critical in controlling the scope, size and in many cases the validity of the relationships within a chain. If these constraints are not adhered to or if suitable parameters are not chosen for each of them, then the occurrence of spurious chains (chains that contain weakly related or incorrect terms) will be greatly increased. We now look at each of these constraints in turn: Chain Salience: This constraint refers to the notion that words should be added to the most recently updated chain. This intuitive rule appeals to the notion that terms are best disambiguated with respect to active chains, i.e. active themes or speaker intentions in the text. Thesaural Relationships: Regardless of the knowledge source used to deduce semantic similarity between terms, a set of appropriate knowledge source relationships must be decided on. Morris and Hirst state that their relationship rules 1 and 2, defined above, based on Roget’s thesaural structure, account for nearly 90% of relationships between chain words. On the other hand, in 23 WordNet-based chaining the specialisation/generalisation hierarchy of the noun taxonomy is responsible for the majority of associations found between nouns. Transitivity: Another factor to consider when searching for relationships between words is transitivity. In particular, although weaker transitive relationships (such as a is related to c because a is related to b and b is related to c) increase the coverage of possible word relationships in the taxonomy, they also increase the likelihood of spurious chains, as they tend to be even more context specific than strong relationships such as synonymy. For example, consider the following tentative relationship found in WordNet: ‘foundation stone’ is indirectly related to ‘porch’ since ‘house’ is directly related to both ‘foundation stone’ and ‘porch’. Deciding whether these transitive relationships are useful is a difficult decision as one must also consider the loss of possible valuable relationships if they are ignored, e.g. ‘cheese’ is indirectly related to ‘perishable’ since ‘dairy product’ is directly related to both words according to WordNet. Allowable Word Distance: This constraint works on a similar assumption as Chain Salience, where the relationships between words are best disambiguated with respect to the words that they lie nearest to in the text. The general rule is that relationships between words that are situated far apart in a text are only permitted if they exhibit a very strong semantic relationship such as repetition or synonymy. As well as these constraints Morris and Hirst also refer to the notion of chain returns, a final chaining step where related chains are merged after the initial chain formation process is complete. Chain returns are chains which share candidate terms with an earlier created chain. In their own words, chain returns occurs when a theme in a text represented by a chain ‘has clearly stopped’, and is then returned to by the speaker later on in the text, thus creating the second occurrence of the chain, i.e. a chain return. If such chains returns are linked back to the original chain and concatenated with it then this resulting chain will represent a single theme or ‘structural text entity’ in the discourse. In many subsequent lexical chaining implementations, including our own, there are no occurrences of chain returns, i.e. no word belongs to more than one chain. This is made possible by relaxing the allowable word distance parameters for strong relationships (to span the entire text), 24 so as to ensure that all repetitions of a particular word will occur in the same chain. However, since Morris and Hirst’s aim was to model intentional shifts in the text, shorter word distance constraints are necessary. Morris and Hirst conclude that lexical chains are good indicators of text structure with respect to Grosz and Sidner’s structural analysis method. Nevertheless, they admit that further research is needed in order to uncover the true impact of ‘unfortunate chain linkages’, the stability of parameters and constraints across different textual styles, and how a static knowledge source such as Roget’s might impact on the practical limitations of their work. 2.5.2 Lexical Chaining on Japanese Text The first fully automated version of Morris and Hirst’s algorithm was implemented by Okumura and Honda (1994) for use on Japanese text which uses a thesaurus similar in design to Roget’s called Bunrui-goihyo. However, unlike Morris and Hirst, they only avail of a subset of all possible relationships provided by the thesaurus. More specifically, for two words to be related they must either be repetitions or share the same category number. Okumura and Honda’s algorithm begins by choosing a set of candidate terms for chaining. In this case all nouns, adjectives and verbs are chosen. Unlike the generic algorithm detailed in Figure 2.3 words are not immediately added to chains. Instead, an attempt is made to disambiguate term ti with respect to the other words that occur in the current sentence. Any terms that remain ambiguous after this step will be disambiguated when added to the most appropriate chain in the next step. Again, as in the case of the generic algorithm, a term is added to the chain that is most salient and strongly related to a particular sense of ti. If a satisfactory relationship between a chain cm and ti is found then ti becomes a member of cm, otherwise ti becomes the head of a new chain. Since one of the side effects of the lexical chaining process is sense resolution, Okumura and Honda decided to test how well their chaining algorithm could disambiguate words in a small collection of 5 Japanese texts. They found that chain-based disambiguation obtained an average accuracy of 63.4%. According to the authors this is ‘a tolerable level of performance without any training’. Disambiguation errors were attributed to two sources, morphological analysis errors and errors due to word senses being ‘dragged into the wrong contexts’. 25 Okumura and Honda also experimented with lexical chains as a means of segmenting text into semantic units. They observed that distinct segments tend to use related words. Therefore, an area of the text that exhibits a high number of chain-end and chain-begin points indicates a transition between old and new themes in the text. This lexical chaining application will be returned to again in Chapters 6 and 7, where a more detailed look at the authors’ approach to this task is given. The performance of their segmentation method was evaluated on a collection of only 5 texts and appeared to be poor with an average recall and precision of 0.52 and 0.25 respectively6. However, they noted that this work was in a preliminary stage, and that with a number of refinements, such as taking account of discourse markers (clue words) and including a measure of chain importance, they could improve these results. More recently in Mochizuki et al. (2000), Okumura and Honda’s algorithm was used to segment Japanese documents into sub-topic segments. These segments were then used in passage-level retrieval experiments, where relevant passages or segments of long documents are returned in response to user queries instead of an entire set of semi-relevant documents. Mochizuki et al.’s results showed that by combining passage-level retrieval with a combination of keyword retrieval and lexical chain-derived segments, their system could outperform a method that used either one of these techniques. Using lexical chains as a means of segmenting text will be returned to again in Chapters 6 and 7. 2.5.3 Roget’s Thesaurus-based Chaining Algorithms In their original chaining algorithm, Morris and Hirst (1991) manually identified lexical cohesive relationships between chain words using a version of Roget’s Thesaurus. Subsequent attempts to automate the chaining process have predominantly focussed on finding lexical cohesive relationships using WordNet. St-Onge (1995) and Stairmand (1996) both cite the following convincing arguments as to why they choose WordNet over Roget’s, (a number of these points have already been discussed in more detail in Section 2.4.4): 6 In this context, recall refers to the number of correctly identified boundaries divided by the total number of boundaries in the text, while precision is the number of correctly identified boundaries divided by the total number of boundaries returned by the system. 26 The 1911 version of Roget’s has no index (only categories are listed hence this index must be automatically generated), and although the 1993 version includes an index it is not freely available. Roget’s lacks new words and so is not well suited to processing contemporary texts. Defining the strength of a cohesive link using Roget’s is more difficult as most relationships between words are found by their co-existence in thesaural categories with no accompanying explanation of how related they are. On the other hand, relationships between terms in WordNet are defined in a more principled and explicit manner. Concepts in WordNet are organised around psycholinguistic theories of human lexical memory structures. WordNet versions are accompanied by a library of functions providing access to its database. However, in spite of these inadequacies, there have been two recent attempts at automating chain creation using Roget’s Thesaurus and Morris and Hirst’s original cohesive link definitions (listed in Section 2.5.1). Ellman (2000) details a system called Hesperus that generates lexical chains and calculates document similarity using these chains. Ellman acknowledges the difficulties outlined by Stairmand (1996) and St-Onge (1995), but also points out that Roget’s, unlike WordNet, contains intuitive associations between words, which are an important part of the chaining process. He also notes that Roget’s has a more balanced structure than WordNet, and so might prove more efficient for establishing cohesive ties. He also suggests that the difficultly in determining the strength of relationships using Roget’s might not be such a disadvantage since determining semantic proximity in WordNet has also proven to be a very difficult problem. Ellman’s Hesperus system is designed to enhance search engine retrieval results. The lexical chaining element of his research involves using cohesive chains to build document representations based around Roget’s categories, called Generic Document Profiles (GDP). Two documents are then considered very similar if their GDP’s have a number of Roget’s categories in common. Ellman evaluates the Hesperus document similarity strategy by comparing its ranking of texts to that of 27 human judges who were asked to rank a random set of texts in order of similarity to a set of texts taken from Encarta on diverse topics ranging from artificial intelligence to socialism. The results of his experiment were mixed: where rankings relating to some topics were shown to be statistically significant when compared to the gold standard human rankings, while results for other topics were disappointing. Ellman also found that comparing document representations composed of finelevels of Roget’s category granularity, and no explicit sense disambiguation worked best in most cases. Jarmasz and Szpakowicz (2003) also generated lexical chains using Roget’s. However, they used a machine-readable 1987 version of Penguin’s Roget’s Thesaurus of English Words and Phrases. Their research group at the School of Information Technology and Engineering, University of Ottawa has recently been granted permission to work with this version of the thesaurus. According to the authors, initial experiments have shown that, as expected, Roget’s provides a very broad range of associations between words using Morris and Hirst’s relationship rules. However, they conclude that these ‘thesaural relations are too broad to build well-focussed chains or too computationally expensive to be of interest’. Ellman came to a similar conclusion and only found a sub-set of Morris and Hirst’s relationship types were useful (see Section 2.5.1 for a list of relationship types). 2.5.4 Greedy WordNet-based Chaining Algorithms The generic algorithm shown in Figure 2.3 is an example of a greedy chaining algorithm where each addition of a term ti to a chain cm is based only on those words that occur before it in the text. A non-greedy approach, on the other hand, postpones assigning a term to a chain until it has seen all possible combinations of chains that could be generated from the text. One might expect that there is only one possible set of lexical chains that could be generated for a specific text, i.e. the correct set. However, in reality, terms have multiple senses and could be added into a variety of different chains. For example consider the following piece of text: Among the many politicians who sought a seat on the space shuttle, John Glenn was the obvious, perfect choice. There are 10 distinct senses of ‘space’ defined in the WordNet thesaurus, ranging from the topological sense of space to the ‘blank space on a form’ sense. It is 28 obvious when reading the above sentence that the correct sense of ‘space’ used in this context is WordNet’s sense number 5, i.e. any region of space outside the earth’s atmosphere. However, a greedy algorithm would have chosen WordNet’s sense 6 defined as ‘an area reserved for a purpose’. More specifically, the algorithm was forced to make this disambiguation error as sense 1 of ‘seat’ is a type of ‘space’ and so the word ‘space’ will be added to a chain containing the word ‘seat’. This is an example that fortifies the argument against a greedy chaining approach, where the algorithm could have made an more informed decision, and chosen the correct sense of ‘space’, if it had postponed disambiguating ‘space’ at such an early stage in the text and waited until it had seen additional vocabulary such as ‘shuttle’, ‘John Glenn’, ‘NASA’, ‘space station’, and ‘MIR’. On the other hand, non-greedy chaining approaches are still prone to disambiguation errors and are less efficient. Further discussion of these points is left to Section 2.5.5, while in the following section we will look at some greedy chaining algorithms proposed in the literature and how they have been applied to various IR and NLP tasks. St-Onge and Hirst St-Onge and Hirst’s algorithm (St-Onge, 1995; Hirst and St-Onge, 1998) was the first published work to use WordNet as a tool for building lexical chains. Their intention was to use lexical chains as a means of detecting malapropisms in text. StOnge (1995) defines a malapropism as ‘the confounding of one word with another word of similar sound and/or similar spelling that has a quite different meaning’. StOnge provides the following example, ‘an ingenuous machine for peeling oranges’ where ‘ingenuous’ is confused with ‘ingenious’. A traditional spell checker would not have picked up this type of error because its purpose is to identify incorrect spellings and, in some cases, grammatical mistakes. St-Onge and Hirst’s biggest contribution to the study of lexical chains is the mapping of WordNet relations and paths (transitive relationships) to Morris and Hirst’s word relationship types. St-Onge defines three categories of WordNet-based relationships in a text: Extra Strong Relations include all repetition-based relationships, e.g. ‘men’ and ‘man’. 29 Strong Relations include all synonyms (bike, bicycle), holonyms/meronyms (arm, biceps), antonyms (night, day), and hypernyms/hyponyms (aircraft, helicopter). Medium-Strength Relations include all relationships with allowable paths in WordNet (with a maximum path length 5). By defining what constitutes an allowable path between words in the taxonomy, StOnge and Hirst aimed to limit spurious and tentative links between words in chains. Since St-Onge and Hirst’s algorithm forms an integral part of our own approach to lexical chain generation, we leave an in-depth discussion of the exact details of their algorithm to Chapter 3, which documents our enhanced version of their algorithm. Many other chaining algorithms supervening their work are also based on their approach. In particular, the systems of both Green (1997b) and Kozima and Ito (1997), described below, use St-Onge’s algorithm, LexC, to enhance hypertext generation and to improve multimedia indexing, respectively. St-Onge and Hirst base their malapropism detector on the following hypothesis: words that do not form lexical chains with other words in a text are potential malapropisms, as they appear to be semantically dissimilar to the general context of the text. Once these potential malapropisms have been detected St-Onge’s algorithm then tries to find slight spelling variations of these words that fit into the overall semantics of the document. Hence, if one of these spelling alternatives forms a relationship with one of the lexical chains, then St-Onge’s algorithm concludes that the original word was incorrectly used, and that the variation of the word was the intended use. An evaluation of this system on 500 Wall Street Journal articles, that were deliberately corrupted with roughly one malapropism every 200 words (1409 in total) yielded a precision of 12.5% and a recall of 28.2%7. In further experiments by Budanitsky (1999), it was shown that malapropism detection could be improved by using a more simplistic approach that analysed the semantic distance between all terms in the text, rather than one based on lexical chains. 7 In this context recall is defined as the percentage of correctly identified malapropisms as a portion of the total number of malapropisms and precision is defined as the percentage of correctly identified malapropisms as a portion of the total number of malapropisms detected by the system (Budanitsky, 1999). 30 Green and Hirst Green’s system (1997b) for the construction of hypertext between newspaper articles creates links between related paragraphs within a document (intra-article links) and across documents (inter-article links). Inter-article links are generated based on the cosine similarity of their WordNet synset vectors, where each synset vector is made up of all lexical chain members generated for that document. Green states that these synset vectors ‘can be seen as a conceptual or semantic representation of an article, as opposed to the traditional IR method of representing a document by the words it contains’. A different strategy is used for generating intra-article hypertext links, where the similarity between each pair of paragraphs is calculated based on two factors: the number of words they contain that occur in the same lexical chain and the importance of that chain to the overall topic of the text. Chain importance is weighted with respect to the relative length of the chain in the text, i.e. the number of elements in the chain divided by the total number of content words in the document. Green evaluated the usefulness of his hypertext links by examining how well subjects answered a set of questions by following links within and between documents in a TREC corpus of newspaper articles. Green found that users experienced no significant advantage when answering questions using lexical chainbased links over links generated using a simple vector space model of document similarity. Green suggests that a more appropriate evaluation would involve directly analysing and scoring the validity of the intra and inter-article links generated by each approach rather than indirectly evaluating links with respect to a task. Kazman et al. Kazman et al. (1995, 1997) used lexical chains as a means of creating a ‘meaningful’ index of the recorded words spoken during a meeting or videoconference. By ‘meaningful’ Kazman et al., like Green, refer to topic/concept/theme-based indexing rather than a traditional keyword-based indexing approach. This indexing strategy is just one component of a larger multimedia indexing system called Jabber, which provides users with a unified browsing/searching interface. As well as concept indexing, there are three further indexing utilities: indexing by human-interaction patterns, indexing by human- 31 prepared meeting agendas, and indexing by a participant’s use of a shared application. Initially, Jabber captured topics discussed in a meeting by generating lexical trees. These trees are essentially lexical chains that have been reorganised into a hierarchical structure where the most general word in the chain is placed at the root of the tree which represents the general concept of the chain or tree. Later lexical trees were replaced by concept clusters that only use relationships in the is-a (hypernym/holonym) hierarchy to generate chains because Kazman found that it is easier to characterise or label the cluster with the lowest common hypernym. According to Kazman et al. (1996), lexical trees compared favourably with human generated trees created from a journal article of 1,800 words where more than 75% of subject-assigned words that characterised the theme of their cluster were found in the characteristic tree (i.e. the strongest automatically generated lexical tree). They go onto conclude that the results of their experiment are ‘a strong indication of the usefulness of the lexical trees in information indexing and retrieval’ (Kazman et al., 1996). Brunn et al. Brunn et al. (2001, 2002) at the University of Lethbridge use lexical chains to extract relevant sentences from a text that should be included in a summary. They use a greedy chaining algorithm that is similar in nature to St-Onge’s except that they only consider relationships between words that have a path length no longer than two edges in WordNet. Their algorithm also requires that the relationship between words in a chain is pairwise mutual. This means that every word in a chain must be related to every other word in the chain. In contrast, St-Onge only requires one chain word to be related to the target word during chain generation. The most significant difference between their technique and the other approaches, discussed so far in this section, is their preprocessing step for choosing candidate nouns for chaining. In most cases, preprocessing involves part-of-speech tagging in order to identify nouns, proper nouns and noun phrases. This is often followed by a morphological analysis that transforms nouns into their singular form. A stoplist of ‘noisy’ nouns is usually used at this stage to eliminate words that cause spurious chains to be formed and/or contribute very little to the subject of the text. However, Brunn et al. (2001) suggest a more attractive alternative to static 32 ‘noisy’ noun removal using a stopword list, where nouns are dynamically identified as ‘information poor’ by applying the following hypothesis: ‘nouns contained within subordinate clauses are less useful for topic detection than those contained within main clauses’. The problem then becomes how to identify subordinate clauses in sentences– no easy task according to the authors. Hence, they use the following heuristic on parsed text: if a noun is either ‘the first noun phrase or the noun phrase included in the first verb phrase taken from the first sub-sentence of each sentence’, then it is a candidate term and will take part in chaining; otherwise, the noun belongs to a subordinate clause and is eliminated from the chaining process. Although this noun filtering heuristic is appealing, the authors did not compare the performance of their lexical chain-based summariser when it used this preprocessing component to when it used the traditional method of filtering ‘troublesome’ nouns from the chaining process using a stopword list. Hence, it is unclear if their technique actually improves chain performance. A more detailed list of heuristics for identifying subordinate clauses can be found in Brunn et al. (2002), and information on multi-document summarisation using their chaining technique can be found in Chali et al. (2003). 2.5.5 Non-Greedy WordNet-based Chaining Algorithms As stated previously, a non-greedy approach to lexical chaining postpones resolving ambiguous words in a text until it has analysed the entire context of the document. Barzilay and Elhadad (1997) were the first to discuss the advantages of a nongreedy chaining approach. They argued that disambiguating a term after all possible links between it and the other candidate terms in the text have been considered was the only way to ensure that the optimal set of lexical chains for that text would be generated. In other words, the relationships between the terms in each chain will only be valid if they conform to the intended interpretation of the terms when they are used in the text. For example, chaining ‘jaguar’ with ‘animal’ is only valid if ‘jaguar’ is being referred to in the text as a type of cat and not a type of car. However, with this potential improvement in lexical chain quality comes an exponential increase in the runtime of the basic chaining algorithm, since all possible chaining scenarios must be considered (Silber, McCoy, 2002). This has lead to a number of recent initiatives to develop a linear time non-greedy algorithm that attempts to improve chaining accuracy without over-burdening CPU and 33 memory resources. In Chapter 3, we look more closely at the assumption that nongreedy chaining produces better chains. In particular, we show that this extra computational effort does not necessarily result in improved disambiguation accuracy during chaining. Two categories of non-greedy chaining approaches have been proposed in the literature: those that attempt to create all possible chains and then choose the best of these chains (Stairmand, 1996; Barzilay, 1997; Silber McCoy, 2002); and those that disambiguate terms before noun clustering begins resulting in a single set of chains (Bo-Yeong, 2003; Galley and McKeown, 2003). We now examine approaches from both of these categories in detail. Stairmand and Black Stairmand (1996), like St-Onge (1995), was one of the first to attempt the automatic realisation of lexical chains using WordNet. Stairmand’s thesis (1996) examines how lexical cohesion information about a text can be used to improve traditional keyword-based approaches to IR problems. More specifically, Stairmand used lexical chains for segmenting text (similar to Okumura and Honda’s technique, described in Section 2.5.2) and for disambiguating word senses (Stairmand, 1997) and for improving ad hoc retrieval. However, his main focus was his experimental IR system COATER (Context-Activated Text Retrieval) (Stairmand, 1996; Stairmand, Black, 1997; Stairmand, 1997). COATER’s task was to take a set of disambiguated TREC queries, consisting of WordNet senses, and subsequently determine the relevance of a document to a query by observing the level of activation of the each query word’s concepts in a document representation consisting of lexical clusters of WordNet synsets. In line with other lexical chaining techniques, Stairmand’s chaining algorithm QUESCOT (Quantification and Encapsulation of Semantic Content), first chooses a set of candidate terms (nouns) for chaining. The second phase of his chaining procedure, the non-greedy aspect of his algorithm, then establishes all the possible term senses in the text that are related to each other through direct and indirect links found in WordNet. In this context, direct links are repetitions, synonyms, hypernyms, holonyms, meronyms and hyponyms; while indirect links are hypernym paths in WordNet where no maximum path length is defined. After establishing all 34 possible links between each sense of each word in the text, the next step involves generating a set of lexical clusters. Stairmand’s lexical clusters are not mutually exclusive, so the same word can occur in different chains. The next step is to merge any lexical clusters that share the same sense of a word. This allows for transitive relationships to be found between word senses that are not explicitly defined in the taxonomy. Once merging is completed, all lexical clusters are broken up into lexical chains by splitting clusters at points in the cluster where the allowable distance threshold of 80 words between adjacent terms is exceeded. Any chain resulting from this split must exceed 3 words. Consequently, there will be some words in the cluster that are not included in the resultant chains. These chains are then weighted with respect to the portion of text spanned by the chain (their span) and their density (the number of terms divided by the span of the chain). The overall span of a cluster then becomes the average span of each of its chains, while its overall density becomes the average density of each of its chains. Stairmand’s main evaluation of his lexical chains centres on ad hoc TREC retrieval. Document representations in COATER consist of a set of synsets derived from the chain clusters, where each synset is weighted with respect to the strength of its cluster, i.e. the normalised product of its span and density. In this way, concepts in a text are weighted in terms of their overall contribution to the message of the text, where concepts that are members of low scoring clusters are assigned a low score in the document representation. COATER was evaluated with respect to the SMART information retrieval system on a set of 50 queries taken from a TREC evaluation corpus. Stairmand found that COATER ranked relevant documents more precisely than SMART if the query’s terms refer to the main theme or an important sub-theme of the ranked documents. However, the SMART system was more adept at distinguishing between documents that were mildly relevant or not relevant at all to the query. Stairmand believes that this result is encouraging and shows that ‘textual context can improve precision performance’. However, he also admits that the recall performance of COATER is relatively poor (due to gaps in WordNet coverage, especially in the case of proper noun coverage), which he concludes ‘prohibits the use of COATER in a real world IR scenario’ (Stairmand, 1997). 35 Barzilay and Elhadad Barzilay and Elhadad (1997) were the first to coin the phrase ‘a non-greedy or dynamic solution to lexical chain generation’. They proposed that the most appropriate sense of a word could only be chosen after examining all possible lexical chain combinations that could be generated from a text. Their dynamic algorithm begins by extracting nouns and noun compounds. Barzilay reduces both non-WordNet and WordNet noun compounds to their head noun e.g. ‘elementary_school’ becomes ‘school’. As each target word arrives, a record of all possible chain interpretations is kept and the correct sense of the word is decided only after all chain combinations have been completed. As stated at the beginning of this section, one of the main problems associated with non-greedy algorithms is that they exhibit an exponential runtime. To reduce this algorithmic complexity, Barzilay’s dynamic algorithm continually assigns each chain interpretation a score determined by the number and weight of the relations between chain members. When the number of active chain interpretations for a particular word sense exceeds a certain threshold (i.e. 10 chains), weaker interpretations with lower scores are removed from the remainder of the chaining process. Further reductions in the runtime of the algorithm are also achieved by Barzilay’s stipulation that relationships between words are only permitted if words occur in the same text segment. Hearst’s TextTiling algorithm was used to segment the document into sub-topics or text segments8. Consequently, chain merging is a necessary component of the algorithm as same sense words often occur, due to this stipulation, in different chains. Once all chains have been generated only the strongest chains are retained. Barzilay provides a more rigorous justification of her chain weighting scheme than Stairmand does. In particular, she uses a human evaluation to determine what chain characteristics are indicative of strong chains (representing pertinent topics) in a text, i.e. chain length, chain word distribution in the text, chain span, chain density, graph topology (of chain word relationships in WordNet) and the number of word repetitions in the chain. Barzilay found that the best predictors of chain importance 8 In further experiments Barzilay showed that segmentation does not improve chain disambiguation accuracy (Barzilay, 1997). However Silber and McCoy (2000; 2002), described next, use segmentation to ‘reduce the complexity of their algorithm’, which is a more efficient version of Barzilay and Elhadad’s. 36 or strength were: the chain length (the number of words in the chain plus repetitions) and the homogeneity index (one minus the number of distinct occurrences of words in the chain divided by its length). A single measure of chain strength is calculated by combining chain length with homogeneity index. So in order for a chain to be retained it must exceed the average chain strength score plus twice the standard deviation of this average. The main focus of Barzilay’s thesis is to investigate if a useful summary can be built from a lexical chain representation of the original text. Since chains represent pertinent themes in a document, significant sentences that should be included in a summary can therefore be identified by examining the distribution of chains throughout the text. More specifically, Barzilay found that the following heuristic worked well: for each chain choose a representative word (i.e. the term with the highest frequency of occurrence), then extract the first sentence in the text that contains a representative word for each of the chains (i.e. extract one sentence for each chain). A data set of 40 TREC news articles containing roughly 30 sentences each was chosen for the evaluation. Human subjects were asked to produce summaries of length 10% and 20% respective of the length of the original document. Barzilay then compared the similarity of these manually constructed summaries to those generated by her lexical chain-based system, the Microsoft Word 1997 Summariser, and Marcu’s summariser based on discourse structure analysis (Marcu, 1997). Results from these experiments show that Barzilay’s lexical chain-based summaries are closer to human generated summaries than either of the other two systems. Results from these experiments showed that lexical chains are a strong intermediate representation of a document, and hence should perform well in other applications that would benefit from a more meaningful representation of a document than a ‘bag of words’ representation. Silber and McCoy An important extension of Barzilay and Elhadad’s work has been Silber and McCoy’s (2000, 2002) linear time version of their lexical chaining algorithm. They make two modifications to Barzilay’s algorithm in order to reduce its runtime. 37 The first modification relates to the WordNet searching strategy used to determine word relationships in the taxonomy. In her original implementation Barzilay uses the source code accompanying WordNet to access the database, resulting in a binary search of the input files. Silber and McCoy note that chaining efficiency could be significantly increased by re-indexing the noun database by line number rather than file position, and saving this file in a binary indexed format. Consequently, this also meant writing their own source code for accessing and taking advantage of this new arrangement of the taxonomy. Their second modification to Barzilay’s algorithm related to the way in which ‘chain interpretations’ are stored, where Barzilay’s original implementation explicitly stores all interpretations (except for those with low scores), resulting in a large runtime storage overhead. To address this, Silber and McCoy’s implementation creates ‘a structure that implicitly stores all chain interpretations without actually creating them, thus keeping both the space and time usage of the program linear’. Once all chain interpretations, or meta-chains as Silber and McCoy refer to them, are created, their algorithm must then decide which meta-chains are members of the optimal set of lexical chains for that document. To decide this, their algorithm makes a second pass though the data taking each noun in the text, and deciding which meta-chain it contributes the most to. The strength of a noun’s contribution to a chain depends on two factors: how close the word is in the text to the word in the chain to which it is related, and how strong the relationship between the two words is. For example, if a noun is linked by a hypernym relationship to a chain word that is one sentence away then it gets assigned a score of 1, if the words are 3 sentences away the score is lowered to 0.5. Silber and McCoy define an empirically-based scoring system for each of the WordNet relationships found between terms during chaining. The subsequent steps of their algorithm proceed in a similar fashion to Barzilay’s, where only chains that exceed a threshold (twice the standard deviation of the mean of the chain scores plus the mean chain score) are selected for the final stage of the summarisation process. Although Silber and McCoy evaluate their chaining approach with respect to a summarisation task, they do not compare the results of their algorithm to Barzilay and Elhadad’s. However, in theory their algorithm should perform better since they did not have to employ a pruning cycle in their algorithm as Barzilay did, in order 38 to improve the runtime of her algorithm by periodically eliminating low scoring chain interpretations. Instead they focus on the fact that their algorithm can complete the chaining of a document in 4 seconds that would have taken Barzilay’s implementation 300 seconds. Due to these improvements in time/space complexity, Silber and McCoy approach does not impose an upper limit on document size. Bo-Yeong Unlike the previous three chaining algorithms, Bo-Yeong’s technique (Bo-Yeong, 2002, 2003) belongs to the second category of non-greedy algorithm that disambiguates words before chain formation proceeds. Her idea is similar to Okumura and Honda’s approach in that an attempt is made at disambiguating nouns within a certain local context, or, as Bo-Yeong calls it, a semantic window. The larger this window size the more nouns are examined, i.e. for a given noun, if the window size is n then 2n nouns will be involved in its disambiguation. The most likely sense of a noun is the sense that links most frequently with the terms in the semantic window, where each sense is assigned a score depending on the strength of this link (Bo-Yeong, 2002). Once as many terms as possible are disambiguated in this way, the chaining algorithm proceeds in a similar manner to the generic algorithm in Figure 2.3. Bo-Yeong ran a similar experiment to Barzilay’s, and also found that her chaining method produced summaries that were closer in content to human generated summaries than Microsoft Office 2000 summaries. Her chaining method was also evaluated as a keyword extraction technique, which, when compared to the KEA extraction system (Witten et al., 1998), extracted more nouns that were deemed topically important by human judges. However, this evaluation ignored other parts of speech such as verbs and adjectives, which may have significantly impacted KEA’s performance. Galley and McKeown Like Bo-Yeong, Galley and McKeown (2003) devised a non-greedy chaining method that disambiguates nouns prior to the processing of lexical chains. Their method first builds a graph, called a disambiguation graph, representing all possible links between word senses in the document, where each node corresponds to a distinct sense of a word. Like most other techniques these relationships between 39 words are weighted with respect to two factors: the strength of the semantic relationship between them, and their proximity in the text. Galley’s weighting scheme is nearly identical to Silber and McCoy’s (2002). However, their relationship-weight assignments are in general lower for terms that are further apart in the text. Once the graph is complete, nouns are disambiguated by summing the weights of all the edges or paths emanating from each sense of a word to other words in the text. The word sense with the highest score is considered the most probable sense and all remaining redundant senses are removed from the graph. Once each word has been fully disambiguated lexical chains are generated from the remaining disambiguation graph. Another interesting aspect of Galley and McKeown’s paper is that they evaluated the performance of their lexical chaining algorithm with respect to the disambiguation accuracy of the nouns in the resultant lexical chains. Their evaluation showed that, in this respect, their algorithm was more accurate than Barzilay and Elhadad’s and Silber and McCoy’s. A more in-depth discussion on the format of this experiment is left until Section 3.3 where we report the findings of a similar experiment undertaken in the course of our research. 2.6 Discussion In this chapter, we have explored the notion of lexical cohesion and how it relates to the discourse structure of a text. Our definition of lexical cohesion, taken from Halliday and Hasan (1984), encompasses five different types of lexical cohesion: repetition, synonymy, generalisation/specialisation, whole-part/part-whole and collocation or statistical word association. A number of lexicosemantic knowledge sources have been used by researchers to capture these lexical cohesive relationships between words in text. We looked at the advantages and disadvantages of three such resources: Longmans Dictionary of Contemporary English, the WordNet online thesaurus and Roget’s Thesaurus. In the course of our research we rely on the WordNet thesaurus for establishing associations between words in text due to its popularity in NLP circles. However, although it covers four of the five forms of lexical cohesion, WordNet is still an inadequate resource for capturing statistically significant or ‘intuitive’ word associations. Most researchers try and indirectly capture these relationships by following path lengths in WordNet. 40 However, it has been shown that this is an unreliable means of establishing semantic relationships between words because the existence of a path between two words in the taxonomy (especially a long path) does not necessarily correspond to a strong semantic relationship between these words. In the following chapter, we give details of how statistical word associations can be incorporated into a lexical chaining framework. This framework is based on StOnge and Hirst’s lexical chaining algorithm, described in Section 2.5.4, which like the majority of approaches is WordNet-based. We make two modifications to their original algorithm which complement it and help to improve the generation of lexical chains in a news environment. Our algorithm, LexNews, falls into the greedy category of chaining algorithms. As stated in Section 2.5, various lexical chaining researchers, such as Barzilay and Elhadad, have stressed the need for nongreedy approaches to chaining. The assumption is that by postponing the disambiguation of a noun until all possible lexical chains (or senses of the noun) have been considered, disambiguation accuracy will improve and the occurrence of spurious chains will be reduced. However, in Section 3.3 we show that a nongreedy approach to chaining does not necessarily improve disambiguation accuracy, and that St-Onge and Hirst’s greedy algorithm is as effective as a non-greedy approach. 41 Chapter 3 LexNews: Lexical Chaining for News Analysis Most IR and NLP applications such as question answering, machine translation, text summarisation, information retrieval and information filtering, to name but a few, are developed and tested on collections of news documents. News text is a popular area for such research for two reasons: 1. There is a genuine demand for intelligent and automatic tools for managing ever-expanding repositories of daily news. 2. Large volumes of news stories are freely available on the Internet, on television and radio broadcasts and in newswire and print formats. Hence, it is relatively easy to gather a large collection of related news documents for experimental purposes, as opposed to collecting documents of a more sensitive and confidential nature which would lead to access restrictions and other additional overheads. As already stated, the focus of the work in this thesis is based around the tasks defined by the Topic Detection and Tracking (TDT) initiative which, unlike the majority of other news related investigations, looks at broadcast news (automatically recognised speech transcripts and closed-caption transcripts from radio and television broadcasts) as well as documents from standard newswire collections. With this in mind, we have designed and developed a lexical chaining method called LexNews that is specifically suited to building lexical chains for documents in the news domain. Unlike previous chaining approaches, LexNews integrates domain dependent statistical word associations into the chaining process. As already stated in Section 2.3, statistical word associations represent an additional type of lexical cohesive relationship that is not found in WordNet. LexNews also recognises the importance of analysing the lexical cohesion resulting from the use of named entities, such as people and organisations that are ignored by standard chaining techniques since such proper noun phrases are also absent from the WordNet taxonomy. 42 We begin this chapter with a discussion of the basic chaining algorithm used in our LexNews implementation. This algorithm was devised by St-Onge and Hirst and was categorised in the previous chapter as a ‘greedy WordNet-based chaining approach’. In Section 3.2, the LexNews algorithm is described in terms of the enhancements made to this basic chaining algorithm. This is followed by a more indepth analysis of the lexical chains generated by the LexNews algorithm, which includes an evaluation of the quality of the chains with respect to disambiguation accuracy, details of chain statistics, and finally a discussion on how lexical chains can be used as an intermediary natural language representation of news story topics. 3.1 Basic Lexical Chaining Algorithm In Section 2.5.4, we described a lexical chaining algorithm by St-Onge and Hirst that establishes relationships between nouns in a text using the WordNet taxonomy. This algorithm is an important focal point of the work described in this thesis as it forms the basis of our own lexical chaining approach, LexNews. As already stated, LexNews has been adapted in a number of ways, which gives it an advantage over other chaining algorithms in a news domain. In this section, we will review some of the main aspects of St-Onge’s chainer, and look more closely at how the algorithm traverses WordNet’s noun taxonomy seeking out valid relationships between noun phrases along its pathways. We begin our dissection of St-Onge and Hirst’s approach with their definition of the three possible link directions between words in the WordNet thesaurus: Horizontal Link: Includes antonym relationships, i.e. nouns that are opposites such as ‘life’ and ‘death’. Upwards Link: Includes semantically more general relationships such as meronyms (‘building’ is more general than ‘office_building’) and hypernyms (‘flower’ is more general than ‘rose’). Downwards Link: Includes semantically more specific relationships such as hyponyms (‘fork’ is more specific than ‘cutlery’) and holonyms (‘school_year’ is more specific than ‘year’). Coupled with these link direction definitions are three categories of relationship type based on repetition relationships and lexicographic WordNet relationships: 43 Extra Strong Relations: These relationships include word repetitions, e.g. mouse/mice. Strong Relations: These relationships are split into three different subtypes o Two words are related if they have the same synset number in WordNet, e.g. telephone/phone o Two synsets are related if they are connected by a horizontal link o Two synsets are related if an upward link or downward link exists between them o Two words are related if one word is a compound noun that contains the other, e.g. orange_tree, tree Medium-strength Relations: Two synsets are related if an allowable path of length greater than 1, but no more than 5 exists between them in WordNet, where an allowable path is defined by the following two rules o No other direction must precede an upward link o No more than one change of direction is allowed except when a horizontal link is used to make the transition from an upward to a downward link As explained in Section 2.4.4, the use of WordNet to measure the semantic distance between words is based on the premise that semantic distance is directly related to semantic proximity (or the number of edges) between terms in the taxonomy. As stated before, this basic assumption is incorrect as it presumes that every edge and all branches (sub-hierarchies) in the taxonomy are of equal length. Unfortunately, this is not the case and a strategy to compensate for these inadequacies is needed. Hence, St-Onge and Hirst define link directions and link categories in order to limit the number of semantically proximate but unrelated terms that can be found by following illogical paths in the taxonomy. The two rules that define an allowable path in a medium-strength relationship are based, according to Budanitsky (1999), on ‘psycholinguistic theories concerning the interplay of generalisation, specialisation, and coordination9’. However, only intuitive reasons for these rules are given in St-Onge’s thesis. 9 Coordination is a structural relationship that exists between words due to the use of the conjunctions ‘and’ and ‘or’. An example of a coordinate relationship between terms is a string of words such as ‘all green vegetables are good for you especially spinach, cabbage and broccoli’. This is an obvious list of related words but less obvious context dependent relationships may be identified 44 With regard to the first rule, stating that no link may precede a general link, StOnge explains that once the context has been narrowed down by an antonym or a specific relationship, ‘enlarging the context by following a general link doesn’t make much sense’. With regard to the second rule, St-Onge explains that changes of direction constitute large semantic steps, and therefore must be limited except in the case of horizontal links that represent small semantic steps. However, the only horizontal link possible between two nouns is an antonym relationship which means that this exception to the second rule is a rare occurrence. Hence, most mediumstrength relationships will consist of paths of generalisations (upward links) or paths of specialisations (downward links). The example shown in Figure 3.1, of an inaccurate link between the words ‘handbag’ and ‘airbag’, helps to illustrate the need for these rules. Restraint Handbag Handbag Fastener Airbag Generalisation Clasp Specialisation Figure 3.1: Example of a spurious relationship between two nouns in WordNet by not following St-Onge and Hirst’s rules. As well as defining relationship categories, St-Onge and Hirst also rank them in order of strength. Thus when their chaining algorithm is searching for a relationship between a target word and a word in a chain it seeks an extra strong relationship first, then a strong relationship, and finally a medium-strength relationship, if the preceding relationships were not found. Unlike the relationships that make up the extra strong and strong link categories, medium-strength relationships differ in strength with respect to two factors: the length of the path and the number of direction changes in the path. Hence, even if the algorithm comes across a medium- by words or even clauses being listed together in this way, e.g. ‘Tom, Dick and Harry’ are also associated though a coordination relationship. This type of relationship is captured in WordNet by the inclusion of noun phrases like ‘skull_and_crossbones’ and ‘seek_and_destroy_mission’. It is less specific than specialisation or generalisation associations. 45 strength relationship it must continue searching for all remaining medium-strength relationships between the target word and all currently created chains, in order to ensure that it has found the strongest possible medium-strength link. St-Onge defines a formula to weight these medium strong links: Link Strength = C − path length − k * (number of direction changes ) where C and (3.1) k are constants. According to this equation, medium-strength relationships with short path lengths and a low number of direction changes are assigned higher weights. St-Onge and Hirst also impose further restrictions on the allowable word-chain distance between related terms, where the word-chain distance is defined as the difference between the sentence number of the current word being added to the chain and the sentence number of the chain word with the highest (or closest) sentence number to the current word. More specifically, St-Onge limits the search scope to a distance of 7 sentences for strong relationships and 3 sentences for medium-strength links. No distance restriction is defined for extra strong (repetition) relationships as St-Onge’s algorithm concurs with the ‘one sense per discourse’ assumption (Gale et al., 1992). St-Onge and Hirst’s chainer is categorised in Chapter 2 as a greedy lexical chaining approach. This implies that the algorithm adds a word to a chain by considering only the context to the left of that word up to that point in the text, and that no information regarding the context to the right of the word is considered in the chaining process. However, this is not entirely true as St-Onge’s implementation attempts to chain words in a sentence first (the word’s immediate context) before committing the algorithm to choosing a particular weak sense for a word or making it the seed of a new chain. This process is implemented using a queue data structure where each word in a sentence n is added to the queue, and extra strong relationships are sought between these sentence words and all currently created chains. The search though the chain stack for a relationship with the current queue member halts soon as an extra strong relationship is found, whereupon the candidate term is removed from the queue and added to the related chain. Strong relationships are then sought between all remaining members of the sentence word queue and each lexical chain. Again, any related words are added to their respective chains and removed from the queue as soon as a strong relationship is found. This 46 process is repeated for medium-strength relationships. However in this case, as already stated, all medium-strength connections within the search scope must be sought, since the weight of a medium-strength relationship can vary (unlike extra strong or strong links). Once all medium-strength relationships are found and weighted using Equation 3.1, the current queue word is deleted from the queue and added to the chain with the strongest medium-strength weight. At this point in the algorithm, if there are still unchained words in the sentence queue, a new chain for each unchained queue member is created. Use of this queue structure is similar to a windowing approach where words are disambiguated with respect to their left- and right-hand contexts. The size of these left- and right-hand contexts depends on the position of the word in the sentence, and the sentence’s begin and end points in the text. 5 kid 4 Stack Head 3 {foster_home, home} 2 {guardian, ward, deputy, offical} 1 {child} 0 {fund, money, assistance, welfare} 5 4 Stack Head 3 {child, kid} 2 {foster_home, home} 1 {guardian, ward, deputy, official} 0 {fund, money, assistance, welfare} 5 4 {school} 3 {child, kid} 2 {foster_home, home} 1 {guardian, ward, deputy, official} 0 {fund, money, assistance, welfare} Stack Head school Figure 3.2: Diagram illustrating the process of pushing chains onto the chain stack. 47 One final important feature of St-Onge’s algorithm is chain salience. This facet of the algorithm ensures that words are not only added to the most strongly related chain, but also the most recently updated chain. This idea is based on the notion that chains that are currently active in the text will be the most appropriate context in which to disambiguate the current noun. This feature of their algorithm is implemented using the ‘chain stack’ data type illustrated in Figure 3.2. In this example, the next word in the sentence queue to be added to the chain stack is the word ‘kid’ which is added to the chain at position 1 in the stack. Since ‘kid’ and ‘child’ are synonyms (a type of strong relationship), the search for a related term is complete, and ‘kid’ is added to this chain which is in turn moved to the head of the chain stack. The next word in the current sentence queue is ‘school’; however, no lexicographical relationship is found between this word and any of the chains in the chain stack so ‘school’ becomes the seed of a new chain. This chain is then pushed onto the head of the chain stack. 3.2 Enhanced LexNews Algorithm In the previous section we reviewed St-Onge and Hirst’s approach to chain generation. In this section we give details of our lexical chaining system called LexNews which is an enhanced version of their algorithm. LexNews consists of two components: a ‘Tokeniser’ for text preprocessing and selection of candidate terms for chaining, and a ‘Chainer’ which clusters related candidate terms into groups of semantically associated terms. Our novel approach to chaining differs from previous chaining attempts (see Chapter 2) in two respects: It incorporates genre specific information into the chaining process in the form of statistical word associations. It acknowledges the importance of considering proper nouns in the chaining process when dealing with text in a news domain, and builds a distinct set of chains representing the repetition relationships between these parts of speech in the text. The motivation for including statistical word associations in the chaining procedure is discussed in the following section and our method of generating these cooccurrence relationships is also described. This is followed by a description of our tokeniser based on the identification of technical terms in text proposed by Justeson 48 and Katz (1995) and our enhanced lexical chainer. Figure 3.3 gives an overview of the LexNews architecture. Input News Articles Tokeniser Sentence Boundary Identifier WordNet Part-of-speech Tagger Noun Phrase Parser Statistical Word Associations Candidate Terms Lexical Chainer Lexical Chains generated for each News Article Figure 3.3: LexNews system architecture. 49 3.2.1 Generating Statistical Word Associations As previously stated, existing approaches to generating lexical chains rely on the existence of either repetition or lexicographical relationships between nouns in a text. However, there are a number of important reasons for also considering cooccurrence relationships in the chaining process. These reasons relate to the following missing elements in the WordNet taxonomy: Missing Noun Relationships: The fact that all relationships between nouns in WordNet are defined in terms of synonymy, specialisation/generalisation, antonymy and part/whole associations means that a lot of intuitive cohesive relationships, which cannot be defined in these terms, are ignored. This characteristic of the taxonomy was referred to in Section 2.4.4 as ‘the tennis problem’ (Fellbaum, 1998b), where establishing links between topically related words such as ‘tennis’, ‘ball’ and ‘net’ is often impossible. Missing Nouns and Noun Senses: WordNet’s coverage of nouns is continually improving with the release of each new version. However, a lot of British and Hiberno-English phrases are still missing, including the ‘sweater’ sense of ‘jumper’, the ‘bacon’ sense of ‘rasher’ and the ‘drugstore’ sense of ‘chemist’. On occasion there are also noun phrase omissions, for example ‘citizen band radio’ and its abbreviation ‘CBR’. However, the omission of examples such as these usually only become obvious or critical when working in a very specific domain. Missing Compound Noun Phrases: Many compound nouns are also absent from the noun taxonomy. For example, there are a number of important newsrelated compound noun phrases such as ‘suicide bombing’ or ‘peace process’ that are not listed. Quite often these compound nouns are ‘media phrases’, that in time will find their way into future versions of the WordNet thesaurus. The generation of co-occurrence relationships between nouns from the news story domain is one method of addressing these inadequacies. In fact, augmenting WordNet with these types of statistical relationships has been, and still remains, a ‘hot-topic’ in computational linguistic research. Many researchers believe that this is the most appropriate means of both improving the connectivity of a general ontology, and of adapting it to better suit a particular domain specific application (Resnik, 1999; Agirre et al., 2000; Mihalcea, Moldovan, 2001; Stevenson, 2002). 50 In our case, rather than attempting the complex task of fitting word associations into the taxonomy (i.e. disambiguating word associations with respect to synsets), and the subsequent re-organisation of the taxonomy, we view our co-occurrence data instead as an auxiliary knowledge source that our algorithm can ‘fall back on’ when a relationship cannot be found in WordNet. We will now look at how these co-occurrences were generated using a log-likelihood association metric on the TDT1 broadcast news corpus (Allan et al., 1998). Tokenisation of the TDT1 corpus is the first step in generating bigram statistics for token pairs, where tokens in this context are simply all WordNet nouns and compound nouns (excluding proper nouns) identified by the JTAG tagger (Xu, Broglio, Croft, 1994). These nouns are transformed, if necessary, into their singular form using handcrafted inflectional morphological rules. With regard to the identification of proper noun phrase relationships, we felt this process would complicate the estimation of frequencies as some form of normalisation would be needed in order to tackle the disambiguation and mapping of phrases such as ‘Hillary Clinton’ and ‘Senator Clinton’ to a single concept, while correctly identify that ‘Bill Clinton’ is an entirely different entity. However, identifying these types of relationships would be a definite advantage during the chaining process, and a possible avenue for future research. The next step is the collecting and counting of bigram frequencies in a window size of four nouns, where all combinations of noun bigrams within this window are extracted and their frequency counts are updated. We emphasise the phrase ‘window size of four nouns’ as most co-occurrence statistics are calculated based on the entire vocabulary of the corpus. We are (as previously stated) only interested in relationships between nouns. Therefore, we simplify this process by providing the bigram identifier with only nouns in this active window, where the window is limited by sentence and document boundaries. In addition, the ‘lift-right’ ordering of the nouns in each bigram in the window is preserved with respect to their original ordering in the text. Once the relevant statistical counts are collected the algorithm then uses a statistical association metric to determine which bigrams occur in the TDT1 corpus more often than would be expected by chance. Dunning (1994) highlights a number of common statistical measures of co-occurrence and their inadequacies. In 51 particular, Dunning states that it is incorrect to assume that all words are either normally or approximately normally distributed in a large corpus. For example, ‘simple word counts made on a moderate sized corpus show that words which have a frequency of less than one in 50,000 words make up about 20% - 30% of typical English newswire reports’. Hence, these low frequency words are too rare to expect standard statistical techniques (that rely heavily on the assumption of normality) to work. In particular, Dunning criticises the use of measures like Pearson’s test and z-score tests since they tend to over-estimate the significance of the occurrence of rare events. Instead, he suggests that likelihood ratio tests that ‘do not depend so critically on assumptions of normality’ are more suitable for textual analysis. He also stresses that a likelihood ratios test such as the G2 statistic is easier to interpret than a t test, z test or test as it does not have to be looked up in a table. Instead, it can be directly interpreted as follows: G2 measures the deviation between the expected value of the frequency of the bigram AB and the observed value of its occurrence in the corpus. We use the following log-likelihood formula taken from Pedersen (1996) to measure the independence of the words in a bigram, where Table 3.1 is a contingency table showing the counts needed to estimate G2. Bigram = AB A ¬A Total B n11 n12 n1+ ¬B n21 n22 n2+ Total n+1 n+2 n++ Table 3.1: Contingency table of frequency counts calculated for each bigram in the collection. n11 is the frequency of the bigram AB in the collection, n1+ is the number of bigrams with the word B in the right position, n+1 is the number of bigrams with A in the left position, and n++ is the total number of bigrams in the corpus. The maximum likelihood estimate is then calculated as follows: mi , j = n i+ *n+ j n++ (3.1) and the log-likelihood ratio is defined as: 52 G 2 =2 nij log i, j nij mij (3.2) So after filtering out less significant bigrams (i.e. removing all bigrams with G2 < 18.9) from a corpus of 1,565988 nouns, we collected 25,032 significant bigrams or collocates, which amounted to 3,566 nouns that had an average of 7 collocates each. This is admittedly a relatively small number of co-occurrence relationships; however, our intention here is to capture only the most strongly related and domain specific on-topic noun relationships in the corpus, as most other relationships can be found using the WordNet taxonomy. AIDS: virus 0.993134 HIV 0.950758 patient 0.897451 research 0.806503 disease 0.8009 infection 0.788194 vaccine 0.753551 activist 0.662563 epidemic 0.64386 drug 0.635456 researcher 0.625237 health 0.528567 cure 0.41355 counselor 0.408026 testing 0.16426 treatment 0.0705887 cancer 0.0479009 Airport: flight 0.978022 runway 0.926649 passenger 0.845762 plane 0.813131 hijacker 0.779987 arrival 0.771899 convoy 0.628275 port 0.59517 harbor 0.528054 airline 0.506392 airlift 0.503038 delay 0.482481 facility 0.408617 perimeter 0.404356 departure 0.309304 cargo 0.214528 pilot 0.158775 Glove: estate 0.995581 scene 0.973169 blood 0.966698 property 0.955137 murder 0.935448 walkway 0.920652 home 0.874014 crime 0.870699 defense 0.859927 driveway 0.834991 match 0.820273 evidence 0.817314 guesthouse 0.773595 racist 0.702691 knit 0.697956 house 0.682884 detective 0.675308 new_yorker 0.667298 trail 0.630485 hair 0.617661 grounds 0.582741 article 0.582623 left_hand 0.549755 bronco 0.540522 mansion 0.46512 prosecutor 0.428464 plastic 0.338857 bag 0.302399 body 0.0952888 football 0.0610795 prosecution 0.0288431 Plutonium: reactor 0.944839 nuclear_weapon 0.927557 ingredient 0.890152 uranium 0.84229 bomb 0.840949 material 0.674085 Munich 0.605903 waste 0.5494 fuel 0.360125 Korea 0.339765 expert 0.202612 weapon 0.202573 site 0.0967882 facility 0.0877131 Figure 3.4: Examples of statistical word associations generated from the TDT1 corpus. 53 Figure 3.4 contains examples of statistically derived word associations (in order of normalised strength) for four nouns in the TDT1 collection. The majority of these relationships could not be found using the WordNet thesaurus. For example, the compound noun phrases ‘AIDS activist’, ‘AIDS epidemic’ and ‘Plutonium reactor’, and the word associations ‘airport-passenger’, ‘airport-plane’ and ‘AIDSHIV’ cannot be associated using the taxonomy and the basic chaining algorithm described in Section 3.1. Although the relationship between AIDS and HIV is represented in WordNet, a path length of 16 edges must be traversed in order to establish a link between these words, and our chaining algorithm only looks at relationships with a maximum path length of four edges. Although Figure 3.4 contains a number of motivating word relationships, there are also examples of weakly related word associations that have surprisingly high strength of association scores. In particular, we noticed a number of strange relationships with the noun ‘glove’, e.g. ‘glove-estate’, ‘glove-scene’, ‘glovemurder’, and ‘glove-blood’. These co-occurrences are classic examples of how corpus statistics can be skewed when generated from a document collection containing a large number of stories on a particular topic. In this case the topic is the OJ Simpson trial, where a blood-soaked glove found near the crime scene was discussed at length in the trial by the prosecution, and consequently in the TDT1 collection. One method of tackling this problem is to remove documents from some of the larger topics in the collection when calculating associations. However, we felt that a number of important word relationships pertaining to the ‘judicial system’ may have been lost, so all 16,856 TDT1 news stories were considered. We also found that the occurrence of weak links like these in the chaining process is infrequent as the likelihood of finding, for example, the nouns ‘glove’ and ‘blood’ in the same news story in a news corpus which does not cover the OJ Simpson case is low, thus ensuring the integrity of relationships between lexical chain members. However, incorporating statistical word associations into the chaining process does pose one significant problem when generating WordNet-based lexical chains. More specifically, statistical word associations fail to consider instances of polysemy where the sense of a word defined in a chain may not be related to the intended sense of the statistically associated word. For example, ‘gun’ is statistically related to the word ‘magazine’, and so they should in theory be added to 54 the same lexical chain. However, in the context of the lexical chain {book, magazine, new edition, author}, it is evident that the ‘publication’ sense of the word ‘magazine’ is intended rather than the ‘ammunitions’ sense, and so the noun ‘gun’ should not be added to this chain. Unfortunately, errors such as these will result in the generation of spurious chains. For the remainder of this section, we will examine the LexNews chain generation process. 3.2.2 Candidate Term Selection: The Tokeniser The objective of the chain formation process is to build a set of lexical chains that capture the cohesive structure of the input stream. However, before work can begin on lexical chain identification, all sentence boundaries in each sample text must be identified. We define a sentence boundary in this context as any word delimited by any of the characters in the following pattern: [ ! | . | .” | .’ | ? ]+. Exceptions to this rule include abbreviated that contain full stops such as social titles (e.g. ‘Prof.’, ‘Sgt.’, ‘Rev.’ and ‘Gen.’), qualifications (e.g. ‘Ph.D.’ ‘M.A.’ and ‘M.D.’), first and middle names reduced to initials (e.g. ‘W.B. White’), and abbreviated place names (e.g. ‘Mass.’ and ‘U.K.’). Once all sentences have been identified in this way, the text is then tagged using the JTAG part-of-speech tagger (Xu, Broglio, Croft, 1994). All tagged nouns in the text are then identified and morphologically analysed: all plurals are transformed into their singular state, adjectives pertaining to nouns are nominalized and all sequences of words that match grammatical structures of compound noun phrases are extracted, e.g. WordNet compounds such as ‘red wine’, ‘act of god’ or ‘arms deal’. By considering such noun phrases as a single unit, we can greatly reduce the level of ambiguity in a text. This will significantly reduce lexical chaining errors caused by phrases such as ‘man of the cloth’ (priest) which do not reflect the meaning of their individual parts. To identify these sequences of proper noun/noun phrases our algorithm uses a series of regular expressions. The extraction of noun phrases that match these pattern are often referred to as technical terms, an idea first proposed by Justeson and Katz (Justeson, Katz, 1995). Another advantage of scanning part-of-speech tagged news stories for these technical terms is that important non-WordNet proper noun phrases, such as White House aid, PLO leader Yasir Arafat and Iraqi leader Saddam Hussein, are also discovered. In general, news story proper noun phrases will not be present in WordNet, since keeping an 55 up-to-date repository of such words is a substantial and unending problem. However, non-WordNet proper nouns are still useful to the chaining process since they provide a further means of capturing patterns of lexical cohesion though repetition in the text. For example, consider the following news story extract in Figure 3.5. Iraqi President Saddam Hussein has for the past two decades the dubious distinction of being the most notorious enemy of the Western world. Saddam was born in a village just outside Takrit in April 1937. In his teenage years, he immersed himself in the anti-British and anti-Western atmosphere of the day. At college in Baghdad he joined the Baath party. Figure 3.5: Example of noun phrase repetition in a news story. There are 2 distinct technical term references to Saddam Hussein in Figure 3.5: Iraqi President Saddam Hussein and Saddam. As is evident in this passage the main problem with retaining words in their compound proper noun format is that they are less likely to have exact syntactic repetitions elsewhere in the text. Hence we introduce into our lexical chaining algorithm a fuzzy string matcher that looks first for a full syntactic match (Saddam_Hussein Saddam_Hussein) and then a partial syntactic match (Iraqi_President_Saddam_Hussein Saddam). Approximate string matching spans a very large and varied area of research, which includes computational biology, IR and signal processing, to name but a few (Navarro, 2001). The most common algorithm for determining gradations in string similarity is one based on calculating the edit distance between two words, which calculates the minimum possible cost of transforming the two words so that they match exactly. This process may involve a number of insertions, deletions or replacements of letters in the respective strings, where the edit distance is the sum of the costs attached to each transformation. So for example, if each change has a cost of 1 then the edit distance between tender and tenure is 3 since this transformation involves three replacements: d to u, e to r, r to e. Another popular alternative to the edit distance measure is to calculate the longest common subsequence (LCS), which measures the longest (order-dependent) sub-string common to both words. Since we are looking at string matching as a means of approximating semantic similarity between words, any matching function that we use must be have strict matching constraints in order to ensure that tentative 56 links between compound proper noun phrases are ignored. Hence, our LCS distance measure imposes the following limitations: Let S ∈ (a finite alphabet) be a string of length |S|. Let P ∈ be a string or pattern of length |P| |S|. Let d(S, P) be a distance metric such that P ⊂ S, where d(S, P) is the length of the longest sub-string in P that matches a sub-string in S. However, a match between P and S is only found if the following conditions hold: 1. |P| > 3 and d(S, P) > 3 2. If P ⊄ S and |P| > 6 then find d(S, P′) such that P′ ⊂ P where P′ is P where the last three letters in P have been deleted. In line with condition 1, the following inequalities must hold true: |P′| > 3 and d(S, P′) > 3 3. Pattern P or P′ must occur at text position 0 to |P| or |P′| respectively, in string S. Condition 1 ensures that words like Prince, Prime_minister and Prisoner do not match. Immediately one might think that there are prefixes longer than 3 that could cause the same problem like fore-, ante-, trans-, semi-; however, our compound noun phrases mainly consist of proper nouns which, unlike regular nouns, in general do not take prefixes such as these. While condition 3 ensures that a string like Stan in Stan_Laurel does not match within a word like Pakistan or Afganistan, or Tina in Tina_Turner with Argentina. Longer proper nouns often have slight variations due to pluralisation or the genitive case, so condition 2 ensures that the phrase Amateur_Footballer’s_Organization matches the proper noun phrase Football_Association_Ireland. These heuristics for capturing variations in compound noun phrases are not foolproof, for example, in the case of a story on “citizen band radio” the matching function was unable to associate CBR with CBers (people who use CBR’s) due to Condition 1. Also the algorithm is unable to find the link between an acronym such as CBR and the phrase citizen band radio, which would need to be resolved using an entity normalisation technique commonly used in Information Extraction applications. However, this fuzzy matching technique is sufficient for our purposes. In summary then, the Tokeniser produces tokenised text consisting of noun and proper noun phrases, including information on their location in the text, i.e. word number and sentence number. This information is then given as input to the next 57 step in the LexNews algorithm, the Lexical Chainer. During the course of our work, both the Chainer and the Tokeniser used version 1.6 of WordNet. The fuzzy matching algorithm for proper noun phrases described in this section is used during the chain creation step, which is described in more detail in the next section. 3.2.3 The Lexical Chainer In Section 3.1, we described St-Onge and Hirst’s lexical chaining approach, which, as already stated, forms the basis of our own lexical chaining algorithm. The aim of our chainer is to find relationships between tokens (nouns, proper nouns, and compound nouns) in the data set using the WordNet thesaurus and a set of statistical word associations. Our algorithm follows all of the following chaining constraints implemented by St-Onge and discussed in Section 3.1: Word relationship types and strengths (extra strong, strong and mediumstrength). Maximum allowable word-chain distances between the current word and a chain word (applies to strong and medium strength relationships). Chain salience (implemented using the chain stack data structure). Sentence-based disambiguation (implemented using the sentence queue data structure). Rules defining admissible paths between related words in the WordNet taxonomy. All of these constraints help to eliminate spurious or weakly cohesive chains. An example of a spurious link between words would be associating ‘gas’ with ‘air’ (hypernymy) when ‘gas’ refers to ‘petroleum’ (synonymy). This example illustrates the necessity for seeking out semantic word relationships based on the ordering set out by St-Onge, where extra strong relationships precede strong relationships, and strong relationships are followed by a medium-strength relationship search in the taxonomy. In the ‘gas-petroleum-air’ case both relationships are strong; however, synonymy precedes hypernymy in the search for a strong word connection. This point leads us on to the question of where statistical word association should fit into this searching strategy. Let us first define what a statistical word association is in this context: 58 A statistical word association exists between a word and a chain word if the log-likelihood association metric indicates that the co-occurrence of these words in the TDT1 news corpus is greater than chance. In Section 2.3, we stated that this type of word connection occurs when there is an intuitive link between words, but that the nature of the association cannot be defined in terms of repetition, synonymy, antonymy, specialisation/generalisation or part/whole relationships. However, when determining these relationships using a statistical measure of association some lexicographical relationships defined in WordNet will also be found. For example, in Figure 3.4 the occurrence of ‘AIDS’ and ‘disease’ is statistically significant and this relationship is also captured in the WordNet taxonomy where ‘AIDS’ is a type of (hyponym) ‘infectious disease’ and ‘infectious disease’ is a type (hyponym) of ‘disease’. Since these statistical word associations are not mapped to any synset numbers in WordNet they do not provide the algorithm with an explicit means of disambiguating a related word when it is added to a chain. Hence, our algorithm puts statistical word associations last in its relationship search (i.e. after medium-strength relationships). However, statistical word associations for the most part are strong evidence of a connection between words so we define the maximum allowable word-chain distance for this relationship as 5 sentences (the same distance constraint imposed on mediumstrength relationships). Up to this point in our description of the LexNews algorithm we have reported on the generation of one type of lexical chain consisting of WordNet nouns. However, an integral part of our algorithm is the inclusion of proper noun phrases in the chaining process (e.g. Chairman Bill Gates, Economist Alan Greenspan). Other chaining algorithms ignore these phrases as their coverage in WordNet is either sparse or non-existent. The chaining procedure for proper noun chains is simpler than for their noun-only counterparts, since the algorithm is not concerned with either statistical or lexicographical relationships. Instead it uses the repetitionbased fuzzy matching function described in Section 3.2.2 to find associations between compound proper noun phrases in the text. As was the case for noun chaining, word associations are searched for in order of strength. The following is a list of fuzzy matches in order of strength, starting with the strongest match first: 59 1. Exact Full Phrase Match: Helmut_Kohl Helmut_Kohl 2. Partial-Phrase Exact-Word Match: Hubble_Telescope Space_Telescope_Science_Institute. 3. Partial-Phrase Partial-Word Match: National_Caver’s_Association Irish_Cave_Rescue_Organisation. As previously stated, during noun-only chaining, allowable chain-word distances are longer for stronger relationships than weaker associations. A distance parameter is also enforced during proper noun chaining. Although, in this case, all types of phrase match are classified as repetition relationships, and so are assigned an infinite word-chain distance (confined by the length of the document). Optimal values for these distance parameters are discussed in Section 3.3. The result of this part of the chaining process is a distinct set of proper noun lexical chains that, coupled with the noun-based chains, form a representation of the lexical cohesive structure of a news story document. Appendix A contains a more formal description of the chaining part of our LexNews algorithm. Appendix A also includes a stopword list of ‘problematic’ WordNet nouns that are eliminated from the chaining process since they are often the root cause of spurious chains, i.e. chains containing incorrectly disambiguated or weakly cohesive chain members. Stopword lists are also widely used in other lexical chaining implementations (St-Onge, 1995; Stairmand, 1997; Green, 1997b). In our case, most of these offending nouns were found by automatically looking for concepts in the topology that subordinated a higher than average number of nouns in the taxonomy, e.g. the concept ‘entity’. Hence, for the most part these nouns or concepts tend to lie in the top-level of the taxonomy and were considered too general an indication of semantic similarity between nouns. A number of manually identified nouns were also added to this list. The remainder of this chapter will focus on performance and parameter estimation issues arising from the generation of the chains as well as a concrete example of how chains capture cohesion in a text. 60 3.3 Parameter Estimation based on Disambiguation Accuracy In Section 2.5.1, the importance of choosing correct chaining parameters in order to reduce the number of spurious or incorrect chains being generated was mentioned. No formal method of parameter estimation has yet been developed for lexical chaining analysis. Usually parameter estimation is the process of maximising the performance of a system on an initial training collection, and then applying these parameters to a test collection from which the system’s performance is determined. The hypothesis is that parameters that worked well on the training set should work well on the test collection (assuming they are from the data sample). The key to finding optimal parameters is finding optimal performance. However, lexical chain performance or quality cannot be evaluated directly. Instead lexical chains can only be evaluated with respect to a task-oriented evaluation strategy. However, if the lexical chains perform poorly on a specific task, for example, as an indexing strategy for an IR engine, then this may be due to the unsuitability of the application rather than a reflection on the quality of the chains. Hence, a task-oriented evaluation of lexical chains must be based on the performance of a fundamental operation involved in the lexical chaining algorithm. As already stated, a side-effect of lexical chain creation is noun disambiguation. Consequently, by measuring the disambiguation accuracy of our lexical chainer we can indirectly establish the chaining performance of our algorithm, since a disambiguation error implies that a word has been incorrectly added to a chain. We use the Semantic Concordance corpus (SemCor version 1.6) to evaluate lexical chain disambiguation accuracy. SemCor is a collection of documents on a variety of topics taken from the Brown Corpus that have been manually annotated with synset numbers from WordNet (Miller et al., 1993). Using SemCor and the IR metrics recall, precision and F1 we can indirectly measure the quality of a set of lexical chains generated from the original text of the SemCor corpus. In this context these IR metrics are defined as follows: Recall is the number of correctly disambiguated nouns returned by the disambiguator divided by the total number of nouns in our SemCor test set. Precision is the number of correctly disambiguated nouns returned by the disambiguator divided by the total number of SemCor nouns disambiguated by 61 the system. In general, there will be instances where the disambiguator will not be able to decide on the correct sense of a noun, thus producing different denominators in the recall and precision formulae. F1 is the harmonic mean of the recall and precision values for a system that represents a single overall measure of system performance, or in this case disambiguation effectiveness. Initially the main purpose of this disambiguation-based evaluation of lexical chaining quality was to establish whether the original chaining parameters suggested by St-Onge (1995) where optimal. In his thesis St-Onge suggested, based on a manual observation of his results, allowable word-chain distances for strong and medium-strength relationships and a set of rules for following paths in WordNet, so as to minimise the occurrence of spurious chains. At the outset of our research we verified these parameters by observing their effect on disambiguation accuracy. More recently we ran further experiments on a subset of documents taken from the SemCor corpus. This second set of experiments was driven by Galley and McKeown’s (2003) recent publication. In this paper, they investigate the effectiveness of their non-greedy chaining algorithm with respect to two other nongreedy approaches by Barzilay (1997) and Silber and McCoy (2003) (for descriptions of these algorithms see Section 2.5.5). Galley and McKeown ran these experiments on a (74 document) subset of the SemCor corpus because Barzilay’s algorithm, due to its high run-time and space requirements, could only run on shorter length documents10. Galley and McKeown define a single disambiguation accuracy measure: Accuracy is calculated as the number of correctly disambiguated nouns divided by the total number of nouns in the SemCor corpus. This definition is identical to the definition of system recall. This implies that precision values are also needed in order to gain a true picture of system effectiveness. However, in Galley and McKeown’s experiment, recall and precision values are equal since each system is required to return a default sense for each word that it fails to disambiguate. Consequently, the denominators in both metrics 10 Established during personal communications with the authors. 62 are equivalent. This point should be clearer following a more formal definition of these metrics: Precision = c / (c + i) Recall = c / (c + i + d) where c is the number of nouns disambiguated correctly by the system, i is the number of nouns disambiguated incorrectly by the system and d is the number of nouns not disambiguated by the system. So in the case of the systems involved in Galley and McKeown’s experiments, all systems disambiguated all nouns in the collection, i.e. d = 0. Hence, recall is equivalent to precision. Table 3.2 shows the results of this experiment published in (Galley, McKeown, 2003) as well as the results of the same experiment using our lexical chaining algorithm. Algorithm Accuracy Barzilay and Elhadad 56.56 % Silber and McCoy 54.48 % Galley and McKeown 62.09 % LexNewsbasic 60.39 % Table 3.2: Disambiguation Accuracy results taken from Galley and McKeown (2003). Using this evaluation strategy Galley and McKeown’s system outperforms the two non-greedy algorithms and our semi-greedy approach. However, there is a flaw in this evaluation methodology due to the default assignment of sense 1 to all nondisambiguated nouns returned by each system. In WordNet all senses of a particular word are ordered with respect to their frequency of use in the English language. In order to generate these frequency counts WordNet designers created the SemCor corpus, and counted all the occurrences of specific senses in an attempt to estimate their frequency of use in the English language. Consequently, if an algorithm were to choose the first sense of all nouns in the SemCor corpus then it would achieve an accuracy, recall, precision and F1 value of 76.32%. Hence, any system in Galley and McKeown’s experiment that returns a relatively low number of disambiguated terms will significantly outperform a system that returns a higher number of disambiguated terms since all non-disambiguated nouns will have a 76.32% chance 63 of being correct. This gives low recall systems an advantage in this experimental set up. In order to address these problems we ran additional experiments, with the help of the authors, using an evaluation methodology that required that all systems did not use a default sense assignment when they failed to make a decision on the sense of a word. The results of this experiment are shown in Table 3.3, where coverage is defined as the percentage of nouns that were disambiguated (either correctly or incorrectly) by the disambiguator. As expected Galley and McKeown’s accuracy measure biases systems with lower recall values. Using F1 values to evaluate chain disambiguation accuracy we see that our algorithm marginally outperforms Galley and McKeown’s technique. Algorithm Accuracy Precision Recall F1 Coverage LexNewsbasic 60.39% 59.45% 56.92% 58.20% 94.77% Galley and McKeown 62.02% 59.64% 56.00% 57.65% 93.89% Table 3.3: Results of Galley and McKeown’s evaluation strategy using the ‘accuracy’ metric and default sense assignments, compared with the recall, precision and F1 values when all disambiguated nouns are not assigned default senses. Algorithm Accuracy Precision Recall F1 Coverage LexNewsbasic A 60.39% 59.45% 56.92% 58.20% 94.77% LexNewsbasic B 71.20% 71.61% 30.12% 42.41% 42.06% 66.70% 64.13% 46.41% 53.85% 72.37% 60.54% 59.55% 56.44% 57.95% 94.77% (Synonymy Relations only) LexNewsbasic C (Strong Relations only) LexNewsbasic D (All relations and St-Onge Path Restrictions) Table 3.4: Comparing the effect of different parameters on the disambiguation performance of the LexNews algorithm. The results in Table 3.4 also verify the bias toward low recall systems inherent in Galley and McKeown’s accuracy measure. However, the purpose of the results in Table 3.4 is to corroborate a number of parameter settings and design issues in the 64 LexNewsbasic algorithm, i.e. St-Onge and Hirst’s original chaining procedure without the use of fuzzy proper noun matching and statistical word associations. 1. Strong and medium relations: LexNewsbasic A as described in Section 3.1 searches for extra-strong, strong and medium-strength relationships during chain generation. In order to verify the use of these different relationships we ran LexNewsbasic B and C, where B only looks at synonym relationships and C looks at all strong relationships. In both cases we observe a decrease in coverage where precision dramatically increases at the expense of recall. A comparison of the F1 measures for LexNewsbasic A, and the B and C versions verifies the use of medium-strength as well as strong relationships in the chaining process. 2. St-Onge path restrictions: In Section 3.1 we described a number of path restrictions that St-Onge suggested should be imposed on WordNet paths in the taxonomy that exceeded length 1 in order to avoid finding spurious links between nouns. In Table 3.4 from the results of LexNewsbasic D, we see that restricting paths using St-Onge’s rules marginally improves precision performance. However, a comparison of F1 measures suggest that allowing all paths (LexNewsbasic A) slightly improves coverage with a slight decrement in precision. So it seems St-Onge’s rules having little or no effect on chaining accuracy. The most likely reason for this outcome is that these rules were written to take care of spurious relationships with path of lengths greater than the 4 edges limit that we use in our implementation. We hypothesise that these rules are more critical for chaining algorithms that use a broader search scope than this when searching for related nouns in WordNet. 3. Allowable distances between related terms: St-Onge and Hirst originally suggested an allowable distance of 7 sentences (roughly a 130 word distance) for strong chain word relationships and a distance of 3 sentences (roughly a 60 word distance) for medium-strength relationships. Our experiments confirm these allowable distance values, in fact even small variations in these distances resulted in a slight deterioration. For example, a distance of 50 words for strong associations and 120 words for medium-strength relationships resulted in an F1 of 58.11 while a distance of 60 words and 140 words respectively produced an F1 value of 58.08. Although a wide range of other parameter values were also 65 experimented with, no improvement could be gained over the original 60 – 130 estimate proposed by St-Onge. This result is quite surprising since, as already stated, St-Onge and Hirst choose these parameter values based on a manual observation of their results. Figure 3.6 examines the results of the LexNewsbasic A system in more detail. This graph plots four lines representing: The percentage of polysemous nouns in SemCor with n senses that were not disambiguated by the LexNewsbasic system, i.e. % Not Disambiguated. The percentage of polysemous nouns in SemCor with n senses that were incorrectly disambiguated by the LexNewsbasic system, i.e. % Incorrectly Disambiguated. The percentage of polysemous nouns in SemCor with n senses that either failed to be disambiguated by the LexNewsbasic system, or were incorrectly disambiguated. More specifically, the sum of the previous two errors, i.e. % of Noun Errors. The percentage of polysemous nouns with n senses in the SemCor corpus, i.e. % of Nouns in Collection. 18 % Not Disambiguated % Incorrectly Disambiguated % of Noun Errors % of Nouns in Collection 16 14 % of Nouns 12 10 8 6 4 2 0 2 3 4 5 6 7 8 9 10 No. of Senses (Polysemy) Figure 3.6: Graph showing relationship between disambiguation error and number of senses or different contexts that a noun may be used in. 66 So for example, taking all nouns with two senses in the SemCor corpus we find that these nouns represent: 16.08% of all nouns in the collection, 9% of all errors made by LexNewsbasic, where 6.67% of these errors were due to incorrectly disambiguated nouns, and 2.33% to failing to disambiguate these nouns. Due to the systems large coverage (94.77% of nouns are disambiguated), errors attributed to the failure of the system to disambiguate a noun are low, and errors caused by incorrect disambiguation by the system are high. We can see that the system has most trouble with words that have 5 or more senses and that lower sense words are more easily handled by the system. An interesting additional experiment would be to compare these disambiguation-error trends with a non-greedy approach to chaining, in order to determine if the same categories of polysemous nouns are responsible for the majority of disambiguation errors produced by the algorithm. As a result of the findings described in this section of the thesis all further experiments involving our lexical chain algorithm, LexNews, will use St-Onge’s suggested parameters. In spite of the slight degradation in performance when StOnge path restrictions were incorporated into the search for medium-strength relationships, we still included these rules in all further applications of the chains so as to ensure that the resultant chains were as accurate as possible. This decision was motivated by the observation that some spurious chains and incorrect chain additions could be avoided when these rules were applied. 3.4 Statistics on LexNews Chains In this section, we examine a number of characteristics of lexical chains, including the types of relationship that dominate chain creation, and the size and span of chains in a document. As explained throughout this chapter our chaining algorithm uses three categories of word relationship strengths when clustering related noun word senses in a text. Figure 3.7 shows the percentage of each relationship category that participates in the chaining process, where extra strong relationships (repetition) account for 55.83% of relationships, strong relationships (synonymy and other WordNet associations) account for 13.03%, and medium-strength relationships (path lengths greater than 1 in WordNet) account for 31.14%. Figure 3.8 presents a breakdown of the results in Figure 3.7, where we see that hyponymy and hypernymy (specialisation/generalisation) are the most dominant 67 strong relationships, followed by synonymy, and last of all by meronymy and holonymy (part/whole). The sparsest relationship type in the WordNet taxonomy is antonymy, accounting for a mere 0.01% of word relationships. % Chain Word Relationships 60 50 40 Extra Strong Strong 30 Medium 20 10 0 Extra Strong Strong Medium St Onge Relationship Types Figure 3.7: Graph showing the dominance of extra strong and medium-strength relationships during lexical chain generation. % of Chain Word Relationships 60 50 40 30 20 10 0 REP SYN HYPO HYPR MER HOL PATH > 1 Chain Relationship Types Figure 3.8: Graph showing a breakdown of all relationship occurrences in the chaining process. REP = repetition or extra strong relationships, SYN = synonymy, HYPO = hyponymy, HYPR = Hypernymy, MER = meronymy, HOL = holonymy, and PATH > 1 = all medium-strength relationships. 68 Table 3.5 presents the average number of nouns per chain, chains per document and the length/span of a chain in a document in the SemCor collection, where the average document length is 2022.12 words (standard deviation 11.31). On average 442.4 of these words are nouns (standard deviation 50.2). However, in all cases for each of these statistics the standard deviation is high due to the fact that a document will contain a number of long chains followed by a series of short, often unimportant chains. This implies that a strong chain spans a large section of a topic, thus capturing a central theme in the discourse. However, the strength of the relationships between the words in the chain is also an important factor. In the next section we look in more detail at this idea of chain strength or importance and how it can be measured. The next section also examines the type of chains that are generated by the LexNews algorithm (this time also incorporating non-WordNet proper nouns and statistical word associations), and how these chains reflect themes in a document. Chain Statistic Average Standard Deviation Nouns per Chain (including repetitions) 11.54 20.17 Chains per Document 29.80 8.49 Word Span of Chains in Text 187.01 162.08 Sentence Span of Chains in Text 41.16 36.76 Table 3.5: Chains statistics for chains generated on subset of SemCor collection referred to in Section 3.4. 3.5 News Topic Identification and Lexical Chaining Figure 3.9 shows a broadcast news document discussing the premiere of the film ‘Veronica Guerin’. Appendix B shows the part-of-speech output for this text and the candidate terms selected by the Tokeniser discussed in Section 3.2.2. As mentioned in Section 3.2.3 our chaining algorithm, LexNews, produces two distinct sets of lexical chains for a news text: non-WordNet proper noun phrase chains and WordNet noun phrases chains, which are shown in Figures 3.10 and 3.11 respectively. In both these figures, chains are ordered with respect to their last ordering in the chain stack. Chain numbers represent the order in which the chains were created by 69 the LexNews algorithm, where all chains containing only one candidate term (with a frequency of one) are filtered out at the end of the chaining process. Chain words in each chain are displayed according to the order of their addition to the chain, where the word tagged as ‘SEED’ is the first word to be added to the chain. Chain spans are represented in terms of words and sentences, where a chain span represents the portion of text covered by a lexical chain i.e. in the case of the sentence span, this represents the maximum and minimum sentence number of the chain’s members. Also, each chain word has a frequency and weight assigned to it. This weight represents the strength of the relationship between the term and the most strongly related chain member that was added to the chain based on this relationship. These relationship weights, shown in Table 3.6, were chosen after a manual analysis of different weighting schemes by the author on news documents. Relationship Type Repetition Synonymy Antonymy, Hyponymy, Meronymy, Holonymy, and Hypernymy Path lengths greater than 1 in WordNet Statistical Word Associations Weight 1.0 0.9 0.7 0.4 0.4 Table 3.6: Weights assigned to lexical cohesive relationships between chain terms. Consider, for example, the word ‘murder’ in Chain 2 in Figure 3.11, [murder (investigation) Freq 2 WGT 0.7 STRONG] The seed of this chain is ‘investigation’, and ‘murder’ was added to this chain based on a statistical relationship with the noun ‘murder’ (unfortunately, the compound noun ‘murder_investigation’ is not listed in the WordNet noun database). However, the word ‘murder’ is assigned a score of 0.7 (not 0.4) since it is responsible for the addition of the word ‘killing’ to the chain, where a strong relationship (or hyponym relationship to be exact) exists between ‘killing’ and ‘murder’ in the taxonomy. Chain characteristics such as chain span, chain density and chain word relationship strengths are all useful indicators of the importance of a chain to a text, where high scoring chains represent dominant themes in the discourse. In Section 8.2, we use these chain scores as a means of assigning lexical cohesion-based weights to terms in a broadcast news story in order to generate an extractive gist (a short summary no 70 greater than a sentence) that captures the essence of that story. A more detailed description of this linguistically-motivated weighting strategy is left until Section 8.2, while in the remainder of this section we will concentrate on analysing the words that make up the chains. Looking first at the WordNet noun phrase chains in Figure 3.10, the four most important chains in this list cover the following themes: ‘the film’ (chain 4), ‘the location of the movie premiere’ (chain 3), and ‘the film topic: the murder of Veronica Guerin’ (chain 2, chain 1). Chains 10, 14, 7, and 9 all represent lower ranked chains, while chains 7 and 9 should be amalgamated with chain 4 as they are related to ‘the film’ theme, and chain 10 should be merged with chain 3 of the proper noun chain set because ‘Veronica Guerin’ is a specialisation of a ‘journalist’. However, neither of our auxiliary knowledge sources can relate the members of these two chains. Finally, consider chain 14: CHAIN 14; No. Words 3; Word Span 151-243; Sent Span 11-17; [energy (SEED) Freq 1 WGT 0.4 MEDIUM] [nature (energy) Freq 1 WGT 0.4 MEDIUM] [pressure (energy) Freq 1 WGT 0.4 MEDIUM], This is a perfect example of a weak spurious chain. Although the noun ‘energy’ is being used in the ‘vigour’ sense, the noun ‘pressure’ in the ‘personal stress’ sense, and the noun ‘nature’ in the ‘type of something’ sense, the algorithm has incorrectly linked these nouns through their alternative ‘scientific’ senses. Unfortunately, these types of chain are common occurrences in lexical chaining output, due to the ambiguous nature of these nouns in the text. However, the values of chain characteristics, such as the relatively low span of this chain in the text and the weak strength of the relationships between its chain members, are good indications that this chain represents a minor theme in the general discourse of the news story. Looking next at the chains in Figure 3.11 we see that our fuzzy matching component has identified the relationship between the phrases ‘Veronica’ and ‘Veronica_Guerin’, and between ‘Cate_Blanchett’ and ‘Blanchett’. Capturing the relationships between proper noun phrases such as these is an important part of our chaining algorithm as it helps to reduce the generation of spurious chains by removing ambiguity from the text. In particular, we have noticed that a large number of first names and surnames are documented in WordNet as noun phrases that bear no resemblance to their intended use in the news story. For example, the name ‘Veronica’ is indexed as a type of plant in the WordNet taxonomy, and the 71 noun ‘Savoy’ in the phrase ‘Savoy_cinema’ is described as a type of cabbage. Consequently, under the St-Onge chaining regime or even a non-greedy strategy, ‘savoy’ and ‘veronica’ would be deemed related and added to the same chain as they are both types of plant. This is a prime example of the need for the tokenisation component of the LexNews algorithm, which identifies compound proper noun and noun phrases such as these. Thus, helping to reduce the ambiguity of the candidate terms chosen for the lexical chaining step. As Gardai launch an investigation into gangland murders in Dublin and Limerick a film opened in Dublin tonight which recalls the killing of another victim of organised crime in 1996. The world premiere of the Veronica Guerin movie took place in Dublin's Savoy Cinema, with Cate Blanchett in the title role. The film charts the events leading up to the murder of the Irish journalist. Crowds gathered outside the Savoy Cinema as some of Ireland's biggest names gathered for the premiere of Veronica Guerin, the movie. It recounts the journalists attempts to exposed Dublin drug gangs. But for many the premiere was mixed with sadness. “It' s odd. It can' t be celebratory because of the subject matter.” Actress Cate Blanchett takes on the title role in the movie. It was a part she says she felt honoured to play. “I got this complete picture of this person full of life and energy. And so that' s when it became clear the true nature of the tragedy of the loss of this extraordinary human being, and great journalist.” Apart from Blanchett every other part is played by Irish actors. Her murderer was later jailed for 28 years for drug trafficking. The film-makers say it' s a story of personal courage, but for the director, there was only one person' s approval that mattered. “A couple of months ago I brought the film to show to her mother. It was the most pressure I' ve ever felt.” But he needn' t have worried. “I see it as a tribute to Veronica, a worldwide tribute.” Figure 3.9: Sample broadcast news story on the Veronica Guerin movie. 72 WordNet Noun Phrase Chains CHAIN 4; No. Words 14; Word Span 11-264; Sent Span 1-19; [film (SEED) Freq 3 WGT 0.9 STRONG] [movie (film) Freq 3 WGT 0.9 STRONG] [premiere (film) Freq 3 WGT 0.4 MEDIUM] [subject_matter (film) Freq 1 WGT 0.7 STRONG] [actress (movie) Freq 1 WGT 0.7 STRONG] [picture (film) Freq 1 WGT 0.9 STRONG] [actor (actress) Freq 1 WGT 0.7 STRONG] [film_maker (film) Freq 1 WGT 0.4 MEDIUM] [approval (subject_matter) Freq 1 WGT 0.7 STRONG] [story (subject_matter) Freq 1 WGT 0.4 MEDIUM] [director (actor) Freq 1 WGT 0.4 STATISTICAL] [tribute (approval) Freq 2 WGT 0.7 STRONG] CHAIN 14; No. Words 3; Word Span 151-243; Sent Span 11-17; [energy (SEED) Freq 1 WGT 0.4 MEDIUM] [nature (energy) Freq 1 WGT 0.4 MEDIUM] [pressure (energy) Freq 1 WGT 0.4 MEDIUM] CHAIN 1; No. Words 6; Word Span 4-198; Sent Span 1-14; [gangland (SEED) Freq 1 WGT 0.4 MEDIUM] [world (gangland) Freq 1 WGT 0.4 MEDIUM] [crowd (gangland) Freq 1 WGT 0.4 MEDIUM] [gang (crowd) Freq 1 WGT 0.9 STRONG] [drug (gang) Freq 2 WGT 0.4 STATISTICAL] CHAIN 3; No. Words 7; Word Span 7-187; Sent Span 1-13; [Dublin (SEED) Freq 4 WGT 0.7 STRONG] [Ireland (Dublin) Freq 3 WGT 0.7 STRONG] CHAIN 2; No. Words 9; Word Span 3-168; Sent Span 1-12; [investigation (SEED) Freq 1 WGT 0.4 STATISTICAL] [murder (investigation) Freq 2 WGT 0.7 STRONG] [killing (murder) Freq 1 WGT 0.7 STRONG] [victim (killing) Freq 1 WGT 0.4 STATISTICAL] [crime (victim) Freq 1 WGT 0.4 STATISTICAL] [life (murder) Freq 1 WGT 0.4 MEDIUM] [loss (life) Freq 1 WGT 0.4 STATISTICAL] [murderer (victim) Freq 1 WGT 0.4 MEDIUM] CHAIN 10; No. Words 3; Word Span 63-177; Sent Span 3-12; [journalist (SEED) Freq 3 WGT 0] CHAIN 7; No. Words 2; Word Span 48-123; Sent Span 2-9; [title_role (SEED) Freq 2 WGT 0] CHAIN 9; No. Words 2; Word Span 54-112; Sent Span 3-8; [event (SEED) Freq 1 WGT 0.4 MEDIUM] [celebration (event) Freq 1 WGT 0.4 MEDIUM] Figure 3.10: WordNet noun phrase chains for sample news story in Figure 3.9. 73 Non-WordNet Proper Noun Phrase Chains CHAIN 3; No. Words 3; Word Span 33-260; Sent Span 2-19; [Veronica_Guerin (SEED) Freq 2 WGT 0.8] [Veronica (Veronica_Guerin) Freq 1 WGT 0.8] CHAIN 5; No. Words 3; Word Span 44-180; Sent Span 2-13; [Cate_Blanchett (SEED) Freq 2 WGT 0.8] [Blanchett (Cate_Blanchett) Freq 1 WGT 0.8] CHAIN 4; No. Words 2; Word Span 40-68; Sent Span 2-4; [Dublin’s_Savoy_cinema (SEED) Freq 1 WGT 0.8] [Savoy_cinema (Dublin’s_Savoy_cinema) Freq 1 WGT 0.8] Figure 3.11: Non-WordNet proper noun phrase chains for sample news story in Figure 3.9. 3.6 Discussion This chapter has primarily focused on our news-oriented lexical chaining algorithm, LexNews. LexNews addresses a number of inadequacies in previous chaining approaches that generate chains for documents in the news story domain. More specifically, our algorithm incorporates domain knowledge and named entities such as ‘people’ and ‘organisations’ into the chaining process in order to capture important lexical cohesive relationships in news stories that have been ignored by previous chaining techniques. LexNews consists of two principal components: a tokeniser that selects candidate words for chaining, and a lexical chainer that creates two distinct sets of lexical chains, i.e. non-WordNet proper noun chains and WordNet noun chains. This chaining algorithm is based on St-Onge and Hirst’s greedy WordNet-based approach. In our review of chaining approaches in Chapter 2, we differentiated between greedy and non-greedy chaining techniques where greedy methods have been largely dismissed based on the assumption that delaying noun disambiguation until all possible lexical chains have been generated greatly improves chaining accuracy (Barzilay, 1997). However, our experiments in Section 3.3 show that noun disambiguation accuracy for both greedy and non-greedy approaches remains stable at an F1 value of around 58%. This is an interesting outcome as it shows that StOnge and Hirst’s semi-greedy chaining approach, which disambiguates words in their local context (i.e. the sentence in which they occur), works as well as an algorithm that considers the entire document before assigning a sense to a noun. 74 This counter-intuitive result may be explained by Voorhees (1998) observation that although WordNet sense definitions are satisfactory as reference units for disambiguation, the lexicographical relationships defined between these senses are not sufficient to achieve full and accurate disambiguation. Voorhees suggests that syntagmatic information (such as corpus statistics) is needed, in addition to lexicographical relationships, in order to achieve improvements in disambiguation accuracy. This statement has somewhat been confirmed by the results of the SENSEVAL-2 workshop (SENSEVAL-2, 2001), which found that in general supervised disambiguation algorithms where the most accurate WordNet-based disambiguators. The SemCor experiment, described in Section 3.3, also provides us with a means of parameter tuning our chaining algorithm. In particular, we looked at the effect of various word relationship types and distance constraints between related words, on noun disambiguation accuracy. A closer analysis of these chains showed that repetition relationships between words were responsible for 55.83% of chain word additions, followed by medium-strength relationships with 31.14% (i.e. path lengths greater than 1 in WordNet), and strong relationships which only accounted for 13.03% of all relationships. In the final part of this chapter, we examined a single news story in order to illustrate how lexical chains can be used as an intermediary natural language representation of news story topics. It is this representation of a document’s content, in terms of its lexical cohesive structure, that is used in the TDT applications of our lexical chaining technique described in the remainder of this thesis. More specifically, Chapters 4 and 5 look at our attempts to improve New Event Detection performance using lexical cohesion analysis; while Chapters 6 and 7 examine the use of lexical chains as a means of segmenting a broadcast news stream into its consistent news stories. The thesis ends with a discussion of some preliminary results relating to on-going work on News Story Gisting using LexNews chains. 75 Chapter 4 TDT New Event Detection This is the first of two chapters on New Event Detection (NED): a task that deals with the automatic detection of breaking news stories as they arrive on an incoming broadcast news and newswire data stream. The aim of this chapter is to provide some background on this task and on the techniques that are commonly used to accomplish it. In Section 4.1, we discuss the most commonly used Information Retrieval model, the vector space model. The popularity of this model is not only due to its effectiveness as an IR model, but also to the simplicity of its implementation. This retrieval model is the underlying technique used in the implementation of our approach and most other approaches to the NED task. NED is one of five tasks defined by the Topic Detection and Tracking (TDT) initiative. In Section 4.2, we explore the notion of a topic as defined by the TDT community, which forms the basis of all TDT task definitions. This discussion is followed by an overview of previous NED approaches in Section 4.3. In contrast to these approaches, our NED technique augments the basic vector space model with some linguistic knowledge derived from a set of lexical chains capturing the cohesive structure of each news story. A detailed description of our approach can be found in Chapter 5. In particular, this chapter discusses the performance of our technique with respect to the TDT1 and TDT2 evaluation methodologies. 76 4.1 Information Retrieval This section contains a brief overview of some Information Retrieval terminology that is essential for an understanding of our New Event Detection system, its evaluation described in Chapter 5, and the various approaches to New Event Detection examined in the remainder of this chapter. For a comprehensive introduction to general topics in Information Retrieval, we refer the reader to the following core textbooks on the subject (Van Rijsbergen 1979; Salton and McGill, 1983; Baeza-Yates, Ribeiro-Neto 1999). Information Retrieval (IR) research deals with the representation, storage, accessibility and organisation of information items (Baeza-Yates, Ribeiro-Neto 1999). Hence, an IR system is responsible for processing and responding to a user query by presenting the user with a ranked list of the most relevant documents that relate to that query, in a large collection of documents (e.g. the World Wide Web). Most contemporary IR systems represent documents as a set of index terms or keywords, where the relevance of a document to a query is calculated using a matching function that determines how frequently the query terms occur in the set of index terms representing the document. Hence, if a set of query terms occur frequently in a document a high relevance score will be assigned to the document. There are a variety of techniques or IR models that are based on this intuitive representation of document content, some of which include: The Boolean Model: This model specifies a document only in terms of the words it contains and disregards how frequently they occur. Hence, the weight of a query term in a document is 1 if it is present, and 0 if it is absent. Consequently, a document is either relevant or non-relevant and no ranking of documents is possible. A Boolean query is expressed in terms of the Boolean operators: and, or and not. Although the Boolean model is attractive due to its neat formalism, it is not commonly used by IR systems due to its ineffectiveness at ranking documents and the difficulty that users have with formulating complex queries. The Vector Space Model: This model is one of the most popular approaches used by researchers in the IR community. Unlike the Boolean model, the vector space model employs a term weighting scheme which facilitates document ranking. 77 In this model, documents and queries are represented as vectors in ndimensional space, where the intuition is that documents and queries that are similar will lie closer together in the vector space than dissimilar documents. This model is discussed in more detail in the next section. The Probabilistic Model: This model like the vector space model is capable of ranking documents with respect to their relevance to a query. More specifically, the probabilistic model ranks documents by their probability of relevance given the query. Also, index term weights are all binary variables as in the Boolean model. The similarity of the document to the query is defined using the following odds ratio: sim(dj , q) = P ( R | dj ) P ( R | dj ) (4.1) where P( R | dj ) is the probability that the document belongs to the relevant set of documents to the query R, and P( R | dj ) is the probability that the document is a member of the non-relevant set R . Assuming independence of the terms in a document (a strong assumption since the occurrence of a word is in some way related to the occurrence of other words in the text), and using Bayes’ rule, documents can be ranked in terms of the follow equation: sim( dj , q ) ~ t i =1 wi , q × wi , j × log 1 − P ( ki | R ) P ( ki | R ) + log 1 − P ( ki | R ) P ( ki | R ) (4.2) where P(ki | R ) is the probability that the index term ki is present in a document randomly selected from the set of relevant documents R, P( ki | R ) is the probability that the index term ki is present in a document randomly selected from the non-relevant document set R , and wi , q and wi , j are the weights of the term t in the query and the document set respectively. The problem then becomes how to estimate P(ki | R ) and P( ki | R ) . To begin with, assumptions are made regarding the values of these probabilities, i.e. P(ki | R ) is constant for all index terms ki (usually 0.5), and P( ki | R ) is approximated using the distribution of index terms among all the documents in the collection. After the initial ranking, subsequent rankings are made, and the values of these probabilities are 78 refined as more and more information is known about the distribution of terms in the relevant and non-relevant portions of the collection. The Language Model: A language model can be defined as a probabilistic model for generating natural language text. A language modelling approach to query-based retrieval assigns a score to the query for each document representing the probability that the query was generated from that document, in contrast to the probabilistic model which estimates the probability of relevance of the document to the query (Ponte, Croft, 1998). Ponte and Croft acknowledge that these two measures are correlated but distinct. Equation 4.3 represents a more formal definition of a language model, in this case a unigram11 language model: n P( q1, q 2,....., qn | d ) = ∏ P( qi | d ) (4.3) i =1 The most natural method of estimating P( qi | d ) , the probability of observing query word qi in document d, is to use the maximum likelihood of observing q as a sample of d: P( qi | d ) = freq ( qi, d ) / length ( d ) (4.4) While this estimate may be unbiased it suffers from one fatal problem: if the document contains no instances of one particular query word then P( qi | d ) = 0 , which implies that P( q1, q 2,....., qn | d ) is also zero. However, we cannot assume that since q fails to appear in this document that it could never occur in another document on the same topic. Hence, we come to a core area of research in the language modelling community called discounting methods, which address this zero frequency or data sparseness problem. All these discounting methods work by decreasing the probability of previously seen events, so that there is a little bit of probability mass left over for previously unseen events, in this case query terms, while still preserving the requirement that the total sum of the probability masses is 1. This process of discounting is often referred to as smoothing, since a probability distribution with no zeros is smoother than one with zeros. 11 An n-gram model attempts to model sequences of words in a text i.e. which words tend to follow other words in a text. If n is 1, as in the case of the unigram model, then probabilities are calculated based on single words and no apriori word information. In contrast, the bigram model uses the previous two words in the text to predict the next word, and the tri-gram model uses the previous three etc. 79 Research has shown that retrieval effectiveness is sensitive to smoothing parameters, and that unleashing the true potential of language modelling depends greatly on the understanding and selection of these parameters, thus providing the motivation for more research in this area (Chen and Goodman, 1996). We will now look at the vector space model in more detail and some important preprocessing steps that are commonly used by IR systems. 4.1.1 Vector Space Model The vector space model (VSM), as already stated, ranks a document with respect to its similarity to a given query. According to the VSM this similarity can be estimated by calculating the cosine of the angle between the document vector and a query vector. More formally then: sim( dj, q) = dj • q | dj | × | q | (4.5) t sim(dj, q) = i =1 t i =1 wi , j × wi , q ( wi , j ) × 2 t i =1 ( wi , q ) 2 (4.6) where | dj | and | q | are the norms of the document and query vectors. Both the document and query vectors are weighted, that is an index term in the document wi , j and the query wi , q , represented by the same position i in their respective vectors, will be assigned a weight. The value of this weight will depend on the weighting scheme used, of which there are many variations in IR research. The simplest possible weight for a term is the frequency of the term in the document, or the normalised frequency when it is divided by the length of the document, tf. However, an even better measure is one that also considers the inverse document frequency or idf value, which represents the frequency of occurrence of a term in the entire collection. Incorporating an idf count into a term weighting scheme is useful because although a term may appear to be important in a document due to its high frequency, this term may not be very useful at distinguishing this document from others in the collection, since it also occurs frequently in other documents in the corpus. Therefore, a good index term can be defined as one that has a low idf count, but a high tf count. The most common weighting scheme combining both these measures is defined as follows (Baeza-Yates, Ribeiro-Neto, 1999): 80 tf.idf = wi , j = fi, j × log N ni (4.7) where fi , j is the normalised tf value (i.e. the frequency of term ki in document dj divided by the maximum word frequency in dj), and the log expression is the idf value with N the total number of documents in the collection and ni the number of documents in which the term appears. In the VSM two important preprocessing steps are generally undertaken before term weighting occurs: Stopword Removal is the process of identifying and eliminating frequently occurring (often closed class words) that add very little information to a document representation. Before processing begins a list is usually compiled that contains such words as modal verbs ‘be’ ‘have’ ‘should’, determiners ‘the’, conjunctions ‘because’, and vague concepts ‘nobody’ ‘anyone’. Since the tf.idf would have assigned these types of words low scores anyway, one may ask is there any need then to remove them? The main advantage of stopword removal is that it improves the speed of execution of subsequent processing steps (e.g. document-query similarity calculations) because the number of vocabulary items in the collection is reduced, and hence all vector lengths are shortened. Stemming is the process of reducing terms to their root form. A stemmer uses morphological and derivational transformation rules to accomplish this. So for example, noun plurals such as ‘chocolates’ are transformed into ‘chocolate’, and derivational endings such as ‘ing’, ‘es’ ‘s’ and ‘ed’ are removed from verbs. However, some words do not conform to these transformations rules so many stemmers use a table of exceptions to identify and correctly reduce these words to their correct root, e.g. ‘children’ to ‘child’, and ‘ate’ to ‘eat’ and not ‘at’. The major benefit of stemming is that it increases the accuracy of the term weighting process, thus improving the recall of the system, i.e. more documents will be retrieved in response to a query. However, since words are reduced to a root form the original semantics of the word may be lost. For example, consider the case of the words ‘(chemical) plant’ and ‘(tobacco) plantation’ which will both be represented as ‘plant’ in a term index after stemming, even though one occurrence refers to a ‘factory’ and the other an ‘area of foliage’. In spite of this, however, researchers have found that a marginal improvement in retrieval 81 performance is possible using a stemming algorithm on index terms (Frakes, Baeza-Yates, 1992). 4.1.2 IR Evaluation In IR research, two aspects of system performance are explored when determining the effectiveness of a given retrieval system. Performance evaluation looks at the real-world practicality of the system by examining the trade-off between the time and space complexity of the system. Such issues were, at one time, of great concern to the IR community. However, with the advent of high-speed processors and the relative inexpensiveness of memory, this facet of system performance is not of as much interest in current evaluation methodologies. System effectiveness (or retrieval performance) on the other hand, is what drives most IR research. This type of evaluation looks at how well a system can match a set of ground-truth retrieval results (i.e. what a human expert believes the system should return in response to a query given a particular document collection). Creating such a test collection and employing experts to generate these relevancy judgements is a very time-consuming and expensive endeavour. However, it is a small price to pay for an evaluation procedure that can stand up to scientific scrutiny, allow different systems to be compared on an even platform, and consequently ensure that real and transparent progress is being made in a particular area of IR research. The most influential large-scale12 evaluation of IR strategies that follows this line of thinking is the TREC (Text REtrieval Conference) initiative, which began in the early 1990’s and is now in its 11th year. This annual conference holds separate forums on different English text retrieval tasks and on other diverse areas such as multi-lingual and digital video retrieval. However, in this thesis we focus on another well-known large-scale evaluation initiative called the Topic Detection and Tracking (TDT) initiative. A more detailed discussion of the TDT evaluation methodology is given in Section 4.2. For the remainder of this sub-section we will focus on a more general description of IR evaluation. Given a large test collection (~10 Gigabytes), and a ground-truth or goldstandard set of judgements for a given task, the first step in an IR evaluation is to measure the degree of overlap between the list of relevant documents the system 12 For a discussion of other important IR test collections, e.g. ISI and CACM refer to (Baeza-Yates, Ribeiro-Neto 1999). 82 returns in response to a query and the gold-standard output for that query produced by a set of human judges. There are two standard metrics for measuring this degree of overlap: recall and precision. Section 3.3 defines these metrics with respect to disambiguation accuracy, however, these metrics are more commonly defined in the context of an IR evaluation. Figure 4.1 taken from (Baeza-Yates, Ribeiro-Neto 1999) represents a set theory approach to defining these metrics. Figure 4.1: IR metrics precision and recall. Recall is the number of relevant documents returned by the system divided by the total number of relevant documents in the corpus. Recall = |S | |A| (4.8) Precision is the number of relevant documents returned by the system divided by the total number of documents retrieved by the system. Precision = |S | |R | (4.9) F measure is an attempt to combine recall and precision into a single score. It is calculated by finding the harmonic mean of the two numbers, p precision and r recall. F1 = 2 pr p+r (4.10) 83 Typically in IR experiments one finds that a trade-off exists between precision and recall, i.e. as the precision of the system improves the recall deteriorates and vice versa. Hence, reports of systems achieving 100% precision and recall are very rare. Although recall, precision and F1 are the most commonly used metrics for evaluating system performance, the TDT evaluation and the work described in this thesis measures performance using two alternative system error metrics: misses and false alarms. The decision by the TDT community to choose a signal detection model of evaluation is based on the fact that most information filtering (see Section 4.1.3) tasks can be viewed as a detection process. For example, consider a security alarm (that detects intruders) or a retina-scanning device (that detects unauthorised users), where the goal of the system is to minimise both the number of false alarms and the number of misses. However, some errors are more critical than others depending on the fault-tolerance bias of the system. For example, consider a smoke alarm that is highly sensitive and produces many false alarms; the annoyance caused by the smoke alarm falsely detecting a fire is less critical than the loss of human life if the system were to miss the occurrence of a real fire. This tolerance of false alarms over misses can be integrated into a cost function that combines misses and false alarms (like the F1 measure) and penalises the system more heavily when it commits a critical error (in this case a miss). Like recall and precision, a trade-off also exists between misses and false alarms, where a reduction in system misses will often lead to an increase in false alarms. A fuller explanation of these error metrics in terms of the TDT evaluation methodology can be found in Section 5.4.2. 4.1.3 Information Filtering Information Filtering tasks are often referred to as IR sub-tasks. This is mainly because IR strategies like the models described in Section 4.1 have also been successfully applied to information filtering problems. Although, an IR system compares the similarity of each document in the collection to a query, the same similarity measure can be applied to an information filtering task, where the query/document comparison process is replaced with a series of document/document similarity comparisons. However, in spite of their resemblance in this regard there are still some striking differences between ad hoc/query-based retrieval and information filtering. 84 Firstly, as just stated, a filtering system does not deal with an explicit query like an IR system does. A ‘filtering query’ has been described as a long-term query that anticipates a future user’s need by organising a dynamic document collection into a structure representative of its content. In this way filtering is often seen as a classification task where a document is either relevant or not relevant. An example of a filtering task would be a clustering task that partitions a collection of documents into a set of clusters, where each cluster contains a series of documents discussing a particular topic. Secondly, although filtering tasks can handle static document collections, unlike IR systems, they may also have to deal with dynamic collections of documents, e.g. streams of newswire documents arriving every few minutes. This collection characteristic means that a filtering system may have to make an online decision regarding the relevance or non-relevance of a document. In contrast, an IR system has the entire collection at its disposal and so has the luxury of a ‘retrospective’ decision making process. TDT tasks such as New Event Detection and Topic Tracking are examples of such filtering tasks. So far in our discussion of filtering, we have assumed that no interaction occurs between a user and the filtering system, however, this is not strictly the case. Figure 4.2 shows Arampatzis’s (2001) model of document filtering, which clearly illustrates that interaction between the user and the collection, and the user and the filtering or selection process is possible. WWW Newswire Collection Selection Display Feedback ??? Figure 4.2: Typical Information Filtering System. In the case of the collection, the user may have control over what sources of information are to be filtered, e.g. only European newswire. While the filtering process may be aided or controlled by personal information regarding the user’s 85 User interests, level of expertise or age. However, any subsequent information displayed to the user is not ranked and it is basically up to the user to browse through this subset of information to find what interests them. At this stage if the filtered information is ranked in order of relevancy to a user’s preferences then this special case of filtering is called routing. 4.2 Topic Detection and Tracking ‘TDT is a body of research and an evaluation paradigm that addresses the eventbased organisation of broadcast news’ (Allan, 2002b). More specifically, TDT research is concerned with the detection of new events in a stream of news stories taken from multiple sources, and the tracking of these known events. The initiative hopes to provide an alternative to traditional query-based retrieval, by providing the user with a set of documents stemming from a generic question like: “What has happened in the news today/ this week/ in the past month?” In this way, TDT tasks are seen as classification or filtering tasks as they deal with static queries which present the user with an organised event structure and then allow the user to decide what is of interest to them. Although TDT tasks focus on the organisation of large volumes of news stories, the techniques and methods being developed can be used in a variety of other scenarios that match this type of static information need. A few examples of other possible applications include stock market analysis, email alerting, junk mail filtering, and incident and accident analysis. The TDT initiative began in 1997 and is still an active area of research after having completed one pilot study and six ‘open and competitive evaluations’ on four distinct test collections (Allan, 2002b). TDT was originally funded and supported by DARPA (Defence Advanced Research Projects Agency), but is now under the control of the TIDES (Translingual Information Detection, Extraction and Summarization) program. Most years the evaluation attracts roughly 11 participants, which include its founding members University of Massachusetts, Carnegie Mellon University, and Dragon Systems, and other important participants such as IBM Watson and University of Maryland. In the TDT pilot study and TDT 1998 evaluation, only English language news sources were provided. However, TDT is now a multilingual forum that also focuses on Chinese (TDT2 corpus), 86 Arabic, Spanish, Korean, and Farsi13 news stories (soon to be released in the TDT4 corpus). Another important feature of the TDT collections is that they contain not only newswire (text), but also audio transcriptions of radio and television news broadcasts. The contribution of the audio sources is an important motivation for the TDT researchers, since previous large-scale filtering and classification work has focussed on organising ‘clean’ newspaper sources. Therefore, TDT systems must be robust enough to be able to deal with error-prone manually and automatically transcripted, and translated broadcasts. Another important feature of the TDT paradigm, described in the next section, relates to its definition of news topics and events, and how this has affected the creation of the TDT corpora by LDC (Linguistic Data Consortium) annotators and the evaluation methodology set out by the TDT community. 4.2.1 Distinguishing between TDT Events and TREC Topics During the 1997-1998 pilot study evaluation, the TDT participants at the time settled on the following definition of an event: ‘something that happens at some point in time’ (Allan et al. 1998a). Later it was admitted that this definition was somewhat vague; however, it was still considered important as the realisation of this definition, and the discussion around its formulation, marks the first large scale attempt in IR research to move away from a broad notion of ‘aboutness’ to a finergrained definition of how news topics grow and expand on a day-to-day basis. As Allan (2002b) explains, much of the TREC filtering and retrieval work done before the TDT initiative centred around the classification and retrieval of documents that discuss broad subject areas such as stories on ‘earthquakes’. TDT topics, on the other hand, differentiate between different instances of ‘earthquakes’ in that general topic. For example: o 17th of January 1995, Kobe earthquake o 30th of May 1998, northern Afghanistan earthquake o 17th of August 2000, northwest Turkey earthquake o 25th of March 2002, northeast Afghanistan earthquake 13 Farsi is the most widely spoken Persian language with over 30 million speakers which include 50% of Iranians and 25% Afghanis in their respective countries, (Source: http://www.farsinet.com/farsi/, July 2003). 87 The TDT definitions of a topic and an event that are still used in current evaluations were agreed upon during the second TDT evaluation in 1998, and are defined as follows (Papka, 1999): A topic is a seminal event or activity along with all directly related events and activities. An event is something that happens in a specific time and place. (Specific elections, accidents, crimes and natural disasters are examples of events.) An activity is a connected set of actions that have a common focus or purpose. (Specific campaigns, investigations, and disaster relief efforts are examples of activities). Two important areas of TDT research have emerged from these definitions. Firstly, notice the emphasis on ‘time’. The temporal nature of news topics is a common phenomenon in news streams and incorporating this into TDT technology presented an exciting challenge for TDT researchers. Yang et al. (1998) outlined three important observations on this characteristic of news which helped to focus the development of their TDT systems: News stories discussing the same event tend to be temporally proximate, i.e. occur in news bursts. A time gap between bursts of topically similar stories is often an indication of different TDT topics. New events are characterised by large shifts in vocabulary in the data stream, especially where proper nouns are concerned. Secondly, the event, topic and activity definitions contain a notion of event evolution. To illustrate this, consider a seminal event such as a storm warning that triggers the start of a topic on the devastation caused by a ferocious tropical hurricane. As the story progresses over the course of the topic, different events or activities will arise that are not clearly related to other events, but are associated by the seminal event that triggered them. So some events that might originate from the hurricane event include: the resulting rescue attempt; the rebuilding of damaged houses and fundraising activities; the subsequent rise in home insurance in the area. When an event evolves in this way, clearly, there will be a gradual shift in vocabulary and focus as the story develops. Allan (2002b) points out that this is another difference between subject-based topics and event-based topics, where the 88 relevance of a news story to a topic is dependent on time, while the relevance of a news story to a general subject is time independent. The general effect of an event-based definition of a topic on corpus creation is that a large amount of annotation time is spent on ensuring that annotators understand what seminal events are, and what constitutes a set of related events for a particular topic. In the development of the TDT2 corpus a set of ‘rules of interpretation’ were formulated to help annotators with this task. Once annotators have assigned labels to documents relating to specific topics, a number of quality assurance tests are performed in order to ensure that different annotators agree on these labels. Cieri et al. (2002) at the LDC note that when they measured human annotation consistency using the kappa statistic they found that ‘kappa scores on TDT2 were routinely in the range of 0.59 to 0.89 … scores for TDT3 ranged from 0.72 to 0.86’, where 0.6 indicates marginal consistency and 0.7 measures good consistency. These scores show that annotating a corpus using an event-based topic definition is a challenging task. More recently, the TDT organisers have discovered ‘an unrecognised (but always present) problem with topic annotation’. The LDC defined a topic by first selecting a random story from the corpus, then identifying the seminal event that triggered the story, and finally building a topic for the seminal event by finding all related documents using the aforementioned rules of interpretation. The problem with this procedure is that ‘if the LDC were to randomly sample a story from that topic and then re-apply the process, it might not get the original topic back. The issue is in how the seminal event is chosen from the sample story, and that depends on which story is selected.’14 Currently, the organisers are planning to determine the impact of this finding on their evaluation methodology by comparing cluster detection results (see 5.1.2 for definition of this task) from different sites on the new TDT4 corpus. 4.2.2 The TDT Tasks The goal of a TDT system is to monitor a stream of broadcast news stories, and to find the relationships between these stories based on the real-world proceedings or events that they describe. Five technical tasks have been outlined within the TDT study (Allan, 2002b): 14 Source: http://ciir.cs.UMass.edu/research/tdt2003/guidelines.html last accessed 16th of July, 2003. 89 Segmentation is the task of breaking a broadcast news stream into its constituent news stories. This task opens up a whole new area of discussion that has not been explored so far in the thesis. The necessity of this task relates to the added difficulty of working with broadcast radio and television transmissions. Since unlike written sources of news, which contain title, paragraph and story boundary information, a broadcast news transcript or closed caption material will not contain any mark-up indicating where stories begin and end in the data stream. This has prompted an entirely new avenue of research (discussed at length in Chapters 6 and 7) that requires systems to automate the story segmentation process. As a consequence of this, TDT systems must incorporate more robust filtering technologies that can tackle noisy input due to segmentation errors (e.g. a missed story) and additional errors contained in ASR (Automatic Speech Recognition) system output such as a lack of capitalisation, and errors due to pronunciation similarity between different word forms, e.g. ‘ice cream’ and ‘I scream’. Much of the LDC’s work in creating the TDT corpora was taken up adding boundary information to automatic ASR output and closed-caption transcripts. Segments in TDT text must also be classified as one of the following: a news story, a miscellaneous news item such as reporter chit-chat or commercials and untranscribed text containing incomplete stories where there is not enough information present in the text to identify its topic (Cieri et al., 2002). These human identified topic boundaries are then used to evaluate the performance of TDT segmentation systems. The TDT community has also investigated the impact of automatic segmentation errors on other TDT tasks, where it has found that ‘segmentation has little-effect on tracking tasks, but does dramatically effect the impact of various detection tasks’ (Allan, 2002b). Detection is the task of identifying similar (on-topic) and dissimilar (off-topic) news stories in the news stream. Detection can be further subdivided into new event detection, cluster detection, and link detection tasks. o (Online) New Event Detection (NED) is the task of recognising seminal events as they arrive on the data stream. In TDT 1999 – 2002 this task was referred to as First Story Detection, however, in its current incarnation the TDT community have reverted back to calling it 90 by its original task name used in the 1997 pilot study. In all evaluations, the task definition remains the same: to find the document that is first to discuss a breaking news story for each event in the collection. This is an online filtering task so the system can only make this decision (first story or not a first story) for the current document by considering only those documents that it has seen so far on the input stream. o Cluster Detection has been referred to as either Event Detection or Retrospective Event Detection in previous TDT evaluations. The task definition for an event detection system is: to retrospectively divide the data stream into clusters of related events by considering all the documents in the TDT collection rather than just those that occur before the current document in the input stream, as in the case of online new event detection. This task has proved to be considerably more popular than its new event detection counterpart due to the similarity of this technology with previous research efforts such as clustering-based TREC tasks. o Story Link Detection is the task of classifying a pair of news stories as on-topic (they belong to the same topic) or off-topic (they belong to different topics). The TDT initiative has emphasised the importance of this task as it is a ‘core technology for all other tasks’ (Allan, 2002b). This claim is easily understood since all IR and filtering systems are concerned with the determination of document similarity. It is hoped that by refining this aspect of TDT research that a break-through in other tasks may be possible. Tracking is the task of finding all subsequent stories in the news stream pertaining to a certain known event represented by the first n sample stories on that event. It is analogous to the TREC information filtering task. In essence, the tracking problem involves classifying each successive story on the input stream as either on-topic (it describes the target event) or off-topic (it is not related to the target event). Quantifying an optimal value for n is a major part of tracking research where it is of paramount importance to a real-time system that tracking can begin as soon as possible (i.e. small n) after the seminal event has been identified. 91 Figure 4.3 shows a possible TDT architecture that integrates the tasks defined above. The system inputs are a broadcast news and newswire stream. An ASR system converts a television or radio broadcast speech signal to a text transcript which is then segmented into its constituent news stories. Newswire and Broadcast News Stories Broadcast News Preprocessing Automatic Speech Recognition System Story Segmentation System News Stream Retrospective Event Detection System Specific Time Span Event Clusters New Event Breaking Detection News Stories System User Event Cluster Event Tracking System Related News Stories Figure 4.3: TDT system architecture. 92 The data stream is then fed to each of the TDT components. Given a specific time span (e.g. summer 2003), the retrospective detection component will provide the user with a list of news topic clusters in that time frame. The user can then specify a news event of interest that can then be tracked in the remainder of the data stream using the event tracking component. The final component, the new event detector, alerts the user to all breaking news stories as they arrive on the data stream. An ad hoc retrieval component could also be included in this system architecture, where given a user query (e.g. ‘Kofi Annan’) the system will return a ranked lists of all relevant event clusters from the set of event clusters detected by the retrospective event detection step. 4.2.3 TDT Progress To Date Since the beginning of the TDT initiative a significant amount of progress has been made in the development of systems based on the tasks defined in the previous section. Allan (2002b) states that tracking technology is now at an acceptable level of accuracy for integration into a real-time system. However, all of the detection tasks have experienced less exciting levels of progress. In particular, this is true of the NED task. In a paper by Allan et al. (2000b), it was shown that NED, or First Story Detection (FSD) as they refer to it, is a special instance of event tracking, and that current TDT tracking approaches used to solve the FSD problem are ‘unlikely to succeed’. This logic follows from the observation that during the detection of first stories multiple tracking of documents is actually taking place. More specifically, each first story identified is a potential event that must be tracked in order to discover further ‘first’ stories that digress from all previously identified events. Allan et al. (2000) show that TDT filtering results are comparable with TREC results on a similar filtering task. Hence, they conclude that a huge effort is needed to get current FSD effectiveness to the level of current tracking effectiveness. In fact, they believe that this improvement in FSD performance will require a 20-fold improvement in current tracking effectiveness. It is generally agreed that the first phase of TDT research has largely been taken up with investigating how well traditional IR filtering solutions would perform in a TDT evaluation environment. Allan (2002b) concludes that parameter tweaking existing techniques has got TDT research this far, but that if further improvements 93 are to be made, future TDT investigations must focus more on modelling the essential entities involved in the event definition – time, location and people and how these entities relate to each other with respect to event evolution. In the next section we examine some of the principal approaches to NED used by the TDT participants. In Chapter 5 we document our novel lexical chaining approach to this task. 4.3 New Event Detection Approaches Most approaches to TDT tracking and detection tasks have used elements of the traditional IR models described in Section 4.1. This section describes the techniques of four different TDT participants who have submitted results for the NED/FSD task. Typically these techniques differ in their implementation of the following three NED system components: Feature Extraction and Weighting: This component reduces documents to a set of index terms or features, and then weights these features with respect to their discriminating power as an element of the resultant document classifier (or document representation). Similarity Function: A similarity function is used to determine the strength of association between document representations or classifiers. This component coupled with a similarity threshold is used to determine if two classifiers are ontopic (an old event has been detected) or off-topic (a first story has been detected). Detection Algorithm: All NED algorithms are based around some sort of cluster algorithm. A document clustering algorithm groups documents into sets or clusters that contain a high overlap of highly weighted features. Unlike text categorisation, another type of text classification task, document clustering is an unsupervised task with no apriori knowledge about the types of categories (in our case events) present in the collection. Another added difficulty associated specifically with NED clustering algorithms is that retrospective clustering is prohibited. That means that documents must be processed sequentially, and that the classification of a document as a new or old event must be based only on the documents that have occurred before this point on the data stream. This implies implementing the detection algorithm using a single-pass clustering algorithm 94 (van Rijsbergen, 1979), which uses the similarity function and a feature weighting strategy in its decision to assign a document to a cluster. The technical details of this process will be described in more detail with respect to each of the four NED techniques covered in this section. Another important contribution to NED research which will also be looked at are the conclusions drawn from the Workshop on Topic-based Novelty Detection held at John Hopkins University in Summer 1999 (Allan et al., 1999). Although TDT is now a multi-lingual evaluation environment, we will only focus on aspects of TDT research that deal with English multi-source news stories as this is the main focus of our research. For more information on multi-lingual TDT, we refer the reader to various site contributions found in (Allan, 2002a). 4.3.1 UMass Approach The initial UMass NED system that took part in the pilot study, and the TDT 1998 and 1999 evaluations is based on work by Papka (1999) and Allan et al. (1998a, 1998b, 1998c, 1998d, 1999, 2000a, 2000b). In their NED vector space model implementation, they use a single-pass clustering algorithm, and avail of the InQuery system framework (Callan et al., 1992) for representing documents, and measuring similarity between an incoming document and a document in a cluster. In the UMass system an initial classifier for the current document on the input stream using the n most frequently occurring stopped and stemmed terms is created. The InQuery weighting function, a variation on the tf.idf measure, is then used to assign ‘belief values’ to each of the terms in a document using the following formulae taken from (Papka, 1999): dj , k = 0.4 + 0.6 ∗ tfk ∗ idfk tfk = tk /(tk + 0.5 + 1.5 ∗ (4.11) dlj ) avg _ dl (4.12) C + 0. 5 cfk idfk = log(C + 1) log (4.13) where dj,k is a document feature in document dj, tk is its frequency in that document, dlj is the document’s length and avg_dl is the average document length in the test collection. As already stated, the TDT task requirement for NED states that classification decisions are to be made online rather than retrospectively. This not 95 only affects the type of clustering algorithm the system can use, but also the calculation of the idf part of any tf.idf measure. More specifically, the problem arises from the fact that it is difficult to generate meaningful idf measures from the small amount of previously seen documents that are permitted for this calculation during the detection process. UMass use an auxiliary corpus of TREC documents (all on general news topics) to estimate idf values for each term in the test collection. Therefore, in the above idfk formula C is the number of documents in the auxiliary collection, cfk is either the number of documents containing the feature tk in this auxiliary collection, or the default value of 1 if the term is not present in the auxiliary collection. A document classifier is represented as a weighted vector of its features. The similarity between any two classifiers is calculated using the InQuery scoring metric #WSUM: N sim( qi, dj ) = k =1 qi , k ⋅ dj , k (4.14) N qi , k k =1 where qi,k is the relative weight of a feature in an existing document classifier, and dj,k is the weight of a feature in the incoming document’s classifier, calculated using Equation 4.11. The threshold used to determine if these two classifiers are on-topic is dynamically determined using the following formula: threshold ( qi, dj ) = 0.4 + ∗ ( sim( qi, di ) − 0.4) + ∗ ( datej − datei ) where = 0.2 and (4.15) = 0.0005 control the effect of sim( qi, di ) and the time parameter datej − datei , and 0.4 is an InQuery constant. This time parameter is used as a means of modelling the temporal nature of news in the detection task, and as a means of controlling the similarity of the classifiers so that documents that are far apart on the input stream seem less similar than they actually are. This is based on the following observation that ‘an event is less and less likely to be reported as time passes [because].. it slowly becomes news that is no longer worth reporting’ (Allan et al. 1998c). Another important element of Equation 4.15 is the calculating of sim( qi, di ) , i.e. the similarity value between the (existing) classifier and the document from which it was originally formulated. In Papka’s NED implementation a document classifier is continually reformulated as new documents arrive on the input stream. This classifier reformulation step is based on the notion that as a news topic grows a variation in vocabulary will also occur, and so in order 96 to ensure that discriminating features in a cluster are weighted correctly, they must be re-weighted each time a new document is added to that cluster. More specifically, classifiers are re-weighted with respect to how often their features occur in the relevant document set (the cluster in which the classifier resides) tfrel and in the non-relevant document set tfnonrel (all other document classifiers): qi , k = c1 ∗ tfrel − c2 ∗ tfnonrel (4.17) where c1 and c2 are equal to 0.5. Using the above weighting and thresholding strategies, and a single-pass clustering algorithm, an incoming document is first compared to each previously seen classifier in each cluster. In each case the system assigns a decision score to each document comparison using the following formula: decision( qi, dj ) = sim( qi, dj ) − threshold ( qi, dj ) (4.16) where the occurrence of a positive value indicates that an old event has been found, while a negative value indicates that an the incoming document discusses a seminal event or first story. Papka’s clustering algorithm employs an online single-link strategy when comparing a cluster to a target document. More specifically rather than maintaining and updating a cluster centroid classifier, this comparison strategy maintains a set of individual classifiers representing all the documents that have been added to a particular cluster during the clustering process. The single-link comparison strategy takes ‘the maximum positive decision score for the classifiers contained in a cluster’ as the similarity value between a cluster and an incoming document (Papka, 1999). Once an incoming document classifier has been added to an existing cluster (i.e. an old event has been detected), or forms the seed of a new cluster (i.e. a new event has been detected) then the next document in the input stream is read in and the detection process begins again. 4.3.2 CMU Approach The Carnegie Mellon University approach to NED uses a vector space model to represent documents and clusters during first story detection (Yang et al., 1998; 2002). However, unlike the original UMass NED approach (Papka, 1999), they use a variation on the single-pass clustering algorithm that compares incoming documents to other recently occurring documents on the input stream rather than to documents in existing clusters. Like UMass, the CMU researchers also addressed the difficulties of calculating idf statistics online. To tackle this problem they 97 devised an incremental vector space model (Incr.VSM) which calculates the idf statistic in two ways: Retrospective idf statistics can be generated from a same domain corpus in order to approximate the real idf values in the current corpus. Incremental idf statistics generated from the current corpus are captured by recomputing the statistics as new information arrives on the data stream. In other words, the feature frequency counts in the documents seen so far on the input stream are used to update and augment the retrospective idf values obtained apriori. The incremental idf measure is defined as follows in (Yang et al., 1998): idf ( t , p ) = log 2( N ( p ) / n ( t , p )) (4.18) where p is the current time, t is the term, N ( p ) is the number of accumulated documents up to the current point (including the retrospective corpus if used), and n ( t , p ) is the number of documents which contain term t up to the current point on the input stream. Terms are then weighted using the following version of the tf.idf measure also taken from (Yang et al., 1998): w(t , d ) = (1 + log 2 tf ( t , d )) × idf ( t , p ) || d || (4.19) where the denominator || d || is the 2-norm of vector d , i.e. the square root of the sum of the squares of all the elements in that vector. This equation and all document preprocessing (stopword removal and stemming) are provided by the SMART 11.0 system developed at Cornell University (Salton, 1989). CMU’s detection algorithm, as already stated, compares incoming documents to other previously seen documents rather than cluster representations of topics. Furthermore, their approach calculates document-document similarity using the standard cosine measure; however, using document-document comparison is more computationally expensive than comparing documents to clusters. Hence, they implement a time-window component which improves their algorithm’s efficiency by limiting the number of ‘target document to existing document’ comparisons to those that exist within a window of m previously seen stories up to the current point on the input stream. Using a window size of 2000 documents (about 1.5 months of news time), Yang et al. (1998) found that this windowing technique actually improved performance rather than compromised it. This result is in part due to the 98 temporal nature of news where documents becomes less related in the input stream as time goes on, and so it therefore becomes unnecessary to compare all old documents to the target document. To further model this idea of news temporality the CMU team also incorporate a decay function into the similarity measure as follows: score( x ) = 1 − max { i di ∈ window m sim( x , di )} (4.20) where x is the current document, di is the i-th document in the window, and i = 1, 2, …, m. The value i m is the decay factor which ensures that documents that exist far apart on the input stream will be assigned a reduced similarity score. With regard to feature extraction, all features (excluding stopwords) are retained for each document representation, and the optimal similarity threshold was found to be 0.16. This means that an incoming document must not exceed the 0.16 similarity limit with another document in the time-window in order to be classified as a ‘first story’. This CMU system took part in the TDT-pilot study, TDT 1998 and TDT 1999 evaluation initiatives. A lot of CMU’s work has also focused on using multipleclassifiers and multiple-method approaches (the BORG approach) to improve tracking and cluster detection performance; however, these techniques were not suitable for the NED task just described. More information on this work and their multi-lingual TDT findings can be found in (Yang, et al., 2002). 4.3.3 Dragon Systems Approach Dragon systems use a language modelling approach to NED (Allan et al., 1998a; Yamron et al., 2002). This means that documents are represented using n-gram frequencies (in this case unigrams). Like CMU, Dragon use an auxiliary corpus to generate word statistics, however, their approach does not use this information to collect idf values to weight terms. Instead discriminator topic models are built from an available same domain corpus using an iterative k-means clustering algorithm. The Dragon approach then uses a single-pass clustering algorithm to determine which stories discuss seminal events, where a new event is detected if it is closer to a discriminator topic model than an existing story cluster. Dragon use a variation 99 on the Kullback-Leibler divergence metric to measure the distance15 between the tf distributions of a document and a cluster which is defined more formally as: d= ( sn / S ) log n un / U + decay term c ′n / C (4.21) In this equation c ′n is the smoothed cluster count16 for word wn, and un /U, sn / S and c ′n / C are the relative frequencies of wn in the background unigram model (or discriminator model), the story unigram model and the cluster unigram model respectively. The decay term is used to make clusters containing old news stories have greater d, i.e. appear less similar and is defined in (Allan et al., 1998) as ‘the product of a decay parameter and the difference between the number of the story representing the distribution sn and the number midway between the first and last story in the cluster’. Dragon systems only participated in the pilot study evaluation for NED, preferring to concentrate like CMU on the segmentation, tracking and cluster detection tasks for further TDT evaluations. In later tracking and detection approaches Dragon combine two complementary statistical models, a betabinominal model and a unigram model, to improve the performance of these tasks. However, no attempt was made to observe the effects of this combination technique on the NED performance. Yamron et al. (2002) give additional details on these tracking and retrospective detection approaches. 4.3.4 Topic-based Novelty Detection Workshop Results The Topic-based Novelty Detection Workshop took place in Summer 1999 at the John Hopkins University Center for Language and Speech Processing (CLSP) (Allan et al., 1999). Its aim was to investigate two novelty detection-based tasks: New Event Detection which looks at measuring novelty at a document level, and New Information Detection which looks at detecting novelty at a sentence level. The findings in this workshop resulted in a number of publications and a new avenue of research for the CIIR at UMass on temporal news summarisation using New Information Detection techniques (Allan et al., 2001). One of the major conclusions of the workshop, already mentioned in Section 4.2.3, was that further 15 Kullback-Leibler divergence (or relative entropy) is an information-theoretic measure of how different two probability distributions (over the same event space) are. This metric is commonly used as a distance measure, although it is asymmetrical. 16 Smoothing and its importance to language modelling-based retrieval is discussed briefly in Section 4.1. 100 improvements in NED performance using current IR technologies had reached an upper bound in expected effectiveness (Allan et al., 2000b). The Novelty Detection Workshop also gave the UMass team an opportunity to review elements of their initial approach to NED described in Section 4.3.1. This work also lead to the implementation of a ‘new-look’ TDT system for the TDT 2000 evaluation. Details on the clustering, similarity and weighting strategies supported by the UMass system can be found in Section 5.6.1. They found that optimal NED performance could be achieved using a VSM, KNN clustering strategy, the cosine similarity metric, the basic tf.idf metric and a feature vector containing all terms in the document, i.e. full dimensionality (Allan et al., 2000c, 2002c). Stopping and stemming also appeared to be useful preprocessing steps as they reduced the number of tokens in the detection process without hurting the effectiveness of the system (Allan, et al. 1999). Research into the effect of named-entity recognition on NED performance was a major focus of the Detection Workshop. Using the BBN named-entity detector a variety of attempts were made to integrate this information into their vector space model. The basic intuition of this work was that named entities such as people or companies play a significant role in the description of a topic as they help to distinguish one topic from another. The simplest method of including named entities in a VSM is to assign these phrases higher weights in a term index. However, this was shown to have little effect on system performance as these phrases tend to have high tf and low idf values, and so are assigned higher weights anyway. Other integration approaches worked on the notion that a first story will have a greater percentage of previously unseen named entities since they are describing an entirely new event. However, this assumption was found to be false in a lot of cases because there are many high profile named entities that are mentioned continuously in different topics such as the phrase ‘President Clinton’ which turns up in topics as diverse as his state of the union address, his affair with Monica Lewinsky and his attempts at brokering peace in Northern Ireland. Named entities of this nature are partially responsible for missed events. It was also found that there were a number of topics that did not contain a significant number of original named entities in their first stories, particularly if details of the incident were ‘shady’ at that point in the 101 story, and so the names of the individuals involved were not known. This ‘first story’ characteristic was also responsible for missed new events. In addition, this characteristic was also the cause of many false alarms, where new named entities were introduced as a topic developed, thus leading the system to believe that a new event had been found. The most promising technique involving named entities explored by the workshop used a two phase nearest neighbour approach that first found the n closest documents to the target document based on the similarity of their feature vectors disregarding all their named entities. This initial search was a comparison based purely on ‘content words’, and the second phase then involved counting the number of novel named entities in the target document that were not present in any of its nearest neighbours. Any document that contained a high percentage of new entities relative to its number of old events was classified as a ‘first story’. However, the performance of this technique still did not exceed that of their baseline (tf.idf, vector space model) NED strategy. 4.3.5 Other notable NED Approaches In 1999 and 2000 only 2 groups, UMass and National University of Taiwan (NUT), participated in the NED track at the TDT workshops. In both cases the UMass system outperformed NUT’s contribution. The NUT approach to the TDT problem is still an interesting one as they attempt to extend the simple VSM model with some linguistic knowledge present in the text. In particular, they use a part-ofspeech tagger to identify noun phrases and verbs in the text since according to Chen and Ku (2002) these parts-of-speech are the main contributors to the event description in a news story. They also use a centroid-based clustering algorithm to detect first stories. However, when a new document is added to a cluster and the centroid is updated with its terms, their algorithm assigns time labels to all terms in the centroid, and deletes older candidate terms that have not appeared for a while in the input stream. This ensures that the important terms and the latest terms are retained in each topic centroid. Chen and Ku also acknowledge that short documents in the input stream tend to have very low similarities. Hence, they adopt a form of query expansion to help identify the occurrence of synonymous terms in the story and centroid representations. For Chinese documents their algorithm uses a Chinese thesaurus, while for English documents it uses the WordNet thesaurus 102 where a term is expanded using synonyms, and the weight of an expanded term is half that of the original term. Chen and Ku use a standard tf.idf weighting scheme to weight terms in each document and cluster vector. As already stated, none of these augmentations to the basic VSM model could improve upon the performance of the UMass system. However, some improvements have been shown in the link detection task, details of which can be found in (Chen, Chen, 2002) In 2001, IBM’s NED system17 outperformed attempts by CMU, UMass and University of Iowa (UIowa). IBM’s NED system like many others uses an unsupervised single-pass clustering algorithm with a document/centroid comparison strategy and the Okapi weighting function. After a document has been assigned part-of-speech tags, morphological analysis is performed and all noun bigrams and the remainder of the terms are used to represent the story. IBM’s method differs, however, in the way in which a document/cluster similarity is calculated. They use a combined approach where the similarity between a document and a cluster is a weighted linear combination of the ‘traditional’ similarity of their vectors and a similarity measure based on the novelty of the terms in the centroid. This term novelty weighting scheme aims to capture the temporal nature of news by decreasing the weight of terms that occurred much earlier in the news stream. At the same TDT workshop, the UIowa system performed worst of all four submitted NED system results. This system was based on a named-entity recognition approach where noun phrases are identified and sorted in the following entity categories: persons, organisations, locations and events (all other parts-ofspeech are ignored). Each document/cluster similarity is calculated as the weighted sum of similarities of the vectors for each of these entity types. Eichmann and Srinivasan (2002) comment only on the effect of this approach on tracking performance, but their observation that sparse or empty entity vectors had a detrimental effect on similarity calculations also holds true for their NED system. 17 Since no formal publication of IBM’s NED results exists, we were only able to gather details on their technique from a presentation given at the TDT 2001 workshop which can be found at http://www.nist.gov/speech/tests/tdt/tdt2001/paperpres.htm (as of March, 2004). NED system descriptions are also sketchy for the UIowa and NUT systems. However, personal correspondences with the researchers at NUT has confirmed that their system is based on the techniques described in (Chen, Ku, 2002). 103 4.4 Discussion The aim of this chapter was to provide some background on the New Event Detection task as a general Information Retrieval problem, and as a TDT task. This chapter also provided background on the TDT initiative, and a detailed review of the previous solutions to New Event Detection proposed by various research groups participating in this project. An important conclusion drawn by the TDT community, and from this chapter, is that although much progress has been made in the area of automatic organisation of news streams into a manageable information source for newsreaders, there is much room for improvement. In particular, New Event Detection has gained a reputation as the most challenging of the information filtering tasks defined by the TDT researchers (Allan et al., 2000b). Consequently, we have focused our energies on improving the performance of this filtering task over a typical approach that uses a general baseline vector space model. The next chapter details the results of the evaluation of our lexical chainbased approach to NED on the TDT pilot study and TDT2 evaluation corpora. In particular, we describe a novel document representation strategy which attempts to improve upon the traditional view of a document as a mere set of term frequencies, and instead aims to capture the essence of an incoming story with respect to its lexical cohesive structure. 104 Chapter 5 Lexical Chain-based New Event Detection In the previous chapter, we looked in detail at the Topic Detection and Tracking initiative. In particular, we focussed on the New Event Detection (NED) task and how various participants had addressed this problem using traditional IR methods. As previously stated, NED is a classification task that identifies all breaking news stories discussed in a news stream. In Chapter 2, we reviewed work by Morris and Hirst (1991) which concluded that lexical chains can be used to identify prominent sub-topics and themes in texts that correspond well with the discourse units described in Grosz and Sidner’s Theory of Discourse Structure (1986). Based on this observation, we investigate whether lexical chains can be used as a means of differentiating between informative and unimportant terms in a text. In this chapter, we attempt to improve NED performance by using chain words in conjunction with a standard keyword indexing strategy to represent document content. The first set of experiments, described in Section 5.5, represent a preliminary investigation into the suitability of our hybrid model of textual content for identifying new events in the TDT1-pilot study corpus; while the experiments described in Section 5.6 attempt to replicate these results on the TDT2 corpus by integrating our linguistic indexing strategy with the UMass NED system. Before we report on the results of these TDT experiments, we first look at the relationship between disambiguation accuracy and previous attempts at integrating lexical chains into an IR model, and how these systems influenced the design of our NED system, LexDetect. In Section 5.2, we also examine how lexical cohesive relationships in text can be used to identify pertinent themes in news stories, and how this differs from a traditional word frequency-based approach. We conclude this chapter with a review of other lexical chain-based NED research also conducted at University College Dublin. 105 5.1 Sense Disambiguation and IR In Chapter 2, we explored various methods of creating lexical chains using different lexical knowledge sources. The resultant chains were then used to address a variety of NLP and IR problems. In particular research efforts by Stairmand, Green and Ellman looked at using information derived from the generation of the chains to improve hypertext generation (Green, 1997a; 1997b) and query-based retrieval (Stairmand, 1996; Ellman, 2000). In the case of Ellman, he attempted to model document content in terms of Roget’s categories, while Stairmand and Green represented documents in terms of concepts indexed using WordNet synsets. Assigning WordNet synsets or Roget’s categories to words in a text requires a method of sense disambiguation which, as we saw in Chapter 3, is one of the side effects of lexical chain generation, since noun phrase clustering based on semantic similarity requires a decision on which context a word is being used in. Using disambiguated words to index documents has been the focus of much interest in the IR community as it was hoped that a deeper understanding of document content might improve retrieval performance. There are two linguistic phenomena that motivate this intuition. They are defined as follows with respect to their effect on retrieval performance in terms of recall and precision: Synonymy is the phenomenon that occurs when two distinct syntactic phrases are used which share the same meaning (e.g. ‘domestic animal’ and ‘pet’). Since the core operation of any traditional IR model is the measurement of similarity in terms of syntactic word matching, synonymy will cause documents and queries to appear less similar than they actually are, thus reducing the recall of the IR system18. Polysemy is the phenomenon that occurs when a word has more than one word sense depending on the context in which it is used (e.g. the ‘financial’ and ‘river’ senses of ‘bank’ is a typical example). Polysemy has the effect of causing documents and queries to appear more similar than they actually are, thus reducing the precision of the IR system. However, despite the fact that addressing synonymy and polysemy to improve IR performance makes sense, disambiguation-based IR experiments have by and large 18 Furnas et al. (1987) refer to this phenomenon as the vocabulary problem, which states that ‘people tend to use a surprisingly great variety of words to describe the same thing’. 106 been unsuccessful. This is also true for Stairmand’s, Ellman’s and Green’s attempts at replacing keyword indexing with WordNet synsets and Roget’s categories. For the remainder of this section, we will briefly touch on some of the reasons stated in the literature for these disappointing results. In Section 5.2, we discuss how these results prompted us to use lexical chaining as a feature selection method rather than as a disambiguating and indexing strategy. 5.1.1 Two IR applications of Word Sense Disambiguation Throughout the nineties, interest in sense disambiguation for IR escalated due to the release of online dictionaries and thesauri like Roget’s Thesaurus, Longmans Dictionary of Contemporary English (LDOCE) and WordNet (see Section 2.4 for details), which made it possible to automate the sense disambiguation process using the sense definitions defined in these lexical resources. According to Sanderson (2000), the first large scale IR experiments investigating the usefulness of disambiguation were carried out by Voorhees (1993; 1994; 1998) and Sussna (1993) using WordNet, and Wallis (1993) using the LDOCE. Since we are primarily interested in WordNet-based indexing approaches, we will focus on the conclusions drawn by Voorhees which concur with the results of subsequent experiments by Sussna (1993), and Richardson and Smeaton (1995). Voorhees (1998) looked at two applications of word sense disambiguation for query-based retrieval: conceptual indexing and query expansion. Since a WordNet synset represents a single concept or set of synonymous words, building a vector of synsets as a representation of a document in the vector space model (VSM) is referred to as conceptual indexing. Like most WordNet-based sense resolution approaches, Voorhees’s disambiguator assigns a synset number to a word if that sense is the most active node in the WordNet network in the context of a specific document, i.e. the word sense that is most related to other words in the document. Once the conceptual index for the document collection has been built for each query, Voorhees’s system disambiguates each incoming query resulting in a synset query vector. The traditional VSM is then used to retrieve and rank documents relevant to this query. Voorhees ran experiments on five popular IR collections19: CASM, CISI, CRAN, MED, TIME. However, in each case she found that the 19 Descriptions and statistics on these test collections can be found in Chapter 3 of the IR textbook by Baeza-Yates et al. (1999). 107 effectiveness of the sense-based vectors was worse than the traditional stem-based vectors. She suggests two main causes for this degradation in retrieval performance: disambiguation errors and the inability of the disambiguator to resolve word senses in short queries due to a lack of context. In a second set of experiments, Voorhees uses WordNet as a source of words for expanding queries, in order to widen the breath of the search for relevant documents in a TREC collection. More specifically, she looked at the effect of WordNet-based query expansion on retrieval performance when query terms were manually disambiguated and then automatically expanded with related terms from the WordNet taxonomy during the retrieval process. For example, Voorhees adds the following words to the query containing the word furniture: table, dining, board, refectory. These words are all specialisations of the word furniture. Unfortunately, this query expansion technique did not outperform the traditional VSM approach to query-based retrieval. However, when words were manually disambiguated and manually expanded (not all related WordNet terms where chosen) then a significant improvement in retrieval performance was observed. Only certain lexicographically related words are useful in the expansion process, because a hypernym path between words in the WordNet taxonomy does not always indicate a useful query expansion term. The reason for this relates to the fact that not every edge in the WordNet taxonomy is of equal length and not all branches in the taxonomy are equally dense (see Section 2.4.4 for further discussion). Hence, Voorhees’s query expansion experiments indicate that semantic distance in WordNet cannot be used to approximate semantic relatedness with sufficient accuracy for use in this application. 5.1.2 Further Analysis of Disambiguation for IR A major weakness of Voorhees’s experiments was that no evaluation of disambiguation performance was conducted. Hence, it is impossible to know to what extent disambiguation error reduces IR performance. However, measuring disambiguation accuracy is a very time consuming process as a gold standard set of documents must be manually assigned senses before such an evaluation can take place. This prompted Sanderson (1994; 1997; 2000) to investigate the impact of disambiguation errors on IR effectiveness using a technique that artificially adds ambiguity to a test collection. This technique, which was first proposed by 108 Yarowsky (1993), is based on the addition of pseudo-words to a test collection. A pseudo-word is an artificially created ambiguous word, generated by randomly selecting a sequence of n words and concatenating them together. An example of a size 2 pseudo-word would be ‘cat/spade’ were every instance of ‘cat’ and ‘spade’ in the corpus would be replaced by this pseudo-word. Sanderson showed, using this technique, that adding ambiguity to queries and collections has little effect on IR performance compared to the effect of adding disambiguation errors to the collection (e.g. replacing ‘cat/spade’ with ‘cat’ in a particular document where this instance of the pseudo-word should have been disambiguated as ‘spade’). Consequently, Sanderson concluded that only low levels of disambiguation error (less than 10%) would result in improvements over a basic word stem-based IR model. This result is in agreement with earlier work by Krovetz and Croft (1992) who undertook a manual investigation of thousands of query/document word sense matches after retrieval, and concluded that sense ambiguity (caused by polysemy) did not downgrade retrieval performance as much as was originally expected. They pinpointed two reasons for this, as highlighted by Sanderson (2000): The query word collocation effect where query words implicitly disambiguate each other by the fact that in a ranked list the highest ranking documents will have occurrences of all or most of the query words. Therefore, one can presume that these documents are using these words in the context intended by the query. Hence, the effect of polysemy on retrieval performance is less than expected. 75% of words in a corpus are either unambiguous or have skewed sense distributions and so are used in the majority sense in most queries. A term is said to exhibit a skewed sense distribution if one of its senses is used more frequently in a particular domain. For example, it would be unusual to find the ‘friendship’ or ‘chemical’ sense of ‘bonds’ used in the Financial Times newspaper. Again this contributes to the fact that the effect of polysemy on retrieval performance is less than expected. Furthermore, in his thesis Buitelaar (1998) states that only 5% of word stems in WordNet are truly unrelated. This means that stemming words and conflating particular instances to a common stem, as is carried out in most traditional IR models, is not as harmful as researchers might have expected since 95% of words originate from a core related sense. Hence, stemming ‘computation’ and ‘computer’ 109 to ‘comput’ is actually good for retrieval in contrast to say using WordNet-based conceptual indexing where ‘computer’ and ‘computation’ will be assigned two distinct synset numbers, hence contributing to the dissimilarity between a document and a query. All of these points in some way account for why resolving polysemy has not proved to be an effective means of improving IR performance. However, there is evidence to suggest that improvements are possible by resolving synonymy. Gonzalo et al. (1998; 1999) report on the results of a series of experiments on the SemCor collection. SemCor (Miller et al., 1993) is a publicly available subset of the Brown Corpus that has been hand-tagged with WordNet synsets. Gonzalo et al. adapted SemCor by splitting up large documents into coherent self-contained fragments and writing synset tagged summaries of each fragment. These summaries were then used as queries in an IR-style experiment. More specifically, for each query the system was only required to retrieve one known item, in this case the document from which the query/summary was generated. Gonzalo et al. found that synset indexing ranked the correct document in first place 62% of the time compared with 53.2% for word sense indexing, and 48% for basic word indexing. The first two results show that resolving synonym relationships between words is responsible for a much greater improvement over the traditional keyword indexing performance than resolving and differentiating between polysemous words in the collection, i.e. word sense indexing versus basic word indexing. Gonzalo et al. also suggest that Sanderson’s original estimate of 90% disambiguation accuracy as the minimum cut-off point for observing any improvement in retrieval performance is too high, and that this cut-off point is nearer 60%. They believe that the reason for the difference between these two accuracy estimates is due to the fact that pseudo-words do not always behave like real ambiguous words in text, and that their experiment is much closer to how real ambiguity works. The results of Stokoe et al.’s (2003) (Web) TREC retrieval experiments provide some evidence to support Gonzalo et al.’s claim, where their high-recall disambiguator performed with an accuracy of only 62.1%, but still managed to outperform the traditional VSM by 1.76% with regard to average recall. However, this experiment differed from previous WordNet-based concept indexing experiments where an initial ranked list of documents is retrieved for a given query 110 using the VSM, and then this ranking is refined with respect to the similarity of the disambiguated query to each disambiguated document in the ranked list. Stokoe et al. also state that this improvement may be limited to certain types of retrieval, and may only be useful to ad hoc retrieval systems that deal with very short queries (one to two words) where the query collocation effect tends not to apply. However, regardless of where this cut-off point lies current state-of-the-art WordNet-based disambiguation has only reached a maximum of 69% accuracy20, which is on the lower end of the scale of acceptable disambiguation accuracy for IR applications. Consequently, it will be some time before traditional approaches will be significantly outperformed by conceptual indexing approaches to IR. 5.2 Lexical Chaining as a Feature Selection Method In the previous section we looked at reasons why integrating sense disambiguation strategies into IR system indexing has not been as successful as researchers had anticipated. This discussion also gives us an insight into why in the past lexical chain-based IR tasks have not been as successful as expected. For example, Green (1997b) found that users experienced no significant advantage when answering questions using lexical chain-based hypertext links over links generated by a simple vector space model of document similarity (see Section 2.5.4). Both Stairmand (1996) and Ellman (2000) used lexical chains as a means of improving query-based retrieval. In both cases, even though Ellman used Roget’s categories rather than WordNet synsets, they reported mixed results where improvements were observed in certain cases but not in others (see Sections 2.5.5 and 2.5.3). Kazman et al. (1995, 1996) also looked at chain-based dialogue indexing but no formal IR-based evaluation was performed (see Section 2.5.4). These researchers, apart from Ellman, were also elusive on the effect of the disambiguation accuracy of their algorithms on the performance of their particular lexical chaining application. For example: Green (1997b) partially evaluated his intra-document hypertext linking strategy by clustering documents from six topics (from the 50 available) in a TREC collection. His clustering technique was based on a conceptual indexing strategy 20 This result was taken from the Senseval-2 evaluation – a WordNet-based disambiguation workshop. For more information see http://www.sle.sharp.co.uk/senseval2/ (February, 2004). 111 using WordNet synsets derived from the lexical chains generated for each document as features. However, Green’s evaluation only comments on the fact that ‘the similarity function for synset weight vectors works as expected, that is, higher thresholds result in less connections, i.e. additions to clusters’. More specifically, no comparison with a clustering strategy that uses traditional keyword weighted vectors was performed. Stairmand’s (1996) evaluation on the other hand is more comprehensive, since he compares his conceptual indexing technique using chain synsets with a traditional VSM of query-based retrieval. However, his experiments would still be considered small on an IR scale, as he only evaluates system performance on 12 carefully chosen queries. Stairmand only selected queries that contained nouns in the WordNet index so as to ensure that degradations in system performance could be attributed to the indexing strategy rather than limitations in WordNet’s coverage. Stairmand compared his system with the SMART retrieval system and found that his system exhibited a higher rate of precision. However, the systems low recall levels and limited ability to deal with all types of queries made it unsuitable as a real replacement for a traditional VSM. Stairmand suggests that ‘a hybrid approach is required to scale up to real-world IR scenarios’. Like Green, Stairmand did not directly evaluate the disambiguation accuracy of his algorithm. In Chapter 3, details were given of the disambiguation accuracy of three lexical chaining algorithms on the SemCor corpus. It was found that both greedy and nongreedy lexical chaining approaches can only hope to attain recall and precision values, representing disambiguation accuracy, that lie between 55% and 60%. This means that these algorithms are capable of disambiguating just over half of the nouns in the SemCor collection correctly. Both Sanderson’s and Gonzalo’s analysis of IR system tolerance to disambiguation errors suggests that between 55% and 60% accuracy is on the low side of this tolerance level. Also it must be remembered that their estimates were based on full-text disambiguation, in contrast to lexical chain-based disambiguation which only looks at one part-of-speech, i.e. nouns. This will undoubtedly have caused a further degradation in performance as valuable content information would have been missing from the document representations, in particular parts-of-speech such as verbs and adjectives. 112 As a result of these inadequacies in previous chaining attempts, we proposed a more suitable method of incorporating lexical chaining into an IR model based on the following hypotheses: 1. Lexical chaining can be viewed as a method of feature selection, where our feature selection hypothesis states that nouns in the text that form clusters of cohesive words are considered to be pertinent in describing the overall topic of the text. 2. A document representation strategy consisting of chain terms is more appropriate than one based on chain synsets. This hypothesis is put forward based on the fact that our lexical chaining algorithm achieves a relatively low level of disambiguation accuracy (see Chapter 3) with respect to Gonzalo’s suggested IR performance breakeven point of 55%-60%. 3. A data fusion document representation strategy that combines a lexical chain term representation with a free-text representation of text will perform better than a document representation based solely on lexical chain words. As mentioned previously the focus of this chapter is the use of lexical chains as a means of improving New Event Detection (NED) in the TDT domain. So in this case we will be testing these hypotheses on a text classification task rather than an ad hoc retrieval task. However, the techniques discussed in Section 5.3 should also be applicable to any VSM-based system. Our evaluation, and the other work on TDT at University College Dublin (Hatch, 2000; Carthy, 2002), represents the first large-scale evaluation of lexical chaining as an indexing strategy in a text classification or IR task. According to Yang and Pedersen (1997), automatic feature selection methods ‘include the removal of non-informative terms according to corpus statistics and the construction of new features which combine lower level features (i.e. terms) into higher level orthogonal dimensions’21. In contrast to this definition our feature selection method is based solely on a linguistic analysis of a text rather than a statistical one. In addition, most of the interest in feature selection research has 21 For more information on statistical feature selection we refer the reader to Yang and Pedersen (1997) who compare and contrast a number of statistical-based techniques like mutual information, document frequency, information gain, and X 2-test. Latent Semantic Indexing (Deerwester et al. 1990), Support Vector Machines (Joachims, 2002) and more recently statistical word clustering are also popular means of extracting features from text. For more details on statistical word clustering see Baker and McCallum (1998), Slonim (2002) and Dhillon et al. (2003). 113 grown out of a need for smaller feature spaces when using computationally intensive machine learning-based text classification techniques like neural networks and Bayes’ belief networks. In the context of this thesis, we are using lexical chains as a means of augmenting a basic VSM with additional information regarding the main themes of a news story. Hence, a lexical chain representation is not meant to replace a free-text representation but to improve it, so we combine two distinct document representations using an extended vector space model proposed by Fox (1983). In this model an extended vector is actually a collection of sub-vectors where the overall similarity between two extended vectors is the weighted sum of the similarity of their corresponding sub-vectors. A more detailed explanation of this composite document representation is discussed in the following section. 5.3 LexDetect: Lexical Chain-based Event Detection In the following sub-sections, we will describe how we have integrated our approach to lexical chain-based New Event Detection (NED) with a traditional vector space model approach. Figure 5.1 gives an overview of the system architecture of the LexDetect system. Newswire Radio Television News Stream LexNews Tokeniser Free Text Document Representation Lexical Chainer Lexical Chain Word Document Representation New Event Detector Breaking News Stories Figure 5.1: System architecture of the LexDetect system. 114 We evaluate this hybrid system with respect to a traditional keyword-based NED system using the TDT1 pilot study evaluation in Section 5.5 and with respect to the UMass NED system on the TDT2 collection in Section 5.6. 5.3.1 The ‘Simplistic’ Tokeniser In Section 3.2.2, we described a complex lexical chaining candidate selection step (the tokeniser) that used a part-of-speech tagger and a parser to find useful proper noun phrases and noun compounds, and a set of morphological rules that changed adjectives to nouns. Our initial TDT1 pilot study experiments undertaken at an earlier stage in our research used a much simpler tokenisation process that did not avail of the part-of-speech tagging and proper noun/noun phrase identification steps. Instead, like many other lexical chaining approaches (St-Onge, 1995; Stairmand, 1997), a term was considered a candidate term for chaining if it was listed in the WordNet noun database. However, we found this selection process to be unsatisfactory in many cases as it led to additional ambiguity in the chaining process which was responsible for spurious lexical chains, e.g. the verb ‘to drive’ was incorrectly identified as the noun ‘a drive’ which has 12 defined senses in WordNet. The ‘simplistic’ version of the Tokeniser used in the TDT1 experiments described in this chapter, changed all terms in a text that occurred in the WordNet noun database to their singular form (if necessary). However, adjectives pertaining to nouns, noun compounds long than two words, and proper noun phrases were not identified, and so did not take part in the chaining process. 5.3.2 The Composite Document Representation Strategy As explained in Section 5.1, we use lexical chains as a means of filtering noisy terms from a document representation, where only those terms that are cohesively linked with many other terms in the text are retained, since we hypothesise that they capture the essence of the news story. However, as already stated we consider this chain word representation as partial evidence in a composite document representation that also includes a free text representation. In practical terms this means that determining the similarity between two documents involves calculating the cosine similarity between their respective chain word vectors and free text vectors (where both sub-vectors weight their tokens with respect to their frequency within the document), and then combining these two scores into a single measure of 115 similarity. This process of combining evidence is also referred to in the literature as data fusion. In Croft’s (2000) review of data fusion techniques used in IR, he states that combining different text representations or search strategies has become a standard technique for improving IR effectiveness. In the case of combining search systems the class of ‘meta-search’ engines such as MetaCrawler have been very successful. Similar improvements have been seen when multiple representations of document content are used within a single IR search strategy (McGill et al. 1979; Katzer, 1982; Fox, 1983; Fox et al., 1988). As Croft explains, there are many different classes of representation that have been used in these experiments, such as single words from the text of a document (used in the vector space model), representations based on controlled index terms (a list of key words composed by an indexer to describe a set of documents), citations (references to other texts within a document), passages (where documents are seen as a set of self contained parts rather than a monolithic block of text), phrases and proper nouns (documents are described in terms of people, companies, locations and phrases such as ‘budget deficit’), and multimedia (where documents are seen as complex multimedia objects represented by references to other media such as sound bites and video images). In general, researchers have found that when combining these different text representations the best results are obtained when a free text representation (i.e. traditional representation containing all parts of speech) is used as stronger evidence than any other class of representation. In practice this means that higher weights are given to free text representations, with alternative representations seen as additional rather than conclusive evidence of similarity. 5.3.3 The New Event Detector As stated in Chapter 4, New Event Detection or First Story Detection is in essence a classification problem where documents arriving in chronological order on the input stream are tagged with a ‘YES’ flag if they discuss a previously unseen news event, or a ‘NO’ flag when they discuss an old news topic. However, unlike detection in a retrospective environment a story must be identified as novel before subsequent stories can be considered. A single-pass clustering algorithm bases its clustering methodology on the same assumption and has been used successfully by UMass, CMU and DRAGON systems to solve the problem of NED. 116 In general, this type of clustering algorithm takes as input a set of S objects, and outputs a partition of S into non-overlapping subsets S1, S2, S3,…Sn where n is a positive integer. In our implementation of a single-pass algorithm no limit is imposed on n (the number of clusters). Instead, this number is indirectly controlled by a thresholding methodology which determines the minimum similarity between a document and a cluster that will result in the addition of that document to the cluster. Determining the similarity between an incoming document and a cluster and controlling which clusters are compared to that document is managed by a cluster comparison strategy and a thresholding strategy. The following explanation encapsulates how these strategies are integrated into the single-pass clustering algorithm: 1. Convert the current document on the input stream into a lexical chain word vector and a ‘free text’ vector. 2. The first document on the input stream will become the first cluster and its chain word vector and ‘free text’ vector will form two distinct cluster centroids. 3. All subsequent incoming documents are compared with all previously created clusters up to the current point in time. A comparison strategy is used here to determine the extent of the similarity between an incoming document and the existing cluster centroid vectors. 4. When the most similar cluster to the current document is found, the thresholding strategy is used to discover if this similarity measure is high enough to warrant the addition of that document to the cluster, i.e. the event has been previously detected so the current document is classified as ‘an old event’. If this document does not satisfy the minimum similarity condition for the cluster determined by the thresholding methodology, then that document is classified as discussing a new, previously unseen, event, i.e. a first story. This document will then form the seed of a new cluster representing this new event. 5. The clustering process will continue until all documents in the input stream have been classified. 117 Cluster Comparison Strategy There are two facets to the cluster comparison strategy: a similarity measure and a Time_Window. In some clustering implementations the addition of a document to a cluster involves maintaining the documents original representation for subsequent comparisons. So for example, in the single link cluster comparison strategy the similarity between the incoming document and the cluster is taken as the maximum similarity score between the incoming document and the document representations in the cluster. However in the LexDetect implementation, to improve the overall efficiency of the algorithm a cluster representative or centroid is used which is an average representation of all the documents in the cluster. This process involves merging (updating) the centroid representation every time a new member is added to the cluster. However, before this merging can occur the most similar cluster to the current document must be found. In accordance with the VSM, each document/cluster representation is characterised in the detection process as a vector of length t, which can be expressed as a unique point in t-dimensional space. The importance of this is that document/cluster vectors that lie close together in this tspace will contain many of the same terms. This closeness or similarity is calculated using the cosine similarity measure (Section 4.1.1, Equations 4.5 and 4.6). So far we have described a cluster comparison strategy based on a traditional VSM approach using keyword-based document classifiers. The data fusion element of our research, as described in Section 5.3.2, involves the use of two distinct representations of document content to identify first stories in a single cluster run. In our alternative IR model, we use sub-vectors to describe our two document representations, where the overall similarity between a document/cluster pair is computed as the linear combination of the similarities for each sub-vector. So the similarity function for our LexDetect system when comparing document D to cluster C is for free-text vectors Cword and Dword, and chain word vectors Cchain word and Dchain word : Sim(C , D ) = ( Kword ∗ Sim(Cword , Dword )) + ( Kchain word ∗ Sim(Cchain word , Dchain word )) where Kword , Kchain word (5.1) are coefficients that influence the weight of evidence each document representation contributes to the similarity measure. As in the case 118 of the traditional NED system, vector similarity is determined using the cosine similarity function. The final component of the comparison strategy is a Time_Window, which aims to exploit the temporal nature of broadcast news. In general when a significant news story breaks many stories discussing the same event will occur over a certain time span. This means that stories closer together on the input stream are more likely to discuss related topics than stories further apart on the stream. Hence, we impose a Time_Window within which documents can be clustered, by only allowing the n most recently updated clusters to be compared with the current document. Thresholding Strategy Working in tandem with the cluster comparison strategy is a thresholding methodology, which influences the decision for generating a new cluster. When the system has established the most similar cluster to a particular document, that document may only become a member of that cluster if it exceeds the cluster similarity threshold (CST). The CST is calculated by finding the similarity of the updated cluster centroid (after the document representation has been merged with it), and the newest document member of the cluster: CST (Cupdate, D ) = Sim(Cupdate, D ) ∗ R (5.2) At this point, this new similarity threshold is too high, and is reduced by multiplying in a reduction coefficient, R. This reduction coefficient plays an important role in the resulting cluster formation, controlling the size of these clusters and consequently the classification of new and old events. R is one of three system parameters (the other two being Dimensionality (the length of the document classifiers) and the Time_Window parameter) that have varying effects on the detection process. In particular, increasing R will decrease system misses as it makes it easier for documents to be classified as new events, increasing the Dimensionality will reduce precision but increase recall, and decreasing the Time_Window parameter will increase the efficiency of the system. If a sensible Time_Window value is used then this should help decrease the number of system false alarms by eliminating irrelevant (old news) from the cluster/document comparisons step of the detection algorithm. 119 The LexDetect implementation of the thresholding strategy deviates slightly from the traditional NED system described above. In particular, when a document representation is merged with a cluster representation, two separate merging processes are required where vector Dword is merged with Cword and Cchain merged with Dchain word word is which results in the following updated cluster vectors (Cword)update and (C chain word)update . Equation 5.3 is then used to calculate an overall updated cluster similarity threshold by multiplying the similarity value obtained by Equation 5.2 with the reduction factor R: CST (Cupdate, D ) = (( Kword ∗ Sim((Cword )update, Dword )) + ( Kchain word ∗ Sim((Cchain word )update, Dchain word ))) ∗ R (5.3) The cluster comparison and thresholding strategies are the only difference between the implementations of the traditional NED and our lexical chain-based NED system LexDetect. 5.4 The TDT Evaluation Methodology Currently the TDT community are about to embark on their 8th evaluation to date. Since the beginning of the TDT initiative in 1997 four distinct corpora have been created (the latest TDT-4 was released for the TDT 2003 evaluation). Initially TDT research focussed on mono-lingual English language newswire and broadcast news stories; however, since the advent of the TDT2 corpus multi-lingual data has been made available, and most of the TDT participants have built systems that can filter both types of news data. In this thesis we focus only on English news sources, because unlike most other TDT approaches our technique relies on an additional knowledge source (i.e. WordNet) for text understanding, in contrast to the mainly statistical approaches of other TDT participants. In what follows, we describe the TDT1 pilot study corpus, the TDT2 English corpus, and the evaluation methodologies used in the pilot study and subsequent TDT workshops. 5.4.1 TDT Corpora New Event Detection is concerned with detecting the occurrence of a new event such as a plane crash, a murder, a jury trial result or a political scandal in a stream of news stories from multiple sources. To assist in the research of this task the TDT 120 initiative has developed a number of event-based corpora two of which have been used during the course of our work. The TDT1 pilot study corpus is comprised of 15,863 news stories spanning from the 1st of July 1994 through to the 30th of June 1995. These stories were randomly chosen from Reuters news articles and CNN news transcripts from this period, and were assigned an ordering that represents the order in which they were published or broadcast. This corpus is accompanied by a file of relevance judgements created for a set of 25 events, some of which include ‘the Kobe earthquake’, ‘DNA evidence in the OJ Simpson trial’ and ‘the arrest of Carlos the Jackal’. These recognised events are only a subset of the total number of distinct events in the corpus and were chosen for their ‘interestingness’, their uniqueness and the fact that there were an acceptable number of stories on each of these events in the corpus. In total 1132 stories were judged relevant, 250 stories were judged to contain brief mentions, and 10 stories overlapped between the set of relevant and the set of brief mentions. However these ‘brief mentions’ and overlaps are removed from the evaluation process, so classification is measured on relevant and nonrelevant stories only. The TDT2 corpus consists of 64,000 stories spanning the first six months of 1998 taken from six different news sources: TV Broadcast News Transcripts: Cable News Network (CNN) Headline News, American Broadcasting Company (ABC) World News Tonight, Radio Broadcast News Transcripts: Public Radio International (PRI) The World, Voice of America (VOA) English news programs, Newswire: Associated Press Worldstream (APW) News Service, New York Times (NYT) News Services. For the TDT 1998 evaluation the corpus was split into training, development and evaluation test sets. Both training and development corpora are always provided for any initial dry-run experiments conducted by the participants. The evaluation test set on the other hand is used for the final ‘blind’ evaluation and is only sent to the participants a few weeks in advance of the TDT workshop. Since its release in 1998 the TDT2 corpus has existed in three distinct versions: Version 1 was used in the TDT2 evaluation and was annotated against 100 target topics (only 96 first stories could be identified for these 100 topics). 121 Version 2 was augmented with three Mandarin news sources which were annotated against 20 target topics from the original 100 topics identified in version 1. This version was released in June 1999, and used as development and training data in the TDT 1999 evaluation, while the TDT3 corpus provided the evaluation test data. Version 3.2 is the current version of the corpus on offer from the LDC which was released on the 6th of December 1999. It consists of the same Mandarin and English news sources as version 2 (a number of bugs were also fixed). This version was annotated against an additional 97 topics that in total provides 193 first stories on which NED performance is based. These new topics are, however, only partially annotated (not all stories belonging to the event have been added to the relevance files), and were initially created to facilitate NED research at the John Hopkins University Novelty Detection Workshop (Allan et al., 1999) (see Section 4.3.4). This version was used as development and training data in the TDT 2000 evaluation runs with the TDT3 corpus being used as the evaluation test set. All broadcast news in the TDT2 collection is available in audio or transcripted format. These transcripts were generated by the Dragon and BNN automatic speech recognisers, and boundaries between adjacent stories in the audio streams were determined by the LCD annotators. Some manual text transcriptions are also available which include closed caption material taken from television news streams and some Federal Document Clearing House (FDCH) formats. In the experiments that follow, we use the TDT1 pilot study corpus and version 3.2 of the English TDT2 corpus. Unfortunately, since the TDT 1998 evaluation did not include the NED task and all subsequent evaluations used TDT2 as a training and development resource we are unable to directly compare our system results in Sections 5.6 with other TDT participants. However, during a visit to the Center for Intelligent Information Retrieval at UMass in 2001 a number of experiments were conducted and evaluated with respect to the UMass NED system. The UMass system was the best performing system at the 1999 TDT evaluation and was marginally outperformed by the IBM NED system in the 2000 evaluation. Before we report on these results we will first look at the evaluation metrics used to determine system performance by the TDT community. 122 5.4.2 Evaluation Metrics As stated in Section 4.1.2, IR performance is generally measured in terms of three metrics recall, precision and the F1 measure. However, in the TDT evaluation methodology two system errors (misses and false alarm probabilities) are used to assess the effectiveness of the classification task. Misses occur when the system fails to detect the first story discussing a new event, and false alarms occur when a document discussing a previously detected event is classified as a new event. These definitions are now described more formally with respect to Table 5.1. For completeness, definitions of the traditional IR evaluation metrics are also included. # Retrieved by the system # Not Retrieved by the system # Relevant Stories in Corpus A C # Non-relevant Stories in Corpus B D Table 5.1. Values used to calculated TDT system performance. A, B, C, D are document counts. recall = r = A /( A + C ) if A + C > 0, otherwise undefined precision = p = A /( A + B ) if A + B > 0, otherwise undefined Pmiss = 1 − recall = C /( A + C ) if A + C > 0, otherwise undefined Pfa = B /( B + D ) if B + D > 0, otherwise undefined F1 = 2 pr /( p + r ) = 2 A /(2 A + B + C ) if ( 2 A + B + C ) > 0, otherwise undefined Since the TDT1 evaluation (Allan et al., 1998a) was only based on a relevance file of 25 identified first stories, an evaluation methodology was developed which expanded the number of trials and effectively increased the number of decisions that the system could be judged on. This was achieved by calculating miss and false alarms rates based on 11 system passes through the input data, where the goal of the first pass is to detect the first story to discuss one of the 25 events on the input stream, and the goal of the second pass after all first stories have been removed is to identify all the ‘second stories’. This process is then iterated until the 10th document on the event has been skipped. If an event has less than the required number of documents in order to participate in the iteration then it is ignored. Final performance metrics are obtained by calculating the macro average of the respective miss and false alarm rates for each of the 11 passes. As well as requiring that each NED system tag each document with a declaration that it either discusses a new event (a ‘YES’ tag) or discusses an old event (a ‘No’ tag), the TDT evaluation also requires that the system produce a confidence score to 123 accompany these tags, i.e. a score that indicates how sure the system was about its declaration. Often this score is based on the maximum similarity between the target document and its most similar document in the corpus. An example of the UMass confidence score, Equation 4.16 referred to as the decision score decision( qi, dj ) , was given in Section 4.3.1. The confidence scores returned by the NED system are then used to generate a Detection Error Tradeoff (DET) graph which represents the trade-off between miss and false alarm rates for a system. The TDT evaluation software22 constructs a DET graph from the confidence score space by calculating the Pmiss and Pfa for a large range of decision thresholds, i.e. how well the system perform if only documents with confidence scores exceeding X are tagged as ‘first stories’, where X is incremented in small steps from 0 to 1. During this ‘threshold sweep’ average Pmiss and Pfa values are computed across topics. Once this process is completed a topic weighted DET curve can be generated by plotting each point (Pmiss, Pfa) for each threshold. These points are plotted on a Gaussian scale rather than a linear one as this helps to ‘expand the high performance region’ of the graph making it easier to differentiate between similarly performing systems, where the curve closest to the origin represents the best performing system. A welcome enhancement to the TDT pilot study evaluation was the formulation of a cost function for the TDT2 evaluation. The purpose of the cost function was to provide participants with a single measure that could define TDT performance in terms of miss and false alarm probabilities. Like the F1 measure, used to combine recall and precision, the TDT cost function does not perfectly characterise detection effectiveness. However, it has been shown to be useful for parameter tuning purposes. The general form of the TDT cost function is: CDet = Cfa ∗ Pfa ∗ (1 − Pevent ) + Cmiss ∗ Pmiss ∗ Pevent (5.4) For the TDT2 evaluation, cost was defined with constants Pevent = 0.02 and Cfa = Cmiss= 1.0, where Pevent is the apriori probability of finding a target (in this case a first story). Fiscus and Doddington (2002) point out that although this measure (Equation 5.4) is useful, it is difficult to determine what exactly constitutes a well- 22 Our experiments on the TDT2 collection used the TDT3eval_v2.1 evaluation software available at http://www.nist.gov/ 124 performing system with respect to a specific task. To address this, they suggest a normalised CDet which is calculated by dividing CDet by the ‘minimum expected cost achieved by either answering YES to all decisions or answering NO to all decisions’ (Fiscus, Doddington, 2002), i.e. (CDet ) Norm = CDet / MIN (Cmiss ∗ Pevent , Cfa ∗ (1 − Pevent )) (5.5) The TDT evaluation software also calculated the value of a topic weighted minimum normalised cost function for the systems DET curve which represents the point on the curve where the optimal Pmiss and Pfa was achieved. 5.5 TDT1 Pilot Study Experiments A number of experiments were conducted on the TDT1 collection in order to explore the effect on NED performance when lexical chains are used in conjunction with a free text representation to represent a document. This involved comparing the effectiveness of a number of different chain-based approaches to NED with a traditional VSM approach to the problem. Details of these systems are described in the following section. The results described in this section were published in (Stokes et al., 2001a; 2001b; 2001c). 5.5.1 System Descriptions Four distinct detection systems TRAD, CHAIN, SYN and LexDetect took part in the following set of experiments. The main difference between these systems is that TRAD, SYN and CHAIN use a single text representation of a document, while LexDetect uses two distinct representations of document content. The TRAD system, our benchmark system in these experiments, is a basic NED system that expresses document content in terms of a free text representation and computes detection on the syntactic similarity between documents and clusters within the vector space model framework described in Section 4.1. Classification of a new event occurs in a similar manner to that described in Section 5.3.3. Three TRAD schemes23 are used, TRAD_30, TRAD_50, and TRAD_80, which differ 23 An IR ‘system’ and an IR ‘scheme’ are used in this context to describe two different concepts. An IR system refers to the physical implementation of an IR algorithm, which can have various operational modes or various parameter settings. The same IR system may be used to execute different IR schemes by adjusting these parameters (Lee, 1997). 125 only in the length of their document representations (i.e. varying the Dimensionality parameter by selecting the n most frequently occurring terms). Fixing the Dimensionality parameter is a common strategy in IR and filtering systems based on the VSM, since the cosine measure can become distorted when calculating the similarity between vectors of uneven length. Hence, we experimented with a number of different dimensionalities ranging from 30 to full dimensionality (all the words in the text) in order to determine the optimal value for the TRAD system on the TDT1 corpus. We found that dimensionality 50 produced optimal TRAD performance which corresponded with other NED results reported by (Allan et al., 1998d). Another important parameter of the TRAD system is the Time_Window parameter which exploits the news stream characteristic that stories closer together on the input stream are more likely to discuss related topics than stories further apart on the stream. Thus the Time_Window parameter ensures that only the t most recently updated (or active) clusters are compared to the current document on the input stream. A time window of 30 clusters was chosen as a suitable value for t and is employed in the TRAD, CHAIN, SYN and LexDetect systems experiments described below. The design of our second system LexDetect has been described in detail in Section 5.3. Unlike TRAD the dimensionality of LexDetect (80 words) remains static throughout these experiments. Using our basic lexical chaining method, just under 72% of documents contained greater than or equal to 30 chain words. We therefore normalised the length of chain word representations by imposing a dimensionality value of 30 on all LexDetect schemes. In theory, it is possible to vary the length of the free text representation in our combined representation; however, in these experiments all schemes contain free text representations of length 50, with a combined document representation length of 80. The final system parameters to be varied in these experiments are the weighting coefficients Kword and Kchain word used in Equations 5.1 and 5.2 to control the importance of the similarity evidence derived from the chain word and free text sub-vectors. The design of our third and fourth systems, CHAIN and SYN, are similar to TRAD in that they use a singular document representation during NED. However, both these systems incorporate chain words into the document representation strategy during the detection process. In the case of CHAIN a syntactic chain word 126 representation is used, in contrast to SYN which uses the WordNet synset numbers to represent a document. The use of synsets instead of terms as a document representation has been discussed in Section 5.1 and 5.2, and is referred to in the literature as conceptual indexing. By including this representation in our experiment we can explore the effect of disambiguation performance on our NED task, the results of which are discussed in more detail in the next section. As in the case of LexDetect the dimensionality of the chain representations used in SYN and CHAIN are also limited to 30 features. 5.5.2 New Event Detection Results The objective of this experiment is to determine if LexDetect’s combined representation approach can exceed the NED performance of our single representation systems TRAD, SYN and CHAIN. Figure 5.2 is a Detection Error Tradeoff (DET) graph showing the impact of our combined representation on detection performance. As explained in Section 5.4.2, a DET graph illustrates the trade-off between misses and false alarms, where points closer to the origin indicate better overall performance. The points on a DET graph are plotted from the false alarm and miss rates of each of the four systems using a range of reduction coefficients R (from 0.1 – 0.9) for each of their 11 iterations. The average miss and false alarm values of these iterations are then plotted on the DET graph. As can be seen the graph with the closest point to the origin is the LexDetect system. The error bars (at 5% statistical significance) lead us to concluded that a composite document representation using chain words and free text words marginally outperforms a system containing either one of these representations, i.e. (CHAIN and TRAD). This result is in agreement with two of the three hypotheses set out in Section 5.2, namely that lexical chaining works well as a feature selection method and that a data fusion experiment involving a combination of chain word and free text representations outperforms a traditional keyword-based approach. The third hypothesis, that chain words are better index terms than chain synsets, also holds as the SYN system performs significantly worse than the CHAIN system. This is also an important result as it clearly shows that there is no advantage in using WordNet synsets in a conceptual indexing strategy for an NED classification task. Since we have now established that SYN is an inferior representation, no further experiments with this system are performed. 127 100 LexDetect (No Weighting) CHAIN 90 TRAD SYN 80 LexDetect % False Alarms 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 % Misses Figure 5.2: The effect on TDT1 NED performance when a combined document representation is used. System Optimal R % Misses % False Alarms LexDetect 0.30 19.0 30.66 LexDetect (No Weighting) 0.30 19.0 33.67 TRAD_50 0.40 28.0 30.86 TRAD_80 0.40 30.0 31.19 CHAIN 0.15 43.0 27.44 SYN 0.15 40.0 38.1 Table 5.2: Miss and False Alarms Rates of NED systems for optimal value of the Reduction Coefficient R on the TDT1 corpus. The graph in Figure 5.2 also makes reference to a version of the LexDetect system called LexDetect (No Weighting). In this version of the algorithm the weight coefficients defined in Equation 5.2 are assigned equal weight, i.e. Kword = Kchain word = 1. A range of other values incremented in steps of 0.1 from 0.1 ≤ K ≤ 0.9 for both coefficients were also tried, where Kchain word = 1 and Kword = 0.5, used in the LexDetect schema, were found to be optimal. This is an 128 100 interesting result as similar experiments using composite document representations to improve search system performance based on ranking, only experienced optimal effectiveness when they allowed free text evidence to bias the retrieval process (McGill et al., 1979; Fox, 1983; Fox et al., 1988; Katzer et al., 1982). Table 5.2 summarises the optimal miss and false alarms achieved by each of the systems in Figure 5.2. 5.5.3 Related New Event Detection Experiments at UCD The initial phase of the NED research (1999-2000) carried out in this thesis was worked on in conjunction with Hatch (2000). Separate implementations of the NED system and the lexical chaining algorithm described in Section 5.3 were used to pursue two different avenues of NED research24: Using lexical chain words as distinct features in a VSM, i.e. the work in this thesis. Using lexical chains as features in a VSM (Hatch, 2000). More specifically, Hatch’s work looked at determining document similarity at a chain rather than a chain word level, where a document is a set of chain word vectors (one vector for each chain) and the pairwise comparison of these chain vectors is used to calculate the similarity between two document representations. Figure 5.3 illustrates the pairwise comparison of chains in documents A and B, where each chain is represented as a weighted vector of its chain words, and the similarity value between chains can be measured using the cosine similarity metric. Once these pairwise chain comparisons have been calculated only the maximum similarity value for each document chain vector (with a cluster chain vector) is retained and used to determine the overall similarity between a target document and a document cluster, defined more formally as follows in Hatch (2000): sim max( dcj, cluster ) = (max j : 1 ≤ j ≤ n : sim(dcj, cck )) (5.8) where sim max( dcj, cluster ) is the maximum similarity between a document chain dcj and each of the cluster chains cck , and 24 The C and Perl programming languages in a Unix environment were used to implement the work covered in this thesis, while Hatch’s implementations were developed and run on a Java/Unix platform. 129 n sim(document , cluster ) = sim max(dc j, cluster ) j =1 n (5.9) where sim(document,cluster) is the average maximum similarity assigned to each document chain dcj in the previous step. Figure 5.3: Example of cross chain comparison strategy. Red arrows indicate an overlap between chains. An interesting question arises from considering whole chains in the document/cluster similarity metric, where once a document is to be added to a cluster how should its chain representation be merged with the cluster centroid representation? Hatch identifies two possibilities and comments on how these might affect recall and precision values: Taking the UNION of the document and cluster chains as the centroid representation will improve recall. Taking the INTERSECTION of these two chain sets will improve precision. Hatch chose the union of the two chain representations as a method of merging new documents with the centroid representations, as it ‘broadens the event definition and captures the evolution of the event’. The validity of this statement is obvious when one considers the following example where the union of the following two chains clearly increase the number of terms in the centroid chain representation: f C1 = {a, c, d } C 2 = {a , e} (5.10) then C1 C 2 = {a , c, d , e} C1 C 2 = {a} Hatch also evaluated this lexical chain-based NED prototype using the TDT1 corpus and evaluation methodology. Figure 5.4 is a DET graph comparing the 130 performance of both SYN systems (WordNet concept indexing): one based on comparison of chain terms between document and cluster representations (SYN (Stokes, 2001a)), the other based on a comparison of lexical chains (SYN (Hatch, 2000)). It is evident from this graph that the former technique works best for comparing chain representations in a simple VSM. Hence, as confirmed by the error bars (at 5% statistical significance), we can conclude that no gain in performance occurs when Hatch’s computationally more expensive prototype is used. Consequently, in subsequent experiments involving the TDT2 corpus and evaluation, described in Section 5.6, only the approach documented earlier in this chapter is considered. 100 90 SYN (Stokes, 2001) SYN (Hatch, 2000) 80 % False Alarms 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 % Misses Figure 5.4: DET Graph showing performance of the SYN system using two alternative lexical chain-based NED architecture. Hatch also performed a number of data fusion experiments using a similar LexDetect architecture to that described in Section 5.3.2 (Equations (5.2) and (5.3)) where the weighted sum of the term and chain representations was used to calculate the similarity between a document and a cluster. However, Hatch only experimented with a conceptual indexing based chain representation in her data fusion experiment. Figure 5.5 shows a DET graph of the NED results for both the chain-based VSM (Hatch, 2000), which is a combination of SYN and TRAD, and the chain word-based VSM (Stokes, 2001a), which is a combination of CHAIN and 131 TRAD. It is evident from this graph that Stokes’s LexDetect schema marginally outperforms Hatch’s, which is in agreement with the SYN results in Figure 5.4. This result is also statistically significant to 5% for lower miss and false alarm rates indicated by the error bars on the LexDetect (Stokes, 2001a) curve. 100 90 TRAD LexDetect (Stokes, 2001) 80 LexDetect (Hatch, 2000) SYN (Hatch, 2000) 70 False Alarms SYN (Stokes, 2001) 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Misses Figure 5.5: DET Graph showing performance of two alternative lexical chain-based NED architectures. 5.6 TDT2 Experiments In this section, we repeat our NED experiments described in the previous section. However, in this case the underlying retrieval model, used by the LexDetect system to detect new events, is the UMass approach to NED. These experiments also differ from those described in Section 5.5, since the TDT2 corpus and evaluation methodology are used to evaluate system accuracy. In particular, the TDT2 cost function, (CDet ) Norm , (not available in the TDT1 evaluation) provides a single measure of system performance based on a combination of the miss and false alarm probabilities calculated for a give system run. In addition, the TDT2 corpus is over three times the size of the TDT1 corpus, it contains 96 identified new events compared with only 25 events in the TDT pilot study evaluation, and hence is a more comprehensive evaluation strategy. The TDT2 corpus also provides a more realistic evaluation environment where NED systems are required to process error- 132 prone ASR text, with limited capitalisation, some spelling errors and segmentation errors. The effect of this facet of the evaluation on NED performance will be discussed in due course. 5.6.1 System Descriptions In Chapter 4, we reviewed the preliminary UMass system proposed by Papka (1999). His NED implementation used a single-pass clustering algorithm based on the vector space model using the cosine similarity metric and an InQuery term weighting scheme (see Section 4.3.1). This system participated in the pilot study evaluation in 1997, and the TDT 1998 and TDT 1999 evaluation workshops. However, the 1999 workshop at the John Hopkins University prompted a re-design of Papka’s system. The participants found that optimal NED performance could be achieved using a vector space model, k-NN clustering strategy, the cosine similarity metric, an InQuery tf.idf metric and a feature vector containing all terms in the document, i.e. full dimensionality (Allan et al., 2000c). Stopping and stemming words was also found to have a positive effect on performance. Their system also supports a number of language modelling approaches to the TDT tasks; however, optimal performance was achieved using a vector space modelling approach. This system participated in the TDT 2000 evaluation and was found to be the best performing NED system in this evaluation run. The basic UMass system is made up of a number of command-line switches which provide a flexible means of testing various combinations of term weighting, clustering, and similarity strategies. Here is a list of the strategies supported by the UMass system taken from (Allan et al., 2000c) Topic Models: o k-NN clustering o Agglomerative centroid based clustering Similarity Functions: o InQuery weighted sum (Equation 4.14) o Vector cosine similarity (Equations 4.5, 4.6) o Language modeling approach (Equation 4.3, 4.4) o Kullbach-Leiblar Divergence (or relative entropy): This is an information theoretic measure which calculates the divergence between 133 two distributions, in this case the document distribution D and a topic model M. More formally: KL( D, M ) = − di log(mi di ) i where di and mi are the relative frequencies of word i in D and M respectively (both smoothed appropriately) . Term Weighting Schemes: o Basic tf weighting o Basic tf.idf weighting o InQuery tf.idf weighting (Equations 4.11, 4.12, 4.13) As described in Section 5.3.3, the LexDetect system combines the similarity of two distinct sub-vectors: one a free text document representation and the other a lexical chain based-document representation. For the experiments described in this section we combined these forms of evidence within the UMass NED framework, with the hope of outperforming the basic UMass system as was achieved in our preliminary TDT1 exploration using our own NED retrieval model. 5.6.2 New Event Detection Results As explained in Section 5.3, the basic LexNews lexical chaining algorithm (Section 3.1) was used to generate and facilitate the feature selection method discussed in Section 5.2. Figures 5.6 and 5.7 contrast the performance of both the basic LexNews and the enhanced LexNews chaining algorithm. As described in Chapter 3, both these algorithms look at repetition-based and lexicographical relationships between words. However, the enhanced version also looks at compound nouns, proper nouns and statistical word associations. In both these graphs the LexDetect systems are marginally outperformed by the basic UMass system. Unfortunately, there is no significant difference between the performance of the LexDetect system using the basic chaining algorithm and the system using the enhanced algorithm. 134 Figure 5.6: DET graph showing performance of the LexDetect and CHAIN systems (using the basic LexNews chaining algorithm), and the UMass system for the TDT2 New Event Detection task. Figure 5.7: DET graph showing performance of the LexDetect and CHAIN system (using the enhanced LexNews chaining algorithm), and the UMass system for the TDT2 New Event Detection task. 135 However, the CHAIN system in Figure 5.7 (minimum TDT cost 0.7869) does exhibit a marked improvement over the CHAIN system in Figure 5.6 (minimum TDT cost 0.9174) implying the features detected by the enhanced LexNews chaining algorithm better encapsulate the essence of the news story25. This outcome was largely expected since the exclusion of proper nouns in the original chain word representation would have greatly affected its performance as a document classifier. Also the inclusion of statistical associated words is bound to bolster the dimensionality of the resultant chain word document representation. Table 5.3 shows a breakdown of the TDT2 results in terms of the broadcast news (ABC, CNN, PRI, VOA) and newswire (NYT, APW) NED performance of each system in Figures 5.6 and 5.7. As previously mentioned, the broadcast news portion of the TDT2 corpus is a ‘noisier’ information source than its newswire equivalent due to the presence of segmentation errors, spelling errors and most notably (for an NLP system) a lack of capitalisation. The effectiveness of a number of preprocessing steps in the LexNews chaining algorithm, in particular the part-ofspeech tagging and noun phrase identification steps, are greatly compromised which has a knock-on effect on chaining accuracy. For example, consider the following spelling mistake made by an ASR system: ‘earnest gold wrote the score for numerous productions including the film the exorcist’ In this sentence the proper name ‘Ernest’ has been incorrectly identified as the adjective ‘earnest’ by the ASR system. Due to a lack of capitalisation the phrase ‘earnest gold’ will be incorrectly identified as an adjective-noun phrase by the partof-speech tagger and then incorrectly interpreted by the lexical chainer as a reference to a ‘sincere precious metal’. This compounding of errors during the chain formation process is evident when one compares the degradation in performance between LexDetect and CHAIN systems on the newswire and broadcast news portions of the TDT2 corpus. In the case of the basic UMass system, performance degradation is 8.74% compared with 20.61% for the CHAIN (enhanced) system. 25 Unlike the TDT1 experiments full dimensionality was used in both the free text and lexical chain word document representations, since this resulted in optimal NED performance on the TDT2 corpus. 136 More moderate degradations were experienced by the other systems, the most important conclusion being that in each case the UMass system outperforms all other systems on both the broadcast and newswire portions of the corpus and on the entire TDT corpus. Hence, unlike the TDT1 experiments the combined document representation in the LexDetect system could not outperform the baseline NED system, in this case the UMass system. Another difference between the TDT2 and TDT1 corpora that greatly affected the performance of the LexDetect system in these experiments was the occurrence of very short broadcast news stories. From Table 5.4, we can see that 15.5% of documents in the TDT2 collection have less than 50 words (roughly 2 sentences) which means that in many instances the chainer fails to select any defining features of the document resulting in a very low dimensionality document representation. More specifically, the LexNews chaining algorithm has great difficulty analysing the lexical cohesive structure of very short documents, as they tend to lack even weakly cohesive ties like statistical word associations between words. Comparing this to the TDT1 collection only 0.9% fall into the ‘very short’ document category and only 3.5% into the ‘short’ category compared to 19.8% in the TDT2 collection. An interesting extension to this work would be to investigate the impact of document length on the effectiveness of both the LexNews chaining algorithm and on TDT performance in general. System Broadcast Newswire %∆ All News UMass 0.6508 0.5634 -8.74 0.6302 CHAIN (enhanced) 0.8569 0.6508 -20.61 0.7869 CHAIN (basic) 0.9546 0.8558 -9.88 0.9174 LexDetect (enhanced) 0.6688 0.5981 -7.07 0.6444 LexDetect (basic) 0.6918 0.6845 -0.73 0.6498 Table 5.3: Breakdown of TDT2 results into broadcast and newswire system performance. LexDetect (basic) and Chain results were shown in Figure 5.7 and enhanced versions of these algorithms were shown in Figure 5.8. % ∆ is the percentage degradation in NED performance when broadcast news performance is compared with newswire performance. 137 Document Type No. of Words TDT1 % of Corpus TDT2 % of Corpus Very Short <= 50 0.9 15.5 Short 51 – 100 3.5 19.8 Short-Medium 101 – 250 17.1 20.1 Medium-Long 251 – 500 41.5 18.6 Long 501-1000 30.4 17.3 Very Long > 1000 6.6 8.7 Table 5.4: Breakdown of document lengths in the TDT1 and TDT2 corpora. 5.7 Discussion In this chapter we introduced our New Event Detection system LexDetect which uses a hybrid model of document content where document to cluster similarity is based on the overlap between the free text and chain word representations of the document/cluster pair. Inadequacies of previous chaining attempts were discussed in Section 5.2 and a novel approach to integrating lexical chains into a vector space IR model was described in Section 5.3. The architecture of the LexDetect system is unique in three respects compared to previous research efforts by Stairmand, Green, Kazman et al. and Ellman. In what follows, we justify these design decisions based on the results of our TDT1 pilot study evaluation described in Section 5.4. Firstly, unlike previous chain-based indexing strategies the LexDetect system uses the syntactic form of chain words as features rather than WordNet synsets identified during the chaining process. These two indexing strategies were implemented as the CHAIN and SYN systems, where the results of our experiments showed that the CHAIN system using chain words significantly outperform the SYN system using WordNet synsets. Secondly, the LexDetect system uses lexical chains as a feature selection method where terms that form strong cohesive relationships with other terms in the text are used to represent crucial threads of information in a news story. Thirdly, we recognise that lexical chain words are only partial evidence of document similarity, so we combined this evidence with a traditional keyword- 138 based representation of a document when seeking first stories in a news stream. Justification of these last two design decisions was presented in the evaluation of the TRAD, CHAIN and LexDetect systems where the LexDetect, a combination of the TRAD and CHAIN systems, outperform either one. These initial experiments indicated that lexical chains provide some useful additional information regarding the topic of a document that can be used to enhance a traditional ‘bag-of-words’ approach to the problem. Another important justification of LexDetect’s design was discussed in Section 5.5.3, where we reviewed work by Hatch (2000) who also looked at improving NED using lexical chain information. In her work, Hatch proposed an alternative lexical chain document representation scheme. In this implementation lexical chains rather than chain words are used as features in the detection process. However, a comparison with the implementation of the LexDetect system described in this thesis showed that Hatch’s more computationally expensive implementation is marginally outperformed by our chain word-based approach. Since the work at University College Dublin is the first large scale attempt at ascertaining the suitability of lexical chaining as a solution to a real-world IR problem like new event detection, the results discussed in Section 5.5 were a preliminary investigation into this question. The results discussed in Section 5.6 provided a more thorough evaluation of our hypothesis, where lexical chains were integrated into the UMass NED system architecture. However, when the experiments were repeated on the TDT2 corpus similar improvements in system performance where not observed. We identified three reasons for this disparity between the pilot study and the TDT2 evaluation results: 1. The UMass system is a superior baseline system to the one described in Section 5.3, due to its continual refinement as a result of its yearly participation in the TDT workshop. Hence, it was difficult to improve upon its baseline TDT2 NED performance. 2. The effectiveness of a number of key preprocessing steps in the LexNews chaining algorithm was greatly reduced due to their dependence, like many other NLP techniques, on capitalised text. This characteristic of the ASR broadcast news transcripts affected both the accuracy of the resultant lexical chains and their ability to extract pertinent features in the news stories. 139 3. Since 35.3% of documents in the TDT2 corpus consist of less than 100 words, this made it difficult for the LexNews chaining algorithm to explore the lexical cohesive structure of these texts due to their brevity, which in turn reduced the algorithm’s effectiveness as a feature selection method. These final two points are primarily responsible for LexDetect’s and the CHAIN system’s inconsistent performance on the broadcast and newswire portions of the TDT2 corpus, where NED performance dropped by 7.07% and 20.61% respectively for broadcast news document classification. In Section 4.3.5, we referred briefly to the work of Eichmann and Srinivasan (2002) who also experimented with a document/cluster similarity measure based on the weighted sum of a number of sub-vectors for each of the following named entities: persons, organisations, places and events. From their participation in the TDT2000 workshop, Eichmann and Srinivasan concluded that tracking performance was severely downgraded when sparse or empty entity vectors occurred in similarity calculations. This is analogous to our observation that LexDetect performance is poorer on the TDT2 corpus due to the presence of short, low cohesive documents resulting in sparse or empty chain word vectors. Furthermore, these results are also in agreement with similar lexical chain-based Event Tracking research also conducted at University College Dublin. Carthy (2002) used a similar lexical chaining implementation and document cluster comparison strategy to Hatch (see Section 5.5.3) in his LexTrack system, where document-cluster similarity is determined based on the cross-comparison of chains between the a centroid and a document representation. However, Carthy calculates the Overlap Coefficient (van Rijsbergen, 1979) between two chains rather than their cosine similarity as Hatch did. More specifically the similarity between two chains c1 and c2 is calculated as follows: scorei = | c1 c 2 | min(| c1 |, | c 2 |) (5.6) Using this measure ensures that if the elements of a short chain are all subsumed in a longer chain a high matching value is still achieved. Once all the pair-wise comparisons between the document and event descriptor chains have been calculated then the overall similarity between the document and cluster is the sum of all these comparisons. Hence, the target document is on-topic (i.e. sufficiently 140 similar to the event being tracked) if it exceeds a certain threshold. Like the LexDetect system, the LexTrack system uses a composite document representation consisting of keyword and lexical chain representations. However, overall the LexTrack system could not outperform the three baseline systems that Carthy developed for comparison, and he concluded that considering the temporal nature of news stories26 during the event tracking process had a more positive impact on tracking performance than the inclusion of lexical cohesion information. 26 Carthy extended a baseline tracking system with a time penalty factor that gradually increased the similarity threshold as the distance between the target story and the event descriptor on the data stream increased. This similarity threshold is the similarity value (between a document and a cluster) that must be exceeded in order for the target document to be deemed on-topic. 141 Chapter 6 News Story Segmentation Text segmentation can be defined as the automatic identification of boundaries between distinct textual units (segments) in a textual document. The aim of early segmentation research was to model the discourse structure of a text. Consequently, segmentation research has focussed on the detection of fine-grained topic shifts at a clausal, sentence or passage level (Hearst, 1997). More recently, with the introduction of the TDT initiative (Allan et al., 1998a), segmentation research has concentrated on the detection of coarse-grained topic shifts, in particular, the identification of story boundaries in news feeds. The aim of this chapter is to set the scene for Chapter 7 which explores another TDT application of our lexical chaining algorithm: news story segmentation. Segmentation literature spans nearly four decades of published research. In Section 6.1, we review some of these segmentation approaches which we have categorised into two levels of segmentation granularity: fine-grained segmentation and coursegrained segmentation. Since we are primarily interested in techniques that deal with news story segmentation, Section 6.2 covers three of the most common approaches to sub-topic or topic segmentation including methods based on: information extraction, lexical cohesion analysis, and multi-source and statistical modelling approaches. 142 6.1 Segmentation Granularity In general the objective of a text segmentation algorithm is to divide a piece of text into a distinct set of segments. Segmentation literature abounds with definitions of what unit of text a segment should represent. These definitions have varied in form and size from a shift in speaker focus (a span of speaker utterances) (Passonneau, Litman, 1997) to a distinct topical unit like a news story (a set of multiple paragraphs) (Allan et al., 1998a). The realisation of these diverse segment types requires different levels of textual analysis or segmentation granularity, which must be reflected in the design of the particular segmentation algorithm. In this section we look at the relationship between text segmentation, and linear and hierarchical discourse analysis. This is followed by some examples of fine and coarse-grained segmentation, and a description of how segmentation analysis can be applied to a variety of IR and NLP tasks. 6.1.1 Discourse Structure and Text Segmentation Early text segmentation work stemmed from the desire to model the discourse structure of a text. Discourse analysis, as explained in Section 2.1, examines the interdependencies between utterances (words, phrases, clauses) and how these dependencies contribute to the overall coherence of a text. Segmentation is one method of exploring discourse structure. Many prominent theories of discourse state that relationships between utterances are hierarchically structured (Grosz and Sider, 1986; Mann and Thompson, 1987), where a hierarchical structure, according to Grosz and Sider (1986), is a tree-like formation that represents multi-utterance segments and the relationships between them. In spite of this the majority of approaches, including our own, partition text in a linear fashion. In fact, even in cases where researchers have based their segmentation approach on a hierarchical theory of discourse, they still resort to evaluating their technique with respect to a linear structure (Passonneau, Litman 1997; Yaari 1997). For example, Passonneau and Litman (1997) explained that in order to have a comprehensive evaluation methodology, they had to use a linear strategy since asking human judges to consider hierarchical relationships between segments is an inordinately larger task than asking them to segment text linearly. 143 6.1.2 Fine-grained Text Segmentation Not only do segmentation techniques differ with regard to the type of discourse analysis (linear or hierarchical) that they perform, but they also differ with respect to the level of segmentation granularity they produce. Passonneau and Litman’s (1997) work represents an example of fine-grained segmentation. Their technique identifies discourse structure in speech transcripts by extracting linguistic features such as referential noun phrases, cue phrases and speaker pauses from a training set. The C4.5 algorithm (Quinlan, 1993) is then used to induce a suitable decision tree that combines the segmentation evidence derived from these features. The decision tree is then used to fragment dialogue into segments or spans of utterances, which according to Passonneau and Litman form coherent units in the dialogue. Segment Description Transcript Speaker recommends movie. Well it’s a really great movie, really beautiful scenery. You should see it, I recommend it, I really do. Introduces main character and where the movie takes place. Describes the fort and the countryside. The first part of it just sets up how Kevin Kostner’s [sic] character goes out West. He’s a soldier in the Civil War, and he was a hero so he can be posted where he wants, and he asks to go to the frontier, out to North Dakota ’cause he’s kind of romantic about the West. Well, he gets this army post and it’s abandoned. He’s all alone in the wilderness, with lots of supplies but no people. It’s beautiful country, really wide open, hardly any trees, golden grass waving in the breeze, all like that. Figure 6.1:Example of fine-grained segments detected by Passonneau and Litman’s segmentation technique. Note that the noun phrases ‘frontier’ and ‘North Dakota’ refer back to ‘the West’, and ‘wilderness’ refers to ‘country’ in the third segment. Figure 6.1 is an extract of a speech transcript taken from Passonneau and Litman’s evaluation corpus (1997), which illustrates the fine-grained nature of their segments. Cue phrases (e.g. well, uh, finally, because, also) are highlighted in the transcript in bold and referential nouns phrases (including pronouns) are highlighted in italics. This mark-up was hand-coded by the authors in order to provide the C4.5 machine learning algorithm with labelled training examples. Passonneau and Litman define a segment as a unit of text that represents a speaker intention, i.e. a specific idea or point that the speaker is trying to articulate to the listener. 144 One NLP task that can benefit from fine-grained segments like these is anaphor resolution, i.e. the identification and resolution of referential relationships between pronouns and noun phrases in a text. An experiment by Reynar (1998) showed that by restricting the number of candidate antecedents (possible referents) to those words that exist in the same segment as the pronoun, the efficiency of the resolution algorithm can be greatly improved without compromising its effectiveness. Passonneau and Litman (1993) also present a motivating example of how segmental structure is essential for pronoun resolution. However, in this case no formal evaluation of the effectiveness of the claim was ever reported. Other NLP techniques that have benefited from fine-grained segmentation analysis include speaker turn identification and dialogue generation. 6.1.3 Coarse-grained Text Segmentation Coarser-grained segmentation breaks text into multi-sentence or multi-paragraph sized chunks. These types of segments have been used to improve IR, summarisation and text displaying tasks. In the case of IR applications, Hearst and Plaunt (1993), Reynar (1998), and Mochizuki et al. (2000) have tried to determine the usefulness of linguistically motivated segments in passage-level retrieval. In the early nineties, there was a surge in research relating to passage-level retrieval: the retrieval of smaller units of text rather than full documents in response to user queries (Hearst and Plaunt, 1993; Salton et al., 1993; Callan, 1994; Moffat et al., 1994; Mittendorf, Schauble, 1994; Wilkinson, 1994, Salton et al., 1996). This work was motivated by the idea that long documents (e.g. expository texts) contain a number of heterogeneous sub-topics that make their word frequency statistics unrepresentative of any particular sub-topic in the document. Consequently, a long document that contains a passage relevant to a particular user query will quite likely not be retrieved, since the passage is hidden in a myriad of other textual information included in the document. The units used to represent blocks of text in passage-level retrieval have varied in size from sentences and paragraphs to fixed windows of text and sub-topic segments. Hearst and Plaunt (1993) represent passages as sub-topic segments which consist of multi-paragraph blocks of text. They found that retrieval based on subtopics improved retrieval performance. However, comparable performance gains were also achieved when arbitrarily chosen fixed size segments were used as 145 passages. Reynar (1998) reported similar results which showed that his topic segments were slightly outperformed by Kaszkiel and Zobel’s (1997; 2001) overlapping passages. Kaskziel and Zobel’s technique divides documents into fixed length passages beginning at every word in the document. However, as Reynar points out, although this method is effective, the size of the index greatly increases with even a small increase in collection size, thus magnifying the space and time requirements of the algorithm. Mochizuki et al. (2000) reported some confusing results using lexical chainbased segments. In contrast to all other passage retrieval work, they found that optimal retrieval performance occurred when the original documents rather than fixed block passages were retrieved in response to a set of queries on a Japanese text collection. In this experiment Mochizuki compared a number of segmentation techniques: fix length segments, paragraph-based segments, lexical chain-based segments. However, their results did show that a method that combined the passage-level retrieval results of keyword retrieval with lexical chain-derived segment retrieval could outperform a method that used either one of these techniques. Hearst (1997) suggests text display as a more motivating example of how segmentation information can be used to help users ‘hone in’ on the relevant passages in a document without actually having to read the document in its entirety. Her system TileBars offers users a means of viewing the distribution of their query terms in each passage (or segment) that was deemed relevant to their specific query. The interface also provides users with links leading directly to the positions in the document that are most relevant to their query. Other uses of Hearst’s segmentation algorithm TextTiling include restricting the context surrounding words in order to improve the generation of lexical chains (Barzilay, 1997) 27 and the gathering of cooccurrence statistics for a thesaurus (Mandala et al., 1999). Sub-topic segments have also been used to improve text summarisation tasks (Mittal et al., 1999). In recent times coarse-grained segmentation has found what could be described as its niche IR application. As heterogeneous multimedia data, such as television 27 However, Barzilay’s work also showed that the effectiveness of lexical chain disambiguation actually improved when documents were divided into paragraphs rather than Hearst’s sub–topic segments. 146 and radio broadcast news streams, becomes more readily available a new challenge for IR research also arrives, where systems that were traditionally used for processing demarcated text (containing title, section, paragraph, and story boundary information) must now work on un-segmented streams of error-prone ASR transcripts. Since the TDT initiative started in 1997, news story segmentation has become the main focus of segmentation research (Allan et al., 1998a; Stolcke et al., 1999; van Mulbregt et al., 1999; Beeferman et al., 1999; Eichmann et al., 1999; Eichmann, Srinivasan, 2002; Dharanipragada et al., 1999, 2002; Greiff et al., 2000; Mani et al., 1997; Stokes et al., 2002, 2004; Stokes, 2003; Yamron et al., 2002). In this form of coarse-grained segmentation, segments are defined as coherent units of text that pertain to distinct news stories in a news stream. In Section 4.2.2, we defined the TDT tracking and detections tasks. One of the prerequisites of these systems is good structural organisation of the incoming data stream, where news story boundaries must be correctly identified in order to maximise system performance. Developing robust segmentation strategies is also important since manual segmentation of news transcripts is a very time consuming process. Cieri et al. (2002) state that manual segmentation of the TDT collections represented the largest portion of LDC annotation effort (compared with the time spent on topic annotation and transcription). In the TDT pilot study, Allan et al. (1998a) concluded that segmentation error rates between 10% and 20% were adequate for TDT applications. However, this conclusion was only drawn from event tracking experiments. In subsequent TDT research, Allan conceded that although segmentation errors have little effect on the tracking task these errors do have a more dramatic impact on the various detection tasks, i.e. New Event Detection, Link Detection and Retrospective Detection (Allan, 2002b). For the remainder of this chapter we will look at a variety of approaches that have been used to tackle the problem of coarse-grained segmentation, in particular, sub-topic identification and news story segmentation. 6.2 Sub-topic/News Story Segmentation Approaches According to Manning (1998), text segmentation techniques can be roughly separated into two distinct approaches, those that rely on lexical cohesion and those that rely on statistical Information Extraction (IE) techniques such as cue phrase 147 extraction. In this section, we also look at a two newer story segmentation methods: one based on a Hidden Markov Modeling approach and the other on a combined approach that uses multiple sources of segmentation evidence. However, we do not limit this literature review to coverage of news story segmentation systems, since many sub-topic approaches have also been successfully adapted to tackle story segmentation. One of the key elements in the following section is our description of lexical cohesion-based approaches (Section 6.2.2) since our own segmentation approaches falls into this category. In addition in Section 7.3, our system is evaluated with respect to two other notable lexical cohesion approaches: the C99 (Choi, 2000) and TextTiling (Hearst, 1997) algorithms. 6.2.1 Information Extraction Approaches IE approaches to segmentation are based on the existence of cue phrases or words that contribute little to the overall message of a text, but are still important as they help to indicate thematic shifts of focus in a text. There are two types of cues identified in the literature: domain independent cues and domain specific cues (Reynar 1998). Domain Independent Cues Domain independent cues are generally cues phrases that are applicable to many genres, which include certain conjunctions, adverbs, and pronouns. Depending on the level of segmentation granularity required, it is either the presence or absence of these cues at potential boundary points that indicates a coarse or fine-grained topic shift. For example, in Figure 6.1 taken from Passonneau and Litman (1993), the cue word well is a good indicator that the intentional focus of the speaker has changed. Other examples of this type of cue include also, therefore, yes, so, basically, finally, and actually (further discussion can be found in Hirschberg and Litman (1993)). Passonneau and Litman also use the occurrence of pronouns such as it, that, and he to determine segment boundaries. However, in contrast to adverbial/conjunctive cues, pronoun usage can imply both the presence or absence of a new segment depending on a number of different factors: the location of the clause that the pronoun occurs in, and whether or not the pronoun provides a referential link to another word in the current segment or to a proper noun in the immediately proceeding clause. 148 Domain independent cues are also valuable indicators of cohesion in text when coarse-grained topic shifts such as news story boundaries are required. More specifically, if cue words such as conjunctions or pronouns can be used to indicate sub-topic shifts then this is also evidence of the continuation of the current coarsegrained topic. Figure 6.2 is an extract taken from a CNN broadcast illustrating this point. The cue words and, so, finally and again link in each case the sentence in which they occur to the previous sentence, thus giving the text a continuous rather than a disjoint quality. Similarly the referential pronoun in Sentence 6 links it to Sentence 5. Identifying these cues helps segmentation performance as it reduces the number of possible segment boundary points between sentences 1 and 7. This type of segmentation evidence is used to enhance our segmentation algorithm resulting in a notable improvement in system performance (Section 7.4.2). 1. The French forces appeared reluctant to help. 2. So the Rwandan soldier jumped out of the jeep and into the second one. 3. The scene was repeated. 4. Again there was a struggle, with the French providing no help. 5. Finally, the Rwandan soldier realized the French troops would not intervene. 6. He jumped off the jeep and started running. 7. And an RPF soldier shouted “maliza yeye” (finish him). Figure 6.2: Extract taken from CNN transcript which illustrates the role of domain independent cue phrases in providing cohesion to text. So far we have illustrated how effective domain independent cues are in the segmentation process. However, although they can in general be applied across genres, the list of cues defined for each test set must be fine-tuned a little in order to either eliminate misleading cues or include missing ones. Of course one method of achieving this is to manually hand code these lists. However, there is sufficient evidence to suggest that the extraction and weighting of these cues is better left to a machine learning IE technique, e.g. multiple regression analysis (Mochizuki et al. 1998), the C4.5 algorithm (Passonneau, Litman, 1997), exponential language modeling (Beeferman et al., 1999) and maximum entropy modeling (Reynar, 1998). These techniques have also been used to identify domain specific cues in text, described in more detail in the following section. 149 Domain Specific Cues For domain specific cues to work some explicit structure must be present in the text. For example, Manning’s segmenter (1998) was required to identify boundaries between real estate classified advertisements which, in general, contain the same types of cue information, for example house price, location, acreage, and number of bedrooms. Similarly, in news transcripts an inherent structure exists: an introduction, followed by a series of news stories interspersed with commercial breaks, and finally a summation of the main news stories covered. Some researchers involved in the TDT initiative (Reynar, 1998; Beeferman et al., 1999; Dharanipragada et al., 1999) have put this structure to good use by extracting cue phrases in news transcripts that are reliable indicators of topic shifts in the dialogue such as ‘Good Morning’, ‘stay with us’, ‘welcome back’ or ‘reporting from PLACE’. Reynar (1998), who identified these phrases by hand from the HUB-4 broadcast news transcripts, divides these domain cues into a number of different categories: ‘Greeting’, ‘Introductory’, ‘Pointer’, ‘Return from commercial’, and ‘Sign-off’ cues. Figure 6.3 illustrates how these domain cues reflect news programme structure. Broadcast News Cue Phrases Greeting Commercial ‘welcome’ ‘we’ll be right back’ ‘good evening’ ‘welcome back’ ‘top stories this hour’ ‘and we’re back’ News Summary News Stories Commercial Break Introductory Sign off ‘this just in’ ‘i’m PERSON’ ‘let’s begin’ ‘reporting from PLACE’ ‘and finally’ ‘live from PLACE’ 0 Minutes 30 Minutes Figure 6.3: A timeline diagram of a news programme and some domain cues. 150 One of the main problems, however, with these domain cues is that they are genre specific conventions used only in news transcripts. Furthermore, often these cues are news programme specific as well. For example, in European news broadcasts, in contrast to their American counterparts, news programmes are never ‘brought to you by a PRODUCT NAME’. Newscaster styles also change across news stations because some presenters favour certain catch phrases more than others. Consequently, new lists of cues must be generated either manually or automatically for each news sample. Hence, segmenters that rely heavily on these types of cues tend to be highly sensitive to small changes in news programme structure which can have a detrimental effect on segmentation performance. A more measured approach to segmentation would be to use cue phrase information as secondary evidence of a topic shift, and consider a domain independent technique like lexical cohesion analysis as primary evidence of the existence of a story boundary. In the following section we examine how lexical cohesion, as a textual characteristic, can be successfully used to segment text into distinct topical units. 6.2.2 Lexical Cohesion Approaches Research has shown that lexical cohesion is a useful device for detecting sub-topic and topic shifts in texts. The central hypothesis of segmenters based on lexical cohesion analysis is that portions of text that contain high numbers of semantically related words (cohesively strong links) generally constitute a single unit or segment. Consequently, areas of the text that exhibit very low levels of cohesion are said to be representative of a topic/sub-topic shift in the discourse. Most approaches to segmentation using lexical cohesion only examine patterns of lexical repetition in the text and ignore the four other types of lexical cohesion (as discussed in Section 2.3). There are two methods of capturing these other forms of lexical cohesion; one is to examine semantic association between words using a thesaurus, while the other method finds associations based on co-occurrence statistics generated from an auxiliary (same-domain) corpus. Lexical Repetition Segmenters that examine lexical repetition work on the notion that the repetition of lexical items occurs more frequently in areas of text that are about the same topic. 151 In the case of a news programme, sharp bursts in proper noun and noun phrase repetition will often mark the beginning of the next news report. In general, lexical cohesion-based approaches to segmentation analyse these repetition bursts, where graphical representations of the peaks and troughs in similarity between textual units are used to determine segment boundaries. We will now look in detail at two such repetition-based systems, since they participate in our evaluation methodology described in Section 7.2. The first of these segmentation systems, called TextTiling, was developed by Hearst (Hearst, 1997). Hearst’s algorithm begins by artificially fragmenting text into fixed blocks of pseudo-sentences (also of fixed length). The algorithm uses the cosine similarity metric to measure cohesive strength between adjacent blocks in the text, where words are weighted with respect to their frequency within the block. Depth scores are then calculated for each block gap (segment boundary) based on the similarity between a block and its neighbouring blocks in the text as follows: 1. Find the similarity at gap n, i.e. similarity between block n and block n+1. 2. Find the similarity between n and every block to the left of it until the similarity decreases. Record the difference between the similarity at gap n and the highest encountered similarity. 3. Repeat this procedure for block n+1 comparing it to every block on its right. 4. The depth score for this gap is the sum of the two differences calculated in steps 2 and 3. Figure 6.4 is a graphical representation of the similarity scores calculated for each block boundary point. A depth score, as Reynar (1998) points out, is basically the sum of the differences between the top of the ‘peak’ immediately to the left and right of a ‘valley’. High values of these depth scores indicate topic boundary points as they represent areas in the text that exhibit major drops in similarity. The jagged horizontal line in Figure 6.4 represents the cut-off point above which all depth scores are definite segment boundaries. This cut-off is a function of the average and standard deviations of the depth scores for the text under analysis. Vertical lines are the multi-paragraph boundaries chosen by the TextTiling algorithm, which are slightly off line with the block gap numbers on the x-axis since block positions are mapped back on to the original paragraphs in which they occurred in the text. Also note that Hearst prevents the occurrence of very close adjacent boundaries, by 152 checking that there are at least 3 pseudo-sentences between boundaries. Hearst’s technique is similar to work done by Youman (1991) on the generation of Vocabulary Management Profiles (VMP). A VMP is effectively a plot of the number of first-time uses of words in a fixed window as a function of word position within the text. Similarly peaks and valleys in these plots are indications of vocabulary shifts. However, Nomoto and Nitta (1994), who implemented and tested Youman’s technique, concluded that it failed to consistently detect patterns of vocabulary shift in text. Hearst (1997) implemented an improved version of the algorithm, renaming it the Vocabulary Introduction Method. However, she showed that this method could not outperform her TextTiling algorithm. 14 Neighbouring Block Similarity Scores 12 10 8 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 Block Gap Numbers Figure 6.4: Graph representing the similarity of neighbouring blocks determined by the TextTiling algorithm for each possible boundary or block gap in the text. The second system to take part in our evaluation is Choi’s segmenter C99 (Choi, 2000). This is a three-step algorithm that uses image-processing techniques to interpret a graphical representation of the pair-wise similarity of each sentence in the text as follows: 153 1. Generate a sentence pair similarity matrix using the cosine similarity measure. 2. Replace each value in the similarity matrix Mi,j by its rank Ri,j, where Ri,j is the proportion of neighbouring elements that have lower similarity values than Mi,j. Choi explains that the purpose of this step is to limit the effect of the sensitivity of the cosine metric when short text units are being compared, i.e. the occurrence of a common word between two short sentences in the text could cause a disproportionate increase in relative similarity. By replacing each Mi,j by its Ri,j the similarity values between individual units becomes irrelevant and only their relative ranking with respect to their neighbours is considered during further processing. 3. Use a divisive clustering algorithm to determine the final topic boundaries. This algorithm iteratively sub-divides the original document (one large segment) into smaller segments that maximise the inter-sentence similarity values within each segment until a sharp drop in similarity occurs indicating that an optimal set of segments has been found. Choi’s algorithm is based on another divisive clustering algorithm developed by Reynar (1994). The main differences between these techniques is that Choi’s algorithm uses the cosine similarity metric and a ranking strategy, while Reynar’s algorithm examines the repetition of words with respect to their position in the text (rather than sentence similarities) illustrated using a dotplot diagram. Creating the dotplot involves plotting points on a graph which correspond to word repetitions in the document. As more and more word occurrences are added to the dotplot, square regions (representing text areas containing a high number of word repetitions) begin to emerge along the diagonal axis of the graph. A maximisation algorithm is then used to maximise the density of these square regions, where sub-topic shifts are located by determining the points at which the outside densities are minimised. This segmentation strategy is similar to Hearst’s TextTiling algorithm in that both methods determine boundaries based on a comparison of neighbouring blocks. However, Reynar points out that his approach involves a global rather than a local comparison strategy since each region is compared with all other regions. Coupled with these divisive clustering (top-down) methods are a number of lexical repetition approaches that detect boundaries using agglomerative clustering 154 techniques (bottom-up) (Yaari, 1997; Eichmann et al., 1999). However, regardless of the clustering algorithm used, most repetition-based segmenters calculate similarity between units in a text using the cosine similarity measure. The few exceptions to this trend are Reynar (1994), Youman (1991) and Richmond et al. (1997). In particular, Richmond et al. weight word significance using Katz’s (1996) notion of the ‘burstiness’ of content words in text. Burstiness, according to Katz, is an observable characteristic of important topic words, where multiple occurrences of topic words tend to occur in close proximity to each other in a text. Richmond et al. define a significance weight for each word in a text where words which observe a ‘bursty’ distribution will be weighted higher than other words. The similarity between textual units is then calculated with respect to the significance of words using the following overlap metric: | A′ | − | A′′ | | B ′ | − | B ′′ | + | A| |B| Correspond ence = 2 (6.1) where | A | is the sum of all the significance weights assigned to each of the words in textual unit A, | A′ | are the sum of the weight of the words that A has in common with B, and | A′′ | is the sum of the weight of the words unique to A with respect to B. Similar definitions apply to | B | , | B′ | and | B ′′ | . Richmond et al. claim that incorporating this measure of word significance into the segmentation process leads to improved accuracy over Hearst’s TextTiling algorithm without sacrificing language independency. One of the main disadvantages of a lexical cohesion-based approach to segmentation that only looks at repetition relationships between words is that synonymous and semantic related word repetitions are not considered, making some areas of text appear less cohesive than they actually are. Figure 6.5 illustrates this point using two sentences taken from a report on the SARS epidemic in Toronto. To a repetition-based segmenter these sentences will appear unrelated, since they have no terms in common. However, looking a little closer we see that warning is a synonym of caution and Toronto is a city in Canada (hyponym relationship). In sub-topic or intention-based segmentation (see Section 6.1.2), the segmenter would be correct in classifying these two sentences as separate segments. However, if the required level of granularity is the identification of distinct news stories then 155 considering the synonym and hyponym relationships between the two sentences is essential. In the following two sub-sections, we examine in more detail how these types of relationships have been identified using statistical word associations and thesaural relations. The World Health Organization today issued a SARS-related travel caution for Toronto, saying that they believe the virus had not been effectively contained there yet. This latest warning is expected to hurt an already struggling economy in Canada's largest city, which accounts for about 20 percent of national gross domestic product. Figure 6.5: Extract of CNN report illustrating the role of lexical cohesion in determining related pieces of text. Statistical Word Association In Section 2.3, we described statistical word associations as ‘intuitive’ word relationships that are not represented in a standard thesaurus, since they cannot be defined in terms of generalisation, specialisation or part-whole relationships. However, as their name suggests, these types of lexical cohesive word relationships can be automatically identified by gathering word co-occurrence statistics from a large domain specific corpus. Since lexicographically related words (like the relationship between ‘vehicle’ and ‘car’) are also commonly found in similar contexts, it can be said that statistical associations implicitly considers these types of lexical cohesive relationships as well. Ponte and Croft’s segmenter (Ponte, Croft, 1998) uses a word co-occurrence technique called Local Context Analysis (LCA) to determine the similarity between adjacent sentences. These similarities are then used in the boundary detection phase to find segments using a dynamic programming technique. LCA works by expanding the context surrounding each sentence by finding other words and phrases that occur frequently with these sentence words in an auxiliary corpus. LCA was originally created as a query expansion method (Xu and Croft 1996), where related words are added to a query in order to improve retrieval performance. The advantage of this technique over other statistical approaches is that it calculates sentence similarity based on the co-occurrence of multiple words, thus ensuring that 156 two terms in different sentences are related based on their intended sense usage in the context of the sentence. The author’s show that segmentation based on LCA is particularly suited to texts containing extremely short segments (similar to the example in Figure 6.5) which share very few terms due to their brevity. Their evaluation compared the LCA-based segmenter to a similar segmenter that examines word frequencies only, and found that the LCA method outperformed the other. Further improvements were possible when co-occurrences statistics were generated from a more recent auxiliary corpus that covered more up-to-date topics, and hence shared more vocabulary with the evaluation corpus. Another technique closely related to Ponte and Croft’s word expansion technique is Kaufmann’s VecTiling system (Kaufmann, 2000), which augments the basic TextTiling algorithm with a more sophisticated approach to determining block similarity. However, instead of LCA VecTile uses Schutze’s WordSpace model (Schutze, 1997; 1998) to replace words by vectors containing information about the types of contexts that they are most commonly found in. The WordSpace technique calculates a co-occurrence vector containing the co-occurrence frequencies of 1000 content words with each term in 20,500-word dictionary. Since these matrices tend to be quite sparse (a number of co-occurrences will be zero), Kaufmann reduced the matrix from 20,500 × 1000 to 20,500 × 100 using Singular Value Decomposition (Golub and van Loan, 1989): an operation, Kaufmann explains, that relocates the vectors into a lower-dimensional space to summarise the most important (i.e. defining) parts whilst simultaneously filtering out any noise. So when finding the similarity between two units of text, the VecTile algorithm produces one vector for each unit by adding together the vectors of each of the words in the unit. This calculation maps the unit and its contents onto a single position and direction in the 100 dimensional space. The remainder of the VecTile algorithm stays true to Hearst’s original TextTiling algorithm, where the cosine similarity metric determines similarity between vectors representing blocks of text which are in turn used to find boundary points between sub-topic units in the text. Kaufmann used Pearson’s correlation coefficient to evaluate the results of the two algorithms and found that the VecTile algorithm outperformed its TextTiling counterpart. Two approaches similar to this technique are Choi (2001) and Slaney and Ponceleon (2001), who both used Latent Semantic Analysis (LSA) to determine the similarity 157 between textual units. LSA uses a truncated form of single value decomposition (Deerwester et al., 1990). Slaney and Ponceleon found that their technique was an effective method of segmenting a broadcast news programme; however, no formal evaluation was conducted. Choi’s paper (2001) on the other hand showed that an LSA approach could outperform his C99 algorithm. Thesaural-based Word Association There are relatively few segmentation techniques that consider thesaural-based word association in the boundary detection phase, and most of those that do are lexical chain-based approaches (Okumura and Honda, 1993; Stairmand, 1997; MinYen 1998; Mochizuki et al. 1998, 2000). In the majority of cases thesaural word relationships, such as statistical co-occurrence information, are combined with lexical repetition information rather than used as conclusive evidence of a topic shift. There are both advantages and disadvantages to analysing lexicographical rather than statistical relationships between words. Lexicographical relationships are hand-coded, domain independent relationships so they cover a much wider range of word relationships than lists of statistically derived relationships. For example, a very obvious holonym relationship exists between ‘fish’ and ‘fin’ in a story on the ‘Asian shark-fin trade’. However, finding a significant number of cooccurrence of these words in a news corpus in order to establish a statistical relationship between them is difficult as these terms occur infrequently compared to terms such as ‘violence’ or ‘death’ in this domain. On the other hand, there are many domain specific relationships between words that are not captured by thesaural relationships including many compound nouns such as ‘cocaine addiction’ or ‘dirty bomb’ and related nouns such as ‘corruption’ and ‘money’ or ‘NASA’ and ‘Mars’. One example of published work that uses thesaural word relationships outside the realms of lexical chain-based segmentation is a technique by Jobbins and Evett (1998). Their algorithm is similar to TextTiling in that it looks for areas of low cohesion in text by comparing fixed-size windows of neighbouring text. However, they tested a number of combinations of semantic similarity using lexical repetition, collocations, and thesaural relationships from Roget’s Thesaurus. On a very small test set of 42 pairs of concatenated topical articles, they found that a combination of 158 word repetitions and collocation information performed best. In a similar experiment where sub-topic shifts were identified by a number of judges, they found that a combination of word repetition and thesaural relationships worked best. Lexical chain-based approaches to text segmentation determine segment boundaries by analysing repetition as well as other forms of lexical cohesion derived from a thesaurus such as WordNet. Like other lexical cohesion approaches to segmentation, a boundary is defined as an area of low similarity between neighbouring blocks in a text. More specifically, in the case of lexical chaining based-segmentation, a boundary is defined as a point in the text where a number of lexical chains end and a number of new chains begin. This corresponds with the idea that each chain represents a sub-topical element of the text, and if a number of these chains terminate at the same point in the text where a number of new subtopics or chains begin then we can say that we have detected a topic shift. This boundary detection process is described in more detail in Section 7.1 which documents our own approach to the problem. There have been three previous attempts to tackle text segmentation using lexical chains. The first by Okumura and Honda (1994) involved an evaluation based on five Japanese texts, the second by Stairmand (1997) used twelve general interest magazine articles and the third by Min-Yen et al. (1998) used fifteen Wall Street Journal and five Economist articles. The evaluation methodology described in Section 7.2 uses a substantially larger data set consisting of CNN broadcast news and Reuters newswire documents. Broadcast news story segmentation represents a previously unexplored evaluation domain for lexical chain-based segmentation, as Okumura and Honda’s approach looked to identify sub-topics in newspaper articles, while Stairmand and Min-Yin et al.’s experiments centred around the identification of concatenated newspaper articles. In each of these experiments lexical chains were found to be an adequate means of segmenting text into (sub)-topical units. 6.2.3 Multi-Source Statistical Modelling Approaches So far in this chapter, we have looked at different pieces of textual evidence that can be used to break text up into coherent segments. Although some segmenters use either IE or lexical cohesion techniques, many researchers have found that a combination of evidence works best. For example, in Reynar’s (1998) approach to 159 news story segmentation he showed that significant gains could be achieved by combining cue information with other feature information such as named entities, character n-grams (sequences of word forms of length n), and lexical cohesion analysis. On the other hand, Eichmann et al. (1999) combined a tf.idf measure of similarity with pausal information in the news audio stream (i.e. speaker pause duration between textual units where a longer pause is evidence of a story boundary). Combination approaches such as these work by learning the best indicators of segment boundaries from an annotated corpus and then combining these diverse sources of evidence in a theoretical sound framework, such as a feature-based language modeling approach (Beeferman et al., 1999), a cue based maximum entropy model (Reynar, 1998) or a decision tree-based probabilistic model (Dharanipragada et al., 1999). Another important statistical approach that has been successfully used for news story segmentation is Hidden Markov Modelling (HMM): a method more commonly used in speech recognition applications (Yamron et al., 1998; van Mulbregt et al., 1999; Blei and Moreno 2001; Greiff et al., 2000). Each state in the HMM is representative of a topic, so that given a word sequence the HMM assigns each word a topic, thus producing the maximum-probability topic sequence. Finding topic boundaries is then equivalent to finding topic transitions; in other words, finding where adjacent word topic-labels differ. One of the main disadvantages of building statistical models to solve segmentation is that they have to be trained and fine-tuned on domain specific data. For example, in experiments by van Mulbregt et al. (1998) on the TDT2 collection 48,000 stories (15 million words) were used to train their HMM. Similarly, Beeferman et al. (1999) trained their language modelling approach on a 2 million word subset of the TDT1 broadcast news collection. However, Utiyama and Isahara (2001) have proposed a domain-independent statistical model for text segmentation, where no training data is needed since word statistics are estimated from the given text. Although their approach was not compared to a trained version of their model, their results did compare favourable with Choi’s C99 (2000) word repetition-based approach. 160 6.3 Discussion In this chapter we have examined a number of text segmentation approaches that have been used for a variety of purposes depending on the granularity of the segments required. For example, applications of fine-grained text segments include intention-based discourse analysis, anaphoric resolution and language generation. Coarse-grained text segments have been used in systems performing tasks such as passage-level retrieval, text summarisation, and news story segmentation. In the context of this thesis the goal of a news story segmentation system is to automatically detect the boundaries between news stories in a television news broadcast. Techniques that combine lexical cohesion analysis with domain specific cues have proven to be an effective means of segmenting broadcast news streams. In the Chapter 7, we compare the performance of a number of lexical cohesion based-approaches to this problem. Two of these techniques, TextTiling (Hearst, 1994; 1997) and C99 (Choi, 2000), analyse news transcripts by examining patterns of lexical repetition in the text. Our approach, the SeLeCT system, also analyses other forms of cohesion, namely statistical and lexicographical word associations using the LexNews chaining algorithm presented in Chapter 3. Hence, one of the goals of the experiments described in the next chapter is to determine if an analysis of these additional lexical cohesion relationships can enhance segmentation performance. 161 Chapter 7 Lexical Chain-based News Story Segmentation In the Chapter 6, we explained that un-segmented streams of broadcast news present a challenging real-world application for text segmentation approaches, since the success of other tasks such as Topic Tracking or New Event Detection depends heavily on the correct identification of boundaries between news stories. In this chapter we evaluate the performance of our segmentation system SeLeCT with respect to two well-known lexical cohesion-based segmenters: TextTiling and C99. Using the Pk and WindowDiff evaluation metrics we show that SeLeCT outperforms both systems on spoken news transcripts (CNN), while the C99 algorithm performs best on the written newswire collection (Reuters). We also examine the differences between spoken and written news styles and how these differences affect segmentation accuracy. The work described in this chapter was published in (Stokes et al., 2002; 2004a) and (Stokes, 2003). 162 7.1 SeLeCT: Segmentation using Lexical Chaining In this section we present our topic segmenter, SeLeCT (Segmentation using Lexical Chaining on Text). This system takes a concatenated stream of text and returns segments consisting of single news reports. The system consists of two components the LexNews component made up of a tokeniser for text preprocessing, the lexical chainer, and the boundary detector component that uses these chains to determine news story boundaries. The LexNews component was described in detail in Section 3.2. Figure 7.1 illustrates the general architecture of the system, where a broadcast news programme is input and a segmented stream of news stories is output. Broadcast News Programme LexNews Lexical Chains Tokeniser Lexical Chainer Boundary Detector Boundary Strength Scorer Error Reduction Filter News Story Segments Figure 7.1: SeLeCT news story segmentation system architecture. 163 Unlike other lexical chain-based approaches to segmentation, the SeLeCT system uses a broader notion of lexical cohesion to analyse the text for topic shifts. More specifically, the LexNews chaining component examines repetition, synonymy, antonymy, generalisation/specialisation relationships, part-whole/wholepart relationships (provided by WordNet), and statistical word associations. The tokenisation process gathers candidate terms (proper noun and noun phrases) for the chain generation process. In Section 7.3, we examine the effect of these LexNews enhancements with respect to segmentation performance using the evaluation methodology described in Section 7.2, but first we will look at how chains are used to detect news story boundaries in the SeLeCT system architecture. 7.1.1 The Boundary Detector As already stated, the LexNews chaining algorithm is used to generate lexical chains; however, due to the temporal nature of news streams, stories related to important breaking-news topics will tend to occur in close proximity in time. If unlimited distance were allowed between word repetitions then some chains would span the entire text if two stories discussing the same topic were situated at the beginning and end of a news programme. Consequently, we impose a maximum distance of m words between candidate terms that exhibit an extra strong relationship (i.e. a repetition-based relationship) in the chaining process. However, the distance restrictions set out in Section 3.3 (i.e. a maximum of 130 words for strong relationships and 60 words for medium-strength relationships) are still adhered to in the experiments described in this chapter. Once lexical chains have been generated, the final step in the segmentation process is to partition the text into its individual news stories based on the patterns of lexical cohesion identified by the chains. Our boundary detection algorithm is a variation on one devised by Okumara and Honda (1994), and is based on the following observation paraphrased from Morris and Hirst’s seminal paper on lexical chaining (1991): Since lexical chain spans (i.e. start and end points) represent semantically related units in a text, a high concentration of chain-begin and end points between two adjacent textual units is a good indication of a boundary point between two distinct news stories. 164 We define boundary strength, w( n, n + 1) , between each pair of adjacent textual units in our test set, as the sum of the number of lexical chains whose span ends at paragraph n and begins at paragraph n + 1 . When all boundary strengths between adjacent paragraphs have been calculated we then get the mean of all the non-zero cohesive strength scores. This mean value plus a constant x then acts as the minimum allowable boundary strength score (or threshold) that must be exceeded if the end of textual unit n is to be classified as the boundary point between two news stories28. To illustrate how boundary strengths based on lexical cohesion are calculated consider the following piece of text in Figure 7.2 containing one topic shift (all nouns are highlighted), accompanied by the lexical chains derived from this text fragment where the chain format is: {word1(frequency)….wordn(frequency) | Sentence no. chain start, Sentence no. chain end} CHAINS TEXT {hearing(1), testimony(1) | 1, 1} [1] Coming up tomorrow when the {tomorrow(1), night(1), holiday(1), weekend(1), time(1) | 1, 3} {O.J. Simpson(2) | 1, 1} {airport(2) | 1, 1} {president(1), organisation(1)| 2, 2} {checkpoints(2) | 2, 3} {murders(1), fatalities(1) | 1, 3} hearing resumes, we hear testimony from the limousine driver that brought O.J. Simpson to the airport- who brought O.J. Simpson to the airport June 12th, the night of the murders. [2] The president of Mothers Against Drunk Driving discusses organisation's support checkpoints over of the her sobriety holiday weekend. [3] She hopes checkpoints will be used all the time to limit the number of fatalities on the road. Figure 7.2: Sample lexical chains generated from concatenated news stories. 28 Optimal values for the constant x (used in the calculation of the boundary strength threshold) were found to be x = 1 in the case of the Reuters collection and x = 2 for the CNN collection. The results of these experiments are discussed in Section 7.3. 165 CHAINS Sentence 1 Sentence 2 Sentence 3 {hearing, testimony | 1, 1} {tomorrow, night, holiday, weekend, time | 1, 3} {O.J. Simpson | 1, 1} {airport | 1, 1} {president, organisation| 2, 2} {checkpoints | 2, 3} {murders, fatalities | 1, 3} Boundary Point Figure 7.3: Chain span schema with boundary point detected at end of sentence 1. w(n, n+1) values for each of these points are w(1, 2) = (3+2) = 5 and w(2, 3) = (1+0) = 1. Figure 7.3 illustrates the start and end points of the above chain spans and the position of the two distinct boundary points in the text between sentences 1 and 2 and sentences 2 and 3. No boundary strength score is calculated for the boundary at the end of sentence 3 since this is the last sentence in the text, and so by default must be the end of a particular story. As previously stated, a boundary strength score is the sum of the number of chain-end points and the number of chain-begin points at a particular boundary. Therefore, w(1,2) has a higher score than w(2,3) (5 versus 1 respectively) and the algorithm correctly labels the boundary between sentence 1 and 2 as the end of the O.J. Simpson story in the broadcast since this score exceeds the boundary strength threshold for this piece of text, i.e. w(1,2) > (((5 + 1) /2) + 1) where the threshold is the mean of these two scores plus the constant x = 1. This is a very simple example of segmentation using lexical chains, however, in reality news stories tend to be much longer than this (around 500 words) and news broadcasts consists of many more than two stories concatenated together. Consequently, the segmentation decision process gets increasingly more difficult, for the following reasons: 166 1. Boundary scores tend to increase within a narrow region of text surrounding a true boundary position rather than directly at that point in the text. This is a natural consequence of the fact that not all sub-topics in a story will end in the last line or paragraph of the story. Hence, lexical chain spans will tend to end and begin in the general vicinity of a topic shift. This results in a cluster of high scoring adjacent boundary points labelled in Figure 7.4 as regions A and C. 2. In news broadcasts it is common to follow a story with a related news report, so quite often this means that lexical chain spans will stretch across story boundaries since they share some common or related vocabulary. For example, in Figure 7.3 we saw that the relationship between the words ‘murders’ (sentence 1) and ‘fatalities’ (sentence 3) was captured by a specific lexical chain that spanned across the ‘O.J. Simpson’ and ‘Drink Driving’ story lines. This results in a solitary boundary point that is very close to a cluster of adjacent boundary points. This is labelled in Figure 7.4 as region B. 6 4 0 0 5 5 5 3 0 0 0 0 0 0 0 0 0 0 0 0 A B C Figure 7.4: Diagram showing characteristics of chain-based segmentation. All numbers greater than zero are possible boundary positions, while zero scores represent no story boundary point between these two textual units. Only the boundaries ringed in red are retained after the results are run through the error-reduction filter. Both of these characteristics of chain-based segmentation add noise to the detection process. However, their effect on segmentation performance can be lessened by using an error reduction filter. Our error reduction filter, the final element of the boundary detection process, examines all boundary detection scores that exceed the required threshold and searches for system detected boundary points that are separated by less than d number of textual units from a higher scoring boundary, where d is too small to be a ‘reasonable’ story length. This filter has the effect of smoothing out local maxima in the boundary score distribution, thus increasing segmentation precision. This means that for regions A and C, which represent 167 clusters of adjacent boundary points, only the boundary with the highest score in the cluster is retained as the true story boundary. Therefore, when d = 5, the boundary which scores 6 is retained in region A while in region C both points have the same score so in this case we consider the last point in region C to be the correct boundary position. Finally, the story boundary in region B, a solitary boundary point, is also eliminated because it is situated too close to the boundary points in region C and it has a lower score than either of those boundaries. At the beginning of this section we defined the boundary strength w( n, n + 1) between each pair of adjacent textual unit in our test set, as the sum of the number of lexical chains whose span ends at paragraph n and the number of chains that begin their span at paragraph n + 1 . Our decision to choose the summation of chainbegin and end points over the product of these two numbers is based on the observation that a multiplication driven score would eliminate all potential, highscoring boundary points that have either zero chain-ends or zero chain-begins. For example, if a boundary point has five chain-ends and no chain-begins then its strength is zero and so is eliminated from the set of potential boundaries given to the error filter. However, high-scoring end or begin scores are still good indicators of a topic shift, so we chose a summation driven boundary strength score. Other scoring approaches were also considered such as the weighted sum of the number of chain-end and begin points. However, these combinations reduced system performance leading us to conclude that the position of chain-end points as evidence of topic shifts in a text is equally as important as the position of the chainbegin points in a text. 7.2 Evaluation Methodology In this section we give details of the evaluation metrics used to determine news story segmentation performance in Sections 7.3 and 7.4. In order to determine the effect of different language modes on segmentation performance, these experiments were run on two test collections: one containing CNN broadcast news transcripts and another containing Reuter newswire articles. Both types of news document were taken from the TDT1 pilot study corpus (Allan et al. 1998a) details of which can be found in Section 5.4.1. 168 7.2.1 News Segmentation Test Collections For most test collections used as input to segmentation algorithms, a lot of time and effort is spent gathering human annotations, i.e. human-judged sub-topic shifts. The difficulty with these annotations lies in determining their reliability since human judges are notoriously inconsistent in their agreement on the beginning and end points of these fine-grained boundaries (Passonneau, Litman, 1993). A different approach to segmentation evaluation is available to us due to the nature of the segments that we wish to detect. By concatenating distinct stories from a specific news source and using this as our test set, we eliminate subjectivity from our boundary judgments. Therefore a boundary can now be explicitly defined as the joining point between two news stories, in contrast with other test collections, there is no need for a set of judges to make any subjective decisions on what constitutes a segment in the collection. In Sections 7.3 and 7.4 we report segmentation results gathered from two test collections each consisting of 1000 news stories randomly select from the TDT1 corpus. The first test set contains 1000 news stories extracted from CNN news programme transcripts. These stories were reorganised into 40 files each containing 25 stories. This procedure was repeated on the Reuters test set, which also consists of 1000 written articles. Consequently, all experimental results in Sections 7.3 and 7.4 are averaged scores calculated over each of the 40 samples. In a previously reported set of segmentation results (Stokes, Carthy, Smeaton 2002), experiments involving SeLeCT and TextTiling were run on a single file of 1000 CNN stories. However, splitting the corpus was necessary for the experiments reported here and in (Stokes, 2003) and (Stokes et al., 2004a), because Choi’s C99 program was implemented to handle only small amounts of input data. These earlier results (Stokes et al., 2002) will also be discussed in the following section. 7.2.2 Evaluation Metrics There has been much debate in the segmentation literature regarding appropriate evaluation metrics for estimating segmentation accuracy. Earlier experiments favoured an IR style evaluation that measures performance in terms of recall and precision, which we define as follows: Recall: The number of correctly detected story boundaries divided by the number of actual news story boundaries in the test set. 169 Precision: The number of correctly detected story boundaries divided by the total number of boundaries returned by the system. However, unlike retrieval tasks where documents are classified as either relevant or non-relevant, the notion of segmentation accuracy is a fuzzier concept. For example, if a system suggests a boundary point that is one sentence away from the true story-end point it is unfair to penalise this system as heavily as a system that has missed the same boundary by 10 sentences. In other words recall, precision and their harmonic mean the F1 measure all fail to take into account near-boundary misses. Consequently, these metrics are insufficiently sensitive when trying to find system parameters that yield optimal system performance (Beeferman et al., 1999). Other researchers (Reynar, 1998; Ponte, Croft 1998; Stokes et al. 2002) have tried to remedy this problem by measuring recall and precision values at varying margins of error. More specifically, a system boundary is considered correct if it exists within a certain window of allowable error. So a margin of error of +/- n means that if the system identifies a boundary n paragraphs before or n paragraphs after the correct boundary point then this end point is still counted as correct. This evaluation strategy is illustrated in Figure 7.5, where vertical lines represent boundaries and a red line represents the correct boundary while grey lines represent the range of system boundaries surrounding this point that are considered correct within a margin of error of +/-3 sentences. Outside Range Allowable Margin of Error +/- 3 Outside Range Figure 7.5: Diagram illustrating allowable margin of error, where the red vertical lines represents the correct boundary between 2 stories, the grey lines represent boundaries that lie within an allowable margin of error and so would still be consider correct if a segmentation system returned them, and finally black lines are incorrect boundary positions. The only stipulation when calculating recall and precision with respect to this margin of error is that each boundary may only be counted once as a correct boundary. This problem occurs when the value of n is high and has the effect of 170 exaggerating improvements in system performance as n increases. This is the first of four metrics used in our evaluation which we define more formally as follows: ferror = 1 0 if ref − hyp ≤ n otherwise (7.1) ferror is an error function where n is the allowable distance in units between the actual boundary ref and the system or hypothesised boundary hyp. Since the arrival of the TDT initiative, Beeferman et al.’s metric, which tried to address the inadequacies of recall and precision, has become the standard for segmentation evaluations. They proposed a probabilistic evaluation metric Pk that aims to incorporate gradations of segmentation accuracy in terms of false positives (falsely detected segments), false negatives (missed segments) and near-misses (very close but not exact boundaries). More specifically, Pk is defined as ‘the probability that a randomly chosen pair of words a distance k words apart is inconsistently classified’ (Beeferman et al., 1999): D (i, j )(δ ref (i, j ) ⊕ δ hyp(i, j )) Pk (ref, hyp) = (7.2) 1≤i ≤ j ≤n where δ ref (i , j ) and δ hyp(i , j ) are binary valued functions which are 1 when sentences i and j are in the same topic segment. The symbol ⊕ represents the XNOR function29, which is 1 when its arguments are equal and 0 otherwise. The function D is a distance probability distribution which is estimated based on the average segment size (i.e. story length) in the collection. However, in a recent publication Pevzner and Hearst (2002) highlight several faults with the Pk metric. Most notable they criticise Pk firstly for its inability to deal with different types of error in an even-handed manner and secondly they criticise its over-sensitivity to large variances of segment size in the test set. In the latter case, Pk becomes more lenient as the variance increases and in the former it unfairly penalises false negatives more than false positives while over-penalising near-misses. The authors show though empirical evidence and different segmentation scenarios that their proposed alternative metric called WindowDiff alleviates these problems and provides a fairer and more accurate measure of segmentation performance. WindowDiff, like Pk, works by moving a window of 29 XNOR (exclusive OR) is (A ∧ B) ∨ (¬A ∧ ¬B) . 171 fixed size across the test set and penalising the algorithm whenever a missed or erroneous boundary occurs. However, unlike Pk it calculates this error by counting ‘how many discrepancies occur between the reference and the system results’ rather than ‘determining how often two units of text are incorrectly labelled as being in different segments’ (Pevzner, Hearst 2002). WindowDiff is defined more formally as follows: WindowDiff ( ref , hyp ) = 1 N −k N −k (| b( refi, refi + k ) − b(hypi, hypi + k ) | > 0) (7.3) i =1 where k is the size of the window (based on the average segment size in the text), b(i, j) represents the number of boundaries between positions i and j in the text and N represents the number of textual units (e.g. sentences) in the text. The fourth and final metric to be used in our evaluation also comes from Pevzner and Hearst. It is referred to as the Pk′ metric and is an exact implementation of Pk except that it doubles the false positive penalty to compensate for the overpenalisation of false negatives. However, as the authors explain this metric is still inferior to WindowDiff as it solves only one of many identified problems with Pk. We include this metric in our evaluation as it helps to shed some light on the style of segmentation returned by the segmenter being evaluated. By segmentation style we mean the types of mistakes that a segmenter is prone to making in terms of nearmisses, false positives, and false negatives. Appendix C contains a more detailed explanation of the differences between the WindowDiff and Pk metrics. 7.3 News Story Segmentation Results In this section we present performance results for each segmenter on both the CNN and Reuters test sets, with respect to the aforementioned evaluation metrics. As explained in Section 7.1, we determine the effectiveness of our SeLeCT system with respect to two other lexical cohesion-based approaches to segmentation, namely the TextTiling (Hearst, 1997) and C99 algorithms (Choi, 1999)30. A detailed description of these algorithms was given in Section 5.2.3. Like the SeLeCT system, they are both lexical cohesion-based segmenters; however they only examine one form of lexical cohesion, namely lexical repetition. These types 30 We use Choi’s java implementations of TextTiling and C99 available for free download at www.cs.man.ac.uk/~choif. 172 of systems work on the notion that areas of text where lexical repetition is at a minimum represent transitions between topics in the news stream. The SeLeCT system augments this hypothesis with an extended notion of relatedness so that topic transitions are represented as areas of text where there are low numbers of repetition, lexicographical and statistical relationships between tokens, in our case noun and proper noun phrases. In addition to the C99 and TextTiling performance results, we also include results from a random segmenter that returns 25 random boundary positions for each of the 40 files in both test sets. These results were averaged over 50 random trials and represent a lower bound on segmentation performance. Furthermore, all results in this section are calculated using paragraphs as the basic unit of text. Since both our test sets are in SGML format, each of the segmentation systems make boundary decisions based on SGML paragraph boundaries31 where the beginning of a paragraph is indicated by a speaker change tag in a CNN transcript or a paragraph tag in the case of a Reuters newswire story. 7.3.1 CNN Broadcast News Segmentation The graph shown in Figure 7.6 summarises the results of each segmentation system on the CNN data set, evaluated with respect to the four metrics. All values for these metrics range from 0 to 1 inclusively. However, F1 results are expressed as 1-F1 since a score of 0, in line with the other metrics, will then represent the highest measure of system performance. Consequently, the system with the lowest score in each metric is the best performing algorithm. From Figure 7.6, a visualisation of the results in Table 7.1, we can see that the accuracy of our SeLeCT segmentation algorithm is greater than the accuracy of either C99, TextTiling or the Random segmenter for all four evaluation metrics. Although many combinations of lexical cohesive relationships were experimented with, optimal performance of the SeLeCT system was achieved when only patterns of proper noun and noun repetition were examined during the boundary detection stage. For the remainder of this subsection we will comment on 31 In (Choi, 2000) boundaries are hypothesised using sentences as the basic unit of text. However both C99 and TextTiling can take advantage of paragraph information when the input is formatted so that carriage returns indicate breaks between paragraphs. 173 the segmentation style of each of the algorithms and some interesting characteristics of each of the evaluation metrics when determining segmentation accuracy. The 1-F1 value for TextTiling gives us a prime example of how traditional IR metrics, precision and recall, fail as informative measures of segmentation performance. In their all-or-nothing approach to measuring segmentation performance, TextTiling rates as the worst performing system with highest overall 1-F1 score. A break down of this score shows that TextTiling’s recall and precision values are very low, 27.2% and 22.8% respectively. However, these values take no account of the fact that TextTiling is producing near-misses rather than ‘pure’ false negatives, i.e. ‘just’ missing boundaries rather than failing to detect them at all. To verify this we can observe from Figure 7.7 that recall and precision percentages significantly improve as the margin of error is incremented in units of +/-1 paragraph. In the case of TextTiling, this graph strongly indicates that the system is more prone to near-misses than false negatives, as recall and precision values increase to 68.2 and 53.9 respectively at +/-1 paragraphs. 1 0.9 0.8 0.7 SeLeCT 0.6 TextTiling 0.5 C99 0.4 Random 0.3 0.2 0.1 0 1- F1 Pk Pk' WinDiff Figure 7.6. Accuracy of segmentation algorithms on CNN test set. System Recall% Precision % 1 – F1 Pk SeLeCT 53.4 55.8 0.446 0.25 0.365 0.253 TextTiling 27.9 22.4 0.752 0.259 0.425 0.299 C99 64.1 44.0 0.475 0.294 0.524 0.351 Random 7.5 7.5 0.925 0.421 0.604 0.480 Pk′ WindowDiff Table 7.1: Precision and Recall values from segmentation on concatenated CNN news stories. 174 1 0.9 0.8 F1 Measure 0.7 0.6 0.5 0.4 C99 0.3 TextTiling SeLeCT 0.2 Random 0.1 0 0 1 2 3 4 5 6 7 8 9 Margin of Error +/-n Paragraphs Figure 7.7: Graph illustrating effects on F1 measure as margin of allowable error is increased for CNN segmentation results. Section 7.2.2 explained how the WindowDiff metric corrects Pk’s overpenalisation of false negatives and near-misses. With this in mind we would expect TextTiling to perform better under WindowDiff than Pk. However, it is evident that TextTiling also suffers considerably from false positive errors, as the difference between its Pk and Pk′ (doubles false positive penalty) scores in Table 7.1 is relatively large. Therefore we observe, from an analysis of all four evaluation metrics, that the TextTiling segmentation style is a combination of false positives (over-segmentation) and near-misses rather than false negatives (undersegmentation). Similarly if we look at values for the other two systems we see that C99 is also more prone to false positives than false negatives, and SeLeCT shows no particular bias towards producing false positives since its Pk and Pk′ remain relatively stable. Another interesting observation from these results is that although C99 has a much lower 1-F1 measure than TextTiling in Table 7.1, both Pk and WindowDiff rank it as the worst performing system. Taking a closer look at the results explains why this is the case. C99 returns nearly 3 times more ‘true’ false positives than TextTiling, since more of TextTiling’s false positives are in fact near-misses. This again is not reflected in the recall and precision values. However, Figure 7.7 175 somewhat illustrates this point by the fact that C99’s performance shows the least improvement as the margin of error increases. Overall we observe that in spite of the fact that WindowDiff penalised each system more than Pk does, the overall ranking of the systems with respect to these two measures is the same. Although, in the case of C99 and TextTiling, WindowDiff distinguishes between their levels of accuracy with more certainty than Pk does. 7.3.2 Reuters Newswire Segmentation Table 7.2 and Figure 7.8 summarise the performance of each system on our Reuters newswire test collection. In this experiment we observe that the C99 algorithm outperforms the SeLeCT, TextTiling and Random segmenter with respect to all four evaluation metrics. Optimal performance for the SeLeCT system was once again achieved by analysing only patterns of proper noun and noun phrase repetition. Overall the results show an improvement in performance for each of the systems when segmenting concatenated Reuters news stories rather than CNN transcripts. The difference between WindowDiff scores (improvement in performance) for C99 and SeLeCT is, however, less than was observed for experiments on CNN transcripts, i.e. 0.12 versus 0.059 respectively. In Figure 7.9, we notice that TextTiling performance improves dramatically as the margin of error is incremented from 0 to +/-1 paragraph, which is reflected in its WindowDiff and Pk scores, ranking it a close third to the SeLeCT system. We see in Table 7.2, as in Table 7.1, that although WindowDiff penalises systems more heavily than Pk, the ranking of system accuracy remains the same. Pevzner and Hearst also comment on Pk’s sensitivity to variation in segment size in the test set. In our experiment CNN stories vary in length more than Reuters articles do. Consequently, we observe a smaller deviation between WindowDiff and Pk scores on the Reuters collection in comparison to the CNN collection. In conclusion then, both WindowDiff and Pk attempt to represent each type of segmentation error in a single value of system accuracy. Combining different error information in a unified manner is a difficult problem that has drawn as much attention from the IR community with the formulation of combination metrics such as the F1 and E-measures (van Rijsbergen, 1979), as it has from segmentation researchers. The main reason that recall and precision measures are useful in segmentation evaluation is that they reflect how a user who expects 100% accuracy 176 might perceive the segmentation results. However, judging system performance rather than user satisfaction is what metrics like WindowDiff and Pk are good at and so they also play an important role in measuring system effectiveness. 1 0.8 C99 0.6 SeLeCT TextTiling 0.4 Random 0.2 0 1- F1 Pk Pk' WinDiff Figure 7.8: Accuracy of segmentation algorithms on Reuters test set. System Recall% Precision % 1 – F1 Pk Pk′ WindowDiff C99 70.0 74.9 0.276 0.128 0.189 0.148 SeLeCT 60.6 79.1 0.314 0.191 0.246 0.207 TextTiling 32.1 41.0 0.640 0.221 0.291 0.244 Random 9.3 9.3 0.907 0.490 0.731 0.514 Table 7.2: Precision and Recall values from segmentation on concatenated Reuters news stories. 1 0.9 0.8 F1 Measure 0.7 0.6 0.5 C99 0.4 TextTiling 0.3 SeLeCT 0.2 Random 0.1 0 0 1 2 3 4 5 6 7 8 9 Margin of Error +/-n Paragraphs Figure 7.9: Graph illustrating effects on F1 measure as margin of allowable error is increased for Reuters segmentation results. 177 7.3.3 The Error Reduction Filter and Segmentation Performance In Section 7.1.1, we described the error reduction filter used to improve the performance of SeLeCT system in the boundary detection phrase. This filter works by seeking out clusters of adjacent high scoring boundaries that are separated by less than d textual units (in our case paragraphs), and then deciding which one of these boundary is the correct one using the heuristics discussed in Section 7.1.1. During the course of our segmentation experiments on the CNN and Reuters news corpora, we found that the optimal value for d is 7 sentences for both collections. The problem with a parameter of this nature is that the value of d could be accused of over-fitting the data set in question. More specifically, if d was the average distance between stories in the data set, then it could be argued that this element of the boundary detection process was more important than the information provided by the lexical chains. However, this is not the case for two reasons. Firstly, roughly 85% of CNN and Reuters news articles are longer than 7 sentences (see Table 5.4 for a breakdown of document lengths in the TDT1 corpus). Secondly, the error filter is only responsible for a modest increase in SeLeCT’s performance. Figure 7.10 illustrates the degree to which the error filter improves SeLeCT performance on the CNN data set. In particular, we can see from this graph that the F1 scores at a margin of error of +/-0 units are similar (0.52 without the filter versus 0.55 with the filter). However, after this point these F1 scores tend to plateau out as the margin of error increases for the SeLeCT schema without the filter. In contrast, the F1 score steadily increases when the filter is employed meaning that it improves performance by hypothesising more near-misses than the other schema. In Figure 7.11, we can visualise this segmentation characteristic in terms of recall and precision. From this graph we can see that the ‘without-filter’ schema is capable of achieving higher recall values at the expense of precision, where as the ‘with-filter’ schema can achieve much greater precision without causing a significant deterioration in recall. 178 0.9 F1 Measure 0.8 0.7 0.6 SeLeCT with Filter SeLeCT without Filter 0.5 1 2 3 4 5 6 7 8 9 10 Margin of Error +/-n paragraphs Figure 7.10: Graph illustrating the effect of the error reduction filter on SeLeCT’s F1 measure for the CNN collection as the margin of allowable error increases. 100 SeLeCT without Filter 95 SeLeCT with Filter 90 85 Recall 80 75 70 65 60 55 50 35 45 55 65 75 85 95 Precision Figure 7.11: Graph illustrating the effect of the error reduction filter on SeLeCT’s recall and precision for the CNN collection as the margin of allowable error increases. 179 7.3.4 Word Associations and Segmentation Performance As stated in Sections 7.3.1 and 7.3.2, optimal SeLeCT performance was achieved on the CNN and Reuters test sets when only repetition relationships were used to determine topic shifts during the boundary detection phase32. Figure 7.12 illustrates how segmentation performance deteriorates with the use of additional lexical cohesive relationships. The graph also shows that this trend is consistent across spoken and written forms of news stories. 0.32 Reuters CNN 0.3 WindowDiff 0.28 0.26 0.24 0.22 0.2 R R,S R,S,C R,C R,S,S/G, P/W R,S,S/G, P/W,C Relationship Types Figure 7.12: Graph showing effect of word relationships on segmentation accuracy. R=Repetition, S=Synonymy, C=Co-occurrences or statistical word associations, S/G= Specialisation/Generalisation, P/W=Part/Whole. Although this is a disappointing and somewhat counterintuitive result a closer examination of the effect of using weaker semantic relationships to segment text into topics reveals why the use of these relationships, identified by the lexical chains, is inappropriate in a text segmentation application. Firstly, cohesion and 32 Similar SeLeCT performance is reported in (Stokes, Carthy, Smeaton 2002). However, this preliminary system did not include ‘fuzzy’ syntactic matching of noun phrases (Section 3.2). In that paper optimal performance was achieved when story boundaries were determined by examining patterns of repetition and WordNet-based relationships in the text. Furthermore although the same CNN stories are involved in both papers, (Stokes, Carthy, Smeaton 2002) evaluation was run on single file of 1000 CNN stories rather than 40 files each containing 25 stories. Splitting the corpus was necessary for the experiments reported here and in (Stokes, 2003) as Choi’s C99 program was implemented to handle only small data sets. 180 coherence are independent which means that cohesion can exist in sentences that are not related coherently (Morris, Hirst, 1991). [1] Dwi Sumadji, who was released yesterday after a judge decided there was insufficient evidence against him, said he was willing to testify in any hearings on the case relating to his imprisonment. [2] Dr Ian Wilmut, the head of the Roslin Institute in Edinburgh, released results earlier today proving that Dolly the sheep and her donor’s DNA were identical, using the same DNA typing technique that is now accepted as standard in most courts, Repetition-based Chains: {DNA, DNA typing technique} Repetition + WordNet-based Chains: {DNA, DNA typing technique} {judge, case} Repetition + WordNet + Statistical Word Association-based Chains: {judge, evidence, hearing, case, DNA, DNA typing technique, court} Figure 7.13: Example of the effect of weak semantic relationships on the segmentation process. For example, consider the text extract in Figure 7.13, which contains two unrelated sentences one taken from a story on ‘Dwi Sumadji’s release from prison’ and the other on ‘Dolly the sheep’. Below these sentences are three sets of lexical chains generated from this piece of text using a variety of word relationships: repetitions only, repetitions and WordNet relationships, and repetitions, WordNet relationships and collocations. Looking only for repetition relationships between these sentences we see that they have no nouns in common. Looking for WordNet relationships between these sentences we find that although ‘judge’ and ‘case’ can be related by following paths in the taxonomy the related word ‘court’ in the second sentence is not identified. However, when statistical word associations are included 181 in the chaining process we find that ‘evidence’ in the first sentence is related to ‘DNA’ in the second, and ‘court’, ‘hearing’ and ‘case’ are also added into the same chain. It is evident from this example that weaker semantic relationships can add noise to the detection process by blurring topic shifts between news stories. Although this example only highlights the negative effect of collocations on segmentation accuracy there are also many instances where WordNet relationships are responsible for finding similar weak associations between words. The second reason why weaker lexical cohesive relationships identified by the chains fail to improve segmentation accuracy is that word interpretations must occur in context. More specifically, when WordNet or statistical word relationships are used to analyse cohesion in concatenated news stories, spurious chains are more likely to be generated in a text that is disjoint and incoherent. For example, consider the following two seemingly unrelated sentences: ‘There are approximately 336 dimples on a golf ball.’ ‘They finally had all the wrinkles in the plan pretty much ironed out’. In this example, the lexical chaining algorithm incorrectly associates the words ‘dimples’ and ‘wrinkles’ by a specialisation relationship with the concept – ‘depression, impression or imprint’ in WordNet. However, it is obvious that ‘wrinkle’ in the second sentence is being used in the ‘minor difficulty’ sense of the word. Therefore, we can say that if our chaining algorithm finds a lexicographical or statistical association in a fixed context (i.e. a single news story) then we can assume that this relationship is reliable, otherwise we cannot. 7.4 Written versus Spoken News Story Segmentation It is evident from the results of our segmentation experiments on the CNN and Reuters test collections that system performance is dependent on the type of news source being segmented, i.e. spoken texts are more difficult to segment. This disagreement between result sets is a largely unsurprising outcome as it is well documented by the linguistic community that written and spoken language modes differ greatly in the way in which they convey information. At a first glance, it is obvious that written texts tend to use more formal and verbose language than their spoken equivalents. However, although CNN transcripts share certain spoken text 182 characteristics, they lie somewhere nearer written documents on a spectrum of linguistic forms of expression, since they contain a mixture of speech styles ranging from formal prepared speeches from anchor people, politicians, and correspondents, to informal interviews/comments from ordinary members of the public. Furthermore, spoken language is also characterised by false starts, hesitations, backtrackings, and interjections; however, information regarding prosodic features and these characteristics are not represented in CNN transcripts. In this section we look at some grammatical differences between spoken and written text that are actually evident in CNN transcripts. In particular, we look at the effect that these differences have on part-of-speech distributions, and how these impact segmentation performance. 7.4.1 Lexical Density One method of measuring the grammatical intricacy of speech compared to written text, is to calculate the lexical density of the language being used. The simplest measure of lexical density, as defined by Halliday (1985), is the ‘the number of lexical items (content words) as a portion of the number of running words (grammatical words)’. Halliday states that written texts are more lexically dense while spoken texts are more lexically sparse. In accordance with this we observe, based on part-of-speech tag information, that the CNN test set contains 8.58% less lexical items than the Reuters news collection.33 Halliday explains that this difference in lexical density between the two modes of expression can be attributed to the following observation: Written language represents phenomena as products, while spoken language represents phenomena as processes. In real terms this means that written text tends to convey most of its meaning through nouns (NN) and adjectives (ADJ), while spoken text conveys it through adverbs (ADV) and verbs (VB). To illustrate this point consider the following written and spoken paraphrase of the same information: 33 Lexical items included all nouns, adjectives and verbs, except for function verbs like modals and auxiliary verbs. Instead these verbs form part of the grammatical item lexicon with all remaining parts of speech. Our CNN and Reuters data sets consist of 43.68% and 52.26% lexical items respectively. 183 Written: Improvements/NN in American zoos have resulted in better living/ADJ conditions for their animal residents/NN. Spoken: Since/RB American zoos have been improved/VB the animals residing/VB in them are now/RB living/VB in better conditions. Although this example is a little contrived, it shows that in spite of changes to the grammar, by and large the vocabulary has remained the same. More specifically, these paraphrases illustrate how the products in the written version, improvements, resident, and living, are conveyed as processes in spoken language though the use of verbs. The spoken variant also contains more function words, in particular two adverbs ‘now’ and ‘since’, where adverbs are a grammatical necessity that provides cohesion to text when processes are being described in verb clauses. So looking at the ratio of function words in the written and spoken forms we find that for every one function word in the written text there are 1.8 function words in the spoken form, i.e. 1: 1.8. On the other hand the ratio of content words is almost one for one. Lexical Density and SeLeCT As explained in Section 3.2, the LexNews chaining algorithm, used by the SeLeCT segmenter, only looks at cohesive relationships between nouns, proper nouns and nominalised adjectives in a text. This accounts partly for SeLeCT’s lower performance on the CNN test set, since the extra information conveyed though verbs in spoken texts is ignored by the lexical chainer. The simplest solution to this problem is to repeat the SeLeCT experiments, this time including all verbs (except function verbs such as modals) in the chaining process. The best method of dealing with morphological variations between same-stem verbs is to reduce all verbs to their root form during tokenisation. This ensures that irregular verbs such as ‘to ring’ (ring – rang - rung) or ‘to swim’ (swim – swam – swum) will appear syntactically equivalent during the chaining process. So excluding these irregular verbs (WordNet lists these exceptions), all verbs are reduced to their root form by the tokeniser using standard inflection derived rules. In Section 2.3, we briefly discussed the difficulty of finding semantic relationships between verbs using the 184 WordNet taxonomy; consequently, we only examine repetition relationships between these parts of speech. Table 7.3 shows the negative effect on segmentation performance when verb stems (excluding function verbs) are included in the chaining processes. More specifically, SeLeCT’s performance deteriorates by 3.1% on the CNN collection and 4.3% on the Reuters collection. From the results of this experiment we observe that the standard set of textual function verbs is not enough for speech text processing tasks and that their lists should be extended to include other common ‘low information’ verbs. These types of verbs are not necessarily characterised by large frequency counts in the spoken news collection such as the domain specific phrases ‘to report’ or ‘to comment’. Instead these verbs tend to have no equivalent nominal form, such as the verbs ‘to let’ ‘to hear’ ‘to look’ or ‘to try’. With this in mind, we repeated this experiment including this time only nominalised verbs and the usual proper noun and nouns phrases, and nominalised adjectives in the chaining process. As expected these experimental results, presented in the last row of Table 7.3, show a 1.2% decrease in system error on the CNN collection over the initial SeLeCT system. A similar decrease in error on the Reuters test collection was not observed since written text conveys most of its meaning though the use of nouns, so verbs can be ignored with little or no effect on segmentation performance in the context of this experiment. System SeLeCT CNN WindowDiff Before After 0.253 0.284 0.253 0.241 ∆ Error Reuters WindowDiff ∆ Error Before After + 3.1% 0.207 0.250 + 4.3% -1.2% 0.207 0.209 + 0.2% (stopped and stemmed verbs) SeLeCT (nominalised verbs) Table 7.3: Results of SeLeCT segmentation experiments when verbs are adding into the chaining process. 185 Lexical Density, C99 and TextTiling Comparing Tables 7.1 and 7.2, as in the case of the SeLeCT results, we also notice a decrease in C99 and TextTiling segmentation performance on the CNN collection compared with the Reuters collection results. For the SeLeCT system, we concluded that this performance difference was caused by the loss of valuable topic information when ‘nominalisable’ verbs are excluded from the chain creation phrase. However, since C99 and TextTiling use all parts of speech in their analysis of the text, the replacement of products (nouns) with processes (verbs) is not the reason for a similar deterioration in their performance. More specifically, both C99 and TextTiling rely on stopword lists to identify spurious inter-segment links between function words that by their nature do not indicate common topicality. For the purpose of their original implementation their stopword lists contained mostly pronouns, determiners, adverbs, and function verbs such as auxiliary and modal verbs. However, we observed from the SeLeCT results in Table 7.3 that the standard set of textual function verbs is not enough for speech text processing tasks. In order to observe if a similar improvement in results could be achieved, we re-ran C99 and TextTiling experiments on the Reuters and CNN collections, using only nouns, adjectives, nominalised verbs (provided by the NOMLEX (Meyers et al., 1998)), and nominalised adjectives as input. Alternatively, we could have provided these systems with an extended stopword list that included general stopwords and ‘low information’ verbs; however, it is more desirable and effective in this case to limit the input of the system to content words only. Our results in Table 7.4 show that there is a decrease in the WindowDiff error for the C99 system on both the CNN collection (an 8.3% reduction in error) and the Reuters collection (a 2.7% reduction in error). Similarly, we observe an improvement in the WindowDiff based performance of the TextTiling system on the CNN data set (a 2.5% reduction in error). However, we observe a marginal fall in performance on the Reuters data set (a 0.3% increase in error). These results again illustrate the increased dominance of verbs in spoken text and the importance of function verb removal by our verb nominalisation process for CNN segmentation performance. 186 System C99 (nominalised CNN WindowDiff Before After 0.351 0.268 0.299 0.274 ∆ Error Reuters WindowDiff ∆ Error Before After -8.3% 0.148 0.121 -2.7% -2.5% 0.244 0.247 + 0.3% verbs) TextTiling (nominalised verbs) Table 7.4: Results of C99 and TextTiling segmentation experiments when nominalised verbs are adding into the segmentation process. 7.4.2 Reference and Conjunction in Spoken Text ‘A picture paints a thousand words’ and since news programme transcripts are accompanied by visual and audio cues in the news stream, there will always be a loss in communicative value when transcripts are interpreted independently. It is well known that conversational speech is accompanied by prosodic and paralinguistic contributions, facial expressions, gestures, intonation etc., which are rarely conveyed in spoken transcripts. However, there are also explicit (exophoric) references in the transcript to events occurring outside the lexical system itself. These exophoric references in CNN transcripts relate specifically to audio references such as speaker change, musical interludes, background noise; and visual references such as event, location and people shots in the video stream. All of these exophoric cues help to give context to the dialogue in a news report and commonly contain information which is not always repeated in the accompanying transcript. In particular this ‘spoken’ news story characteristic is repeatedly seen in humaninterest stories and entertainment reports which are less structured than other news reports in the collection. Although, some of these exophoric cues are explicitly tagged in TDT news transcripts (e.g. speaker change), TDT segmenters have only made use of this information to identify boundaries between sentences. For example, Figure 7.14 contains an extract from a CNN story on the movie “The Shadow”. One of the first things one notices is that much of the text produced by the film clip is largely irrelevant. In the context of document indexing the addition of this text would by in large have little effect on the term frequencies and idf scores. However, in story segmentation these types of interludes often appear to automatic segmenters as areas of dissimilarity in the text which can in turn lead to 187 incorrect story boundary assignments. Consequently, identifying and ignoring these text units during boundary detection may improve segmentation performance and be a fruitful area for future speech-based news story segmentation research. However, with respect to the deterioration in segmentation performance on the CNN test collection in our experiments, we believe that this property of transcribed news is a contributory factor. [Film clip from ‘The Shadow’] LONE, stars as Shiwan Khan: How did you know what was happening to me? How did you know who I am? ALEC BALDWIN, stars as the Shadow: The Shadow knows. s the whole idea. CHARLIE COATS, News Correspondent: Of course he knows. That' The Shadow is the super hero who has some loosely-defined mind-reading powers and a dark side- something about evil that lurks in the hearts of men. But he' s a good guy and a snappy dresser. [Film clip from ‘The Shadow’] LONE: That is a lovely tie, by the way. May I ask where you acquire it? BALDWIN: Brooks Brothers. LONE: Is that midtown? BALDWIN: Forty-fifth and Madison. You are a barbarian. LONE: Thank you. Figure 7.14: CNN transcript of movie review with speaker identification information. In addition to the occurrence of exophoric references, speech transcripts also contain examples of endophoric (anaphora and cataphora) reference. Solving endophoric reference has long been recognised as a very difficult problem, which requires pragmatic, semantic and syntactic knowledge in order to be solved. However, there are simple heuristics commonly employed by text segmentation algorithms that we use to take advantage of the increased presence of this form of reference in spoken text. One such heuristic is based on the observation that when common referents such as personal and possessive pronouns, and possessive determiners appear at the beginning of a sentence, this indicates that these referents are linked in some way to the previous textual unit (in our case the previous 188 paragraph). The resolution of these references is not of interest to our algorithm but the fact that two textual units are linked in this way gives the boundary detection process an added advantage when determining story segments in the text. In addition, an analysis of conjunction (another form of textual cohesion) can also be used to provide the detection process with useful evidence of related paragraphs, since paragraphs that begin with conjunctions (because, and, or, however, nevertheless), and conjunctive phrases (in the mean time, in addition, on the other hand) are particularly useful in identifying cohesive links between units in conversational/interview sequences in the transcript. 7.4.3 Refining SeLeCT Boundary Detection In Section 7.1.1, we described in detail how the boundary detection phrase uses lexical chaining information to determine story segments in a text. One approach to integrating referential and conjunctive information with the lexical cohesion analysis provided by the chains is to remove all paragraphs from the system output that contain a reference or conjunctive relationship with the paragraph immediately following it in the text. The problem with this approach is that Pk and WindowDiff errors will increase if ‘incorrect’ segment end points are removed that represented near system misses rather than ‘pure’ false positives. Hence, we take a more measured approach to integration that uses conjunctive and referential evidence in the final filtering step of the detection phrase, to eliminate boundaries in boundary clusters (Section 7.1.1) that cannot be story end points in the news stream. Figure 7.15 illustrates how this technique can be used to refine the filtering step. Originally, the boundary with score six in region A would have been chosen as the correct boundary point. However, since a conjunctive phrase links the adjacent paragraphs at this boundary position in the text, the boundary which scores five is deemed the correct boundary point by the algorithm. 189 6 5 4 0 5 5 3 0 0 0 0 0 0 0 A 0 0 B 0 0 0 0 C Figure 7.15: Diagram illustrating how cohesion information can help SeLeCT’s boundary detector resolve clusters of possible story boundaries. Using this technique and the verb nominalisation process described in Section 7.4.1 on both news media collections, we observed an improvement in SeLeCT system performance on the CNN data set (a decrease in error from 0.241 to 0.225), but no such improvement on the Reuters collection. Again the ineffectiveness of this technique on the Reuters result can be attributed to differences between the two modes of language expression, where conjunctive and referential relationships resolve 51.66% of the total possible set of boundary points between stories in the CNN collection and only 22.04% in the Reuters collection. In addition, these references in the Reuters articles mostly occur between sentences in a paragraph rather than between paragraphs in the text, thus providing no additional cohesive information. A summary of the improved results discussed in this section and in Section 7.4.1 is shown in Table 7.534. This table is followed by two tables (Tables 7.6. and 7.7) containing the results of a paired samples t-test for each pair of systems in our evaluation. In these tables, the symbol * indicates that the difference in WindowDiff scores between the two systems is statistically significant at the 99.9% level, while ** indicates statistical significance at 95% level. These results are based on a two- 34 The C99, TextTiling and SeLeCT implementations yield optimal results using the following parameters. On the both the CNN and Reuters data sets the C99 was run with an 11 x 11 ranking mask as suggested in (Choi, 2000), while TextTiling runs best with window size = 300 and a step size = 20 on the CNN test set, and a window size = 300 and a step size = 30 on the Reuters test set. For a more detailed explanation of these parameters see (Choi, 2000) and the README file which accompanies Choi’s version of TextTiling which can be downloaded at http://www.cs.man.ac.uk/~mary/choif/frame.html (checked March 2004). SeLeCT yielded optimal performance on the CNN test set using x = 2, distance = 7, and a maximum extra strong or repetition-based relationship distance of 750 words. While on the Reuter test the following parameter settings were used x = 1, distance = 7 and a maximum extra strong relationship distance of 400 words which was referred to as parameter m in Section 7.1.1. 190 sided t-test of the null hypothesis of equal means, where all tests are performed on 39 degrees of freedom, i.e. sample size of 40. Table 7.6 shows that our original SeLeCT system is more accurate than both the original TextTiling and C99 systems. However, Table 7.7 shows that after refinements were made to the systems (see Section 7.4.1), only the difference in means between the SeLeCT and TextTiling systems are deemed to be statistically significant (to a level of 95% confidence). Hence, SeLeCT, C99 and TextTiling performance on the CNN collection is equivalent. Further experimentation on a larger CNN test collection might help to distinguish between the performance of these systems. With regard to the Reuters results, the system refinements discussed in Section 7.4.1 were shown to have made no impact on segmentation performance. Hence Table 7.6 contains the original systems results where a paired sample t-test shows that all results are statistically significant to a level of 99.9% confidence. Consequently, C99 performs best, followed by SeLeCT and then TextTiling on the Reuters collection. System CNN WindowDiff Before After SeLeCT 0.253 0.225 C99 0.351 TextTiling 0.299 ∆ Error Reuters WindowDiff ∆ Error Before After − 2.8% 0.207 0.209 + 0.2% 0.268 − 8.3% 0.148 0.121 − 2.7% 0.274 − 2.5% 0.244 0.247 + 0.3 % Table 7.5: Improvements in system performance as a result of system modifications discussed in Sections 7.4.1 and 7.4.3. Initial Results: Paired CNN Reuters Samples T-Test t-statistic p-value t-statistic p-value SeLeCT – C99 -6.802 0.00** 7.406 0.00** SeLeCT – TextTiling -6.464 0.00** -4.911 0.00** C99 – TextTiling 3.051 0.04* -11.916 0.00** Table 7.6: Paired Samples T-Test on initial results from Table 7.1 and 7.2. All results marked with * are statistical significant to 95% confidence and marked with ** are statistical significant to 99.9% confidence. 191 Refined Results: Pair Samples T-Test on from Table 6.4 CNN t-statistic p-value SeLeCT – C99 -1.537 0.13 SeLeCT – TextTiling -2.99 0.05* C99 – TextTiling -0.66 0.51 Table 7.7: Pair Samples T-Test p-values on refined results taken from Table 7.4. All results marked with a * are statistical significant to 95% confidence. 7.5 Discussion In this chapter we described our lexical chain-based approach to news story segmentation, the SeLeCT system. This system uses the LexNews chaining algorithm to build a set of lexical chains from a concatenated set of news stories. The start and end points of these chains in the text are then used to discern where topic transitions or story boundaries occur in the text. On a CNN news story collection of spoken news transcripts using the WindowDiff metric as a means of determining segmentation accuracy, we found that the SeLeCT system outperformed the C99 and TextTiling algorithms. However, on a similar Reuters news collection of concatenated written newswire documents the best performing system was the C99 algorithm followed by the SeLeCT system and then the TextTiling algorithm. In both experiments the SeLeCT system performed best when only patterns of repetitions where analysed. The lack of success of the weaker semantic relationships (WordNet relationships and statistical word associations) in determining boundaries between news stories was attributed to the fact that a text in this application is a non-coherent collection of paragraphs and lexical chains can only be reliably built from a coherent text. Otherwise spurious chains are created which add noise to the boundary detection step. Also, we noted that cohesion and coherence are independent textual properties so even unrelated sentences can have correctly identified cohesive ties between them, making text segmentation an unsuitable application for a full analysis of lexical cohesive patterns in text. The deterioration in segmentation accuracy of all three systems on the spoken news collection was explained in terms of the propensity of written text to express phenomena as products (i.e. nouns) in contrast to speech where phenomena are 192 more commonly expressed as processes (i.e. verbs). With respect to the SeLeCT system, we observed that by including nominalised verbs in the segmentation process the accuracy of the algorithm improved. Further improvements were also observed when all non-nominalisable verbs were eliminated from the TextTiling and C99 input. These verbs tend to be less ‘informative’ and more commonly occurring, therefore they are considered as additional noise in the segmentation process. The final refinement to the SeLeCT boundary detection step involved the inclusion of reference and conjunction information, which helped to improve segmentation performance. On the other hand, none of these refinements improved the performance of any of the systems on the Reuters test collection. Also the improvement in performance of the TextTiling and C99 systems on the CNN collection resulted in the SeLeCT system only marginal outperforming these systems in the end. In this chapter we described the results of our own segmentation evaluation methodology. However, as Section 4.2.2 stated, News Story Segmentation is an offical TDT task. As a result, the offical TDT1 pilot study evaluation provides a means of determining segmentation accuracy on the TDT1 corpus. Like our evaluation they use an error metric (an earlier version of the Pk metric) to directly evaluate the ability of each system to determine boundaries between stories. However, the TDT1 segmentation evaluation is also based on an indirect measurement of segmentation quality with respect to the effect on event tracking performance of automatically segmented news stories. An interesting future experiment would be to re-evaluate the SeLeCT system in this more comprehensive evaluation methodology. However, this evaluation could not involve Choi’s implementation of the C99 algorithm because it only supports segmentation of short documents, and the TDT1 evaluation requires the segmentation of three streams: the Reuters news stream, the CNN news stream and the entire TDT1 collection. As already stated, in order to include the C99 system in our own segmentation evaluation we had to calculate performance as the average performance of each system on 40 files each containing 25 news stories. Another advantage of using our own evaluation format was that it allowed us to determine to what extent repetition-based segmenters are useful in News Story Segmentation. In comparison, the systems that have been involved in the offical 193 TDT segmentation task have been primarily focussed on domain-specific techniques, like those described in Section 6.2.3, that are trained on news data, and are sensitive to the occurrence of domain-specific cues in the news text, e.g. ‘news just in’. By comparing our method with other domain independent repetition-based segmenters, we were able to directly establish how well lexical chains perform with respect to these approaches, and to explore the effect of broader lexical cohesive relations (i.e. WordNet and statistical relationships) on the segmentation process. Another interesting area for investigation would be the combination of SeLeCT’s segmentation evidence with domain specific information using one of the multisource statistical models described in Section 6.2.3. In addition, it is not clear from our experiments how well the SeLeCT system would perform on error-prone ASR news transcripts. This would also be an interesting area for further research, since ASR transcripts were shown to have significantly affected the performance of our lexical chain-based New Event Detection system, LexDetect, in Chapter 5. 194 Chapter 8 News Story Gisting In this chapter we discuss some promising initial results obtained from our final application of lexical cohesion analysis: News Story Gisting. A gist is a very short summary, ranging in length from a single phrase to a sentence, that captures the essence of a piece of text in much the same way as a title or section heading in a document helps to convey the texts central message to a reader. Like News Story Segmentation and New Event Detection, News Story Gisting is a prerequisite for the successful organisation and presentation of news streams to users. News Story Gisting also represents another interesting and novel application of the LexNews algorithm in the broadcast news domain. In this chapter we describe the results of some on-going collaborative work with the Dublin City University (DCU) Centre for Digital Video Processing. More specifically, this part of our research focuses on the creation of news story gists for streams of news programmes used in the DCU Físchlár-News-Stories system (Smeaton et al., 2003): a multi-media system that allows users to search, browse and play individual news stories from Irish television news programmes. In its current incarnation the Físchlár-News-Stories system segments video news streams using audio and visual analysis techniques. Like all real-world applications, these techniques will at times place erroneous story boundaries in the resultant segmented video stream. In addition, since the closed caption material accompanying the video is generated live during the broadcast, a time lag exists between the discussion of an item of news in the audio stream and its appearance in the teletext on the video stream. Consequently, segmentation errors will be present in the closed caption stream, where for example the end of one story might be merged with the beginning of the next story. Previous work in this area undertaken at the DUC summarisation workshops35 and by other research groups has predominantly focussed on generating gists from clean data 35 Document Understanding Conferences (DUC): www-nlpir.nist.gov/projects/duc/intro.html 195 sources such as newswire (Witbrock, Mittal, 1999) thus avoiding the real issue of developing techniques that can deal with the erroneous data that underlies this problem. The work described in this chapter was published in (Stokes et al., 2004b). 8.1 Related Work Automatic text summarisation is the task of generating a more concise version of a source text while trying to retain the essence of its original information content. Summaries range in sophistication from simple extractions to more complex abstractions. In the case of extractions, a text summariser simply returns the set of sentences (verbatim) that it believes represents the central theme of a document. Abstractions on the other hand involve deep-level textual analysis and subsequent paraphrasing of an extract into a more coherent whole. Our approach to gisting is extractive. More specifically, the LexGister system determines a representative sentence for a text based on the strength of the lexical cohesive relationships between that sentence and the rest of the text. In our experimental methodology, Section 8.3, we determine the performance of the LexGister system with respect to a random extractor, a lead sentence extractor and a tf.idf approach to the problem. Other notable extractive gisting approaches discussed in the literature include Kraaij et al.’s (2002) probabilistic approach, Alfonseca et al.’s (2003) genetic algorithmic approach, and Copeck et al.’s (2003) approach based on the occurrence of features that denote appropriate summary sentences. These lexical, syntactic and semantic features include the occurrence of discourse cues, the position of the sentence in the text, and the occurrence of content phrases and proper nouns. Biasing the extraction process with additional textual information such as these features is a standard approach to headline generation that has proved to be highly effective in most cases (Kraaij, 2002; Alfonseca et al., 2003; Copeck et al., 2003; Zhou, Hovy, 2003). An alternative to extractive gisting approaches is to view the title generation process as being analogous to statistical machine translation. Wittbrock and Mittal’s (1999) paper on ‘ultra-summarisation’, was one of the first attempts to generate headlines based on statistical learning methods that make use of large amounts of training data. More specifically, during title generation a news story is ‘translated’ into a more concise version using the Noisy Channel model. The Viterbi algorithm 196 is then used to search for the most likely sequence of tokens in the text that would make a readable and informative headline. This is the approach adopted by Banko et al. (2000), Jin and Hauptmann (2001), Berger and Mittal (2000) and most recently by Zajic and Dorr (2002). In the following section we describe our lexical chain-based gisting approach. 8.2 The LexGister System Like the LexDetect and SeLeCT systems our news gister, the LexGister system, uses the same candidate selection and LexNews algorithm for generating a news story summary. Once lexical chains have been generated the next step is to identify the most important or highest scoring proper noun and noun chains for a story. This step is necessary as it helps to identify the central themes in the text by discarding cohesively weak chains. The overall cohesive strength of a chain is measured with respect to the strength of the relationships between the words in the chain. Table 3.6 in Section 3.5 showed the strength of the scores assigned to each cohesive relationship type participating in the chaining process, i.e. repetition = 1; synonymy = 0.9; antonymy, hyponymy, meronymy, holonymy, and hypernymy = 0.7; path lengths greater than 1 in WordNet = 0.4; statistical word associations = 0.4. The chain weight, score(chain), then becomes the sum of these relationship scores, which is defined more formally as follows: score(chain ) = n reps(i ) + rel (i, j ) (8.1) i =1 where i is the current chain word in a chain of length n, reps(i) is the number of repetitions of term i in the chain and rel(i,j) is the strength of the relationship between term i and the term j where j was deemed related to i during the chaining process. For example, the chain {hospital, infirmary, hospital, hospital} would be assigned a score of ( (reps(hospital) + rel(hospital, infirmary)) + (reps(infirmary) + rel(infirmary, hospital)) ) = 5.8, since ‘infirmary’ and ‘hospital’ are synonyms. Chain scores are not normalised, in order to preserve the importance of the length of the chain in the score(chain) calculation. Once all chains have been assigned a score, the highest scoring proper noun chain and noun chain are retained for the next step in the extraction process. If the highest score is shared by more than one chain (of either chain type) then these chains are also retained. 197 Once the key noun and proper noun phrases have been identified, the next step is to score each sentence in the text based on the number of key chain words it contains as follows: score( sentence) = n score(chain )i (8.2) i =1 where score(chain)i is zero if word i in the current sentence of length n does not occur in one of the key chains, otherwise score(chain)i is the score assigned to the chain where i occurred. Once all sentences have been scored and ranked, the highest ranking sentence is then extracted and used as the gist for the news article36. This final step in the extraction process is based on the hypothesis that the key sentence in the text will contain the most key chain words. This is analogous to saying that the key sentence will be the sentence that is most cohesively strong with respect to the rest of the text. If it happens that more than one sentence has been assigned the maximum sentence score then the sentence nearest the start of the story is chosen, since lead sentences in a news story tend to be better summaries of its content. Another consideration in the extraction phase is the occurrence of dangling anaphors in the extracted sentence, e.g. references to pronouns such as ‘he’ or ‘it’ that cannot be resolved within the context of the sentence. In order to address this problem we use a commonly used heuristic that states that if the gist begins with a pronoun then the previous sentence in the text is chosen as the gist. However, when we tested the effect of this heuristic on the performance of our algorithm we found that the improvement was insignificant. We have since established that this is the case because the extraction process is biased towards choosing sentences with important proper nouns due to the inclusion of proper noun chains in the gisting process. The effect of this is an overall reduction in the occurrence of dangling anaphors in the resultant gist. 8.3 Experimental Methodology Our evaluation methodology establishes gisting performance using an automatic evaluation based on the same framework proposed by Witbrock and Mittal (1999), 36 At this point in the algorithm it would also be possible to generate longer-style summaries by selecting the top n ranked sentences. 198 where recall, precision and the F1 measure are used to determine the similarity between a gold standard or reference title and a system generated title. In the context of this experiment these IR evaluation metrics are defined as follows: Recall is the number of words that the reference and system titles have in common divided by the number of words in the reference title. Precision is the number of words that the reference and system titles have in common divided by the number of words in the system title. F1 measure is the harmonic mean of the recall and precision metrics (defined in Section 4.1.2, Equation 4.10). In order to determine how well our lexical chain-based gister performs, the automatic part of our evaluation compares the recall, precision and F1 metrics of four baseline extractive gisting systems with the LexGister system. A brief description of the techniques employed in each of these systems is provided below: A baseline lexical chaining extraction approach (LexGister(b)) that works in the same manner as the LexGister system except that uses a basic version of the LexNews chaining algorithm, i.e. it ignores statistical associations between words in the news story and proper nouns that do not occur in the WordNet thesaurus. A tf.idf-based approach (TFIDF) that ranks sentences in the news story with respect to the sum of their tf.idf weights for each word in a sentence. The idf statistics were generated from the TDT1 corpus. A lead sentence-based approach (LEAD) that in each case chooses the first sentence in the news story as its gist. In theory this simple method should perform well due to the pyramidal nature of news stories, i.e. the most important information occurs at the start of the text followed by more detailed and less crucial information. In practice, however, due to the presence of segmentation errors in our data set, it will be shown in Section 5 that a more sophisticated approach is needed. A random approach (RANDOM) that randomly selects a sentence as an appropriate gist for each news story. This approach represents a lower bound on gisting performance for our data set. In Chapters 5 and 7 our New Event Detection and News Story Segmentation evaluations focussed on newswire and broadcast news transcripts taken from the 199 TDT1 and TDT2 corpora. However, for our gisting evaluation we collected a corpus of 246 error-prone closed caption news stories captured from RTÉ Irish broadcast news programmes. We manually annotated this corpus with a set of gold standard human generated titles taken from the www.rte.ie/news website. However, there is a marked difference between what is meant by ‘error-prone closed caption material’ and ‘error-prone ASR broadcast news transcripts’, where TDT ASR transcripts are primarily affected by limited capitalisation, and some segmentation and spelling errors. On the other hand, the RTÉ closed caption material is capitalised, but suffers from breaks in transmission (in some cases missing words/sentences) and story segmentation errors are more prevalent than in the TDT transcripts due to an LDC ‘clean-up’ attempt on the TDT2 corpus37. The extrinsic evaluation results discussed in the following section are generated from all 246 stories. 8.4 Gisting Results As previously explained recall, precision and F1 measures are calculated based on a comparison of the 246 generated news titles against a set of reference titles taken from the RTÉ news website. However, before the overlap between a system and reference headline for a news story is calculated both titles are stopped and stemmed using the standard InQuery stopword list (Callan et al., 1992) and the Porter stemming algorithm (Porter, 1997). The decision to stop reference and system titles before comparing them is based on the observation that some title words are more important than others. For example if the reference title is ‘Government still planning to introduce the proposed anti-smoking law’ and the system title is ‘The Vintners Association are still looking to secure a compromise’ then they share the words ‘the’, ‘still’, and ‘to’, and the system title will have successfully identified 3 out of the 9 words in the reference title, resulting in misleadingly high recall (0.33) and precision (0.3) values. Another problem with automatically comparing reference and system titles is that there may be instances of morphological variants in each title, such as ‘introducing’ and ‘introduction’, that 37 Appendix D contains sample documents from each of the news collections used in the thesis: TDT newswire, TDT1 broadcast transcripts, TDT2 broadcast transcripts, and RTÉ closed caption material. 200 without the use of stemming will make titles appear less similar than they actually are. 0.6 Recall Precision 0.5 F1 0.4 0.3 0.2 0.1 0 Human LexGister LexGister(b) TFIDF LEAD RANDOM Figure 8.1: Recall, Precision and F1 values measuring gisting performance for 5 distinct extractive gisting systems and a set of human extractive gists. Figure 8.1 shows the automatic evaluation results, using the stopping and stemming method, for each of our four extractive gisting methods discussed in Section 8.2.2. For this experiment we also asked a human judge to extract the sentence that best represented the essence of each story in the test set. Hence, the F1 value 0.25 achieved by these human extracted gists represents an upper bound on gisting performance. As expected our lower bound on performance, the RANDOM system, is the worst performing system with an F1 measure of 0.07. The LEAD sentence system also performs poorly (F1 0.08), which helps to illustrate that a system that simply chooses the first sentence in this instance is not an adequate solution to the problem. A closer inspection of the collection shows that 69% of stories have segmentation errors which accounts for the low performance of the LEAD and RANDOM gisters. On the other hand, the LexGister outperforms all other systems with an F1 value of 0.20. A breakdown of this value shows a recall of 0.42, which means that on average 42% of words in a reference title are captured in the corresponding system gist generated for a news story. In contrast, the precision value for the LexGister is much lower where only 13% of words in a gist are reference title words. The precision values for the other systems show that this is a 201 characteristic of extractive gisters since extracted sentences are on average two thirds longer than reference titles. This point is illustrated in the follow example where the recall is 100% but the precision is 50% (in both cases stopwords are ignored). Gist: “The world premier of the Veronica Guerin movie took place in Dublin' s Savoy Cinema, with Cate Blanchett in the title role.” Reference Title: “Premier of Veronica Guerin movie takes place in Dublin”. This example also shows that some form of sentence compression is needed if the LexGister were required to produce titles as opposed to gists, which would in turn help to increase the precision of the system. However, the higher recall of the LexGister system verifies that lexical cohesion analysis is better at capturing the focus of a news story than a statistical-based approach using a tf.idf weighting scheme. Another important result from this experiment is the justification of our enhanced LexNews algorithm (which incorporated statistical word associations and non-WordNet proper nouns in the chaining process). Figure 8.1 illustrates how the LexGister system (F1 0.20) outperforms the baseline version, LexGister(b) (F1 0.17). Although our data set for this part of the experiment may be considered small in IR terms, a two-sided t-test of the null hypothesis of equal means shows that all system results are statistically significant at the 1% level, except for the difference between the RANDOM and LEAD results, and the TFIDF and LexGister(b) results which are not significant. One of the main criticisms of an automatic evaluation experiment, such as the one just described, is that it ignores important summary attributes such as readability and grammatical correctness. It also fails to recognise cases where synonymous or semantically similar words are used in a system and reference title for a news story, e.g. ‘killed’ or ‘murdered’ or ‘9-11’ and ‘September 11’.This is a side effect of our experimental methodology where the set of gold standard human generated titles contain many instances of words that do not occur in the original text of the news story. Examples like these account for a reduction in gisting performance, and illustrate the importance of an intrinsic or user-oriented evaluation when determining the ‘true’ quality of a gist. 202 Consequently, we conducted an evaluation experiment, based on one proposed by Jin and Hauptmann (1999), involving human judges that addresses these concerns. However, due to the overhead of relying on human judges to rate gists for all of these news stories we randomly selected 100 LexGister gists for the manual part of our evaluation. We then asked six judges to rate LexGister’s titles using five different quality categories ranging from 5 to 1 where ‘very good = 5’, ‘good = 4’, ‘ok = 3’, ‘bad = 2’ and ‘very bad = 1’. Judges were asked to read the closed caption text for a story, and then rate the LexGister headline based on its ability to capture the focus of the news story. The average score for all judges over each of the 100 randomly selected titles was an average score of 3.56 (i.e. gists were ‘ok’ to ‘good’) with a standard deviation of 0.32 indicating strong agreement among the judges. Since judges were asked to rate gist quality based on readability and content there were a number of situations where the gist may have captured the crux of the story but its rating was low due to problems with its fluency or readability. These problems are a side effect of dealing with error-prone closed caption data that contains both segmentation errors and breaks in transmission. To estimate the impact of this problem on the rating of the titles we also asked judges to indicate if they believed that the headline encapsulated the essence of the story disregarding grammatical errors. This score was a binary decision (1 or 0), where the average judgement was that 81.33% of titles captured the central message of the story with a standard deviation of 10.52 %. This ‘story essence’ score suggests that LexGister headlines are in fact better than the results of the automatic evaluation suggest, since the problems resulting from the use of semantically equivalent yet syntactically different words in the system and reference titles (e.g. Jerusalem, Israel) do not apply in this case. However, reducing the number of grammatical errors in the gists is still a problem as 36% of headlines contain these sorts of errors due to ‘noisy’ closed caption data. An example of such an error is illustrated below where the text in italics at the beginning of the sentence has been incorrectly concatenated to the gist due to a transmission error: “on tax rates relating from Tens of thousands of commuters travelled free of charge on trains today.” 203 It is hoped that the sentence compression strategy, briefly discussed in Section 8.5, will be able to remove unwanted elements of text like this from the gists. One final comment on the quality of the gists relates to the occurrence of ambiguous expressions, which occurred in 23% of system generated headlines. For example, consider the following gist which leaves the identity of ‘the mountain’ to the reader’s imagination: “A 34-year-old South African hotel worker collapsed and died while coming down the mountain”. To solve this problem a ‘post-gisting’ component would have to be developed that could replace a named entity with the longest sub-string that co-refers to it in the text [22], thus solving the ambiguous location of ‘the mountain’. 8.5 Discussion In this chapter we briefly discussed the area of text summarisation, in particular News Story Gisting in the broadcast news domain. In Section 8.2, we presented our novel lexical chaining-based approach to news story gisting, the LexGister system, and in Section 8.4 we explored the robustness of this technique with respect to ‘noisy’ closed caption material from news programmes. The results of our intrinsic and extrinsic evaluation methodologies indicate that this technique is a more effective means of generating a compact and readable headline from a news story text than a ‘bag-of-words’ technique. Another important outcome of our gisting experiment was the notable improvement in performance when the enhanced LexNews algorithm was used, indicating that there are benefits from generating a more comprehensive representation of the lexical cohesive structure of a text when generating an extractive summary. The next stage in our research is to explore current trends in title generation that use linguistically motivated heuristics to reduce a gist to a skeletal form that is grammatically and semantically correct (Fuentes et al., 2003; Dorr, Zaijc, 2003; McKeown et al., 2002; Daume et al., 2002). We have already begun working on a technique that draws on parse tree information for distinguishing important clauses in sentences using the original lexical chains generated for the news story to weight each clause. This will allow the LexGister to hone in on the grammatical unit of the 204 sentence that is most cohesive with the rest of the news story, resulting in a more compact news story title. Re-evaluating the performance of the LexGister using the new ROUGE evaluation metric is also a future goal of our research. ROUGE is a recall oriented metric that calculates the n-gram overlap between a set of reference (i.e. human generated) summaries and a single system generated summary. Experiments have shown that this metric corresponds well with human summary quality judgements (Lin, Hovy, 2003), which represents an exciting development for summarisation research because of the large effort involved in manually determining summary quality. The large scale usability of ROUGE is currently being investigated in the context of DUC (Document Understanding Conference) 2004: a research initiative similar to TDT that invites participants to generate summaries38 for a previously unseen corpus of documents, which are evaluated and then discussed at the annual DUC workshop. 38 The DUC 2004 initiative has defined five tasks: (task1) very short single document English summary generation (~10 words); (task2) short English multi-document summary generation (~100 words); (task3 and task 4) generate English summaries from manually and automatically translated Arabic news documents replicating task 1 and task 2; (task 5) generate answers to ‘Who is X’ style questions where X is a person or group of people. 205 Chapter 9 Future Work and Conclusions In this chapter we discuss some future directions for this work including, some suggestions for improving our LexNews algorithm and a description of how multidocument summarisation can benefit from lexical cohesion analysis. The remainder of the chapter highlights the research contributions and conclusions of this thesis. 9.1 Further Lexical Chaining Enhancements In Chapter 3 we proposed our own novel approach to lexical chaining based on an algorithm proposed by Hirst and St-Onge (1998). One of the fundamental operations of any WordNet-based chaining algorithm is the estimation of semantic relatedness between nouns with respect to their semantic distance in the WordNet taxonomy. As explained in Section 3.1, St-Onge and Hirst’s measure of semantic association is calculated with respect to the number of edges between two nouns in the taxonomy and the number of direction changes on this path (i.e. semantically opposite relations), where a number of ‘rules-of-thumb’ are used to ensure that spurious links are minimised. However, as explained in Section 2.4.4, an edgecounting measure like St-Onge’s assumes that all edges in the taxonomy are of equal length and that all branches are equally dense. However, these assumptions are false in the case of WordNet, and so edge counting is at best a very rough estimate of semantic relatedness. We have found that this measure of association used during the generation of lexical chains is, on occasion, less than satisfactory. Consequently, in the next phase of our research we intend to experiment with a number of different measures of semantic distance in WordNet, such as those recently compared and contrasted by Budanitsky and Hirst with respect to their effect on the performance of malapropism correction and detection (Budanitsky, 1999; Budanitsky and Hirst, 2001). Their overall finding was that the weakest measure of semantic distance was the aforementioned St-Onge and Hirst approach, which Budanitsky describes as ‘being far too promiscuous in its judgement of relatedness’. In addition, Budanitsky 206 found that Jiang and Conrath’s information theoretic-based measure yielded the best malapropism detection results (Jiang, Conrath, 1997). This approach attempts to improve a basic edge counting metric by verifying the correctness of a WordNet association with respect to a set of corpus statistics, i.e. it considers both the number of edges between the two nodes, and the conditional probability of finding an instance of a child node given the occurrence of a parent node. However, like the other approaches that Budanitsky examined, the Jiang-Conrath measure only looks at specialisation/generalisation relationships between nouns. Hence, Budanitsky concedes that St-Onge’s measure might work better if a more constrained version were used, e.g. if paths that traverse from the specialisation/generalisation taxonomy to the whole/part taxonomy were ignored. Another interesting avenue for future research arises from the recent release of two expanded versions of the WordNet thesaurus. Firstly, the latest offical release of the taxonomy, WordNet 2.0, by the Cognitive Science Laboratory at Princeton University is the first attempt at improving the connectivity of the taxonomy with respect to three areas: Topical Clustering: WordNet 2.0 has organised related nouns into topical categories such as ‘criminal law’ or ‘the military’ although much of the work in this area has focussed on vocabulary related to terrorism. Derivational Morphology: Currently, WordNet 2.0 links derivationally related and semantically related noun/verb pairs such as ‘summarise/summary’ and ‘examine/examination’, which results in 42,000 new connections between these two syntactic categories. However, links between other parts-of-speech such as adverbs-adjectives (quickly-quick) are planned for future releases. Gloss Term Disambiguation: In future releases, gloss definitions will be tagged with synset numbers. This will provide a broader context for each sense of a particular word form and will also help to increase the connectivity between different syntactic categories in the taxonomy. In addition to this research effort at Princeton, the University of Texas at Dallas has released an enhanced version of WordNet called eXtended WordNet. The objectives of both these research projects are closely related since they both aim to increase the connectivity of the semantic network by exploiting information contained in the gloss for each word sense. In this regard, the research effort at the 207 University of Texas leads the way, since they have already developed and implemented automatic methods that syntactically parse and semantically tag each gloss word (noun, verb, adjective and adverb) with a synset number39. In the course of our research we also investigated a method of increasing noun connectivity by incorporating statistical word associations into the LexNews chaining process. However, in Section 3.2.1 we highlighted a number of inadequacies with this approach: Words linked with respect to this form of lexical cohesion are not mapped to WordNet synsets. Hence, statistical word associations in the chaining process fail to consider instances of polysemy. For example, the noun ‘noise’ would be added to the following chain {racket, sports implement, hockey stick} through a statistical relationship with the word ‘racket’ in spite of the fact that ‘racket’ is being used in the ‘sport’ sense in this context. Biases in the training corpus can create less than intuitive associations, e.g. in the TDT1 corpus there is a strong statistical association between ‘glove’ and ‘blood’ due to the large portion of documents discussing the OJ Simpson case. In some instances statistical associations also capture thesaural relationships between words, because (not surprisingly) these types of word pairs also tend to co-occur frequently in text. This leads to additional and unnecessary word comparisons during the search for statistical associations between related candidate terms in the chaining process since thesaural relationship searches (i.e. Extra Strong, Strong and Medium Strength relationship searches) precede the statistical association search. In Section 3.2.1, we explained that in order to provide a full lexical cohesive analysis of a text, as defined by Halliday and Hasan (1976), statistically derived relationships would also have to be considered in the chaining process. However, using eXtended WordNet in the chaining process would provide a ‘cleaner’ method of considering these associations than our method of integration using corpus statistics, since the chaining algorithm would not be affected by the problems outlined above. Consequently, this represents a viable and promising direction for 39 WordNet 2.0 is available for download from http://www.cogsci.princeton.edu/~wn/ and the latest version of eXtended WordNet can be downloaded from http://xwn.hlt.utdallas.edu/index.html (last checked March 2004). 208 future lexical chaining research. However, the inclusion of statistically derived associations will more than likely still play a role in the chaining process for two reasons. Firstly, word associations established though gloss definitions will have a binary score (i.e. the word either exists in the gloss of another word or it does not). Secondly, there will be many instances of loosely related associations based on these gloss definitions. Hence, a more sensitive measure of relatedness will be needed, which could be measured using corpus statistics gathered from an appropriate domain. In the following section we propose another lexical chaining application that represents a promising future direction for our research. 9.2 Multi-document Summarisation Many of the lexical chaining approaches reviewed in Chapter 2 were applied to text summarisation tasks, in particular single document summarisation (Barzilay, Elhadad, 1997; Silber, McCoy, 2000; Brunn, Chali, Pinchak, 2001; Bo-Yeong, 2002; Alemany, Fuentes, 2003). Like many of these researchers we found that lexical chain-based gisting performance could outperform a standard ‘bag-ofwords’ technique (Section 8.4), thus fortifying the consensus that summarisation is an area that can benefit greatly from lexical cohesion information. Another important outcome of our gisting experiment was the notable improvement in performance when the enhanced LexNews algorithm was used. This indicates that there are benefits from generating a more comprehensive representation of the lexical cohesive structure of a text when generating an extractive summary. Hence, another promising area for future work would be the application of this technique to other summarisation tasks. In particular, there is a lot of scope for experimentation in the area of multi-document summarisation where the task is to provide an overview of content, given a cluster of related articles. Multi-document summarisation has proven to be a more challenging area of summarisation research for a number of reasons, including: The need for summary adaptation with respect to the strength of association between the documents in the cluster. For example, if the documents are strongly related then the summary should be based on their commonality. However, if the documents are loosely related (i.e. contain a lot of non- 209 overlapping novel information) it is more difficult to decipher which novel elements should be included. Fluency also plays a more dominant role in multi-document summarisation, since unlike single document summaries sentences cannot be ordered with respect to their original position in their source document. Fluency is also a major concern within a summary sentence if the summarisation technique attempts to fuse different pieces of information scattered across related sentences into a single sentence40. With regard to summary adaptation, our composite document representation described in Chapter 5 could be a promising approach to this problem. While determining similarity in the New Event Detection task involves on-topic and offtopic (or unrelated) document comparisons, similarity in the multi-document summarisation sense only requires a measurement of how similar related documents are to each other, which might prove to be a more appropriate application for a measure of similarity based on lexical cohesion analysis. With regard to fluency, lexical chains could also play an important role in enforcing thematic continuity when ordering sentences in a multi-document scenario, since one of the characteristics of fluent, coherent text is the presence of lexical cohesion. To date there have been very few attempts to develop a lexical chain-based multi-document summarisation strategy (Chali et al., 2003). Hence, this represents an exciting new application for lexical cohesion analysis. 9.3 Thesis Contributions In Sections 9.1 and 9.2 we highlighted areas of our lexical chaining research that would benefit from future work. In this section we summarise the main contributions of the work presented in this thesis, which are organised under the following headings: Lexical Chaining, New Event Detection, News Story Segmentation and News Story Gisting. Lexical Chaining The design and development of a novel news-oriented lexical chaining algorithm, the LexNews system, that considers, in addition to the standard 40 A more detailed discussion on information fusion, fluency and sentence ordering for multidocument summarisation can be found in (Barzilay, 2003). 210 lexical cohesive relationships found in WordNet, news-specific statistical word associations and an extended definition of lexical repetition based on the fuzzy matching of noun phrases referring to named entities such as people, places and organisations. An exploration of greedy and non-greedy approaches to lexical chaining, evaluated with respect to disambiguation accuracy on a portion of the SemCor corpus. An investigation into the suitability of lexical cohesion analysis to Topic Detection and Tracking tasks. New Event Detection The design, implementation and evaluation of a novel method of integrating lexical cohesion analysis into an IR model used to detect breaking news stories on a news stream, motivated by the TDT initiative request for techniques that went beyond the traditional ‘bag of words’ approach. An investigation into the extent to which our lexical chain-based approach, the LexDetect system, is affected by ‘noisy’ broadcast news data. News Story Segmentation The design, implementation and evaluation of our lexical chain based-approach to text segmentation, the SeLeCT system, in a novel application of text segmentation. An examination of segmentation performance in terms of a variety of evaluation metrics. An investigation into the effects of lexical cohesion relationships on segmentation accuracy. An exploration of the effects of written and spoken news sources on segmentation performance. News Story Gisting The design, implementation and evaluation of our lexical chain-based News Gister, the LexGister system. An analysis of LexGister performance on error-prone closed caption material using intrinsic and extrinsic evaluations. 211 9.4 Thesis Conclusions In this, the final section of the thesis, we clarify and summarise the principal findings of our work. These conclusions are split into two main areas: those pertaining to our lexical chaining algorithm LexNews, and those relating to the success of its application to the TDT tasks explored in the thesis. Greedy and Non-Greedy Lexical Chaining: Despite the fact that greedy approaches to lexical chain generation have been largely disregarded of late by researchers, the results of the evaluation described in Chapter 3 indicate that no significant gain in performance is achieved when a non-greedy, more computationally expensive approach is used. Lexical chaining performance, in this instance, was measured with respect to WordNet-based disambiguation accuracy on the SemCor corpus. This result validated the use of our semi-greedy chaining approach, LexNews, to address the various IR and NLP tasks examined in this thesis. News Story Document Representation using Lexical Chains: Our composite document representation strategy, presented in Chapter 5, attempted to amalgamate information regarding the lexical cohesive structure of a news story with a traditional ‘bag-of-words’ representation of its content. Our experiments showed that: - A lexical chain representation of a document is best used as additional rather than conclusive evidence of news story similarity in the New Event Detection task (Section 5.5.2). - Considering chain words as features in a vector space model marginally outperforms a New Event Detection system that considers the chains as features in a representation of the document (Section 5.5.3). - Using word stems instead of WordNet concepts (i.e. synset numbers) improved the performance of the chain word document representation with respect to the New Event Detection task (Section 5.5.2). - Results from experiments on the TDT2 corpus showed that New Event Detection (NED) performance deteriorated when broadcast rather than newswire stories were processed. This was evident from both the performance of the UMass NED system and our lexical chain based-NED system, LexDetect. 212 Two factors were identified as being responsible for this decline in LexDetect’s performance: a lack of capitalisation which added errors to lexical chaining preprocessing steps (e.g. noun phrase identification based on part-of-speech tag information), and the presence of very short news reports (35.3% of documents in the TDT2 corpus consist of less than 100 words) making it difficult for our lexical chaining algorithm, LexNews, to perform a lexical cohesive analysis of these short texts (Section 5.6.2). - Although initial experiments on the TDT1 corpus indicated that the LexDetect system could outperform a simple ‘bag-of-words’ approach to the NED problem, these results could not be replicated on the ‘noisier’ TDT2 corpus when compared with the performance of the UMass NED system. News Story Segmentation using Lexical Chains: In Chapter 7, our News Story Segmentation system, SeLeCT, was evaluated with respect to two well-known lexical cohesion-based approaches to segmentation: the C99 and TextTiling algorithms on a collection of CNN broadcast news stories and Reuters news articles. Our experiments showed that: - The SeLeCT algorithm outperformed the C99 and TextTiling algorithms on the CNN news collection. However, on the Reuters newswire collection the C99 algorithm performed best, followed by the SeLeCT system and finally the TextTiling algorithm (Section 7.3). - Only pure lexical repetition proved to be a useful lexical cohesive relationship for detecting boundaries between adjacent stories (Section 7.3.4). - Like the New Event Detection results, significant degradations in performance were observed for results on the CNN (spoken) collection compared with the Reuters (written) news collection, even though in this case the CNN broadcast news documents were manually transcribed and so were correctly punctuated and capitalised. This led us to investigate the differences between written and spoken language modes, and how they convey information in a news story. Using Halliday’s observation that ‘written language represents phenomena as products (nouns), while spoken language represents phenomena as processes (verbs)’, we adapted our SeLeCT system and showed that gains in broadcast (spoken) News Story Segmentation performance could be achieved by including 213 nominalisable verbs in the chaining process. In addition, the C99 and TextTiling performance improved when all non-nominalisable verbs were eliminated from their input, since these verbs tend to be less ‘informative’ and more commonly occurring, hence adding additional noise to the segmentation process (Section 7.4). News Story Gisting using Lexical Chains: Our work on News Story Gisting, described in Section 8.2, provides further evidence that lexical cohesion analysis can make a positive contribution to text summarisation tasks. Another interesting conclusion of this initial investigation is that lexical chaining is robust enough to be able to deal with broadcast news segmentation errors in closed caption material, leading us to conclude that although ASR transcripts pose a significant challenge for any NLP-based approach, there is still use for these techniques in the broadcast news environment on ‘cleaner’ data samples like closed caption material. The performance of the LexNews chaining algorithm: In Chapter 3 we introduced the enhanced LexNews algorithm, which focuses on the exploration of lexical cohesive relationships in a broadcast news environment. Throughout the thesis the impact of these enhancements on the various TDT applications was considered. More specifically: - In Section 5.6, a version of our NED system LexDetect, using the basic LexNews algorithm, was outperformed by a version using the enhanced LexNews algorithm. - However, as explained in Section 7.3.4, News Story Segmentation performance deteriorates when additional lexical cohesive relationships outside of exact repetition are explored, prompting the conclusion that the scope of the lexical cohesion analysis provided by the basic and enhanced version of the LexNews algorithm is too broad. - On the other hand, the News Story Gisting application of our chaining algorithm produced some definite evidence to support the inclusion of statistical word associations and extended noun phrase matching as presented in this thesis, in future implementations of lexical chain-based summarisation. However, it remains to be seen whether the additional lexical cohesive links provided by the 214 recently released of the WordNet 2.0 and eXtended WordNet taxonomies can lead to further gisting improvements. 215 Appendix A The LexNews Algorithm The purpose of this appendix is to provide the reader with a more formal description of the LexNews algorithm described in Section 3.2.3 of the thesis. The LexNews chaining approach was designed and developed with the intention of building lexical chains for broadcast news and newswire text. The underlying algorithm is based on a technique by Hirst and St-Onge (1998), and Section A.1 is based on St-Onge’s formal description (1995) of his own algorithm. Section A.2 contains a list of stopwords, more specifically WordNet nouns that are excluded from the chaining process due to their tendency to cause spurious chains, i.e. chains containing incorrectly disambiguated or weakly cohesive chain members. This list of ‘problematic’ nouns is a combination of manually (domain-specific) and automatically identified words that tend to subordinate a higher than average number of nouns in the WordNet taxonomy, and consequently are responsible for the creation of weakly cohesive lexical chains. A.1 Basic Lexical Chaining Algorithm LexNews Chaining algorithm XS_R = Extra Strong Relation FXS_R = Fuzzy (proper noun) Extra Strong Relation S_R = Strong Relation MS_R = Medium-Strength Relation SW_R = Statistical Word Relation 1. 2. 3. 4. 5. 6. 7. 8. if (new_word_sentence_number = current_sentence_number) then if ((current_word.type == `NN`) and (current_word.type != stopword)) then NN_queue.push(new_word); end if else /* current_word.type == `PN`*/ then PN_queue.push(new_word); end if end if 216 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. else current_sentence_number = new_word.sentence_number; /* Begin NN chaining */ for current_word from NN_queue.first to NN_queue.last do if (NN_chain_stack.try_to_chain(current_word, XS_R)) then NN_queue.remove(current_word); end if end do for current_word from NN_queue.first to NN_queue.last do if (NN_chain_stack.try_to_chain(current_word, XS_R) or NN_chain_stack.try_to_chain(current_word, S_R)) then NN_queue.remove(current_word); end if end do for current_word from NN_queue.first to NN_queue.last do if (NN_chain_stack.try_to_chain(current_word, XS_R) or NN_chain_stack.try_to_chain(current_word, S_R) or NN_chain_stack.try_to_chain(current_word, MS_R)) then NN_queue.remove(current_word); end if end do for current_word from NN_queue.first to NN_queue.last do if (NN_chain_stack.try_to_chain(current_word, XS_R) or NN_chain_stack.try_to_chain(current_word, S_R) or NN_chain_stack.try_to_chain(current_word, MS_R)) or NN_chain_stack.try_to_chain(current_word, SW_R)) then NN_queue.remove(current_word); end if end do for current_word from queue.first to queue.last do NN_chain_stack.create_chain(current_word); NN_queue.remove(current_word); end do 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. /* Begin PN chaining */ for current_word from PN_queue.first to PN_queue.last do if (PN_chain_stack.try_to_chain(current_word, FXS_R)) then PN_queue.remove(current_word); end if end do for current_word from PN_queue.first to PN_queue.last do PN_chain_stack.create_chain(current_word); PN_queue.remove(current_word); end do end if 217 As explained in Section 3.2 the tokeniser identifies candidate terms that should be included in the chaining process. There are two types of candidate terms: WordNet noun phrases (NN) and non-WordNet proper noun phrases (PN). During the chaining process separate sets of non-overlapping NN and PN chains are generated. These two sets of chains are non-overlapping because the algorithm has no means of associating non-WordNet proper nouns with WordNet nouns as they are not linked in the noun database. The algorithm begins by reading in candidate terms as they occur in the original source text. The algorithm then pushes each term from the current sentence onto either the NN_sentence_queue or the PN_sentence_queue depending on the type assigned to the phrase by the tokeniser. In the case of the WordNet noun phrases, only non-stopword terms are pushed on the NN_sentence_queue (lines 1-8). Once all words for a particular sentence have been read in, the chain formation process can begin. The algorithm begins by generating NN chains. So for each word in the NN_sentence_queue, an extra strong relationship is sought in the NN_chain_stack, which stores all WordNet noun phrase chains in the order in which they were last updated. If a match is found then the candidate term in the sentence queue is added to the related chain, the chain is then moved to the head of the chain stack and the noun phrase is removed from the NN_sentence_queue. Otherwise, if a match is not found, then the current phrase remains in the sentence queue and awaits further processing (lines 12-16). The algorithm then iterates though the sentence queue again, this time searching for strong relationships between queue words and chain words. Again, if a sentence word is related to a chain word then it is added to the chain, the chain becomes the head of the chain stack, and the word is removed from the sentence queue (lines 1722). This process is repeated for the medium-strength (lines 23-29) and statistical association (lines 30-37) searches. However, in each of these loops the algorithm checks again for the possibility of the relationships that preceded it. The reason for these additional searches is that if, for example, a word is added to a chain based on a medium-strength relationship (lines 23-29) this might create the possibility of a strong relationship between the recently added word and a member of the sentence queue that didn’t exist during the first strong relationship search (lines 17-22). 218 Figure A.1 helps to illustrates this point, where during the first strong search iteration (lines 17-22) a relationship is sought between ‘resident’ in the sentence queue and ‘town’ in the chain stack. However, the chaining algorithm could not establish a relationship since their path length in WordNet exceeds 4 edges. In fact these terms can only be related though a transitive relationship, once the words ‘state’ and ‘town’ had been established at line 19, and a second strong relationship search has been completed at line 25, whereupon the algorithm finds that a relationship between ‘state’ and ‘resident’ exists. resident state Belfast Sentence Queue 1 0 {town, city_limits} {dissident, radical} Chain Stack Figure A.1: Chaining example illustrating the need for multiple searches. When all searches have been completed and there are no remaining relationships between the members of the noun phrase sentence queue and the members of the chain stack, then each remaining candidates term in the sentence queue becomes the head of a new chain in the NN_chain_stack (lines 38-41). This point in the algorithm marks the end of the noun phrase chaining process and the beginning of the chaining of non-WordNet proper noun (PN) phrases. Proper noun chaining is similar to noun chaining in that each phrase in the PN_sentence_queue is compared to each member of each chain stored in the PN_chain_stack (lines 4247). However, this comparison process does not require any WordNet lookup. Instead, the fuzzy matching function, described in Section 3.2.3, is used to seek out links between proper noun phrases. Like the medium-strength relationship search, not all relationships of this type are assigned the same strength, in which case all relationships are sought between the sentence queue phrase and each chain. The phrase is then added to the chain with which it holds the strongest relationship. Once all possible proper noun phrase additions to the PN_chain_stack have been 219 made, the algorithm then makes each remaining sentence queue proper noun phrase the head of a new proper noun chain (lines 48-51). This process is repeated until each sentence in the source text has been processed. Once all lexical chains have been generated, only proper noun and noun phrase chains that have more than one member take part in any further processing, i.e. New Event Detection, News Story Segmentation or News Story Gisting. 220 A.2 Lexical Chaining Stopword List This stopword list is available on request from the author. abstraction act action activity afternoon amount artefact artifact attribute being bit blank cause content course day daybreak dimension distance edge effect end entity existence extent form front function group grouping hour human_action human_activity instance instrumentality instrumentation kind large length level life_form light line little living_thing location look lot lots manner matter mean minute morning mortal natural night noon nothing now number object old order organism part past people person phenomenon physical_object piece point portion position process quality region relation right second series side size somebody someone something soul standard state 221 status stuff subject thing time try type unit use way while year Appendix B LexNews Lexical Chaining Example The following is a sample news story (‘clean’ closed-caption material) taken from an Irish broadcast news programme. In Section 3.5 we explored the generation of lexical chains using the LexNews algorithm on this piece of text. In the following sections we provide the original version of the text (Section B.1), a tagged version (Section B.2) used as input to the Tokeniser, and the lexical chains generated from this set of candidate terms using the enhanced version of the LexNews algorithm (Section B.3). 222 B.1 News Story Text Version All noun phrase and adjectives pertaining to nouns are marked in bold in the following piece of text. As Gardai launch an investigation into gangland murders in Dublin and Limerick a film opened in Dublin tonight which recalls the killing of another victim of organised crime in 1996. The world premiere of the Veronica Guerin movie took place in the Dublin's Savoy Cinema, with Cate Blanchett in the title role. The film charts the events leading up to the murder of the Irish journalist. Crowds gathered outside the Savoy Cinema as some of Ireland's biggest names gathered for the premiere of Veronica Guerin, the movie. It recounts the journalists attempts to exposed Dublin drug gangs. But for many the premiere was mixed with sadness. “It' s odd. It can' t be celebratory because of the subject matter.” Actress Cate Blanchett takes on the title role in the movie. It was a part she says she felt honoured to play. “I got this complete picture of this person full of life and energy. And so that' s when it became clear the true nature of the tragedy of the loss of this extraordinary human being, and great journalist.” Apart from Blanchett every other part is played by Irish actors. Her murderer was later jailed for 28 years for drug trafficking. The film-makers say it' s a story of personal courage, but for the director, there was only one person' s approval that mattered. “A couple of months ago I brought the film to show to her mother. It was the most pressure I' ve ever felt.” But he needn' t have worried. “I see it as a tribute to Veronica, a worldwide tribute.” 223 B.2 Part-of-Speech Tagged Text This tagged text was generated using the JTAG tagger (Xu, Broglio, Croft, 1994) algorithm. ***000001 284 As/CS Gardai/NP launch/VB an/AT investigation/NN into/TOIN gangland/NN murders/NNS in/IN Dublin/NP and/CC Limerick/NP a/AT film/NN opened/VBD in/IN Dublin/NP tonight/NN which/WDT recalls/VBZ the/AT killing/NN of/IN another/DT victim/NN of/IN organised/VBD crime/NN in/IN 1996/CD ./. The/AT world/NN premiere/NN of/IN the/AT Veronica/NP Guerin/NP movie/NN took/VBD place/NN in/IN Dublin/NP 's/$ Savoy/NP Cinema/NP ,/, with/IN Cate/NP Blanchett/NP in/IN the/AT title/NN role/NN ./. The/AT film/NN charts/VBZ the/AT events/NNS leading/VBG up/RP to/TOIN the/AT murder/NN of/IN the/AT Irish/JJ journalist/NN ./. Crowds/NNS gathered/VBN outside/IN the/AT Savoy/NP Cinema/NP as/CS some/DTI of/IN Ireland/NP 's/$ biggest/JJ names/NNS gathered/VBN for/IN the/AT premiere/NN of/IN Veronica/NP Guerin/NP ,/, the/AT movie/NN ./. It/PPS recounts/VBZ the/AT journalists/NNS attempts/NNS to/TOIN exposed/VBN Dublin/NP drug/NN gangs/NNS ./. But/CC for/IN many/AP the/AT premiere/NN was/BEDZ mixed/VBN with/IN sadness/NN ./. It/PPS 's/BEZ odd/JJ ./. It/PPS can't/MD be/BE celebratory/JJ because/CS of/IN the/AT subject/JJ matter/NN ./. Actress/NN Cate/NP Blanchett/NP takes/VBZ on/IN the/AT title/NN role/NN in/IN the/AT movie/NN ./. It/PPS was/BEDZ a/AT part/NN she/PPS says/VBZ she/PPS felt/VBD honoured/VBN to/TO play/VB ./. I/PPSS got/VBD this/DT complete/JJ picture/NN of/IN this/DT person/NN full/JJ of/IN life/NN and/CC energy/NN ./. And/CC so/CS that/DT 's/BEZ when/WRB it/PPS became/VBD clear/RB the/AT true/JJ nature/NN of/IN the/AT tragedy/NN of/IN the/AT loss/NN of/IN this/DT extraordinary/JJ human/JJ being/NN ,/, and/CC great/JJ journalist/NN ./. Apart/RB from/IN Blanchett/NP every/AT other/AP part/NN is/BEZ played/VBN by/IN Irish/JJ actors/NNS ./. Her/PP$ murderer/NN was/BEDZ later/RB jailed/VBN for/IN 28/CD years/NNS for/IN drug/NN trafficking/NN ./. The/AT film-makers/NNS say/VB it/PPS 's/BEZ a/AT story/NN of/IN personal/JJ courage/NN ,/, but/CC for/IN the/AT director/NN ,/, there/EX was/BEDZ only/RB one/CD person/NN 's/$ approval/NN that/WPS mattered/VBD ./. A/AT couple/NN of/IN months/NNS ago/RB I/PPSS brought/VBD the/AT film/NN to/TO show/VB to/TOIN her/PP$ mother/NN ./. It/PPS was/BEDZ the/AT most/AP pressure/NN I/PPSS 've/HV ever/RB felt/VBD ./. But/CC he/PPS needn't/NP have/HV worried/VBN ./. I/PPSS see/VB it/PPO as/CS a/AT tribute/NN to/TOIN Veronica/NP ,/, a/AT worldwide/JJ tribute/NN ./. 224 B.3 Candidate Terms Below is a list of candidate terms (proper noun and noun phrases) identified by the tokeniser for chaining. All terms highlighted in bold are stopwords (as defined by Section A.2), and do not take part in the chaining process. The candidate term information is in the following format where nn refers to a WordNet noun and pn a non-WordNet proper noun: Document identifier; Word Number; Sentence Number; Term Tag; Term 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 pn gardai 3 1 nn launch 5 1 nn investigation 7 1 nn gangland 8 1 nn murder 10 1 nn dublin 12 1 pn limerick 14 1 nn film 17 1 nn dublin 18 1 nn tonight 22 1 nn killing 25 1 nn victim 28 1 nn crime 32 2 nn world 33 2 nn premiere 36 2 pn veronica_guerin 38 2 nn movie 43 2 pn dublin’s_savoy_cinema 47 2 pn cate_blanchett 51 2 nn title_role 54 3 nn film 57 3 nn event 62 3 nn murder 65 3 nn ireland 66 3 nn journalist 67 4 nn crowd 71 4 pn savoy_cinema 76 4 nn ireland 79 4 nn names 83 4 nn premiere 85 4 pn veronica_guerin 89 4 nn movie 93 5 nn journalist 94 5 nn attempt 96 5 nn exposition 97 5 nn dublin 98 5 nn drug 99 5 nn gang 104 6 nn premiere 108 6 nn sadness 115 8 nn celebration 119 8 nn subject_matter 121 9 nn actress 122 9 pn cate_blanchett 127 9 nn title_role 225 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 131 135 142 147 150 153 155 166 169 172 176 181 184 187 189 191 192 194 200 202 205 210 213 218 224 226 230 232 237 242 247 262 264 268 9 nn movie 10 nn part 10 nn player 11 nn picture 11 nn person 11 nn life 11 nn energy 12 nn nature 12 nn tragedy 12 nn loss 12 nn human_being 12 nn journalist 13 pn blanchett 13 nn part 13 nn player 13 nn ireland 13 nn actor 14 nn murderer 14 nn years 14 nn drug 15 nn film_maker 15 nn story 15 nn courage 15 nn director 15 nn person 15 nn approval 16 nn couple 16 nn month 16 nn film 16 nn mother 17 nn pressure 19 nn tribute 19 pn veronica 19 nn tribute B.4 Weighted Lexical Chains The mark-up and weighting scheme adopted in the following sets of lexical chains is explained in Section 3.5. WordNet Noun Phrase Chains CHAIN 4; No. Words 14; Word Span 11-264; Sent Span 1-19; [film (SEED) Freq 3 WGT 0.9 STRONG] [movie (film) Freq 3 WGT 0.9 STRONG] [premiere (film) Freq 3 WGT 0.4 MEDIUM] [subject_matter (film) Freq 1 WGT 0.7 STRONG] [actress (movie) Freq 1 WGT 0.7 STRONG] [picture (film) Freq 1 WGT 0.9 STRONG] [actor (actress) Freq 1 WGT 0.7 STRONG] [film_maker (film) Freq 1 WGT 0.4 MEDIUM] [approval (subject_matter) Freq 1 WGT 0.7 STRONG] [story (subject_matter) Freq 1 WGT 0.4 MEDIUM] [director (actor) Freq 1 WGT 0.4 STATISTICAL] [tribute (approval) Freq 2 WGT 0.7 STRONG] CHAIN 14; No. Words 3; Word Span 151-243; Sent Span 11-17; [energy (SEED) Freq 1 WGT 0.4 MEDIUM] [nature (energy) Freq 1 WGT 0.4 MEDIUM] [pressure (energy) Freq 1 WGT 0.4 MEDIUM] CHAIN 1; No. Words 6; Word Span 4-198; Sent Span 1-14; [gangland (SEED) Freq 1 WGT 0.4 MEDIUM] [world (gangland) Freq 1 WGT 0.4 MEDIUM] [crowd (gangland) Freq 1 WGT 0.4 MEDIUM] [gang (crowd) Freq 1 WGT 0.9 STRONG] [drug (gang) Freq 2 WGT 0.4 STATISTICAL] CHAIN 3; No. Words 7; Word Span 7-187; Sent Span 1-13; [Dublin (SEED) Freq 3 WGT 0.7 STRONG] [Ireland (Dublin) Freq 3 WGT 0.7 STRONG] CHAIN 2; No. Words 9; Word Span 3-168; Sent Span 1-12; [investigation (SEED) Freq 1 WGT 0.4 STATISTICAL] [murder (investigation) Freq 2 WGT 0.7 STRONG] [killing (murder) Freq 1 WGT 0.7 STRONG] [victim (killing) Freq 1 WGT 0.4 STATISTICAL] [crime (victim) Freq 1 WGT 0.4 STATISTICAL] [life (murder) Freq 1 WGT 0.4 MEDIUM] [loss (life) Freq 1 WGT 0.4 STATISTICAL] [murderer (victim) Freq 1 WGT 0.4 MEDIUM] CHAIN 10; No. Words 3; Word Span 63-177; Sent Span 3-12; [journalist (SEED) Freq 3 WGT 0] CHAIN 7; No. Words 2; Word Span 48-123; Sent Span 2-9; [title_role (SEED) Freq 2 WGT 0] CHAIN 9; No. Words 2; Word Span 54-112; Sent Span 3-8; [event (SEED) Freq 1 WGT 0.4 MEDIUM] [celebration (event) Freq 1 WGT 0.4 MEDIUM] 226 Non-WordNet Proper Noun Phrase Chains CHAIN 3; No. Words 3; Word Span 33-260; Sent Span 2-19; [Veronica_Guerin (SEED) Freq 2 WGT 0.8] [Veronica (Veronica_Guerin) Freq 1 WGT 0.8] CHAIN 5; No. Words 3; Word Span 44-180; Sent Span 2-13; [Cate_Blanchett (SEED) Freq 2 WGT 0.8] [Blanchett (Cate_Blanchett) Freq 1 WGT 0.8] CHAIN 4; No. Words 2; Word Span 40-68; Sent Span 2-4; [Dublin’s_Savoy_cinema (SEED) Freq 1 WGT 0.8] [Savoy_cinema (Dublin’s_Savoy_cinema) Freq 1 WGT 0.8] 227 Appendix C Segmentation Metrics: WindowDiff and Pk In this appendix we provide a more detailed explanation of the difference between the WindowDiff and Pk metrics which were used to evaluate segmentation performance in Section 7.2.2 (Equations 7.2 and 7.3). In that section, we referred briefly to Pevzner and Hearst’s (2002) paper on the shorting-comings of Beeferman et al.’s Pk metric (1999), and their proposed alternative, the WindowDiff metric, which they state is a more intuitive and accurate means of determining segmentation performance. In this paper, Pevzner and Hearst informally define these two error metrics as follows: Pk uses a sliding window method for calculating error, where if the two ends of the window are in different segments in the reference segmentation and in the same segment in the system’s segmentation (or vice versa), then an error has been detected and the error counter is incremented by 1. WindowDiff also uses a sliding window; however, this metric compares the number of boundaries in the window in the reference segmentation (r) with the number of boundaries in the system’s segmentation (s) and if the number of boundaries is not equal, then errors have been detected and the error counter is incremented by the absolute difference between these two numbers, i.e. |r – s|. In both these metrics the window size is half the average size of the segments in the reference segmentation. Figure C.1 shows the window incrementing in units of 1, where each block numbered 0-20 represents a unit of text, at the beginning and end of which exists a possible boundary point. We also notice that the system has made two segmentation errors: it has placed a false boundary point between blocks 7 and 8 (a false positive), and it has missed a boundary between blocks 12 and 13 (a false negative). Table C.1 shows how the Pk and WindowDiff metrics calculate errors for each of the shifting windows in Figure C.1 numbered 1 to 10. The error scores for the Pk metric for each window show that although it detects the false negative error, it fails to identify the false positive error, because in each window from 1 to 5 the 228 start and end of the window lie in different segments in the reference and system segmentations. In comparison, the WindowDiff metric correctly identifies both errors, because its error metric is based on the difference between the number of boundaries in each window in the system and reference segmentations. This example illustrates one of Pk’s flaws outlined by Pevzner and Hearst, i.e. the Pk metric has the potential to penalise false negatives more than false positives. It also helps to illustrate how the Pk and WindowDiff metrics are calculated. For an explanation of other Pk flaws and an empirical justification of the WindowDiff metric see (Pevzner, Hearst, 2002). 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 REF SYS 1 2 3 4 5 6 7 8 9 10 Figure C.1: Diagram showing system segmentation results and the correct boundaries defined in the reference segmentation. Blocks in this diagram represent units of text. 229 Window Iteration Pk WindowDiff 1 0 1 2 0 1 3 0 1 4 0 1 5 0 0 6 1 1 7 1 1 8 1 1 9 1 1 10 0 0 Table C.1: Error calculations for each metric for each window shift in Figure C.1. 230 Appendix D Sample News Documents from Evaluation Corpora In this appendix we provide sample text from each of the ‘clean’ and ‘noisy’ news sources used in the experiments described in this thesis: New Event Detection: TDT newswire articles (clean), TDT1 broadcast transcripts (clean), TDT2 broadcast transcripts (noisy). News Story Segmentation: TDT newswire articles (clean), TDT1 broadcast transcripts (clean). News Story Gisting: RTÉ closed caption material (noisy). As stated in Chapter 8, TDT ASR transcripts are affected primarily by limited capitalisation, and some segmentation and spelling errors. The RTÉ closed caption material, on the other hand, is capitalised, but suffers from breaks in transmission (missing words/sentences). In addition, story segmentation errors are more prevalent in this data source than in TDT transcripts, due to the manual ‘clean-up’ conducted on these transcripts by the LDC before they were released. 231 D.1 TDT1 Broadcast News Transcript <DOC> <DOCID> CNN786-5.940701 </DOCID> <TDTID> TDT000010 </TDTID> <SOURCE> CNN Daybreak </SOURCE> <DATE> 07/01/94 </DATE> <TITLE> Arafat Returns to Gaza Without Promised Money </TITLE> <SUBJECT> Live Report </SUBJECT> <SUBJECT> News </SUBJECT> <SUBJECT> International </SUBJECT> <TOPIC> Arafat, Yasser </TOPIC> <TOPIC> Middle East--Politics and government </TOPIC> <TOPIC> Palestinian Arabs </TOPIC> <TOPIC> Palestinian self-rule areas </TOPIC> <SUMMARY> Yasir Arafat is expected to return in about an hour-anda-half for a weekend stay in the Gaza Strip. Spirits and security are both high, as Arafat's many supporters and enemies try to make their points. </SUMMARY> <TEXT> <SP> BOB CAIN, Anchor </SP> <P> As we reported to you earlier, Palestine Liberation Organization leder Yasir Arafat is returning to Palestine today for the first time in 27 years - that we know of. He will be crossing the border from Egypt into Gaza shortly. CNN Correspondent Bill Delaney is in Gaza City with the latest developments. Bill? </P> <SP> BILL DELANEY, Correspondent </SP> <P> Bob, the relative nonchalance with which many Gazans greeted the news of PLO Chairman Yasir Arafat's arrival here is finally giving way to excitement, expectation and there is a lot of security out on the streets. The hotel behind me is covered with security on the roof - has been for the last 24 hours or so. Yasir Arafat's entourage is expected to stay there and everywhere else in Gaza, there's evidence of how concerned Palestinians are about keeping their living legend alive. </P> <P> By the truckload, Yasir Arafat's loyal legion, the Palestine Liberation Army Brigade, fanned out in streets still unfamiliar to many who've only themselves so recently returned, awaiting the man these soldiers see as above all other - the keeper of the flame. </P> <SP> 1st RESIDENT </SP> <P> We are here, all of us, to protect Yasir Arafat and to say to him `hello.' </P> <SP> DELANEY </SP> <P> A brigadier general said he was not worried about security because everyone loves Arafat. An exaggeration evidenced by the dragnet of security everywhere in Gaza - at the hotel where Arafat's entourage will stay and in the plaza where Arafat's expected to address tens of thousands as crowds slowly gathered in the Mediterranean heat, Arafat's long journey home to Gaza, where his mother's family once lived, changes everything forever for Palestinians. </P> <SP> 2nd RESIDENT </SP> <P> 232 All the people very happy. </P> <SP> 3rd RESIDENT </SP> <P> My father's happy today. </P> <SP> DELANEY </SP> <P> Dissenters were heard from - after an attack Thursday on Israeli soldiers in Gaza, an Islamic group claimed responsibility for an attack on Jewish settlers on the West Bank. In Gaza, though, whether the Middle East's old violent cycles continue or not, nothing will ever look quite the same once Yasir Arafat's come to town. For young children, the 27-year Israeli occupation won't be much of a memory - Yasir Arafat's arrival surely will be. This boy said `We can't live without him. He is our leader. He is our love.' Still, Yasir Arafat returns as he vowed repeatedly he would not, still only barely solvent, with relatively few of the hundreds of millions of dollars he's been pledged actually in his pocket. </P> <P> Arafat's latest gamble seems to be to so irrevocable stake his claim here that everything else he so desperately needs will follow. We expect him here in the Gaza Strip in about an hour and a half, crossing over from Egypt. Bob? </P> <SP> CAIN </SP> <P> Bill, Arafat apparently poses a security problem not only to the Palestinians, but to the Israelis as well. Do you know how long he's going to be there in Gaza City? </P> <SP> DELANEY We expect him plans are, of understand at </SP> <P> to spend the weekend and leave Monday. Now, those course, always subject to change, but that's what we the moment. </P> <SP> CAIN </SP> <P> Bill Delaney, in Gaza City. </P> <COPYRIGHT> The preceding text has been professionally transcribed. However, although the text has been checked for errors, in order to meet rigid distribution and transmission deadlines, it may not have been proofread against tape. (c) Copyright 1994 Cable News Network. All rights reserved. </COPYRIGHT> </TEXT> </DOC> <DOC> 233 D.2 TDT2 Broadcast News Transcript <DOC> <DOCNO> CNN19980106.1600.0984 </DOCNO> <DOCTYPE> NEWS STORY </DOCTYPE> <DATE_TIME> 01/06/1998 16:16:24.99 </DATE_TIME> <BODY> <TEXT> shares of apple computer were among those in the plus column today. apple says cost-cutting and strong demand for its new computers helped return to company to profitability at the end of last year. that good news helped the company stock soar nearly 20%. apple closed up just over $3, at $19 a share. the acting chief of the bruised computer maker credits his workers. <TURN> every group at apple has been burning the midnight oil over the last six months. the product groups hardware and software have been doing great. our sales and marketing groups all around the world are manufacturing, distribution , and we're starting to really see the results. <TURN> apple also says its eyeing the sub $1,000 personal computer market. other major computer makers including hewlett packard and compaq are already churning out cheap pcs they've become one of the hottest products in the computer industry. </TEXT> </BODY> <END_TIME> 01/06/1998 16:17:20.00 </END_TIME> </DOC> 234 D.3 TDT Newswire Article <DOC> <DOCNO> NYT19980109.0937 </DOCNO> <DOCTYPE> NEWS STORY </DOCTYPE> <DATE_TIME> 01/09/1998 21:13:00 </DATE_TIME> <HEADER> A4332 &Cx1f; taf-z u i &Cx13; &Cx11; BC-IRAN-U.S.-POLICY-270& 01-09 0830 </HEADER> <BODY> <SLUG> BC-IRAN-U.S.-POLICY-270&ADD-NYT </SLUG> <HEADLINE> U.S. OFFICIALS WARMING UP TO INFORMAL LINKS WITH IRAN </HEADLINE> (Eds., see related IRAN-U.S) &QL; (bl) &QL; By STEVEN ERLANGER &LR; &QL; &UR; c.1998 N.Y. Times News Service &QC; &LR; &QL; <TEXT> WASHINGTON _ After a few days of reflection on Iranian President Mohammed Khatami's address to the American people, senior U.S. officials are changing their tone and embracing the idea of cultural exchanges that fall short of a formal, government-togovernment dialogue. But that formal dialogue is vital to any real improvement in relations with Iran, the officials repeated, adding that atmosphere also matters. Noting that the interview with Ayatollah Khatami on Wednesday was also broadcast in Iran and received a mixed reception, U.S. officials say they have a fuller understanding both of the courage of his address and what is possible within the divided politics of theocratic Iran. ``When the president of Iran, a country with whom we've had a very bad relationship for a long time, gets on CNN and addresses the American people and starts praising our values and our civilization and talks about a dialogue, then it behooves us to respond,'' a senior U.S. official said. ``When he says he regrets the hostage-taking and talks about America as a great civilization and these things get criticized in Iran,'' the official continued, ``it is an indication to us that he's interested in breaking down this distrust and finding a way to engage with us.'' All that ``is important on a rhetorical level,'' the official said/ But he cautioned that ``we have some real problems with Iranian behavior'' that can only be resolved in ``authorized, government-to-government talks'' of the kind Washington has been seeking _ publicly and privately, through various diplomatic channels _ for many months. <ANNOTATION> (STORY CAN END HERE. OPTIONAL MATERIAL FOLLOWS) </ANNOTATION> U.S. diplomatic overtures for new talks on the substantive problems of the relationship were passed to Iranian officials in Tehran by Saudi intermediaries in June and early July, The Los Angeles Times reported in July, before Khatami took office in August. 235 Another overture, sometime after Khatami's inauguration, was made in a letter delivered by the Swiss, who represent U.S. interests in Tehran, where there is no U.S. diplomatic representation, The Washington Post reported. But these overtures _ and less formal efforts made through Washington-based research groups _ produced little at the time, officials said. ``A real improvement in Iran's behavior and relations with the United States will depend more on domestic political change in Tehran than anything we do or say,'' a senior official said. ``And what we do or say will have an exaggerated impact over there. There is a real risk in saying too much and doing in the guy who's trying to make things better.'' While wanting to be receptive to the overture from Khatami, U.S. officials do not want to be ``bounced,'' one said, into aimless talks that harm U.S. efforts to isolate Iran and produce no discernible change in Tehran's behavior. So State Department spokesman James P. Rubin says the United States will ``take a serious, hard look'' at Khatami's vague proposal for a more formalized expansion of cultural and educational exchanges. But limited informal exchanges already exist, Rubin said, and what matters to Washington remains now what it was last week: a halt in Iranian support for terrorism; a halt to Iran's pursuit of weapons of mass destruction and ballistic missiles to deliver them, and a halt in Iran's active support for radicals opposed to the Middle East peace effort. The U.S. response to Khatami is ``designed to make clear to him that we listened and we heard, both the good things, the things we appreciate, and the things we do not appreciate,'' a senior official said. It is also designed not to cause any inadvertent damage to Khatami's standing in Iran _ but without appearing to take sides in the struggle between conservative adherents of Iran's spiritual leader, Ayatollah Ali Khamenei, and those who look to Khatami to soften Iran's religious fervor and encourage the trend toward the more moderate brand of Islam he appears to represent. The United States remains a metaphor for the more fundamental battle inside Iran, just as it was during the 1979 revolution against Shah Mohammed Riza Pahlevi that brought the ayatollahs to power. The administration applauded Khatami's call for relations built on ``mutual respect'' and his suggestion that terrorist violence aimed at Israeli citizens is useless and counterproductive. His comments about the United States, an official said, ``were a breath of fresh air, quite contrary to the paranoid, vitriolic view of Western values and culture put around in Iran for many years.'' Among the less attractive comments, to American ears, was Khatami's description of Israel as ``a racist, terrorist regime.'' </TEXT> </BODY> <TRAILER> NYT-01-09-98 2113EST </TRAILER> </DOC> 236 D.4 RTÉ Closed Caption Material <DOC id="story16942_3"> <TEXT> the result of which has caused fierce disagreement among Palestinian militants. In the North, a five-year-old boy suffered a serious eye injury when the taxi he was in came under attack by youths throwing stones. The incident was one of a series of minor skirmishes in Belfast, after what has been the quietest 12th of July weekend for many years. But Jeffrey Donaldson, the Ulster Unionist MP, confirmed he will increase efforts to overturn the policies of David Trimble. One week after a calm Drumcree, a quiet 12th. In North Belfast, after a day's marching and in some cases a day's drinking, Orangemen paraded past nationalists with no serious confrontations. it has been the calmest July since the Troubles began. But for both communities, there are different realities still. They will have to be grasped Last night in North Belfast, the prominent Republicans helping to keep the peace included Bobby Storey, the man at the centre of unproven Unionist allegations about the break-in at Castlereagh pol?? 4?tion. At times it was an extremely tense situation. I saw Sinn Fein's Gerry Kelly intervene to stop a number of youths from getting involved in confrontations with the police. Policing remains a crucial issue for Sinn Fein. Until Nationalist communities support the new policing structures, security issues will be contentious. On the Unionist side, Jeffrey Donaldson may now believe But in his campaign to oust David Trimble, he is looking for support from the likes of Sir Reg Empey, who supports the Good Friday Agreement. Donaldson's strategy seems to be - sort out our own camp and then try to negotiate with Nationalists. I say that I and the people I represent can bring a concensus on the Unionist side to this process, provided our concerns are addressed. Normally in the North, politicians take summer holidays and the streets become tense. The current situation is different – </TEXT> </DOC> 237 References (Agirre et al., 2000) E. Agirre, O. Ansa, E. Hovy, D. Martinez. Enriching very large ontologies using the WWW. In the Proceedings of the Workshop on Ontology Learning, 14th European Conference on Artificial Intelligence (ECAI-00), 2000. (Alemany, Fuentes, 2003) L. Alemany, M. Fuentes. Integrating cohesion and coherence for automatic summarization. In the Proceedings of the 11th Meeting of the European Chapter of the Association for Computational Linguistics (EACL-03), 2003. (Alfonseca et al., 2003) E. Alfonseca, P. Rodriguez. Description of the UAM system for generating very short summaries at DUC 2003. In the Proceedings of the HLT/NAACL Workshop on Automatic Summarization/Document Understanding Conference (DUC 2003), 2003. (Al-Halimi, Kazman, 1998) R. Al-Halimi, R. Kazman, Temporal Indexing through Lexical Chaining. In WordNet: an Electronic Lexical Database. Chapter 14, pp. 33352, C. Fellbaum (editor), The MIT Press, Cambridge, M.A., 1998. (Allan et al., 1998a) J. Allan, J. Carbonell, G. Doddington, J. Yamron, Y. Yang. Topic Detection and Tracking Pilot Study Final Report. In the Proceedings of the DARPA Broadcasting News Transcript and Understanding Workshop 1998, pp. 194-218, 1998. (Allan et al., 1998b) J. Allan, V. Lavrenko, R. Papka. Event Tracking. Computer Science Department, University of Massachusetts, Amherst, CIIR Technical Report IR-128, 1998. 238 (Allan et al., 1998c) J. Allan, R. Papka, V. Lavrenko. On-line New Event Detection and Tracking. In the Proceedings of the 21st Annual ACM SIGIR Conference of Research and Development in Information Retrieval (SIGIR-98), pp. 37-45, 1998. (Allan et al., 1998d) J. Allan, R. Papka, V. Lavrenko, On-line New Event Detection using Single Pass Clustering. University of Massachusetts, Amherst, Technical Report 98-21, 1998. (Allan et al., 1999) J. Allan, H. Jin, M. Rajman, C. Wayne, D. Gildea, V. Lavrenko, R. Hoberman, D. Caputo. Topic-Based Novelty Detection. Summer Workshop Final Report, Center for Language and Speech Processing, Johns Hopkins University, 1999. (Allan et al., 2000a) J. Allan, V. Lavrenko, D. Malin, R. Swan. Detections, Bounds, and Timelines: UMass and TDT-3. In the Proceedings of Topic Detection and Tracking Workshop (TDT-3), 2000. (Allan et al., 2000b) J. Allan, V. Lavrenko, H. Jin. First Story Detection in TDT Is Hard. In the Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM-00), 2000. (Allan et al., 2000c) J. Allan, V. Lavrenko, D. Frey, V. Khandelval. UMass at TDT 2000. In the Proceedings of Topic Detection and Tracking Workshop (TDT-2000), Gaithesburg, MD, 2000. (Allan et al., 2001) J. Allan, V. Khandelwal, R. Gupta. Temporal Summaries of News Topics. In the Proceedings of the 24th Annual ACM SIGIR Conference of Research and Development in Information Retrieval (SIGIR-01), pp. 10-18, 2001. (Allan, 2002a) J. Allan (editor). Topic Detection and Tracking: Event-based Information Organization. Kluwer Academic Publishers, 2002. 239 (Allan, 2002b) J. Allan. Introduction to Topic Detection and Tracking. Chapter 1, In Topic Detection and Tracking: Event-based Information Organization. J. Allan (editor), Kluwer Academic Publishers, 2002. (Allan et al., 2002c) J. Allan, V. Lavrenko, R. Swan. Explorations within Topic Detection and Tracking. Chapter 10, In Topic Detection and Tracking: Event-based Information Organization. J. Allan (editor), Kluwer Academic Publishers, 2002. (Arampatzis, 2001) A. Arampatzis. Adaptive and Temporally-dependent Document Filtering. Ph.D. thesis, University of Nijmegen, The Netherlands, 2001. (Baker, McCallum, 1998) L. D. Baker, A. K. McCallum. Distributional clustering of words for text classification. In the Proceedings of the 21st Annual ACM SIGIR Conference of Research and Development in Information Retrieval (SIGIR-98), pp. 96-103, 1998. (Baeza-Yates, Ribero-Neto, 1999) R. Baeza-Yates, B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, Addisson-Wesley, 1999. (Banko et al., 2000) M. Banko V. Mittal, M. Witbrock. Generating Headline-Style Summaries. In the Proceedings of the Association for Computational Linguistics (ACL-00), 2000. (Barzilay, Elhadad, 1997) R. Barzilay, M. Elhadad. Using Lexical Chains for Text Summarization. In the Proceedings of the Association for Computational Linguistics and the European Chapter of the Association for Computational Linguistics (ACL-97/EACL-97), Workshop on Intelligent Scalable Text Summarization, pp. 10-17, 1997. (Barzilay, 1997) R. Barzilay. Lexical chains for summarisation. Master’s Thesis, Ben-Gurion University, Beer-Sheva, Israel, 1997. 240 (Barzilay, 2003) R. Barzilay. Information Fusion for Mutlidocument Summarization: Paraphrasing and Generation. PhD Thesis, Columbia University, 2003. (Beeferman et al., 1999) D. Beeferman, A. Berger, J. Lafferty. Statistical Models for Text Segmentation. In Machine Learning, Vol. 34, pp. 1-34, 1999. (Berger, Mittal, 2000) A. Berger, V. Mittal. OCELOT: a system for summarizing Web. In the Proceedings of the 23rd Annual ACM SIGIR Conference of Research and Development in Information Retrieval, (SIGIR-00), pp.144-151, 2000. (Blei, Moreno, 2001) D. M. Blei, P. J. Moreno. Topic segmentation with an aspect hidden Markov model. In the Proceedings of the 24th Annual ACM SIGIR Conference of Research and Development in Information Retrieval, (SIGIR-01), pp. 343-348, 2001. (Bo-Yeong, 2002) Bo-Yeong Kang. Text Summarization through Important Noun Detection Using Lexical Chains. M.S. Thesis, Kyungpook National University, 2002. (Bo-Yeong, 2003) Bo-Yeong Kang. A novel approach to semantic indexing based on concept. In the Proceedings of the Association for Computational Linguistics Student Session (ACL-03), 2003. (Brunn, Chali, Pinchak, 2001) M. Brunn, Y. Chali, C.J. Pinchak. Text Summarization Using Lexical Chains. In the Proceedings of the Document Understanding Conference (DUC-2001), pp. 135 - 140, 2001. (Brunn, Chali, Dufour, 2002) M. Brunn, Y. Chali, D. Dufour. The University of Lethbridge Text Summarizer at DUC 2002. In the Proceedings of the Document Understanding Conference (DUC-2002), pp. 39-44, 2002. 241 (Budanitsky, 1999) A. Budanitsky. Lexical Semantic Relatedness and its Application in Natural Language Processing. PhD Thesis, Technical Report CSRG390, Computer Systems Research Group, University of Toronto, August 1999. (Budanitsky, Hirst, 2001) A. Budanitsky, G. Hirst. Semantic Distance in WordNet: An experimental, application oriented-evaluation of five measures. In the Proceedings of the Workshop on WordNet and Other Lexical Resources, in the North American Chapter of the Association for Computational Linguistics (NAACL-2001), Pittsburgh, PA, June 2001. (Buitelaar, 1998) P. Buitelaar. CORELEX: Systematic Polysemy and Underspecification. Ph.D. thesis, Brandeis University, 1998. (Callan et al., 1992) J. P. Callan, W. B. Croft, S. M. Harding. The INQUERY Retrieval System. In the Proceedings of the 3rd International Conference on Database and Expert System Applications, pp. 78-83, 1992. (Callan. 1994) J. P. Callan. Passage level evidence in document retrieval. In the Proceedings of the 17th Annual ACM SIGIR Conference of Research and Development in Information Retrieval, (SIGIR-94), pp. 302-310, 1994. (Carthy, Smeaton, 2000) J. Carthy, A. F. Smeaton. The Design of a Topic Tracking System. In the Proceedings of the 22nd Annual Colloquium on IR Research, (BCSIRSG-00), 2000. (Carthy, Sherwood-Smith, 2002) J. Carthy, M. Sherwood-Smith. Lexical Chains for Topic Tracking. In the Proceedings of the IEEE International Conference on Systems Management and Cybernetics, 2002. (Carthy, 2002) J. Carthy. Lexical Chains for Topic Tracking. PhD thesis, Department of Computer Science, University College Dublin, 2002. 242 (Cieri et al., 2002) C. Cieri, S. Strassel, D. Graff, N. Martey, K. Rennert, M. Liberman. Corpora for Topic Detection and Tracking. Chapter 10, In Topic Detection and Tracking: Event-based Information Organization. J. Allan (editor), Kluwer Academic Publishers, 2002. (Chali et al., 2003) Y. Chali, M. Kolla, N. Singh, Z. Zhang. The University of Lethbridge Text Summarizer at DUC 2003. In Proceedings of the Document Understanding Conference (DUC-2003), pp. 148-152, 2003. (Chen, Chen, 2002) Y. Chen, H. Chen. NLP and IR Approaches to Monolingual and Multilingual Link Detection. In the Proceedings of the 19th International Conference on Computational Linguistics (ACL-02), pp. 176-182, 2002. (Chen, Ku, 2002) H. Chen, L. Ku. An NLP and IR approach to topic detection. Chapter 12, In Topic Detection and Tracking: Event-based Information Organization. J. Allan (editor), Kluwer Academic Publishers, 2002. (Choi, 2000) F. Y. Y. Choi. Advances in domain independent linear text segmentation. In the Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL-00), pp. 26-33, 2000. (Choi, 2001) F. Y. Y. Choi, P. Wiemer-Hastings, J. Moore. Latent semantic analysis for text segmentation. In the Proceedings of the 6th Conference on Empirical Methods in Natural Language Processing (EMNLP-01), pp. 109-117, 2001. (Copeck et al., 2003) T. Copeck, S. Szpakowicz. Picking phrases, picking sentences. In the Proceedings of the HLT/NAACL workshop on Automatic Summarization/Document Understanding Conference (DUC 2003), 2003. (Croft, 2000) W. B. Croft. Combining approaches to information retrieval. In Chapter 1, Advances in Information Retrieval, W. B. Croft (editor), pp. 1-36. Kluwer Academic Publishers, 2000. 243 (Daume et al., 2002) H. Daume, D. Echihabi, D. Marcu, D. S. Munteanu, R. Soricut. GLEANS: A generator of logical extracts and abstracts for nice summaries. In the Proceedings of the ACL Workshop on Automatic Summarization/Document Understanding Conference (DUC 2002), 2002. (Deerwester et al., 1990) S. Deerwester, S. Dumais, G. Furnas, T. Landauer, R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, Vol. 41, pp. 391-407, 1990. (Dharanipragada et al., 1999) S. Dharanipragada, M. Franz, J. S. McCarley, S. Roukos, T. Ward. Story Segmentation and Topic Detection for Recognised Speech. In the Proceedings of Eurospeech, 1999. (Dharanipragada et al., 2002) S. Dharanipragada, M. Franz, J. S. McCarley, T. Ward, W. -J. Zhu. Segmentation and Detection at IBM. Chapter 7, In Topic Detection and Tracking: Event-based Information Organization. J. Allan (editor), Kluwer Academic Publishers, 2002. (Dhillon, et al., 2003) I. Dhillon, S. Mallelaa, R. Kumar. A Divisive InformationTheoretic Feature Clustering Algorithm for Text Classification. To appear in the Journal of Machine Learning Research,: Special Issue on Variable and Feature Selection, Vol. 3 pp. 1265-1287, 2003. (Dorr, Zajic, 2003) B. Dorr, D. Zajic. Hedge Trimmer: A parse-and-trim approach to headline generation. In the Proceedings of the HLT/NAACL Workshop on Automatic Summarization/Document Understanding Conference (DUC 2003), 2003. (Dunning, 1994) T. Dunning. Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, Vol. 19, No. 1 pp. 61-74, 1994. 244 (Eichmann et al., 1999) D. Eichmann, M. Ruiz, P. Srinivasan, N. Street, C. Culy, F. Menczer. A cluster-based approach to tracking, detection and segmentation of broadcast news. In the Proceedings of the DARPA Broadcast News Workshop, 1999. (Eichmann, Srinivasan, 2002) D. Eichmann, P. Srinivasan. A cluster-based approach to broadcast news. Chapter 8, In Topic Detection and Tracking: Eventbased Information Organization. J. Allan (editor), Kluwer Academic Publishers, 2002. (Ellman, 2000) J. Ellman. Using Roget’s Thesaurus to Determine the Similarity of Texts. PhD thesis, Department of Computer Science, University of Sunderland, 2000. (Fellbaum, 1998a) C. Fellbaum (editor). In WordNet: An Electronic Lexical Database and some of its Applications. MIT Press, Cambridge, MA, 1998. (Fellbaum, 1998b) C. Fellbaum. A semantic network of English verbs. Chapter 3, In WordNet: An Electronic Lexical Database and some of its Applications. C. Fellbaum (editor), MIT Press, Cambridge, MA, 1998. (Fiscus, Doddington, 2002). J. Fiscus, G. Doddington, Topic Detection and Tracking Overview. Chapter 2, In Topic Detection and Tracking: Event-based Information Organization. J. Allan (editor), Kluwer Academic Publishers, 2002. (Fox, 1983) E. Fox. Extending the Boolean and Vector Space Models of Information Retrieval with P-norm Queries and Multiple Concept Types. PhD thesis, Cornell University, 1983. (Fox et al., 1988) E. Fox, G. Nunn, W. Lee. Coefficients for combining concept classes in a collection. In the proceedings of the 11th Annual ACM SIGIR Conference of Research and Development in Information Retrieval, (SIGIR-88), pp. 291-308, 1988. 245 (Frakes, Baeza-Yates, 1992) W. B. Frakes, R. Baeza-Yates. Information Retrieval: Data structures and algorithms. Prentice Hall, 1992. (Fuentes et al., 2003) M. Fuentes, H. Rodriguez, L. Alonso, Mixed Approach to Headline Extraction for DUC 2003. In the Proceedings of the HLT/NAACL Workshop on Automatic Summarization/Document Understanding Conference (DUC 2003), 2003. (Furnas et al., 1987) G. W. Furnas, T. K. Landauer, L. M. Gomez, S. Dumais. The Vocabulary Problem in Human-System Communication. CACM, Vol. 30, No. 11, pp. 964-971, 1987. (Gale et al., 1992) W. Gale, K. Church, D. Yarowsky. One Sense Per Discourse. In the Proceedings of the 4th DARPA Speech and Natural Language Workshop, pp. 233-237, 1992. (Galley, McKeown, 2003) M. Galley, K. McKeown. Improving Word Sense Disambiguation in Lexical Chaining. In the Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI-03), 2003. (Gloub, Van Loan, 1996) G. H. Golub, C. Van Loan. Matrix Computations. Johns Hopkins University Press, 3rd Edition, 1996. (Gonzalo et al., 1998) J. Gonzalo, F. Verdejo, I. Chugur, J. Cigarran. Indexing with WordNet Synsets can Improve Text Retrieval. In the Proceedings of the Workshop on the Usage of WordNet in Natural Language Processing Systems, (COLINGACL-98), S. Harabagiu (editors), pp. 38-44, 1998. (Gonzalo et al., 1999) J. Gonzalo, A. Penas, F. Verdejo. Lexical ambiguity and Information Retrieval revisited. In the Proceedings of the Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora (EMNLP/VLC99), 1999. 246 (Green, 1997a) S. J. Green. Building hypertext links in Newspaper Articles using Semantic Similarity. In the Proceedings of the 3rd Workshop on Applications of Natural Language to Information Systems (NLDB’97), pp. 178-190, 1997. (Green, 1997b) S. J. Green. Automatically Generating Hypertext by Computing Semantic Similarity. PhD Thesis, University of Toronto, 1997. (Greiff et al., 2000) W. Greiff, A. Morgan, R. Fish, M. Richards, A. Kundu. MITRE TDT-2000 segmentation system. In the Proceedings of the TDT 2000 Workshop, 2000. (Grosz, Sidner, 1986) B. J. Grosz, C. L. Sidner. Attention, intentions, and the structure of discourse. Computational Linguistics, Vol. 12, No. 3, pp. 175-204, 1986. (Halliday, Hasan, 1976) M. A. K. Halliday, R. Hasan. Cohesion in English. Longman, 1976. (Halliday, 1995) M. A. K. Halliday. Spoken and Written Language. Oxford University Press, 1985. (Hasan, 1984) R. Hasan. Coherence and Cohesive Harmony. Understanding Reading Comprehension: Cognition, Language and the Structure of Prose. James Flood (ed.), Newwark, Delaware: International Reading Association, pp.184-219, 1984. (Hatch, 2000) P. Hatch. Lexical Chaining for the Online Detection of New Events. MSc thesis, Department of Computer Science, University College Dublin, 2000. (Harabagiu, 1999) S. Harabagiu. From Lexical Cohesion to Textual Coherence: A Data Driven Perspective. Journal of Pattern Recognition and Artificial Intelligence, Vol. 13, No. 2, pp. 247-265, 1999. 247 (Hearst, Plaunt, 1993) M. Hearst, C. Plaunt. Subtopic structuring for full-length document access. In the Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR-93), pp. 59-68, 1993. (Hearst, 1997) M. Hearst. TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages. Computational Linguistics, Vol. 23 No. 1, pp. 33-64, 1997. (Hirst, 1995) G. Hirst. Near-synonymy and the structure of lexical knowledge. In the Working Notes of the AAAI Spring Symposium on Representation and Acquisition of Lexical Knowledge: Polysemy, Ambiguity, and Generativity, pp. 5156, 1995. (Hirst, St-Onge, 1998) G. Hirst, D. St-Onge. Lexical chains as Representations of Context for the Detection and Correction of Malapropisms. in Chapter 13 WordNet: An Electronic Lexical Database, C. Fellbaum (editors), pp. 305-332, The MIT Press, Cambridge, MA, 1998. (Hirschberg, Litman, 1993) J. Hirschberg, D. Litman. Empirical studies on the disambiguation of cue phrases. Computational Linguistics, Vol. 19, No. 3, pp. 501530, 1993. (Jarmasz, Szpakowicz, 2003) M. Jarmasz, S. Szpakowicz. Not as Easy as It Seems: Automating the Construction of Lexical Chains Using Roget' s Thesaurus. In the Proceedings of the Canadian Conference on Artificial Intelligence, pp. 544-549, 2003. (Jiang, Conrath, 1997) J. J. Jiang, D. W. Conrath. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In the Proceedings of the International Conference on Research in Computational Linguistics (COLING-97), 1997. 248 (Jin, Hauptmann, 2002) R. Jin, A. G. Hauptmann. A new probabilistic model for title generation. In the Proceedings of the International Conference on Computational Linguistics (ACL-02), 2002. (Joachims, 2002) T. Joachims. Learning to Classify Text using Support Vector Machines, PhD Dissertation, Kluwer, 2002. (Jobbins, Evett, 1998) A. C. Jobbins, L. J. Evett. Text Segmentation Using Reiteration and Collocation. In the Proceedings of the Joint International Conference on Computational Linguistics with the Association for Computational Linguistics (COLING-ACL 1998), pp. 614-618, 1998. (Kaszkiel, Zobel, 1997) M. Kaszkiel, J. Zobel. Term-ordered query evaluation versus document-ordered query evaluation for large document databases. In the Proceedings of the 20th International ACM-SIGIR Conference on Research and Development in Information Retrieval, (SIGIR-97), pp. 343-344, 1997. (Kaszkiel, Zobel, 2001) M. Kaszkiel, J. Zobel. Effective ranking with arbitrary passages. Journal of the American Society of Information Science and Technology, Vol. 54, No. 4, pp. 344-364, 2001. (Kaufmann, 2000) S. Kaufmann. Second-order Cohesion. Computational Intelligence. Vol. 16, No. 4, pp. 511-524, 2000. (Kazman et al., 1995) R. Kazman, W. Hunt, M. Mantei. Dynamic Meeting Annotation and Indexing. In the Proceedings of the 1995 Pacific Workshop on Distributed Meetings, pp. 11-18, 1995. (Kazman et al., 1996) R. Kazman, R. Al-Halimi, W. Hunt, M. Mantei. Four Paradigms for Indexing Video Conferences. In IEEE Multimedia, pp. 63-73, 1996. 249 (Justeson, Katz, 1995) J. Justeson, S. M. Katz. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering Vol. 1, No.1, pp. 9-27, 1995. (Katz, 1996) S. M. Katz. Distribution of context words and phrases in text and language modelling. Natural Language Engineering, Vol 2, No. 1, pp. 15-59, 1996. (Katzer, et al., 1982) J. Katzer, M. McGill, J. Tessier, W. Frakes, P. DasGupta. A study of the overlap among document representations. Information Technology: Research and Development, Vol. 1, No. 4, pp.261-274, 1982. (Kazman et al., 1997) R. Kazman, J. Kominek. Accessing Multimedia through Concept Clustering. In the Proceedings of Computer-Human Interaction (CHI-97), pp. 19-26, 1997. (Kilgarriff, Yallop, 2000) A. Kilgarriff C. Yallop. What’s in a thesaurus? In the Proceedings of the 2nd Conference on Language Resources and Evaluation (LREC00), pp. 1317-1379, 2000. (Klavans, Min-Yen, 1998) J. Klavans, Min-Yen Kan. Role of Verbs in Document Analysis. In the Proceedings of the Joint International Conference on Computational Linguistics with the Association for Computational Linguistics (COLING-ACL 1998), pp. 680-686, 1998. (Kozima, Furugori, 1993a) H. Kozima, T. Furugori. Similarity between words computed by spreading activation on an English dictionary. In the Proceedings of the 6th conference of the European Chapter of Computational Linguistics (EACL93), pp 232-239, 1993. (Kozima, 1993b) H. Kozima. Text Segmentation based on Similarity between Words. In the Proceedings of the 31st Meeting of the Association of Computational Linguistics (ACL-93), pp. 286-288, 1993. 250 (Kozima, Ito, 1997) H. Kozima, A. Ito. Context Sensitive Word Distance by Adaptive Scaling of a Semantic Space. In R. Mitkov and N. Nicolov (editors.), Recent Advances in Natural Language Processing: Selected Papers from RANLP 1995, Volume 136 of Amsterdam Studies in the Theory and History of Linguistic Science: Current Issues in Linguistic Theory, Chapter 2, pp. 111-124, John Benjamins Publishing Company, Amsterdam/Philadelphia, 1997. (Krovetz, Croft, 1992), R. Krovetz, W. B. Croft. Lexical Ambiguity and Information Retrieval. In the ACM Transactions on Information Systems, Vol. 10, No. 2, pp.115-141, 1992. (Lee, 1997) J. Lee. Analysis of multiple evidence combination. In the Proceedings of the 20th Annual ACM SIGIR Conference on Research and Development in IR, pp. 267-276, (SIGIR-97), 1997. (Lin, Hovy, 2003) C. Lin, E. Hovy. Automatic Evaluation of Summaries Using Ngram Co-occurrence Statistics. In the Proceedings of the Joint Human Language Technology, North Atlantic Association of Computational Linguistics Conference, (HLT-NAACL 2003), 2003. (Mandala et al., 1999) R. Mandala, T. Tokunaga, H. Tanaka. Combining Multiple Evidence from Different Types of Thesaurus for Query Expansion. In the Proceedings of the 22nd Annual ACM SIGIR Conference on Research and Development in IR, (SIGIR-99), pp 191-197, 1999. (Mani et al., 1997) I. Mani, D. House, M.T. Maybury, M. Green. Towards ContentBased Browsing of Broadcast News Video. In Intelligent Multimedia Information Retrieval, M. T. Maybury (ed.), AAAI/MIT Press, pp 241-258, 1997. (Mann,Thompson, 1987) B. Mann, S. Thompson. Rhetorical Structure Theory: A Theory of Text Organization. In The Structure of Discourse, L. Polanyi (editor), Norwood, N.J.: Ablex Publishing Corporation, 1987. 251 (Manning, 1998) C. D. Manning. Rethinking text segmentation models: An information extraction case study. Technical report SULTRY-98-07-01, University of Sydney, 1998. (Marcu, 1997) D. Marcu. From Discourse Structures to Text Summaries. The Proceedings of the ACL' 97/EACL' 97 Workshop on Intelligent Scalable Text Summarization, pp. 82-88, 1997. (McGill et al., 1979) M. McGill, M. Koll, T. Noreault. An evaluation of factors affecting document ranking by information retrieval systems. Final report for grant NSF-IST-78-10454 to the National Science Foundation, Syracuse University, 1979. (Mc Hale, 1998) M. Mc Hale. A comparision of WordNet and Roget’s Taxonomy for Measuring Semantic Similarity. In the proceedings of the COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems, pp. 115120, 1998. (McKeown et al., 2002) K. McKeown, D. Evans, A. Nenkova, R. Barzilay, V. Hatzivassiloglou, B. Schiffman, S. Blair-Goldensohn, J. Klavans, S. Sigelman. The Columbia Multi-Document Summarizer for DUC 2002. In the Proceedings of the ACL workshop on Automatic Summarization/Document Understanding Conference (DUC 2002), 2002. (Meyers et al., 1998) A. Meyers. Using NOMLEX to produce nominalization patterns for information extraction. In the Proceedings of the COLING-ACL Workshop on Computational Treatment of Nominals, 1998. (Miller et al., 1990) G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, K. Miller. Five papers on WordNet. In the International Journal of Lexicography. Vol. 3 No. 4, Cambridge, MA, 1990. 252 (Miller et al., 1993) G. A. Miller, C. Leacock, T. Randee, R. Bunker. A Semantic Concordance. In the Proceedings of the 3rd DARPA Workshop on Human Language Technology, pp. 303-308, 1993. (Miller, 1998) G. A. Miller. Nouns in WordNet. In WordNet: An Electronic Lexical Database, C. Fellbaum (Editor), pp.23-46, Cambridge, Massachusetts, USA: The MIT Press, 1998. (Min-Yen et al., 1998) Min-Yen Kan, J. Klavans, K. McKeown. Linear Segmentation and Segment Relevance. In the Proceedings of 6th International Workshop of Very Large Corpora (WVLC-6), pp. 197-205, 1998. (Mihalcea, Moldovan, 2001) R. Mihalcea, D. Moldovan. eXtended WordNet: Progress Report. In the Proceedings of NAACL Workshop on WordNet and Other Lexical Resources, pp.95-100, 2001. (Mittal et al., 1999) V. Mittal, M. Kantrowitz, J. Goldstein, J. Carbonell. Selecting text spans for document summaries: Heuristics and metrics. In Proceedings of the 16th National Conference on Artificial Intelligence, pp. 467-473, 1999. (Mittendorf, Schauble, 1994) E. Mittendorf, P. Schauble. Document and passage retrieval based on hidden Markov models. In the Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in IR, (SIGIR-94), pp. 318-327, 1994. (Mochizuki et al., 1998) H. Mochizuki, T. Honda, M. Okumura. Text Segmentation with Multiple Surface Linguistic Cues. In the Proceedings of the Joint International Conference on Computational Linguistics with the Association for Computational Linguistics, (COLING-ACL-98), pp. 881-885, 1998. 253 (Mochizuki et al., 2000) H. Mochizuki, M. Iwayama, M. Okumura. Passage Level Document Retrieval Using Lexical Chains. RIAO 2000, Content Based Multimedia Information Access, pp. 491-506, 2000. (Moffat et al., 1994) A. Moffat, R. Sacks-Davis, R. Wilkinson, J. Zobel. Retrieval of partial documents. In the Proceedings of the 2nd Text Retrieval Conference, (TREC-2), pp. 181-190, 1994. (Moldovan, Novischi, 2002) D. Moldovan, A. Novischi, Lexical Chains for Question Answering. In the Proceedings of the International Conference on Computational Linguistics (COLING –02), pp.674-680, 2002. (Morris, Hirst, 1991) J. Morris, G. Hirst. Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structure of Text. Computational Linguistics, Vol. 17, No. 1, pp. 21-48, 1991. (Navarro, 2001) G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, Vol. 33 No. 1, pp. 31-88, 2001. (Nomoto, Nitta, 1994) T. Nomoto, Y. Nitta. A Grammatico-Statistical Approach to Discourse Partitioning. In the Proceedings of the 15th International Conference on Computational Linguistics (COLING-94), pp. 1145-1150, 1994. (Okumura, Honda, 1994) M. Okumura, T. Honda. Word Sense Disambiguation and Text Segmentation Based on Lexical Cohesion. In the Proceedings of the 15th International Conference on Computational Linguistics (COLING-94), pp. 775-761, 1994. (Papka, 1999) R. Papka. On-Line New Event Detection, Clustering and Tracking. Department of Computer Science, UMASS, Amherst, PhD Dissertation, 1999. 254 (Passonneau, Litman, 1993) R. Passoneau, D. Litman. Intention based segmentation: Human reliability and correlation with linguistic cues. In the Proceedings of Association of Computational Linguistics, (ACL-93), pp. 148-155, 1993. (Passonneau, Litman, 1997) R. Passoneau, D. Litman. Discourse Segmentation by Human and Automated Means. Computational Linguistics, Vol. 23, No. 1, pp. 103139, 1997. (Pedersen, 1996) T. Pedersen. Fishing for Exactness. In the Proceedings of the South-Central SAS Users Group Conference (SCSUG-96), 1996. (Pevzner, Hearst, 2002) L. Pevzner, M. Hearst. A Critique and Improvement of an Evaluation Metric for Text Segmentation. Computational Linguistics, Vol. 28, No. 1, pp. 19-36, 2002. (Polanyi, 1998) L. Polanyi. A formal model of discourse structure. Journal of Pragmatics, Vol. 12, pp 601-638, 1998. (Ponte, Croft, 1998) J. Ponte, W. B. Croft. A language modeling approach to information retrieval. In the Proceedings of the 21st Annual ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-98), pp. 275-281, 1998. (Porter, 1997) M. F. Porter. An Algorithm for Suffix Stripping. In Readings in Information Retrieval, K.Spark Jones and P. Willet, (editors), pp. 313-316, Morgan Kaufmann Publishers, 1997. (Quinlan, 1993) J. R. Quinlan. C4.5: Programs for machine learning. Morgan Kaufman Publishers, 1993. 255 (Rada et al., 1989) R. Rada, H. Mili, E. Bicknell, M. Blettner. Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man and Cybernetics, Vol. 19, No. 1, pp. 17-30, 1989. (Resnik, 1999) P. Resnik. Semantic Similarity in a Taxonomy: An InformationBased Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research, (JAIR), Vol. 11, pp. 95-130, 1999. (Richardson, Smeaton, 1995) R. Richardson, A. F. Smeaton. Using WordNet in an Knowledge-Based Approach to Information Retrieval. Working Paper CA-0395, School of Computer Applications, Dublin City University, 1995. (Richmond et al., 1998) K. Richmond, A. Smith, E. Amitay. Detecting subject boundaries within text: A language independent statistical approach. In the Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing (EMNLP-97), pp. 47-54, 1997. (Reynar, 1994) J. Reynar. An automatic method of finding topic boundaries. In the Proceedings of the Association of Computational Linguistics (ACL-94), 1994. (Reynar, 1998) J. Reynar. Topic Segmentation: Algorithms and Applications. Ph.D. thesis, Computer and Information Science, University of Pennsylvania, 1998. (Salton, McGill, 1983) G. Salton, M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Co., New York, 1983. (Salton, 1989) G. Salton. Automatic Text processing: The transformation, Analysis and Retrieval of Information by Computer. Addison Wesley, Reading, Pennsylvania 1989. (Salton et al., 1993) G. Salton, J. Allan, C. Buckley. Approaches to passage retrieval in full text information systems. In the Proceedings of the Annual 16th ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR-93), pp. 49-58, 1993. 256 (Salton et al., 1996) G. Salton, J. Allan, A. Singhall. Automatic Text Decomposition and Structuring. Information Processing and Management, Vol. 32, No. 2, pp. 127138, 1996. (Sanderson, 1994) M. Sanderson. Word Sense Disambiguation and Information Retrieval, in the Proceedings of the 17th International Conference on Research and Development in Information Retrieval, (SIGIR-94), pp. 142-151, 1994. (Sanderson, 1997) M. Sanderson. Word Sense Disambiguation and Information Retrieval. PhD Thesis, Technical Report (TR-1997-7) of the Department of Computing Science at the University of Glasgow, 1997. (Sanderson, 2000) M. Sanderson. Retrieving with good sense. Information Retrieval, Vol. 2, No. 1, pp. 49-69, 2000. (Schutze, 1997) H. Schutze. Ambiguity resolution in language learning. CSLI Publications, Stanford CA, 1997. (Schutze, 1998) H. Schutze. Automatic word sense disambiguation. Computational Linguistics, Vol. 24, No. 1, pp. 97-123, 1998. (SENSEVAL-2, 2001) SENSEVAL-2: Sense Disambiguation Workshop 2001, www.sle.sharp.co.uk/senseval2/, 2001. (Silber, McCoy, 2000) H. G. Silber, K. F. McCoy. Efficient text summarization using lexical chains. In the Proceedings of Intelligent User Interfaces 2000, pp. 252255, 2000. (Silber, McCoy, 2002) H. G. Silber, K. F. McCoy. Efficiently Computed Lexical Chains as an Intermediate Representation for Automatic Text Summarization. Computational Linguistics, Vol. 28, No. 4, pp. 487-496, 2002. 257 (Slaney, Ponceleon, 2001) M. Slaney, D. Ponceleon. Hierarchical segmentation using latent semantic indexing in scale space. In the Proceedings of the IEEE International Conference on Acoustics, Speech, & Signal Processing, 2001. (Slonim, 2002) N. Slonim. The Information Bottleneck: Theory and Applications. Ph.D. thesis, The Hebrew University, Jerusalem, 2002. (Smeaton et al., 2003) A. F. Smeaton, H. Lee, N. O' Connor, S Marlow, N. Murphy. TV News Story Segmentation, Personalisation and Recommendation. In the Proceedings of Advances in Artificial Intelligence (AAAI-03) Spring Symposium on Intelligent Multimedia Knowledge Management, 2003. (Stairmand, 1996) M. A. Stairmand. A Computational Analysis of Lexical Cohesion with Applications in Information Retrieval. PhD Thesis, Department of Language Engineering, University of Manchester Institute of Science and Technology, 1996 (Stairmand, Black, 1997) M. A. Stairmand, W. J. Black. Conceptual and Contextual Indexing using WordNet-derived Lexical Chains. In the Proceedings of BCS IRSG Colloquium on Information Retrieval, pp. 47-65, 1997. (Stairmand, 1997) M. A. Stairmand. Textual context analysis for information retrieval. In the Proceedings of the 20th Annual ACM SIGIR Conference on Research and Development in IR, (SIGIR-97), pp. 140-147, 1997. (Stevenson, 2002) M. Stevenson. Combining Disambiguation Techniques to Enrich an Ontology. In the Proceedings of the 5th European Conference on Artificial Intelligence (ECAI-02), Workshop on Machine Learning and Natural Language Processing for Ontology Engineering, 2002. (Stokes et al., 2000a) N. Stokes, P. Hatch, J. Carthy. Topic Detection, a new application for lexical chaining? In the Proceedings of the 22nd BCS IRSG Colloquium on Information Retrieval. pp.94-103, 2000. 258 (Stokes et al., 2000b) N. Stokes, P. Hatch, J. Carthy. Lexical semantic relatedness and online news event detection. In the Proceedings of the Annual 23rd ACM SIGIR Conference on Research and Development in IR, (SIGIR-00), pp.324-325, 2000. (Stokes et al., 2000c) N. Stokes, P. Hatch, J. Carthy. Lexical Chaining for WebBased Retrieval of Breaking News. In the Proceedings of the International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems AH2000, pp. 327-330, 2000. (Stokes, Carthy, 2001a) N. Stokes, J. Carthy. Using Data Fusion to Improve First Story Detection. In the Proceedings of the 23rd BCS-IRSG European Conference on IR Research, pp. 78-90, 2001. (Stokes et al., 2001b) N. Stokes, J. Carthy. First Story Detection using a Composite Document Representation. In the Proceedings of the Human Language Technology Conference, (HLT-01), 2001. (Stokes et al., 2001c) N. Stokes, J. Carthy. Combining Semantic and Syntactic Document Classifiers to Improve First Story Detection. In the Proceedings of the 24th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR-01), pp. 424-425, 2001. (Stokes et al., 2002) N. Stokes, J. Carthy, A.F. Smeaton. Segmenting Broadcast News Streams using Lexical Chaining. In the Proceedings of the Starting Artificial Researchers Symposium, (STAIRS-02), Vol.1, pp. 145-154, 2002. (Stokes, 2003) N. Stokes. Spoken and Written News Story Segmentation using Lexical Chaining. In the Proceedings of the Student Workshop at Joint Human Language Technology Conference and North Atlantic Association for Computational Linguistics, (HLT/NAACL-03), Companion Volume, pp. 49-54, 2003. 259 (Stokes et al., 2004a) N. Stokes, J. Carthy, A. F. Smeaton. SeLeCT: A Lexical Cohesion based News Story Segmentation System. To appear in the Journal of AI Communications, 2004. (Stokes et al., 2004b) N. Stokes, E. Newman, J. Carthy, A. F. Smeaton. Broadcast news gisting using lexical cohesion analysis. To appear in the Proceedings of the 26th BCS-IRSG European Conference on Information Retrieval (ECIR-04), Sunderland, U.K., 2004. (Stokoe et al., 2003) C. Stokoe, M. Oakes, J. Tait. Word Sense Disambiguation in Information Retrieval Revisited. In the Proceedings of the 26th Annual ACM-SIGIR Conference on Research and Development in IR, (SIGIR-03), pp. 159-166, 2003. (Stolcke et al., 1999) A. Stolcke, E. Shriberg, D. Hakkani-Tur, G. Tur, Z. Rivlin, K. Sonmez. Combining words and speech prosody for automatic topic segmentation. In the Proceedings of the DARPA Broadcast News Workshop, pp. 61-64, 1999. (St-Onge, 1995) D. St-Onge. Detecting and Correcting Malapropisms with Lexical Chains. Technical Report CSRI-319, Master’s thesis, University of Toronto, March 1995. (Sussna, 1993) M. Sussna, WordSense Disambiguation for Free-Text Indexing Using a Massive Semantic Network. In the Proceedings of the 2nd International Conference on Information and Knowledge Management (CIKM-93), pp. 67-74, 1993. (Utiyama, Isahara, 2001) M. Utiyama, H. Isahara. A statistical model for domain independent text segmentation. In the Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics, (EACL-01), pp.491-498, 2001. (van Mulbregt et al., 1999) P. van Mulbregt, I. Carp, L. Gillick, S. Lowe, J. Yamron. Segmentation of automatically transcribed broadcast news text. In 260 Proceedings of the DARPA Broadcast News Workshop, pp. 77-80. Morgan Kaufman Publishers, 1999. (van Rijsbergen, 1979) C. J. van Rijsbergen, Information Retrieval, Butterworths, 1979. (Voorhees, 1993) E. M. Voorhees. Using WordNet to Disambiguate Word Senses for Text Retrieval. In the Proceedings of the 16th Annual ACM-SIGIR Conference on Research and Development in IR, (SIGIR-93), pp. 171-180, 1993. (Voorhees, 1994) E. M. Voorhees, Query Expansion using Lexical-Semantic Relations. In the Proceedings of the 17th Annual ACM SIGIR Conference on Research and Development in IR, (SIGIR-94), pp. 61-69, 1994. (Voorhees, 1998) E. M. Voorhees. Using WordNet for Text Retrieval. In WordNet: An Electronic Lexical Database, C. Fellbaum (Editor), pp.285-303, Cambridge, Massachusetts, USA: The MIT Press, 1998. (Vossen, 1998) P. Vossen (Editor). EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Kluwer Academic Publishers, Dordrecht, 1998. (Wallis, 1993) P. Wallis. Information retrieval based on paraphrase. In the Proceedings of the Pacific Association for Computational Linguistics, (PACLING93), 1993. (Wilkinson, 1994) R. Wilkinson. Effective retrieval of structured documents. In the Proceedings of the 17th Annual ACM SIGIR Conference on Research and Development in IR, pp.311-317, 1994. (Witbrock, Mittal, 1999) M. Witbrock, V. Mittal. Ultra-Summarisation: A Statistical approach to generating highly condensed non-extractive summaries. In the Proceedings of the Annual ACM SIGIR Conference on Research and Development in IR, (SIGIR-99), pp. 315-316, 1999. 261 (Witten et al., 1998) I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, C. G. NevillManning. KEA: Practical Automatic Keyphrase Extraction. In the Proceedings of the 4th ACM Digital Libraries Conference, pp. 254-255, 1999. (Xu, Broglio, Croft, 1994) J. Xu, J. Broglio, W. B. Croft. The design and implementation of a part of speech tagger for English. Technical Report IR-52, Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts, 1994. (Yamron et al., 1998) J. P. Yamron, I. Carp, L. Gillick, S. Lowe, P. van Mulbregt. A hidden Markov model approach to text segmentation and event tracking. In the Proceedings of the IEEE International Conference on Acoustics, Speech, & Signal Processing, (ICASSP-98), Vol. 1, pp. 333-336, 1998. (Yamron et al., 2002) J. P. Yamron, L. Gillick, P. van Mulbregt, S. Knecht. Statistical Models of Topical Content. Chapter 6, In Topic Detection and Tracking: Event-based Information Organization. J. Allan (editor), Kluwer Academic Publishers, 2002. (Yang, Pedersen, 1998) Y. Yang, J. O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML-97), pp. 412-420, 1998. (Yang et al., 1998) Y. Yang, T. Pierce, J. Carbonell. A Study on Retrospective and On-line Event Detection. In the Proceedings of the Annual ACM SIGIR Conference on Research and Development in IR, (SIGIR-98), pp. 28-36, 1998. (Yang et al., 2002) Y. Yang, J. Carbonell, R. Brown, J. Lafferty, T. Pierce, T. Ault. Multi-Strategy learning for TDT. Chapter 5, In Topic Detection and Tracking: Event-based Information Organization. J. Allan (editor), Kluwer Academic Publishers, 2002. 262 (Yaari, 1997) Y. Yaari. Segmentation of expository texts by hierarchical agglomerative clustering. In the Proceedings of the Conference on Recent Advances in Natural Language Processing (RANLP-97), pp. 59-65, 1997. (Yarowsky, 1993) D. Yarowsky. One sense per collocation. In the Proceedings of ARPA Human Language Technology Workshop, 1993. (Youman, 1991) G. Youmans. A new tool for discourse analysis: the vocabulary management profile. Language, Vol. 67, pp. 763-789, 1991. (Zajic, Dorr, 2002) D. Zajic, B. Dorr. Automatic headline generation for newspaper stories. In the Proceedings of the ACL workshop on Automatic Summarization/Document Understanding Conference (DUC 2002), 2002. (Zhou, Hovy, 2003) L. Zhou, E. Hovy. Headline Summarization at ISI. In the Proceedings of the Human Language Technology Conference/North Atlantic Association of Computational Linguistics HLT/NAACL workshop on Automatic Summarization/Document Understanding Conference (DUC 2003), 2003. 263
© Copyright 2024 Paperzz