Information Processing and Management 41 (2005) 569–586 www.elsevier.com/locate/infoproman Generic technologies for single- and multi-document summarization Marie-Francine Moens *, Roxana Angheluta, Jos Dumortier Interdisciplinary Centre for Law & IT (ICRI), Katholieke Universiteit Leuven, Tiensestraat 41, B-3000 Leuven, Belgium Received 30 July 2003; accepted 17 December 2003 Available online 5 March 2004 Abstract The technologies for single- and multi-document summarization that are described and evaluated in this article can be used on heterogeneous texts for different summarization tasks. They refer to the extraction of important sentences from the documents, compressing the sentences to their essential or relevant content, and detecting redundant content across sentences. The technologies are tested at the Document Understanding Conference, organized by the National Institute of Standards and Technology, USA in 2002 and 2003. The system obtained good to very good results in this competition. We tested our summarization system also on a variety of English Encyclopedia texts and on Dutch magazine articles. The results show that relying on generic linguistic resources and statistical techniques offer a basis for text summarization. 2004 Elsevier Ltd. All rights reserved. Keywords: Text summarization; Topic detection; Headline construction 1. Introduction Automatic text summarization aims at condensing text to its essential content and assists in filtering and selecting information. With the explosion of text found on the World Wide Web, the need for summarization tools that operate on heterogeneous texts is large. Moreover, summarization tools should be able to represent the content of single- and multiple-document texts at various levels of detail. There is also a need for the automatic creation of generic and viewpoint-oriented summaries. A generic or topic-general summary is a summary that truly reflects the main content of the original text. A viewpoint-oriented summary summarizes text according to a certain viewpoint, which might express the information need of the user in a retrieval system. Our ‘‘generic’’ summarization technologies aim at processing texts from a variety of sources without the need of a priori knowledge acquisition about the collection, and use only knowledge resources that are or * Corresponding author. Tel.: +32-16-325383; fax: +32-16-325438. E-mail addresses: [email protected] (M.-F. Moens), [email protected] (R. Angheluta), [email protected] (J. Dumortier). 0306-4573/$ - see front matter 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2003.12.006 570 M.-F. Moens et al. / Information Processing and Management 41 (2005) 569–586 can be made generally available (e.g., natural language processing tools such as a stochastic part-of-speech (POS) tagger that identifies the most probable syntactic class of a word or a parser that detects the grammatical structure of a sentence). In summarization, it is important to exploit the discourse structure of text and sentences in order to detect main content. We have developed technologies that detect important content in a single discourse based on linguistic theories of topic and comment or focus and on patterns of thematic progression in texts. In this way we can build hierarchical topic trees of individual texts, and perform topic segmentation and summarization at various levels of topical detail. In order to reduce the content of an individual sentence, we analyze the parsed sentence. The parser allows detecting the main grammatical constructions from the dependent ones and gives an indication of the semantic relationships between content items. Sentence reduction is especially needed when the summaries are in the form of headlines. Finally, when summarizing multiple documents, it is important to detect redundant content. This can be done with statistical techniques that cluster the lexical and syntactic features of sentences. The technologies are implemented in the SUMMA summarization system of the K.U. Leuven. In this paper we first discuss the generic summarization technologies and focus on the detection of important sentences at different levels of topical detail, on the compression of sentences and on the detection of redundant content. Then, we describe our participation to DUC 2002 and 2003 and our own experiments with English Encyclopedia texts and Dutch magazine articles including the evaluation procedures. We present and discuss the results, compare our technologies with existing summarization techniques and discuss future improvements. Before concluding we describe related research. 2. Methods 2.1. Preprocessing of the texts The modules for facultative preprocessing of the texts focus on an initial cleaning, a refined tokenization in single and compound words and on grammatical analysis. Initial cleaning of the text regards the removal of tags, text between certain tags, very short sentences, parenthetical text and of direct speech. For some texts, direct speech might not be very important with regard to their content (e.g., in a news story: ‘‘I donÕt hold much hope’’ she added in a telephone.). This is contrary to indirect speech, which might contain valuable information. The SUMMA system also contains modules for text tokenization and lexical analysis. The most important functionalities regard sentence break detection and detection of compound terms. For the latter task we use the likelihood ratio for testing the hypothesis that two words, w1 and w2 , occur independently in a corpus (Dunning, 1993), i.e., P ðw2 jw1 Þ ¼ p1 ¼ p2 ¼ P ðw2 j:w1 Þ. We assume a binomial probability distribution for the two events, compute the corresponding likelihood function and calculate the ratio of the maximum value of the likelihood function over the subspace represented by p1 ¼ p2 and its maximum value over the entire parameter space of p1 and p2 . Asymptotically, 2 ln k is v2 distributed with one degree of freedom. We can accept or reject the hypothesis with a degree of certainty. When the hypothesis is rejected, we have a strong indication that the words are correlated or form a collocation. We use this technique of hypothesis testing to detect collocations of bi- or tri-grams composed of nouns and bi- or tri-grams composed of adjective(s) followed by noun(s). For the grammatical analysis of English texts we use the LT CHUNK software of the University of Edinburgh, which assigns part-of-speech (POS) tags (of the Penn treebank) to the words of the text and detects the boundaries of simple syntactic groups (Mikheev, 1998). Our Dutch texts were POS tagged by Antal Van den Bosch (Tilburg University) and the tags were translated to the Penn treebank tag set. For the English headlines, we rely upon CharniakÕs parser, which uses Maximum Entropy Classification for detecting the phrases and their dependencies in individual sentences (Charniak, 2000). M.-F. Moens et al. / Information Processing and Management 41 (2005) 569–586 571 2.2. Detection of important sentences at different levels of topical detail We have developed an algorithm that detects the hierarchical topic segmentation of a text (Moens & De Busser, 2001). The algorithm consists of the following steps: (1) detection of the main topic of a sentence; (2) computation of the term distributions of a text; (3) construction of the hierarchical topic tree. The main topic of a sentence is the word or word group that reflects the aboutness or the topical participant of that sentence. A sentence is composed of a topic and of additions (comments to or focus of the topic, e.g., properties of the topic, relationships with other items, modifications of the topic). This structure is closely connected to the way we orally communicate and is considered as being common to all languages. Linguistic theories give us a number of heuristics for detecting the main topic of a sentence that are valid for many languages (e.g., Giv on, 1983, 1988; Grosz & Sidner, 1998; Gundel, 1988; Meinunger, 2000, p. 90), among which are initial position of the noun phrase and persistence of the term or its referent that indicates the topic across consecutive sentences. Especially in SVO (Subject-Verb-Object) languages such as English and Dutch, noun phrases in initial position are stressed. Related to persistence is the occurring pattern of a comment to the topic that becomes the main topic of the next sentence. We also compute the distribution of each term in a text, which gives us information on term frequency, cooccurrence and proximity. Knowing the topics of sentences and term distributions allows us computing topic shifts, nested topics (i.e., subtopics of another topic) and semantic returns (i.e., a topic is suspended at one point and resumed later in the discourse) and in finding the topic tree. The text is processed in a way similar to how humans read it and make inferences about its content while considering the most recent discourse context. The hierarchical table of content or topic tree is gradually built and corrected as more evidence becomes available (Moens & De Busser, 2001). It indicates the more general and more detailed subtopics of a text. For each topic, the text segment that covers the topic is represented by its boundaries, i.e., its begin and end positions in the text in terms of character positions. Each topic is described with one or more terms extracted (an example is given in Fig. 1). A topic tree allows zooming in and out into the content of a text. It can be exploited in two different ways for automatic summarization. A selected number of the topical levels can be chosen for the basis of the summary. Or the topics with the highest coverage (i.e., that cover the largest segments as indicated by the character pointers in the text) can be selected. Both approaches are flexible because different levels of topical detail can be chosen. Moreover, the tree allows selecting certain topical viewpoints and computing their salience. As a next step, sentences that introduce the topics, i.e., the first sentence of the corresponding topical segment in the text or phrases that relate the topics, can be extracted from the text to form the summary. 2.3. Sentence compression In the foregoing section we have discussed how we can select sentences that reflect different levels of topical detail. For summarization tasks in which brevity is crucial, it is often necessary to reduce the length Fig. 1. Example of a topic tree (DUC-2002 set d061j document AP880911-0016.S). 572 M.-F. Moens et al. / Information Processing and Management 41 (2005) 569–586 Fig. 2. Three-step sentence compression (DUC-2003 set d100a document NYT19990925.0263). of these sentences to their relevant content. SUMMA uses a number of linguistically motivated cues for sentence compression. Two main parameters are considered in the sentence condensation process: (1) syntactic and morphosyntactic information; (2) and the presence of important topic terms in the sentence. The aim is to find the semantic relationships between the topic terms. A condensed summary sentence (also called a headline) is constructed by outputting the phrase with the smallest length that spans the largest amount of topic terms in its respective clause (Fig. 2). This corresponds to segmenting all sentences into separate clauses (embedded clauses are extracted from their superordinate clauses and treated as separate entities); ordering all clauses according to their overlap with the topic terms; and outputting the substring between two or more topic terms in the highest ranked clauses. In a final stage, auxiliary verbs and determiners are removed from the outputted substrings. From a linguistic point-of-view, the sentence condensation method results in truncated simple clauses or phrases that indicate a semantic relation between two or more topic terms. The reduction scheme also causes the phrasal context premodifying the first and postmodifying the last topic term to be removed, except when they are part of a collocation compound term. SUMMAÕs sentence reduction scheme effectively deletes subclauses from a candidate clause. When deleting relative clauses without coreference resolution, we might miss important content clauses that refer to topical entities. However, deletion of embedded clauses seems justified in most cases. When appositive clauses contain sufficient information relevant to the summarization task, they have a fair chance of ending up in the summary as an independent clause. Non-restrictive relative clauses usual function as a gloss of a noun phrase and can be deleted safely. Restrictive relative clauses contain information that affects the interpretation of the main clause. However, it could be argued that they function like a definite article and restrict the set of referents, to which their antecedent is referring, to a definite set of instances. In summary headlines, references to definiteness (i.e., articles and restrictive relative clauses) are usually deleted under the assumption that ‘‘it is possible for the listener to identify the intended referent.’’ (M ardh, 1980, p. 117). We assume here that the final removal of determiners and auxiliaries from the headlines will not substantially affect the content of the headline, although the removal of modal auxiliaries might in some cases lead to grammatical incorrectness (e.g., ‘‘Poll might not show increase in popularity’’ would be reduced to ‘‘Poll not show increase in popularity’’). 2.4. Detection of redundant content When summarizing multiple documents, the texts should be condensed in a very sizeable way and redundant content should be eliminated. In the SUMMA system we implemented two clustering methods, the covering and k-medoid method (Kaufman & Rousseeuw, 1990, p. 68 ff.; Moens, 2000, p. 175 ff.) and use the clustering to detect sentences similar in content and select the most representative sentence (medoid) of each cluster of related sentences. Both methods regard non-hierarchical (partitioning) methods that are M.-F. Moens et al. / Information Processing and Management 41 (2005) 569–586 573 based on the selection of representative objects (i.e., medoids). A candidate medoid attracts the most similar sentences from the set of remaining sentences based on a criterion or constraint of cluster goodness. The problem that both algorithms try to solve can be seen as an optimization problem. The mathematical models for the algorithms use the following notations: • The set of n objects (i.e., sentences) to be clustered is denoted by X ¼ fx1 ; x2 ; . . . ; xn g. • The similarity between objects xi and xj (also called objects i and j) is denoted by sði; jÞ. In our implementation it was computed as the cosine between the term vectors of the sentences. Feature words can be selected based on their POS class. In the experiments described below we restrict the words to nouns, verbs and adjectives. • A solution to the model is determined by two types of decisions: The selection of objects as representative objects in clusters: yi is defined as a 0–1 variable, equal to 1 if and only if object i is selected as the representative object ði ¼ 1; . . . ; nÞ. The assignments of each object j to one of the selected representative objects: zij is a 0–1 variable, equal to 1 if and only if object j is assigned to the cluster of which i is the representative object. The covering and k-medoid algorithms differ in the constraints they impose on the clusters. In the covering method the objective is to minimize the number of clusters (equivalent to minimizing the number of medoids) so that the similarity between the medoid and the objects of each cluster is greater than a threshold. This can be represented in a mathematical model as: minimize where n X yi i¼1 n X zij ¼ 1; j ¼ 1; 2; . . . ; n i¼1 i; j ¼ 1; 2 . . . ; n zij 6 yi; ; sði; jÞ P S where zij ¼ 1; i; j ¼ 1; 2; . . . ; n The covering algorithm has the advantage that the threshold values can be used to tune the level of similarity or redundancy in a cluster of sentences. The k-medoid algorithm attempts to partition a set of objects into k clusters. The objective of the algorithm is to maximize the sum of similarities between the objects and their medoid. maximize n X n X i¼1 where n X sði; jÞzij j¼1 zij ¼ 1; j ¼ 1; 2; . . . ; n i¼1 zij 6 yi ; i; j ¼ 1; 2; . . . ; n n X yi ¼ k; k ¼ number of clusters i¼1 The complexity of the above algorithms grows exponentially with the number of objects to be clustered (e.g., in the k-medoid the number of possible combinations of n objects into k clusters is tested, i.e., n!=k!ðn kÞ! combinations). Therefore we implemented a good, but not optimal solution for the clustering when the number of clusters exceeds a threshold value. For the covering algorithm individual subsets of objects are clustered and clusters are merged when the similarity of their medoids exceeds the required 574 M.-F. Moens et al. / Information Processing and Management 41 (2005) 569–586 threshold similarity value ðSÞ and the medoid of the merged cluster is recomputed. The k-medoid starts with an initial clustering around k medoids based on a greedy divisive clustering into k clusters. Secondly, objects will be swapped iteratively until no better clustering can be found (i.e., the average similarity of an object and its medoid cannot be increased). The medoid sentences make up the summary. We implemented the algorithms in such a way that the output can flexibly adapt to the required summary length. This is useful when redundancy detection is used as a final step in the summarization process. The covering algorithm allows choosing a minimum (Smin ) and a maximum (Smax ) threshold similarity. Between these two limits, the threshold can take a variable number of values. We split the interval [Smin ; Smax ] in smaller intervals (their number is a parameter in the program) and for each small interval a solution is computed. The solution that best fits the required length of the summary is picked. The size of the interval and the parameter for the further division of the interval determine the computational complexity of this solution. When determining the best k in the k medoid method, we start with a small k value and gradually increase it until the required length of summary words is reached. The above algorithms have advantages over common hierarchical algorithms (e.g., single linkage, group average) as they identify a distinctive medoid sentence and group with this medoid sentences that are quite similar in content. A medoid sentence is most representative of all the other sentences in the cluster and is included in the summary. The algorithms––even their good solution implementations––are less greedy than the above hierarchical clustering methods, but in general computationally more complex. This is not a problem, because we use the algorithms for redundancy detection between the sentences of a summary. The above technologies make up the current SUMMA system and are implemented in separate modules in C and Perl. 3. Our experiments, results and discussion 3.1. Participation at the DUC conferences We participated in the Document Understanding Conference of 2002 and 2003 organized by the National Institute of Standards and Technology (NIST) (a limited account of the results can be found in Angheluta, De Busser, & Moens, 2002, 2003). In 2002 we were involved in following tasks: single-document summaries of ca. 100-words, multi-document abstract of ca. 10 (headlines), 50, 100 and 200 words, and multi-document extracts of ca. 200 and 400 words. Extracts refer to summaries composed of literal sentences. The test corpus consisted of 59 sets of documents obtained from newspapers such as the Wall Street Journal, AP Newswire, San Jose Mercury News, Financial Times, LA Times and FBIS (Foreign Broadcast Information Service). In 2003 we participated in the following tasks: very short single-document summaries of ca. 10 words (headlines), multi-document summaries of ca. 100 words and viewpoint-oriented multi-document summaries of ca. 100 words. The test corpus here consists of 60 sets of documents of the AP Newswire, New York Times Newswire and Xinhua News Agency. A set for generic multi-document summaries was on topic, for the viewpoint-oriented abstracts there were no restrictions on the topics in a set. Except for the viewpoint-oriented ones, the summaries have a generic character, i.e., they should reflect the main content of single documents or a set of documents. Multi-document summaries are made per set. A set contains 10 documents and ca. 330–360 sentences. NIST has manually created abstracts and extracts of the documents, and automatically created summaries by extracting the first words or sentences from the texts, which form the baseline summaries used in evaluation (Over & Ligett, 2002; Over & Yen, 2003). For evaluation, a model summary for each document is chosen from the summaries that are manually made. Our system-created summary is evaluated with M.-F. Moens et al. / Information Processing and Management 41 (2005) 569–586 575 regard to: (1) the coverage or recall of each model unit in the summary; (2) the quality of the summary; (3) the usefulness (only for headlines). Coverage is computed as the completeness judgments in terms of 0%, 20%, 40%, 60%, 80% and 100% for a system-created summary as compared to the model summary (Lin & Hovy, 2002). The aim is to maximize coverage while minimizing the length of the summary. This is why a length-adjusted coverage is also computed. For DUC 2002 length-adjusted coverage is defined as the weighted sum of coverage and brevity, where coverage is twice as important as brevity and brevity is zero when the summary exceeds the target length. For DUC 2003 the improved length-adjusted coverage has the constraint that it is zero when coverage is zero. The length-adjusted coverage with penalty corrects the coverage with the factor ‘‘target length/ actual length’’ which additionally punishes when the summary is over the target length. For the extracts of DUC 2002, recall (R), precision (P ) and their equally weighted combination (b ¼ 1) in the F -measure are computed. R ¼ ðnumber of correct extractsÞ=ðnumber of desired extractsÞ P ¼ ðnumber of correct extractsÞ=ðnumber of extractsÞ Fb¼1 ¼ 2PR=ðP þ RÞ The quality of the summary is, for instance, measured in terms of the correctness of the logical and temporal order of the sentences in the summary, the grammatical correctness of the sentences or in terms of the redundancy of the information. In DUC 2003, the usefulness of a headline was evaluated as a score that is inversely proportional to the rank of each summary of the document in making two assessors to select the document (e.g., in a retrieval task) assuming that the content of the document is relevant. In the following sections we discuss summarization with the SUMMA system and its results. 3.2. Single-document summaries For each document a hierarchical topic tree is constructed. The predefined length of the summary dictates the level of topical detail required for the summary. At a chosen level of topical detail, the first sentence of each topical segment is included in the summary. The extracted sentences are presented in reading order (an example is given in Fig. 3). We performed the preprocessing steps except for the removal of direct speech (see above), because this module was not yet fully implemented. The modules used in this step are presented in Fig. 4a. Fig. 5 shows the results of the mean coverage and mean length-adjusted coverage in comparison with other teams that participated in DUC-2002. If we combine the scores of mean coverage and mean lengthadjusted coverage and weight these scores equally, 1 there are only three systems among the 11 competitor systems that score better, with the best system having an increase of 3% in mean coverage and of 4% in mean length-adjusted coverage. These systems use additional knowledge in the form of manually acquired domain knowledge (Harabagiu & Lacatusßu, 2002), or use a classifier (i.e., a support vector machine classifier in Hirao, Sasaki, Isozaki, & Maeda, 2002, or a combination of a logistic regression and Hidden Markov Model classifiers in Schlesinger et al., 2002) that is trained upon example texts and their summaries. This makes the better systems less portable to new or heterogeneous text domains. If we consider the mean error with regard to the quality questions, we score about average. 1 This combination is justified because some systems produce very short or empty summaries, which make their mean coverage value low and mean length-adjusted coverage high. 576 M.-F. Moens et al. / Information Processing and Management 41 (2005) 569–586 Fig. 3. Example of a 100-word single-document abstract (DUC-2002 set d061j document AP880911-0016.S). Fig. 4. Architecture of the summarizer: (a) single-document summaries; (b) multi-document extracts and abstracts; (c) single-document headlines; (d) multi-document viewpoint-oriented abstracts. It can be seen that the SUMMA system which uses no additional knowledge sources or does not need to be trained on example texts and summaries, scores quite well in this competition. Single-document summarization depends entirely upon the construction of the hierarchical topic trees, which proves the validity of this technology. 3.3. Multi-document extracts and abstracts The 200-word and 400-word extract summaries are made by clustering the literal sentences of 100-word summaries of single documents and in an additional experiment by the clustering of the sentences of 50word single-document summaries. 2 The term vectors (which are restricted to the open word classes of nouns, adjectives and verbs) of the sentences of the summaries are clustered with the k-medoid (used for the 2 We thank Hans Van Halteren, Katholieke Universiteit Nijmegen for the evaluation of the extracts based on 50-word singledocument summaries. M.-F. Moens et al. / Information Processing and Management 41 (2005) 569–586 577 Fig. 5. Mean coverage of the SUMMA system compared to the other DUC-2002 systems for 100-word single-document abstracts. The baseline score is 0.37 and the manual scores range between 0.46 and 0.57. 200-word extracts) or the covering method (used for the 400-word extracts). The representative sentences or medoids are included in the summaries. Sentences from different documents are ordered according to the chronology of the documents. Sentences from the same documents are presented in reading order. The modules are represented in Fig. 4b. Fig. 6 shows that the obtained average F -measure of our system compared to our competitors in DUC2002 for the 400-word extracts made from 50-word single-document summaries was only beaten by a system, which trains with a Weighted Probability Distribution Voting classifier on example texts and their summaries (Van Halteren, 2002). Because after DUC-2002 we had access to the ideal extracts, we could perform additional experiments to evaluate the clustering. We computed precision, recall and F -measure of multi-document extracts obtained from single-document summaries of different lengths, from the documentsÕ lead paragraphs 3 and from all sentences of the texts. Clustering all document sentences in order to generate the multi-document extract summary gave the worst results, while only clustering the sentences of the 50-word single-document summaries give the best results (Fig. 7). This is despite the fact that recall can only be maximal when all document sentences are used because the summaries lacked some of the relevant sentences to be extracted. Clustering the sentences of the lead paragraphs of the texts fits this schema. On average the lead paragraph contained 71 words. In our experiments we saw that if we cluster many sentences into a small number of clusters, the overlap between the sentences in the final clusters comes down to just one word or no words, resulting in clusters that do not group similar content. Moreover, the more varied the content of one cluster is, the more difficultly its medoid represents the complete cluster. Clustering around representative objects (sentences) or medoids seems to us especially useful for grouping and eliminating redundant content. The availability of the suggested summary sentences allowed us also to test the performance of the covering and k-medoid clustering. Variance analysis tests of the results proved that the results of these two algorithms were not statistically different. 3 When the lead paragraph was not marked in the text, we considered the first three sentences of the text. 578 M.-F. Moens et al. / Information Processing and Management 41 (2005) 569–586 Fig. 6. Mean F -measure of the SUMMA system compared to the other DUC-2002 systems for 400-word multi-document extracts. The baseline score is 0.294 and the manual score is 0.270. Our multi-document abstracts are constructed in the same way as the extracts (e.g., an example in Fig. 8), except that rhetorical connectives are removed from the summary sentences. For DUC-2002 we consistently used the covering algorithm for clustering the sentences, for DUC 2003 the k-medoid algorithm. The results of our system were about average in terms of coverage and length-adjusted coverage. We did not do any effort to further reduce sentences to their main content, which could explain our average coverage scores. Because for both the extract and abstracts, we use a very simplistic heuristic for ordering the sentences in the summary, we have errors for the question concerning the time sequence and cause–effect relationship. However, a zero error score on the quality question on unnecessarily repeated information shows that the clustering is effective for eliminating redundant content. The creation of multi-document extracts and abstracts relies on the construction of the hierarchical topic trees and adds the technique of sentence clustering based on terms selected by their POS-tag. Statistical clustering as well as part-of-speech tagging are generic technologies that are generally available. 3.4. Single-document headlines The headlines are constructed from the most important topic terms of a text. They are selected as the k (in our experiments k was empirically set to six) terms in the hierarchical topic tree with the highest coverage in the text––i.e., representing the topics that span the largest text segments in terms of character pointer differences. A number of sentences that contain the selected topic terms are identified in the text and are ordered according to the number of distinct topic terms they contain, the proximity of the topic terms, the size of the clause (with priority to short clauses) and the distance of the clause from the beginning of the text. Headlines are constructed from the parse trees of each of the top classified sentences using the sentence compression technique described above (see Fig. 4c). We are quite happy with the results. We are on the third place with regard to coverage (ca. 36% compared to 40% of the best system) and on the second place with regard to usefulness of the headlines (ca. M.-F. Moens et al. / Information Processing and Management 41 (2005) 569–586 579 Fig. 7. Mean F -measure of the SUMMA system for 200- (a) and 400-word multi-document extracts (b) starting from single-document summaries of different lengths and using the k-medoid or the covering clustering algorithm. Fig. 8. Example of a 50-word multi-document abstract (DUC-2002 set d061j). 2.2 on a scale from 0 to 4 where the best system scores about 2.4) among the 12 competing systems (Fig. 9). Apparently the extracted phrases that relate the important topic terms are considered very useful. The 580 M.-F. Moens et al. / Information Processing and Management 41 (2005) 569–586 Fig. 9. Mean coverage (a) and usefulness (b) of the SUMMA system compared to the other DUC-2003 systems for 10-word singledocument abstracts. The baseline scores are 0.47 (a) and 2.63 (b) and the manual scores range between 0.48 and 0.6 (a) and 2.7 and 3.43 (b). better system also uses topical clues obtained from a topic segmentation and lexical chain information in order to select the most important topical sentences and reduces further the size of the selected sentences with the help of a parser (Chali, Kolla, Singh, & Zhang, 2003). Our system has the current disadvantage that a clause contains at least two topic terms in order to be compressed. This makes the reduction easy, but gives the risk that no good sentence clauses can be found, which happened in ca. 30% of the documents. M.-F. Moens et al. / Information Processing and Management 41 (2005) 569–586 581 The generation of headlines uses hierarchical topic trees for the selection of important sentences and a syntactical parsing of the selected sentences. As syntactical parsers become generally available for different languages, we might label this technology as generic. 3.5. Multi-document viewpoint-oriented abstracts Given a viewpoint description in the form of a natural language string, the task was to construct a 100word summary of a set of documents that integrates the viewpoint parts of the different documents (an example is given in Fig. 10). For this task, we construct the hierarchical topic trees of each document of the set in order to extract their most important sentences based on the 100-word summaries. Then we rank the important sentences by their term overlap with the viewpoint words and select these sentences that had an minimum overlap of one noun, verb or adjective. Finally, we cluster the selected sentences in order to remove redundant content. Here we use the k-medoid method, because it results in a natural clustering without the need of setting a threshold similarity for cluster membership. The schema of this process is shown in Fig. 4d. An alternative schema in which we first find overlapping sentences with the viewpoint and then build a topic tree of their concatenation did not yield good results. This is to be expected because the concatenated sentences lack the coherence requested for topic tree building. A mean coverage of about 17%, a mean improved length-adjusted coverage of more than 19% and a mean length-adjusted coverage with penalty of 12% put us in each case on the fourth place among the 11 competing automated systems with 2% difference with the best system for mean coverage and mean improved length-adjusted coverage and only about 0.5% difference with the best system for the mean lengthadjusted coverage with penalty. The best system also relies upon discourse entity or topic overlap between the sentences of documents and the question, but uses techniques of coreference resolution in the text to more correctly compute the importance of a topic (Litowski, 2003). Again, a zero error score on redundancy of the summary proves the usefulness of the clustering. The current limitation of our approach is that the viewpoint-oriented abstracts are constructed from the single-document abstracts, which works well when the viewpoint information is found in them (e.g., as it is the case for set d112 on the Hubble Space Telescope Service Mission with coverage more than 50% which is largely above the average of 33%). When the viewpoint words are found in sentences on more marginal topics, our technology, as it is currently implemented, will fail. However, there is certainly room for improvement. We think it is important to use the topic tree for ranking the sentences by topical importance and give priority to important viewpoint sentences. Another limitation is that simple term overlap is rather insufficient as a way of precisely selecting viewpoint sentences. The viewpoints contain certain concepts and their understanding often requires additional knowledge on what the concepts may comprise. For instance, the viewpoint ‘‘Chronology of ‘‘Peanuts’’ creator Charles SchutzÕs career’’ (d102) or ‘‘Steps leading to the introduction of the Euro’’ (d127) involves knowing what might be important components of a career or an introduction of a currency respectively, which is knowledge that must be a priori available. Other viewpoints such as ‘‘The republican Presidential primaries of 2000 showed problems with the system as eleven Fig. 10. Example of a viewpoint description and a 100-word multi-document viewpoint abstract (DUC-2003 set d100ai). 582 M.-F. Moens et al. / Information Processing and Management 41 (2005) 569–586 announced candidates dropped out one by one until George W. Bush was left with no active opponents almost six months before the election’’ (d108) contains many facets that should be synthesized in one short summary of 100 words. Here we definitely need to reduce relevant sentences to the content most relevant for the viewpoint and possibly fuse the resulting phrases in a coherent sentence. The creation of multi-document viewpoint-oriented abstracts relies on the construction of the hierarchical topic trees and adds the technique of sentence clustering based on terms selected by their POS-tag. The results also show the limits of the generic approach: We see here that additional domain knowledge is sometimes necessary. 3.6. Summarization of other texts In a limited experiment a researcher manually made single-document summaries of 50 words of 10 English encyclopedia texts from the Columbia Encyclopedia and of 10 Dutch magazine articles from the magazines Knack and Trends in order to demonstrate the generic character of the technologies. We used the tool developed by Chin-Yew Lin (Lin & Hovy, 2002) (see above) for comparing the system made summaries with the manually made ones. The results obtained are as follows: for the encyclopedia texts, the mean coverage was 0.449 and the improved mean length-adjusted coverage with penalty 0.316, while for the Dutch articles they were 0.312 and 0.213 respectively. The results are comparable with the ones obtained for the DUC corpus, which is an indication for the generic character of the techniques. 4. Related research Recent overviews of single- and multi-document summarization can be found in Radev, McKeown, and Hovy (2002), Hovy (2002) and Moens, Angheluta, and De Busser (2003). Topic segmentation and detection has been proven effective in text summarization. Other top performing systems at DUC 2003 also use a form of topic segmentation (e.g., Chali et al., 2003). Existing topic segmentation algorithms usually produce a linear segmentation of the text assuming that the topics are sequentially organized (e.g., Choi, 2000; Hearst, 1997; Kan, Klavans, & McKeown, 1998; Ponte & Croft, 1997). Our topic segmentation algorithm allows detecting both the hierarchical and sequential topical segments including the semantic return of a topic and the level of topical detail of an entity or sentence. Although, hierarchical topic segmentation is acknowledged to be very valuable for generic and viewpointoriented text summarization, text searching and navigation (Kan, McKeown, & Klavans, 2001; Yang & Wang, 2003; Zizi & Beaudouin-Fafon, 1995), the number of approaches that hierarchically segment loosely structured or unstructured text is limited. The hierarchical segmentation algorithm of Yaari (2000) is based on a hierarchical clustering of the sentences. This technique requires that a subtopic sentence contains the terms or their coreferents of its more general topics, which is not always the case. This drawback might be alleviated by grouping of paragraphs (Salton, Singhal, Buckley, & Mitra, 1996), but the clustering also often relies on threshold similarity values for cluster membership, which might be difficult to set across heterogeneous texts. Marcu (2000) detects the rhetorical structure of a text based on rhetorical cues found in it and represents the text as a tree that indicates the salience of each sentence or clause. This approach is sometimes not feasible because of the lack or ambiguity of rhetorical cues, but is useful to integrate in our current topic segmentation algorithm in order to detect topic salience in a more refined way. Detection of the hierarchy of topics and subtopics of a text is also accomplished by a knowledge-rich (and therefore domain-dependent) semantic parsing of the text (Hahn, 1990). For text summarization, lexical chains of topic terms (Morris & Hirst, 1991) can be computed per text segment (Barzilay & Elhadad, 1999; Chali et al., 2003). The chains are built with the repeated terms or with related terms and the salience of a topic term can be computed based on the number of members of a chain. The generic approach of frequency M.-F. Moens et al. / Information Processing and Management 41 (2005) 569–586 583 counting of phrasal chunks is still an effective technique for generating very short summaries (Copeck & Szpakowicz, 2003). Our algorithm combines topic segmentation and detection and can identify the broader or narrower context in which a term is used in the discourse in hierarchically nested text segments. When we use parsing to reduce the content of sentences, our work is mostly related to the work of Knight and Marcu (2001) and to the Hedge Trimmer system (Dorr, Zajic, & Schwartz, 2003). Knight and Marcu use a training set of text/abstract pairs and train a classifier to find the best compression of a parsed sentence by separately using a statistical noisy-channel model and induction of decision rules to compress the parse tree. Hedge Trimmer––like our work––uses linguistically motivated heuristics to compress the parse tree. The difference of our work with Hedge Trimmer is that we select important sentences from the document as those that contain topic terms with the largest coverage as computed by our hierarchical topic segmentation algorithm. Hedge Trimmer always uses the first sentence of the document, which in the case of news stories might be a good choice, but not for other texts. Sentence clustering has been used quite often in summarization of single- and multi-document summarization (e.g., Hatzivassiloglou et al., 2001). When clustering is used in text summarization, usually all sentences of the documents are clustered which does not yield good results for multi-document summarization. We use clustering techniques only for eliminating redundant content from a set of relevant sentences. Using sentence similarity to compute redundant content is quite popular (e.g., Maximal Marginal Relevance (MMR) in Carbonell & Goldstein, 1998; the probability of novelty in Allan, Gupta, & Khandelwal, 2001). The clustering algorithms that we use select a representative sentence in a set of sentences similar in content to be included in the summary. Another approach that summarizes the similarities and differences among related documents exploits cohesion relationships between terms (synonymy/ hypernymy, repetition, adjacency and coreference) represented as graphs and includes these relationships in the computation of the salience of a term. Similarities and differences between texts segments are then described as respectively overlapping and non-overlapping salient terms of the segments (Mani & Bloedorn, 1999). Our summarization techniques might be improved by correct synonym detection and coreference resolution (see infra). In addition, our clustering of sentences might be improved by refining the features used, i.e., by considering not only lexical items but also semantic roles (e.g., actor, location, time) in the matching of sentences. The usefulness of such an approach has been proven by Guo and Stylios (2003). Many text summarization algorithms require a training set of text/abstract pairs. Our algorithms do not need any training making them portable to many texts taking into account the constraints of our current topic segmentation algorithm (i.e., texts are written in SVO––subject-verb-object––language and are sufficiently coherent) and also the availability of a parser for sentence reduction. 5. Conclusions and future improvements We developed the SUMMA system as part of the project ‘‘Generic technology for information extraction from texts’’. Our aim was to develop generic building blocks for automatic text summarization that can be used for a variety of summarization tasks and a variety of texts. The technologies rely upon resources that are generally available and do not require training from example texts and their example summaries. They comprise: (1) reducing the texts to their most relevant sentences with regard to the generic or viewpoint-oriented abstracts at different levels of topical detail; (2) reducing the selected sentences to relevant content; and (3) especially in case of multi-document summarization, eliminating redundant content. We assume that these tasks are essential in an automated summarization system. We do not ignore that the technologies can be refined by reliance upon additional knowledge sources or upon knowledge acquired from training, but our purpose here was to test the bare technologies. In the tests of DUC-2002 and DUC-2003 our summarization system could compete with the best systems, many of which use training, and consistently scored good to very good for the different summarization tasks (i.e., headline 584 M.-F. Moens et al. / Information Processing and Management 41 (2005) 569–586 summarization, generic single- and multi-document abstracts and extracts, and viewpoint-oriented multidocument abstracts). For each of the above technologies we see further room for improvement. For the detection of important sentences at different levels of topical detail, we could incorporate the use of rhetorical cues into the topic segmentation algorithm. A rhetorical structure analysis of the text proves to be very useful to detect the salience of a sentence or clause in the discourse (Marcu, 2000). However, rhetorical cues (e.g., certain adverbs, tense and aspect of verbs) are not always explicitly present in the text or might have an ambiguous meaning. In the case they are present, they might aid in building the hierarchical topic tree. For the detection of relevant content of sentences, we should further analyze the parse trees of the sentences and investigate how to best exploit the relevant topical entities and their syntactic and semantic relationships for text summarization. By relevant entities we mean important entities that have a large coverage in the texts in case of building generic summaries and viewpoint entities that can be found in a question or topic statement of the user in case of building viewpoint-oriented summaries. We will explore better ways of reducing content based on the parse output of a sentence when it contains only one important topic term (cf. Lacatusßu, Parker, & Harabagiu, 2003). For detecting redundant content, we use the simple approach of clustering the sentences based on noun, verb or adjective overlap. Such an approach can be useful in summarizing multiple news stories that often contain almost identical sentences obtained from the same news agency. We could investigate how we can refine this redundancy detection, while incorporating additional features or weighting features differently (e.g., on topical prominence in the sentence, temporal features, etc.) (cf. Hatzivassiloglou et al., 2001). All tasks might be refined by having reliable tools for synonym and coreference resolution of noun phrases in texts. The former requires a machine-readable resource with synonym terms and contextual terms that are needed for word sense disambiguation (Angheluta & Moens, 2002). Noun phrase coreference resolution is the ability to relate each noun phrase (including pronouns) in a text to their referent in the real world. We have developed algorithms for detecting ‘‘identity’’ relationships in a text, which clusters corefering entities based on a number of linguistic features (Mitra, Angheluta, Jeuniaux, & Moens, 2003). We intend to incorporate this tool in our summarization system. Acknowledgements The research was sponsored by the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT), Belgium. We thank Rik De Busser for his valuable comments. We also want to thank Donna Harman, Paul Over and their team (National Institute of Standards and Technology, USA) for the evaluation of the work. We are grateful to Hans van Halteren for evaluating extract summaries. We also want to thank an anonymous reviewer for the valuable comments. Shorter versions of part of the research were presented at the Document Understanding Conference 2002 and 2003. References Allan, J., Gupta, R., & Khandelwal, V. (2001). Temporal summaries of news topics. In W. B. Croft, D. J. Harper, D. H. Kraft, & J. Zobel (Eds.), Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 10–18). New York: ACM. Angheluta, R., De Busser, R., & Moens, M.-F. (2002). The use of topic segmentation for automatic summarization. In U. Hahn & D. Harman (Eds.), Proceedings of the workshop on automatic summarization, Philadelphia, Pennsylvania, USA, July 11–12, 2002 (pp. 66–70). Gaithersburg, MD: NIST. Angheluta, R., & Moens, M.-F. (2002). A study about synonym replacement in news corpora. In Proceedings of the third Dutch– Belgian information retrieval workshop. Leuven, Belgium: Katholieke Universiteit Leuven. M.-F. Moens et al. / Information Processing and Management 41 (2005) 569–586 585 Angheluta, R., Moens, M.-F., & De Busser, R. (2003). K.U. Leuven summarization system for DUC 2003. In D. Radev & S. Teufel (Eds.), Proceedings of the text summarization workshop and 2003 document understanding conference May 31 and June 1, 2003 (pp. 110–117). Gaithersburg, MD: NIST. Barzilay, R., & Elhadad, M. (1999). Using lexical chains for text summarization. In I. Mani & M. T. Maybury (Eds.), Advances in automatic text summarization (pp. 111–121). Cambridge, MA: MIT Press. Carbonell, J., & Goldstein, J. G. (1998). The use of MMR, diversity based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (pp. 335–336). New York: ACM. Chali, I., Kolla, M., Singh, N., & Zhang, Z. (2003). The University of Lethbridge text summarizer at DUC-2003. In D. Radev & S. Teufel (Eds.), Proceedings of the text summarization workshop and 2003 document understanding conference May 31 and June 1, 2003 (pp. 148–152). Gaithersburg, MD: NIST. Charniak, E. (2000). A maximum-entropy-inspired parser. In Proceedings of the ANLP-NAAC’2000, Seattle Washington. Choi, F. Y. Y. (2000). Advances in domain independent linear text segmentation. In Proceedings of the 1st North American chapter of the association for computational linguistics, Washington Seattle (pp. 26–31). Copeck, T., & Szpakowicz, S. (2003). Picking phrases, picking sentences. In D. Radev & S. Teufel (Eds.), Proceedings of the text summarization workshop and 2003 document understanding conference May 31 and June 1, 2003 (pp. 179–186). Gaithersburg, MD: NIST. Dorr, B., Zajic, D., & Schwartz, R. (2003). Hedge Trimmer: a parse-and-trim approach to headline generation. In R. Radev & S. Teufel (Eds.), Proceedings of the HLT-NAACL 2003 workshop on text summarization (pp. 1–8). Omnipress. Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19, 61–74. Giv on, T. (1983). Introduction. In T. Giv on (Ed.), Topic continuity in discourse: a quantitative cross-language study (pp. 1–41). Amsterdam: John Benjamins. Giv on, T. (1988). The pragmatics of word-order: predictability, importance and attention. In M. Hammond, E. Moravcsik, & J. Wirth (Eds.), Studies in syntactic typology (pp. 243–284). Amsterdam: John Benjamins. Grosz, B. J., & Sidner, C. L. (1998). Lost intuitions and forgotten intentions. In M. A. Walker, A. K. Joshi, & E. F. Prince (Eds.), Centering theory in discourse (pp. 39–51). Oxford, UK: Clarendon Press. Gundel, J. (1988). Universals of topic-comment structure. In M. Hammond, E. Moravcsik, & J. Wirth (Eds.), Studies in syntactic typology (pp. 209–239). Amsterdam: John Benjamins. Guo, Y., & Stylios, G. (2003). A new multi-document summarization system. In D. Radev & S. Teufel (Eds.), Proceedings of the text summarization workshop and 2003 document understanding conference May 31 and June 1, 2003 (pp. 102–109). Gaithersburg, MD: NIST. Hahn, U. (1990). Topic parsing: accounting for text macro structures in full-text analysis. Information Processing & Management, 26(1), 135–170. Harabagiu, S. M., & Lacatusßu, F. (2002). Generating single and multi-document summaries with GISTE X T E R . In U. Hahn & D. Harman (Eds.), Proceedings of the workshop on automatic summarization, Philadelphia, Pennsylvania, USA, July 11–12, 2002 (pp. 30–38). Gaithersburg, MD: NIST. Hatzivassiloglou, V., Klavans, J.L., Holcombe, M.L., Barzilay, R., Kan, M.-Y., & McKeown. (2001). SimFinder: a flexible clustering tool for summarization. In Proceedings of the NAACL workshop on automatic summarization, Pittsburgh, PA. Hearst, M. A. (1997). TextTiling: segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23, 33–64. Hirao, T., Sasaki, Y., Isozaki, H., & Maeda, E. (2002). NTTÕs text summarization system for DUC-2002. In U. Hahn & D. Harman (Eds.), Proceedings of the workshop on automatic summarization, Philadelphia, Pennsylvania, USA, July 11–12, 2002 (pp. 104–107). Gaithersburg, MD: NIST. Hovy, E. H. (2002). Automated text summarization. In R. Mitkov (Ed.), Oxford University handbook of computational linguistics. Oxford: Oxford University Press. Kan, M.-Y., Klavans, J. L., & McKeown, K. R. (1998). Linear segmentation and segment relevance. In Proceedings of 6th international workshop of very large corpora (WVLC-6), Montreal, Quebec, Canada, August 1998 (pp. 197–205). Kan, M.-Y., McKeown, K. R., & Klavans, J. L. (2001). Domain-specific informative and indicative summarization for information retrieval. In D. Harman, & D. Marcu (Eds.), Proceedings of DUC 2001 workshop on text summarization. Available: http://wwwnlpir.nist.gov/projects/duc/pubs.html/#2001. Kaufman, L., & Rousseeuw, P. (1990). Finding groups in data: an introduction to cluster analysis. New York: John Wiley & Sons. Knight, K., & Marcu, D. (2001). Statistical-based summarization step one: sentence compression. In Proceedings of the 17th national conference of the American association for artificial intelligence AAAI’2000, Austin, Texas, July 30–August 3, 2000. Lacatusßu, V. F., Parker, P., & Harabagiu, S. M. (2003). LITE-GIST E X T E R : generating short summaries with minimal resources. In D. Radev & S. Teufel (Eds.), Proceedings of the text summarization workshop and 2003 document understanding conference May 31 and June 1, 2003 (pp. 122–128). Gaithersburg, MD: NIST. Lin, C.-Y., & Hovy, E. (2002). Manual and automatic evaluation of summaries. In U. Hahn & D. Harman (Eds.), Proceedings of the workshop on automatic summarization, Philadelphia, Pennsylvania, USA, July 11–12, 2002. Gaithersburg, MD: NIST. 586 M.-F. Moens et al. / Information Processing and Management 41 (2005) 569–586 Litowski, K. C. (2003). Text summarization using XML-tagged documents. In D. Radev & S. Teufel (Eds.), Proceedings of the text summarization workshop and 2003 document understanding conference May 31 and June 1, 2003 (pp. 63–70). Gaithersburg, MD: NIST. Meinunger, A. (2000). Syntactic aspects of topic and comment. Amsterdam: John Benjamins. Mani, I., & Bloedorn, E. (1999). Summarizing similarities and differences among related documents. In I. Mani & M. T. Maybury (Eds.), Advances in automatic text summarization (pp. 357–379). Cambridge, MA: The MIT Press. Marcu, D. (2000). The theory and practice of discourse parsing and summarization. Cambridge, MA: The MIT Press. M ardh, I. (1980). Headlines: a grammar of English text headlines. Lund, Sweden: Gleenup Lund. Mikheev, A. (1998). Part-of-speech guessing rules: learning and evaluation. Available: http://www.ltg.ed.ac.uk/software/pos/ (visited November, 24th, 2003). Mitra, R., Angheluta, R., Jeuniaux, P., & Moens, M.-F. (2003). Progressive fuzzy clustering for noun phrase coreference resolution. In Proceedings of the 4th Dutch–Belgian information retrieval workshop. Amsterdam, The Netherlands: University of Amsterdam. Moens, M.-F. (2000). Automatic indexing and abstracting of document texts. Boston, MA: Kluwer Academic Publishers. Moens, M.-F., Angheluta, R., & De Busser, R. (2003). Summarization of texts found on the World Wide Web. In W. Abramowicz (Ed.), Knowledge-based information retrieval and filtering from the Web. Boston, MA: Kluwer Academic Publishers. Moens, M.-F., & De Busser, R. (2001). Generic topic segmentation of document texts. In W. B. Croft, D. J. Harper, & J. Zobel (Eds.), Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 418– 419). New York: ACM, A more elaborate description of the algorithm is submitted. Morris, J., & Hirst, G. (1991). Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computational Linguistics, 17, 21–43. Over, P., & Ligett, W. (2002). Introduction to DUC-2002: an intrinsic evaluation of generic news text summarization systems. Gaithersburg, MD: NIST. Over, P., & Yen, J. (2003). An introduction to DUC-2003: intrinsic evaluation of generic news text summarization systems. In Proceedings of the 2003 document understanding conference. Gaithersburg, MD: NIST. Ponte, J. M., & Croft, W. B. (1997). Text segmentation by topic. In Proceedings of the first European conference on research on advanced technology for digital libraries (pp. 113–125). Radev, D. R., McKeown, K., & Hovy, E. (2002). Introduction to the special issue on summarization. Computational Linguistics, 28(4), 399–408. Salton, G., Singhal, A., Buckley, C., & Mitra, M. (1996). Automatic text decomposition using text segments and text themes. Hypertext’96 (pp. 53–65). Schlesinger, J. D. et al. (2002). Understanding machine performance in context of human performance for multi-document summarization. In U. Hahn & D. Harman (Eds.), Proceedings of the workshop on automatic summarization, Philadelphia, Pennsylvania, USA, July 11–12, 2002 (pp. 71–77). Gaithersburg, MD: NIST. Van Halteren, H. (2002). Writing style recognition and sentence extraction. In U. Hahn & D. Harman (Eds.), Proceedings of the workshop on automatic summarization, Philadelphia, Pennsylvania, USA, July 11–12, 2002 (pp. 66–70). Gaithersburg, MD: NIST. Yaari, Y. (2000). NLP-assisted exploration of texts. In Proceedings RIAOÕ2000 content-based multimedia information access Paris, April 12–14, 2000. Paris: CID-CASIS. Yang, C., & Wang, F. L. (2003). Fractal summarization for mobile devices to access large documents on the Web. In Proceedings of the twelfth international World Wide Web conference, Budapest, Hungary, May 20–24, 2003. New York: ACM. Zizi, M., & Beaudouin-Fafon, M. (1995). Hypermedia exploration with interactive dynamic maps. International Journal Human– Computer Studies, 43(3), 441–464.
© Copyright 2026 Paperzz