CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Annie Louis December 16, 2016 With some slides adapted from Ani Nenkova Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Recap Motivation, definition of the task How to develop methods for content selection?’ I Unsupervised method. based on frequency cues I Supervised method. used a range of features: frequency, document position, sentence length Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Today I I How to organize an output summary? How do you know if a summary is good? I What does good mean? How can you do this evaluation repeatedly during system development? Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Information ordering The selected content must be ordered so as to make a coherent summary. Why is information ordering needed? Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Revisiting the frequency-based content selection algorithm Step 1. Estimate word weights (probabilities) Step 2. Estimate sentence weights weight(S) = f (wi ∈ S) Step 3. Choose best sentence Step 4. Update word weights Step 5. Go to 2 if desired length not reached Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Selecting sentences in order of importance does not necessarily lead to coherent text But critics say Hong Kong Disneyland is overrated. The opening of Hong Kong Disneyland is expected to turn on a new page of Hong Kong tourism, with focus on family tourists, she said. Hong Kong Disneyland will stage its rehearsal from Aug. 16 to the theme park’s opening on Sept. 12, the park said in a press release on Friday. Many in Hong Kong are ready to give Mickey Mouse a big hug for bringing Disneyland to them. Hong Kong Disneyland Hotel starts at 1,600 (euro 170) a night and Disney’s Hollywood Hotel’s cheapest room costs 1,000 (euro 106). A summary produced by an automatic system (System 24, in TAC 09 for input D0940-A) Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Information ordering: The task of taking selected content and creating a coherent order for it Especially important in multidocument summarization I For single documents, we can order the sentences according to their order in the source document. I For multiple documents, the sentences come from different documents and hence a different approach is needed. Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Information ordering is not the only task for improving the linguistic quality of summaries Redundancy removal. related to preserving important content as well as linguistic quality A. Andrew hit the Bahamas as a Category 5 hurricane and barreled toward South Florida. B. Hurricane Andrew lasted eleven days from August 16-27, 1992 and hit land in the Bahamas and southern Florida. Sentence compression. remove parts of a sentence which are not central to the summary’s point Long. To experts in international law and relations, the US action demonstrates a breach by a major power of international conventions. Compressed. Experts say the US are in breach of international conventions. Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Information ordering is not the only task for improving the linguistic quality of summaries Sentence fusion. combine the common information across sentences to create a new fluent sentence Set of sentences with related content: 1. The forest is about 70 miles west of Portland. 2. Their bodies were found Saturday in a remote part of Tillamook State Forest, about 40 miles west of Portland. 3. Elk hunters found their bodies Saturday in the Tillamook State Forest, about 60 miles west of the familys hometown of Portland. 4. The area where the bodies were found is in a mountainous forest about 70 miles west of Portland. Fused sentence: The bodies were found Saturday in the forest area west of Portland. Barzilay & McKeown 2005. Sentence Fusion for Multidocument News Summarization. Computational Linguistics Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation How to do information ordering Many many techniques. Most of the standard and powerful ones are probabilistic in nature. Eg. using Hidden Markov Models. We will look at a very simple technique Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation What is needed to do information ordering? Quantifiable measures of what generally constitutes well-structured text Combined with domain knowledge to indicate what sequences of events and topics are likely Challenging to find orderings for longer summaries: n! possible orderings for n sentences Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Majority Ordering Algorithm [Barzilay et al 2002] Follow the order of topics (or themes) that exists in a majority in the input documents. Each sentence is associated with a theme (using an approach such as clustering) Barzilay et al 2002. Inferring Strategies for Sentence Ordering in Multidocument News Summarization, JAIR Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Majority Ordering Algorithm Weights = no. of input documents where that ordering between themes is observed Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation An approximate solution based on topological sort Find the path with maximum weight visiting every node only once Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Summary evaluation We care about both content selection quality as well as linguistic quality. Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Document Understanding Conference /Text Analysis Conference Large scale evaluation workshops conducted by NIST (National Institute of Standards and Technology), US Dept. of Commerce. Runs yearly evaluations of summarization systems (apart from others) Has made a big impact in the field by testing and validating different evaluation methods Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Typically evaluation is divided into two facets Content selection quality Linguistic quality Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Content selection quality Typically, we make use of model summaries. Also known as reference summaries or gold standard summaries. A human subject reads the full input set of documents and writes a summary - model summary or reference summary Treated as the desired/target content in a good summary. I In the DUC evaluations, these summaries were written by retired Information Analysts who worked for the US government I So they are expertly-written summaries Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Based on single model summary: Content coverage scores Followed in the initial years of evaluation (up to 2005) I The model summary is split into content units or clauses I Similarly the peer (machine) summary is also split into clauses I Score = average coverage of model units by the peer units Average coverage is computed as follows I For each model unit, find all peer units expressing the information I Judge how well the set of peer units express that model unit (judged as a %) I Average over all the model units Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Remember that content coverage is based on a single model summary, and that is problematic Humans differ a lot in the content they choose to express in their summary. The same machine summary can receive widely different scores when evaluated against different human summaries! A gold-standard created from multiple human summaries is better → Pyramid Method Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Multiple human summaries as the gold-standard: Pyramid method [Nenkova et al., 2007] Main idea: Facts/propositions which multiple people mentioned in their summaries must get more weight than those mentioned in one summary only. Based on Semantic Content Units (SCU) I Emerge from the analysis of several texts I Link different surface realizations with the same meaning Nenkova et al. The Pyramid Method: Incorporating Human Content Selection Variation in Summarization Evaluation, 2007 Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation SCU example Let’s say S1, S2, S3 are three different human summaries. Annotators mark the semantic units in them and link them up. S1 Pinochet arrested in London on Oct 16 at a Spanish judges request for atrocities against Spaniards in Chile. S2 Former Chilean dictator Augusto Pinochet has been arrested in London at the request of the Spanish government. S3 Britain caused international controversy and Chilean turmoil by arresting former Chilean dictator Pinochet in London. Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation SCU: label, weight, contributors Label. London was where Pinochet was arrested Weight = 3 (no. of human summaries where the SCU was expressed) Contributors. (The sentences in the human summaries where the SCU was expressed) S1 Pinochet arrested in London on Oct 16 at a Spanish judges request for atrocities against Spaniards in Chile. S2 Former Chilean dictator Augusto Pinochet has been arrested in London at the request of the Spanish government. S3 Britain caused international controversy and Chilean turmoil by arresting former Chilean dictator Pinochet in London. Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation After annotating SCUs, a pyramid can be constructed to represent them The number of model summaries n determines the maximum number of tiers in the pyramid I Each tier contains only the SCUs with the same weight. I Top tier has units expressed in all the summaries, bottom tier is units present in one summary only A pyramid from 4 model summaries. I 2 SCUs were present in all the 4 model summaries I 3 SCUs were present each in 3 model summaries I ... Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Using the pyramid, we can define an Ideally Informative summary Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Ideally informative summary Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Ideally informative summary Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Ideally informative summary Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Ideally informative summary Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Ideally informative summary Does not include an SCU from a lower tier unless all SCUs from higher tiers are included as well Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Different equally good summaries Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Different equally good summaries Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Scoring using the pyramid Many scores have been defined. We will not go into the details. Compare the weights of units in the machine summary to the score possible for an ideally informative summary Remember every step in this evaluation is done manually. Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Can we automate some aspects of summary evaluation? During system development, we need to repeatedly perform evaluations to understand changes to the system. ROUGE: Recall Oriented Understudy for Gisting Evaluation [Lin, 2004] I Once model summaries are available, the comparison between model summary and machine summary is done automatically I Based on n-gram overlaps between model and machine summary I Can be computed both with single model summary or multiple model summaries Lin, 2004. ROUGE: a package for automatic evaluation of summaries. ACL Text Summarization Workshop Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation ROUGE for content quality evaluation The ROUGE comparison is highly correlated with human judgements of similarity but not perfect. I Above 0.9 correlation with human judgements I ROUGE is the standard measure during development of summarization systems. Many versions of ROUGE are available (vary in the type of n-grams and inclusion of word similarity information). Tool available at http://www.berouge.com/Pages/default.aspx Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Linguistic quality evaluation Mostly done manually. Rate the summary on a scale of 1 to 5 for I Are the sentences ungrammatical? I I I I no unnecessary repetition. Eg. using a full noun phrase ‘Bill Clinton’ when a pronoun ‘he’ would suffice Are the references to different entities clear? I I no datelines, capitalization errors, fragmented sentences Does the summary contain redundant information? should be clear what a pronoun or entity stands for Does the summary have a flow, build up sentence by sentence? I have a focus, be well-structured and organized Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Overall summary quality: Responsiveness Used as a measure of both content and linguistic quality. Judges are asked to rate a summary on a scale of 1 to 5 for overall quality. No model summaries are needed. Useful in new domains where resource creation is costly. Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation Summary of evaluation methods Content quality evaluation: I Fully manual I I I Single model: content coverage score Multiple models: pyramid score Create model summaries manually, but comparison between model and machine summary is automatic I ROUGE Linguistic quality evaluation: I Manual ratings for different quality aspects Both: Responsiveness I Manual but no model summaries Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation When to use which evaluation? Automatic methods such as ROUGE aim to approximate human judgements. But they are not perfect. As a result we need both automatic methods during system development as well as human evaluation when comprehensiveness is needed. For some aspects of summaries such as linguistic quality, current automatic methods are not very good, and so primarily manual evaluation is done. Annie Louis CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
© Copyright 2026 Paperzz