CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation

CE314/CE887 Lecture 19:
Summarization: Ordering, Evaluation
Annie Louis
December 16, 2016
With some slides adapted from Ani Nenkova
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Recap
Motivation, definition of the task
How to develop methods for content selection?’
I
Unsupervised method. based on frequency cues
I
Supervised method. used a range of features: frequency,
document position, sentence length
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Today
I
I
How to organize an output summary?
How do you know if a summary is good?
I
What does good mean? How can you do this evaluation
repeatedly during system development?
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Information ordering
The selected content must be ordered so as to make a coherent
summary.
Why is information ordering needed?
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Revisiting the frequency-based content selection algorithm
Step 1. Estimate word weights (probabilities)
Step 2. Estimate sentence weights
weight(S) = f (wi ∈ S)
Step 3. Choose best sentence
Step 4. Update word weights
Step 5. Go to 2 if desired length not reached
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Selecting sentences in order of importance does not
necessarily lead to coherent text
But critics say Hong Kong Disneyland is overrated.
The opening of Hong Kong Disneyland is expected to turn on a
new page of Hong Kong tourism, with focus on family tourists,
she said.
Hong Kong Disneyland will stage its rehearsal from Aug. 16 to
the theme park’s opening on Sept. 12, the park said in a press
release on Friday.
Many in Hong Kong are ready to give Mickey Mouse a big hug
for bringing Disneyland to them.
Hong Kong Disneyland Hotel starts at 1,600 (euro 170) a night
and Disney’s Hollywood Hotel’s cheapest room costs 1,000 (euro
106).
A summary produced by an automatic system (System 24, in TAC 09 for
input D0940-A)
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Information ordering: The task of taking selected content
and creating a coherent order for it
Especially important in multidocument summarization
I
For single documents, we can order the sentences according to
their order in the source document.
I
For multiple documents, the sentences come from different
documents and hence a different approach is needed.
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Information ordering is not the only task for improving the
linguistic quality of summaries
Redundancy removal. related to preserving important content as
well as linguistic quality
A. Andrew hit the Bahamas as a Category 5 hurricane and
barreled toward South Florida.
B. Hurricane Andrew lasted eleven days from August 16-27, 1992
and hit land in the Bahamas and southern Florida.
Sentence compression. remove parts of a sentence which are not
central to the summary’s point
Long. To experts in international law and relations, the US action
demonstrates a breach by a major power of international
conventions.
Compressed. Experts say the US are in breach of international
conventions.
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Information ordering is not the only task for improving the
linguistic quality of summaries
Sentence fusion. combine the common information across
sentences to create a new fluent sentence
Set of sentences with related content:
1. The forest is about 70 miles west of Portland.
2. Their bodies were found Saturday in a remote part of Tillamook
State Forest, about 40 miles west of Portland.
3. Elk hunters found their bodies Saturday in the Tillamook State
Forest, about 60 miles west of the familys hometown of Portland.
4. The area where the bodies were found is in a mountainous
forest about 70 miles west of Portland.
Fused sentence: The bodies were found Saturday in the forest
area west of Portland.
Barzilay & McKeown 2005. Sentence Fusion for Multidocument News
Summarization. Computational Linguistics
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
How to do information ordering
Many many techniques. Most of the standard and powerful ones
are probabilistic in nature. Eg. using Hidden Markov Models.
We will look at a very simple technique
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
What is needed to do information ordering?
Quantifiable measures of what generally constitutes well-structured
text
Combined with domain knowledge to indicate what sequences of
events and topics are likely
Challenging to find orderings for longer summaries: n! possible
orderings for n sentences
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Majority Ordering Algorithm [Barzilay et al 2002]
Follow the order of topics (or themes) that exists in a majority in
the input documents.
Each sentence is associated with a theme (using an approach such
as clustering)
Barzilay et al 2002. Inferring Strategies for Sentence Ordering in
Multidocument News Summarization, JAIR
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Majority Ordering Algorithm
Weights = no. of input documents where that ordering between
themes is observed
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
An approximate solution based on topological sort
Find the path with maximum weight visiting every node only once
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Summary evaluation
We care about both content selection quality as well as
linguistic quality.
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Document Understanding Conference /Text Analysis
Conference
Large scale evaluation workshops conducted by NIST (National
Institute of Standards and Technology), US Dept. of Commerce.
Runs yearly evaluations of summarization systems (apart from
others)
Has made a big impact in the field by testing and validating
different evaluation methods
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Typically evaluation is divided into two facets
Content selection quality
Linguistic quality
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Content selection quality
Typically, we make use of model summaries. Also known as
reference summaries or gold standard summaries.
A human subject reads the full input set of documents and writes
a summary - model summary or reference summary
Treated as the desired/target content in a good summary.
I
In the DUC evaluations, these summaries were written by
retired Information Analysts who worked for the US
government
I
So they are expertly-written summaries
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Based on single model summary: Content coverage
scores
Followed in the initial years of evaluation (up to 2005)
I
The model summary is split into content units or clauses
I
Similarly the peer (machine) summary is also split into clauses
I
Score = average coverage of model units by the peer units
Average coverage is computed as follows
I
For each model unit, find all peer units expressing the
information
I
Judge how well the set of peer units express that model unit
(judged as a %)
I
Average over all the model units
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Remember that content coverage is based on a single
model summary, and that is problematic
Humans differ a lot in the content they choose to express in their
summary.
The same machine summary can receive widely different scores
when evaluated against different human summaries!
A gold-standard created from multiple human summaries is better
→ Pyramid Method
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Multiple human summaries as the gold-standard: Pyramid
method [Nenkova et al., 2007]
Main idea: Facts/propositions which multiple people mentioned in
their summaries must get more weight than those mentioned in
one summary only.
Based on Semantic Content Units (SCU)
I
Emerge from the analysis of several texts
I
Link different surface realizations with the same meaning
Nenkova et al. The Pyramid Method: Incorporating Human Content
Selection Variation in Summarization Evaluation, 2007
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
SCU example
Let’s say S1, S2, S3 are three different human summaries.
Annotators mark the semantic units in them and link them up.
S1 Pinochet arrested in London on Oct 16 at a Spanish judges
request for atrocities against Spaniards in Chile.
S2 Former Chilean dictator Augusto Pinochet has been arrested in
London at the request of the Spanish government.
S3 Britain caused international controversy and Chilean turmoil by
arresting former Chilean dictator Pinochet in London.
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
SCU: label, weight, contributors
Label. London was where Pinochet was arrested
Weight = 3 (no. of human summaries where the SCU was
expressed)
Contributors. (The sentences in the human summaries where the
SCU was expressed)
S1 Pinochet arrested in London on Oct 16 at a Spanish judges
request for atrocities against Spaniards in Chile.
S2 Former Chilean dictator Augusto Pinochet has been arrested in
London at the request of the Spanish government.
S3 Britain caused international controversy and Chilean turmoil by
arresting former Chilean dictator Pinochet in London.
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
After annotating SCUs, a pyramid can be constructed to
represent them
The number of model summaries n determines the maximum
number of tiers in the pyramid
I
Each tier contains only the SCUs with the same weight.
I
Top tier has units expressed in all the summaries, bottom tier
is units present in one summary only
A pyramid from 4 model summaries.
I
2 SCUs were present in all the 4 model summaries
I
3 SCUs were present each in 3 model summaries
I
...
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Using the pyramid, we can define an Ideally Informative
summary
Does not include an SCU from a lower tier unless all SCUs from
higher tiers are included as well
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Ideally informative summary
Does not include an SCU from a lower tier unless all SCUs from
higher tiers are included as well
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Ideally informative summary
Does not include an SCU from a lower tier unless all SCUs from
higher tiers are included as well
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Ideally informative summary
Does not include an SCU from a lower tier unless all SCUs from
higher tiers are included as well
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Ideally informative summary
Does not include an SCU from a lower tier unless all SCUs from
higher tiers are included as well
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Ideally informative summary
Does not include an SCU from a lower tier unless all SCUs from
higher tiers are included as well
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Different equally good summaries
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Different equally good summaries
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Scoring using the pyramid
Many scores have been defined. We will not go into the details.
Compare the weights of units in the machine summary to the score
possible for an ideally informative summary
Remember every step in this evaluation is done manually.
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Can we automate some aspects of summary evaluation?
During system development, we need to repeatedly perform
evaluations to understand changes to the system.
ROUGE: Recall Oriented Understudy for Gisting Evaluation [Lin,
2004]
I
Once model summaries are available, the comparison between
model summary and machine summary is done automatically
I
Based on n-gram overlaps between model and machine
summary
I
Can be computed both with single model summary or multiple
model summaries
Lin, 2004. ROUGE: a package for automatic evaluation of summaries.
ACL Text Summarization Workshop
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
ROUGE for content quality evaluation
The ROUGE comparison is highly correlated with human
judgements of similarity but not perfect.
I
Above 0.9 correlation with human judgements
I
ROUGE is the standard measure during development of
summarization systems.
Many versions of ROUGE are available (vary in the type of
n-grams and inclusion of word similarity information).
Tool available at
http://www.berouge.com/Pages/default.aspx
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Linguistic quality evaluation
Mostly done manually. Rate the summary on a scale of 1 to 5 for
I
Are the sentences ungrammatical?
I
I
I
I
no unnecessary repetition. Eg. using a full noun phrase ‘Bill
Clinton’ when a pronoun ‘he’ would suffice
Are the references to different entities clear?
I
I
no datelines, capitalization errors, fragmented sentences
Does the summary contain redundant information?
should be clear what a pronoun or entity stands for
Does the summary have a flow, build up sentence by
sentence?
I
have a focus, be well-structured and organized
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Overall summary quality: Responsiveness
Used as a measure of both content and linguistic quality.
Judges are asked to rate a summary on a scale of 1 to 5 for overall
quality.
No model summaries are needed. Useful in new domains where
resource creation is costly.
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
Summary of evaluation methods
Content quality evaluation:
I Fully manual
I
I
I
Single model: content coverage score
Multiple models: pyramid score
Create model summaries manually, but comparison between
model and machine summary is automatic
I
ROUGE
Linguistic quality evaluation:
I
Manual ratings for different quality aspects
Both: Responsiveness
I
Manual but no model summaries
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation
When to use which evaluation?
Automatic methods such as ROUGE aim to approximate human
judgements. But they are not perfect.
As a result we need both automatic methods during system
development as well as human evaluation when comprehensiveness
is needed.
For some aspects of summaries such as linguistic quality, current
automatic methods are not very good, and so primarily manual
evaluation is done.
Annie Louis
CE314/CE887 Lecture 19: Summarization: Ordering, Evaluation