Kathy McKeown, "Lessons Learned from Evaluation of

Lessons Learned from Evaluation of
Summarization Systems:
Nightmares and Pleasant Surprises
Kathleen McKeown
Department of Computer Science
Columbia University
Major contributers: Ani Nenkova, Becky Passonneau
1
2
Questions
 What kinds of evaluation are possible?
 What are the pitfalls?
 Are evaluation metrics fair?
 Is real research progress possible?
 What are the benefits?
 Should we evaluate our systems?
3
What is the feel of the evaluation?
 Is it competitive?
 Does it foster a feeling of community?
 Are the guidelines clearly established
ahead of time?
 Are the metrics fair? Do they measure
what you want to measure?
4
5
The night Max wore his wolf suit and made mischief of one kind
6
and another
and another
7
His mother called him “WILD THING” and he said “I’LL EAT YOU UP!” so
he was sent to bed without eating anything.
8
DARPA GALE: Global Autonomous
Language Environment
 Three large teams: BBN, IBM, SRI
 SRI: UC Berkeley, U Washington, UCSD, Columbia, NYU,
UMASS, NCRI, Systran, Fair Isaacs, Ohio State
 Generate responses to open-ended questions
 17 templates: definitions, biographies, events, relationships, reactions,
etc.
 Using English, Chinese, and Arabic text and speech, blogs
to news
 Find all instances when a fact is mentioned (redundancy)
9
GALE Evaluation
 Can systems do at least 50% as well as a human?
 If not, the GALE program will not continue
 The team that does worst may be cut
 Independent evaluator: BAE
 Has never done text evaluation before
 Has experience with task based evaluation
 Gold Standard
 System responses graded by two judges
 Relevant facts added to the pool
 Granularity of scoring: nuggets
 Metrics
 Variants of precision/recall weighted
 Document citations
 Redundancy
10
Year 1: Sample Q&A
LIST FACTS ABOUT [The Trial of Saddam Hussein]
 The judge , however, that all people should have heard voices, the order
of a court to solve technical problems. (Chi)
 His account of events surrounding the torture and execution of more than
140 men and teenage boys from the Dujail , appeared to do little to
advance the prosecution's goal of establishing Saddam 's "command
responsibility" for the deaths.
 A trial without Saddam could be an embarrassment for the U.S.
government, which has worked hard to help create a tribunal that would be
perceived by Iraqis as independent and fair.
 As the trial got under way, a former secret police officer testified that he
had not received any orders from Saddam during the investigations that
followed an assassination attempt against him in Dujail in 1982 .
11
Year 1: Results
 F-value (Beta of 1)
 Machine average:
 Human average:
0.230
0.353
 Machine to Human average:
0.678
12
DUC – Document Understanding
Conference
 Established and funded by DARPA TIDES
 Run by independent evaluator NIST
 Open to summarization community
 Annual evaluations on common datasets
 2001-present
 Tasks






Single document summarization
Headline summarization
Multi-document summarization
Multi-lingual summarization
Focused summarization
Update summarization
13
DUC is changing direction again
 DARPA GALE effort cutting back
participation in DUC
 Considering co-locating with TREC QA
 Considering new data sources and
tasks
14
DUC Evaluation
 Gold Standard
 Human summaries written by NIST
 From 2 to 9 summaries per input set
 Multiple metrics
 Manual




Coverage (early years)
Pyramids (later years)
Responsiveness (later years)
Quality questions
 Automatic

Rouge (-1, -2, -skipbigrams, LCS, BE)
 Granularity
 Manual: sub-sentential elements
 Automatic: sentences
15
TREC definition pilot
 Long answer to request for a definition
 As a pilot, less emphasis on results
 Part of TREC QA
16
Evaluation Methods
 Pool system responses and break into nuggets
 A judge scores nuggets as vital, OK or invalid
 Measure information precision and recall
 Can a judge reliably determine which facts
belong in a definition?
17
Considerations Across Evaluations
 Independent evaluator
 Not always as knowledgeable as researchers
 Impartial determination of approach
 Extensive collection of resources
 Determination of task
 Appealing to a broad cross-section of community
 Changes over time





DUC 2001-2002 Single and multi-document
DUC 2003: headlines, multi-document
DUC 2004: headlines, multilingual and multi-document, focused
DUC 2005: focused summarization
DUC 2006: focused and a new task, up for discussion
 How long do participants have to prepare?
 When is a task dropped?
 Scoring of text at the sub-sentential level
18
Task-based Evaluation
 Use the summarization system as
browser to do another task
 Newsblaster: write a report given a broad
prompt
 DARPA utility evaluation: given a request for
information, use question answering to write
report
19
Task Evaluation
 Hypothesis: multi-document summaries
enable users to find information efficiently
 Task: fact-gathering given topic and questions
 Resembles intelligence analyst task
20
User Study: Objectives
 Does multi-document summarization help?
 Do summaries help the user find information needed to




perform a report writing task?
Do users use information from summaries in gathering
their facts?
Do summaries increase user satisfaction with the online
news system?
Do users create better quality reports with summaries?
How do full multi-document summaries compare with
minimal 1-sentence summaries such as Google News?
21
User Study: Design
 Compared 4 parallel news browsing systems
 Level 1: Source documents only
 Level 2: One sentence multi-document summaries (e.g.,
Google News) linked to documents
 Level 3: Newsblaster multi-document summaries linked
to documents
 Level 4: Human written multi-document summaries linked
to documents
 All groups write reports given four scenarios
 A task similar to analysts
 Can only use Newsblaster for research
 Time-restricted
22
User Study: Execution
 4 scenarios
 4 event clusters each
 2 directly relevant, 2 peripherally relevant
 Average 10 documents/cluster
 45 participants
 Balance between liberal arts, engineering
 138 reports
 Exit survey
 Multiple-choice and open-ended questions
 Usage tracking
 Each click logged, on or off-site
23
“Geneva” Prompt
 The conflict between Israel and the
Palestinians has been difficult for government
negotiators to settle. Most recently,
implementation of the “road map for peace”,
a diplomatic effort sponsored by ……
 Who participated in the negotiations that produced the
Geneva Accord?
 Apart from direct participants, who supported the Geneva
Accord preparations and how?
 What has the response been to the Geneva Accord by
the Palestinians?
24
Measuring Effectiveness
 Score report content and compare
across summary conditions
 Compare user satisfaction per summary
condition
 Comparing where subjects took report
content from
25
Newsblaster
26
User Satisfaction
 More effective than a web search with
Newsblaster
 Not true with documents only or single-sentence
summaries
 Easier to complete the task with summaries
than with documents only
 Enough time with summaries than documents
only
 Summaries helped most
 5% single sentence summaries
 24% Newsblaster summaries
 43% human summaries
27
User Study: Conclusions
 Summaries measurably improve a news
browser’s effectiveness for research
 Users are more satisfied with Newsblaster
summaries are better than single-sentence
summaries like those of Google News
 Users want search
 Not included in evaluation
28
Potential Problems
29
That very night in Max’s room a forest grew
30
And grew
31
And grew until the ceiling hung with vines and
the walls became the world all around
32
And an ocean tumbled by with a private boat for Max
and he sailed all through the night and day
33
And he sailed in and out of weeks and almost over a year
to where the wild things are
34
And when he came to where the wild things are they roared their
terrible roars and gnashed their terrible teeth
35
Comparing Text Against Text
 Which human summary makes a good gold
standard? Many summaries are good
 At what granularity is the comparison
made?
 When can we say that two pieces of text
match?
36
Measuring variation
 Types of variation between humans
Translation  same content
 different wording
Applications
Summarization  different content??
 different wording
Generation
 different content??
 different wording
37
Human variation: content
words (Ani Nenkova)
• Summaries differ in vocabulary
 Differences cannot be
explained by paraphrase
•7 translations
 20 documents
•7 summaries
 20 document sets
• Faster vocabulary growth
in summarization
38
Variation impacts evaluation
 Comparing content is hard
 All kinds of judgment calls
 Paraphrases
 VP vs. NP


Ministers have been exchanged
Reciprocal ministerial visits
 Length and constituent type


Robotics assists doctors in the medical operating theater
Surgeons started using robotic assistants
39
Nightmare: only one gold standard
 System may have chosen an equally good sentence
but not in the one gold standard
 Pinochet arrested in London on Oct 16 at a Spanish judge’s
request for atrocities against Spaniards in Chile.
 Former Chilean dictator Augusto Pinochet has been
arrested in London at the request of the Spanish government
 In DUC 2001 (one gold standard), human model had
significant impact on scores (McKeown et al)
 Five human summaries needed to avoid changes in
rank (Nenkova and Passonneau)
 DUC2003 data
 3 topic sets, 1 highest scoring and 2 lowest scoring
 10 model summaries
40
How many summaries are
enough?
41
Scoring
 Two main approaches used in DUC
 ROUGE (Lin and Hovy)
 Pyramids (Nenkova and Passonneau)
 Problems:
 Are the results stable?
 How difficult is it to do the scoring?
42
ROUGE: Recall-Oriented
Understudy for Gisting Evaluation
Rouge – Ngram co-occurrence
metrics measuring content overlap
Counts of n-gram overlaps between
candidate and model summaries
Total n-grams in summary model
43
ROUGE
 Experimentation with different units of comparison:
unigrams, bigrams, longest common substring, skipbigams, basic elements
 Automatic and thus easy to apply
 Important to consider confidence intervals when
determining differences between systems
 Scores falling within same interval not significantly different
 Rouge scores place systems into large groups: can be hard to
definitively say one is better than another
 Sometimes results unintuitive:
 Multilingual scores as high as English scores
 Use in speech summarization shows no discrimination
 Good for training regardless of intervals: can see trends
44
Pyramids




Uses multiple human summaries
Information is ranked by its importance
Allows for multiple good summaries
A pyramid is created from the human
summaries
 Elements of the pyramid are content units
 System summaries are scored by comparison
with the pyramid
45
Content units: better study of
variation than sentences
 Semantic units
 Link different surface realizations with
the same meaning
 Emerge from the comparison of several
texts
46
Content unit example
S1 Pinochet arrested in London on Oct 16
at a Spanish judge’s request for
atrocities against Spaniards in Chile.
S2 Former Chilean dictator Augusto
Pinochet has been arrested in London at
the request of the Spanish government.
S3 Britain caused international controversy
and Chilean turmoil by arresting former
Chilean dictator Pinochet in London.
47
SCU: A cable car caught fire
(Weight = 4)
A. The cause of the fire was unknown.
B. A cable car caught fire just after entering a
mountainside tunnel in an alpine resort in Kaprun,
Austria on the morning of November 11, 2000.
C. A cable car pulling skiers and snowboarders to the
Kitzsteinhorn resort, located 60 miles south of
Salzburg in the Austrian Alps, caught fire inside a
mountain tunnel, killing approximately 170 people.
D. On November 10, 2000, a cable car filled to capacity
caught on fire, trapping 180 passengers inside the
Kitzsteinhorn mountain, located in the town of
Kaprun, 50 miles south of Salzburg in the central
Austrian Alps.
48
SCU: The cause of the fire is
unknown (Weight = 1)
A. The cause of the fire was unknown.
B. A cable car caught fire just after entering a
mountainside tunnel in an alpine resort in Kaprun,
Austria on the morning of November 11, 2000.
C. A cable car pulling skiers and snowboarders to the
Kitzsteinhorn resort, located 60 miles south of
Salzburg in the Austrian Alps, caught fire inside a
mountain tunnel, killing approximately 170 people.
D. On November 10, 2000, a cable car filled to capacity
caught on fire, trapping 180 passengers inside the
Kitzsteinhorn mountain, located in the town of
Kaprun, 50 miles south of Salzburg in the central
Austrian Alps.
49
Idealized representation
 Tiers of differentially
W=3
weighted SCUs
 Top: few SCUs, high
weight
 Bottom: many
SCUs, low weight
W=2
W=1
50
Comparison of Scoring Methods
in DUC05
 Analysis of scores for the 20 pyramid sets
 Columbia prepared pyramids
 Participants scored systems against pyramids
 Comparisons between Pyramid
(original,modified), responsiveness, and
Rouge-SU4
 Pyramids score computed from multiple humans
 Responsiveness is just one human’s judgment
 Rouge-SU4 equivalent to Rouge-2
51
Creation of pyramids
 Done at Columbia for each of 20 out of 50 sets
 Primary annotator, secondary checker
 Held round-table discussions of problematic
constructions that occurred in this data set
 Comma separated lists

Extractive reserves have been formed for managed harvesting of
timber, rubber, Brazil nuts, and medical plants without
deforestation.
 General vs. specific

Eastern Europe vs. Hungary, Poland, Lithuania, and Turkey
52
Characteristics of the Responses
 Proportion of SCUs of Weight 1 is large
 44% (D324) to 81% (D695)
 Mean SCU weight: 1.9
Agreement among human responders is
quite low
53
# of SCUs
at each
weight
SCU Weights
54
Preview of Results
 Manual metrics
 Large differences between humans and machines
 No single system the clear winner
 But a top group identified by all metrics
 Significant differences
 Different predictions from manual and automatic metrics
 Correlations between metrics
 Some correlation but one cannot be substituted for another
 This is good
55
Human performance/Best sys
Pyramid
Modified
B: 0.5472
B: 0.4814
A: 0.4969
A: 0.4617
~~~~~~~~~~~~~~~~~
14: 0.2587
10: 0.2052
Resp
A: 4.895
B: 4.526
4: 2.85
ROUGE-SU4
A: 0.1722
B: 0.1552
15: 0.139
Best system ~50% of human performance on manual metrics
Best system ~80% of human performance on ROUGE
56
Pyramid
original
14: 0.2587
17: 0.2492
15: 0.2423
10: 0.2379
4: 0.2321
7: 0.2297
16: 0.2265
6: 0.2197
32: 0.2145
21: 0.2127
12: 0.2126
11: 0.2116
26: 0.2106
19: 0.2072
28: 0.2048
13: 0.1983
3: 0.1949
1: 0.1747
Modified
10: 0.2052
17: 0.1972
14: 0.1908
7: 0.1852
15: 0.1808
4: 0.177
16: 0.1722
11: 0.1703
6: 0.1671
12: 0.1664
19: 0.1636
21: 0.1613
32: 0.1601
26: 0.1464
3: 0.145
28: 0.1427
13: 0.1424
25: 0.1406
Resp
4: 2.85
14: 2.8
10: 2.65
15: 2.6
17: 2.55
11: 2.5
28: 2.45
21: 2.45
6: 2.4
24: 2.4
19: 2.4
6: 2.4
27: 2.35
12: 2.35
7: 2.3
25: 2.2
32: 2.15
3: 2.1
Rouge-SU4
15: 0.139
4: 0.134
17: 0.1346
19: 0.1275
11: 0.1259
10: 0.1278
6: 0.1239
7: 0.1213
14: 0.1264
25: 0.1188
21: 0.1183
16: 0.1218
24: 0.118
12: 0.116
3: 0.1198
28: 0.1203
27: 0.110
13: 0.1097
57
Pyramid
original
14: 0.2587
17: 0.2492
15: 0.2423
10: 0.2379
4: 0.2321
7: 0.2297
16: 0.2265
6: 0.2197
32: 0.2145
21: 0.2127
12: 0.2126
11: 0.2116
26: 0.2106
19: 0.2072
28: 0.2048
13: 0.1983
3: 0.1949
1: 0.1747
Modified
10: 0.2052
17: 0.1972
14: 0.1908
7: 0.1852
15: 0.1808
4: 0.177
16: 0.1722
11: 0.1703
6: 0.1671
12: 0.1664
19: 0.1636
21: 0.1613
32: 0.1601
26: 0.1464
3: 0.145
28: 0.1427
13: 0.1424
25: 0.1406
Resp
4: 2.85
14: 2.8
10: 2.65
15: 2.6
17: 2.55
11: 2.5
28: 2.45
21: 2.45
6: 2.4
24: 2.4
19: 2.4
6: 2.4
27: 2.35
12: 2.35
7: 2.3
25: 2.2
32: 2.15
3: 2.1
Rouge-SU4
15: 0.139
4: 0.134
17: 0.1346
19: 0.1275
11: 0.1259
10: 0.1278
6: 0.1239
7: 0.1213
14: 0.1264
25: 0.1188
21: 0.1183
16: 0.1218
24: 0.118
12: 0.116
3: 0.1198
28: 0.1203
27: 0.110
13: 0.1097
58
Pyramid
original
14: 0.2587
17: 0.2492
15: 0.2423
10: 0.2379
4: 0.2321
7: 0.2297
16: 0.2265
6: 0.2197
32: 0.2145
21: 0.2127
12: 0.2126
11: 0.2116
26: 0.2106
19: 0.2072
28: 0.2048
13: 0.1983
3: 0.1949
1: 0.1747
Modified
10: 0.2052
17: 0.1972
14: 0.1908
7: 0.1852
15: 0.1808
4: 0.177
16: 0.1722
11: 0.1703
6: 0.1671
12: 0.1664
19: 0.1636
21: 0.1613
32: 0.1601
26: 0.1464
3: 0.145
28: 0.1427
13: 0.1424
25: 0.1406
Resp
4: 2.85
14: 2.8
10: 2.65
15: 2.6
17: 2.55
11: 2.5
28: 2.45
21: 2.45
6: 2.4
24: 2.4
19: 2.4
6: 2.4
27: 2.35
12: 2.35
7: 2.3
25: 2.2
32: 2.15
3: 2.1
Rouge-SU4
15: 0.139
4: 0.134
17: 0.1346
19: 0.1275
11: 0.1259
10: 0.1278
6: 0.1239
7: 0.1213
14: 0.1264
25: 0.1188
21: 0.1183
16: 0.1218
24: 0.118
12: 0.116
3: 0.1198
28: 0.1203
27: 0.110
13: 0.1097
59
Pyramid
original
14: 0.2587
17: 0.2492
15: 0.2423
10: 0.2379
4: 0.2321
7: 0.2297
16: 0.2265
6: 0.2197
32: 0.2145
21: 0.2127
12: 0.2126
11: 0.2116
26: 0.2106
19: 0.2072
28: 0.2048
13: 0.1983
3: 0.1949
1: 0.1747
Modified
10: 0.2052
17: 0.1972
14: 0.1908
7: 0.1852
15: 0.1808
4: 0.177
16: 0.1722
11: 0.1703
6: 0.1671
12: 0.1664
19: 0.1636
21: 0.1613
32: 0.1601
26: 0.1464
3: 0.145
28: 0.1427
13: 0.1424
25: 0.1406
Resp
4: 2.85
14: 2.8
10: 2.65
15: 2.6
17: 2.55
11: 2.5
28: 2.45
21: 2.45
6: 2.4
24: 2.4
19: 2.4
6: 2.4
27: 2.35
12: 2.35
7: 2.3
25: 2.2
32: 2.15
3: 2.1
Rouge-SU4
15: 0.139
4: 0.134
17: 0.1346
19: 0.1275
11: 0.1259
10: 0.1278
6: 0.1239
7: 0.1213
14: 0.1264
25: 0.1188
21: 0.1183
16: 0.1218
24: 0.118
12: 0.116
3: 0.1198
28: 0.1203
27: 0.110
13: 0.1097
60
Significant Differences
 Manual metrics
 Few differences between systems
Pyramid: 23 is worse
 Responsive: 23 and 31 are worse

 Both humans better than all systems
 Automatic (Rouge-SU4)
 More differences between systems
 One human indistinguishable from 5 systems
61
Correlations: Pearson’s, 25
systems
Pyr-orig
Pyr-mod
Resp-1
Resp-2
R-2
Pyr-mod
Resp-1 Resp2
R-2
R-SU4
0.96
0.77
0.86
0.84
0.80
0.81
0.90
0.90
0.86
0.83
0.92
0.92
0.88
0.87
0.98
62
Correlations: Pearson’s, 25
systems
Pyr-orig
Pyr-mod
Resp-1
Resp-2
R-2
Pyr-mod
Resp-1 Resp2
R-2
R-SU4
0.96
0.77
0.86
0.84
0.80
0.81
0.90
0.90
0.86
0.83
0.92
0.92
0.88
0.87
0.98
Questionable that responsiveness could be a gold standard
63
Pyramid and responsiveness
Pyr-orig
Pyr-mod
Resp-1
Resp-2
R-2
Pyr-mod
Resp-1 Resp2
R-2
R-SU4
0.96
0.77
0.86
0.84
0.80
0.81
0.90
0.90
0.86
0.83
0.92
0.92
0.88
0.87
0.98
High correlation, but the metrics are not mutually substitutable
64
Pyramid and Rouge
Pyr-orig
Pyr-mod
Resp-1
Resp-2
R-2
Pyr-mod
Resp-1 Resp2
R-2
R-SU4
0.96
0.77
0.86
0.84
0.80
0.81
0.90
0.90
0.86
0.83
0.92
0.92
0.88
0.87
0.98
High correlation, but the metrics are not mutually substitutable
65
Correlations
 Original and modified can substitute for each
other
 High correlation between manual and
automatic, but automatic not yet a substitute
 Similar patterns between pyramid and
responsiveness
66
Nightmare
 Scoring metric that is not stable used to
decide funding
 Insignificant differences between
systems determine funding
67
Is Task Evaluation Nightmare
Free?
 Impact of user interface issues
 Can have more impact than the summary
 Controlling for proper mix of subjects
 Quantity of subjects and time to carry
out is large
68
Till Max said “Be still!” and tamed them with the magic trick
69
Of staring into their yellow eyes without blinking once
And they were frightened and called him the most wild thing of all
70
And made him king of all wild things
71
“And now,” cried Max “Let the wild rumpus start!”
72
73
74
75
Are we having fun yet?
Benefits of evaluation
 Emergence of evaluation methods
 ROUGE
 Pyramids
 Nuggetteer
 Research into characteristics of metrics
 Analyses of sub-sentential units
 Paraphrase as a research issue
76
Available Data
 DUC data sets
 4 years of summary/document set pairs

Multidocument summarization training data not available
beforehand
 4 years of scoring patterns
 Led to analysis of human summaries
 Pyramids




Pyramids and peers for 40 topics (DUC04, DUC05)
Many more from Nenkova and Passonneau
Training data for paraphrase
Training data for abstraction -> see systems moving
away from pure sentence extraction
77
Wrapping up
78
Lessons Learned
 Evaluation environment is important
 Find a task with broad appeal
 Use independent evaluator
 At least a committee
 Use multiple gold standards
 Compare text at the content unit level
 Evaluate the metrics
 Look at significant differences
79
Is Evaluation Worth It?
 DUC: creation of a community
 From ~15 participants year 1 -> 30 participants year 5
 No longer impacts funding
 Enables research into evaluation
 At start, no idea how to evaluate summaries
 But, results do not tell us everything
80
And he sailed back over a year, in and out of weeks and through a
day
81
And into the night of his very own room where he found his
supper waiting for him .. And it was still warm.
82