Evaluation with STS

Evaluation and STS
Workshop on a Pipeline for Semantic Text
Similarity (STS)
March 12, 2012
Sherri Condon
The MITRE Corporation
For Internal MITRE Use
© 2012 The MITRE Corporation. All rights reserved.
Evaluation and STS Wish-list
■ Valid: measure what we think we’re measuring (definition)
■ Replicable: same results for same inputs (annotator agreement)
■ Objective: no confounding biases (from language or annotator)
■ Diagnostic
– Not all evaluations achieve this
– Understanding factors and relations
■ Generalizable
– Makes true predictions about new cases
– Functional: if not perfect, good enough
■ Understandable:
– Meaningful to stakeholders
– Interpretable components
■ Cost effective
Page 2
For Internal MITRE Use
© 2012 The MITRE Corporation. All rights reserved.
Quick Foray into Philosophy
■ Meaning as extension: same/similar denotation
– Anaphora/coreference and time/date resolution
– The evening star happens to be the morning star
– “Real world” knowledge = true in this world
■ Meaning as intension
– Truth (extension) in the same/similar possible worlds
– Compositionality: inference and entailment
■ Meaning as use
– Equivalence for all the same purposes in all the same contexts
– “Committee on Foreign Affairs, Human Rights, Common
Security and Defence Policy” vs “Committee on Foreign Affairs”
– Salience, application specificity, implicature, register, metaphor
■ Yet remarkable agreement in intuitions about meaning
Page 3
For Internal MITRE Use
© 2012 The MITRE Corporation. All rights reserved.
DARPA Mind’s Eye Evaluation
■ Computer vision through a human lens
– Recognize events in video as verbs
– Produce text descriptions of events in video
■ Comparing human descriptions to system descriptions
raises all the STS issues
– Salience/importance/focus (A gave B a package. They were
standing.)
– Granularity of description (car vs. red car, woman vs. person)
– Knowledge and inference (standing with motorcycle vs. sitting
on motorcycle: motorcycle is stopped)
– Unequal text lengths
■ Demonstrates value of/need for understanding these factors
Page 4
For Internal MITRE Use
© 2012 The MITRE Corporation. All rights reserved.
Mind’s Eye Text Similarity
■ Basic similarity scores based on dependency parses
– Scores increase for matching predicates and arguments
– Scores decrease for non-matching predicates and arguments
– Accessible syn-sets and ontological relations expand matches
■ Salience/importance
– Obtain many “reference” descriptions
– Weight predicates and arguments based on frequency
■ Granularity of description
– Demonstrates influence of application context
– Program focus is on verbs, so nouns match loosely
■ Regularities in evaluation efforts that depend on semantic
similarity promise common solutions
(This is work with Evelyne Tzoukermann and Dan Parvaz)
Page 5
For Internal MITRE Use
© 2012 The MITRE Corporation. All rights reserved.
Test Sentences
Predicate Argument
dobj
close
can
dobj
close
it
dobj
hold
lid
dobj
place
it
dobj
place
lid
dobj
put
it
dobj
put
lid
dobj
take
lid
nsubj
close
woman
nsubj
hold
woman
Nominal errors
Verbal Errors
Predicate errors
Weight
1
3
6
1
2
1
5
1
1
2
0
0
1.5
Total
Score
For Internal
MITRE
Use
1
x
2
3
4
5
6
x
x
x
x
x
x
3
1
0 5
15 -6.5
3
5
1
1 6
3 2 3 9
2.5 1 -4 -14
Page 6
© 2012 The MITRE Corporation. All rights reserved.