Relevance
in
Open Domain Question Answering:
Theoretical Framework and Application
Marco De Boni
PhD
University of York
Department of Computer Science
2004
Abstract
While there is a very large amount of written information available in electronic format, there
is no easy way to automatically find a reliable answer to simple questions such as “Who is the
president of the US?”. Research in Question Answering (QA) systems address these issues by
trying to find a method for answering a question by searching for a precise response in a
collection of documents. Current QA systems, however, are no more than prototypes, and,
while there is agreement amongst researchers on the generic aim of QA systems, little work
has been done on clarifying the problem beyond the establishment of a standard evaluation
framework. There is consequently a significant lack of theoretical understanding of QA
systems and a considerable amount of confusion about their aims and evaluation.
This thesis addresses the need for a theoretical investigation into QA systems by employing
the notion of relevance to clarify their purpose and elucidate their constituent structure,
showing how the theory developed can be applied in practice.
Initially we examine the concept of answerhood as applicable open domain QA systems and
we argue that there are limits as to what can be considered an answer to a question. In order to
understand the nature of these limits we then examine the concept of relevance, showing that
to talk about an answer is really to speak about the relevance of that answer in relation to a
question; we maintain that it is misleading to talk about absolutely correct or incorrect
answers: we should instead be referring to answers which are more or less relevant to a
question. We then examine the concept of relevance, illustrating how it is composed of
semantic relevance, dealing with the relationship in meaning between question and answer;
goal-directed relevance, dealing with questioner and answerer goals; logical relevance,
dealing with the more formal relationship which considers whether an answer provides the
information which the question sought; and morphic relevance, dealing with the form an
answer takes in relation to a question.
From the notion of relevance we built a model of QA systems which illustrates the constraints
under which they operate: we show how an answer is constrained by the questioner and the
answerer’s prior knowledge, goals, rules of inference, answer form preferences, as well as the
questioner and the answerer’s approach to giving relevance judgements from the point of
view of semantic, goal-directed, logical, morphic and overall relevance.
2
We then illustrate how the framework can be used to improve current TREC-style QA
systems by considering each component of relevance individually and implementing that
component starting from a “standard” TREC-style QA system, YorkQA:
•
We implement semantic relevance in YorkQA and use it both to improve results in
the TREC evaluation but also to move beyond such an evaluation to provide relevant
answers, not simply “correct” answers.
•
We implement goal-directed relevance in YorkQA through the use of clarification
dialogue, developing a new algorithm for the recognition of clarification dialogue in
open-domain question answering.
•
We demonstrate how logical relevance could be implemented in YorkQA to provide
not simply an indication of a unique “correct” answer which meets the information
need set out by the question, but a ranking of relevant answers from a logical point of
view through the use of a number of rules to gradually relax the constraints on the
proof of an answer.
•
We show how the idea of morphic relevance can be implemented trivially in YorkQA
and how we can move beyond this implementation to find morphically relevant
answer sentences. Finally, we investigate how the performance of the individual
components could be improved by augmenting the background knowledge, showing
how the knowledge base used (WordNet) could be expanded automatically to provide
new, useful relationships.
•
We then establish how the individual implementations of the components could be
brought together to meet the requirements of a system which goes beyond TRECstyle question answering and how an evaluation of the integration of the components
would have to take place.
Finally we discuss a number of issues raised by this investigation, showing what questions
need to be asked in order to address them.
3
Contents
Chapter 1: Introduction........................................................................................................ 13
1
Question Answering Systems...................................................................................... 13
2
Thesis motivation ........................................................................................................ 14
3
Thesis aims .................................................................................................................. 15
4
Thesis contribution ...................................................................................................... 15
5
Chapter summary......................................................................................................... 16
Chapter 2: Question Answering Systems: An Overview ................................................... 18
1
2
3
Previous work on Question Answering Systems......................................................... 18
1.1
The early years: databases, cognitive science and limited domains ................... 18
1.2
Answering questions through cognitive science: Schank and Abelson ............... 20
1.3
Answering questions through cognitive science: Dyer........................................ 21
1.4
Limitations of Schank and Dyer’s approach ....................................................... 23
1.5
Beyond cognitive psychology: open-domain Question Answering...................... 23
Question Answering, Information Retrieval and Information Extraction ................... 24
2.1
Information Extraction through template filling ................................................. 24
2.2
Information Retrieval .......................................................................................... 25
2.3
Query relevant summarisation ............................................................................ 25
TREC-QA and Open-Domain Question Answering ................................................... 26
3.1
Open-Domain Question Answering..................................................................... 26
3.2
The TREC QA Evaluation Framework ................................................................ 27
3.3
TREC-8 QA.......................................................................................................... 27
3.4
TREC-9 QA.......................................................................................................... 31
3.5
TREC-10 QA........................................................................................................ 38
3.6
TREC-11 QA........................................................................................................ 39
3.7
Critical evaluation of the TREC QA track........................................................... 41
3.8
Other research: future directions in QA.............................................................. 44
4
Related work in other fields ........................................................................................ 45
5
Linguistics ................................................................................................................... 45
6
Psychology and Cognitive Science.............................................................................. 46
7
Pedagogy ..................................................................................................................... 46
8
Philosophy ................................................................................................................... 47
8.1
Hermeneutics and Question Answering .............................................................. 47
8.2
Hermeneutics: from Plato to Scheiermacher ...................................................... 47
4
9
8.3
Gadamer and the dialectic of Question and Answer ........................................... 48
8.4
Practical application of hermeneutics................................................................. 49
8.5
Eco and semiotics: a critique of Derrida ............................................................ 50
8.6
Eco: inference as abduction ................................................................................ 51
8.7
Practical application of semiotics ....................................................................... 52
8.8
Relevance Theory and Grice ............................................................................... 52
Logic............................................................................................................................ 54
9.1
Entailment and relevance logic ........................................................................... 55
9.2
Relevance logic, entailment logic and knowledge engineering........................... 55
9.3
Application of relevance logic for Information Retrieval.................................... 56
10
Conclusion............................................................................................................... 57
Chapter 3: The YorkQA TREC-Style QA System ............................................................. 60
1
The TREC Evaluation Framework .............................................................................. 60
1.1
TREC-10 .............................................................................................................. 60
1.2
TREC-11 .............................................................................................................. 61
2
Objective of the experiments....................................................................................... 61
3
YorkQA at TREC-10................................................................................................... 61
4
5
6
3.1
Design.................................................................................................................. 62
3.2
Novel aspects of YorkQA at TREC-10................................................................. 71
3.3
Results ................................................................................................................. 71
YorkQA at TREC-11................................................................................................... 72
4.1
Design.................................................................................................................. 72
4.2
Novel aspects of YorkQA at TREC-11................................................................. 77
4.3
Results ................................................................................................................. 78
Limitations................................................................................................................... 78
5.1
Limitations in System Design for TREC .............................................................. 78
5.2
Limitations of the TREC evaluation framework .................................................. 79
Moving Beyond TREC................................................................................................ 81
Chapter 4: A Definition of Relevance for Automated Question Answering .................... 82
1
Introduction ................................................................................................................. 82
2
Theoretical limitations to the framework .................................................................... 84
2.1
A psychological model?....................................................................................... 84
2.2
An empirical analysis? ........................................................................................ 85
2.3
A Process? ........................................................................................................... 87
2.4
A general theory of question answering? ............................................................ 87
5
3
4
5
6
7
Relevance for automated Question Answering ........................................................... 88
3.1
The notion of answer in TREC QA and previous approaches............................. 88
3.2
Beyond current approaches: relevant answers vs. unique answers .................... 89
3.3
A definition of relevance for automated QA ........................................................ 90
Previous work.............................................................................................................. 91
4.1
Relevance, related concepts and categories........................................................ 91
4.2
Limitations of these approaches to relevance ..................................................... 96
Relevance categorisation for QA systems ................................................................... 97
5.1
Proposed categories ............................................................................................ 97
5.2
Sufficiency of the Relevance Categories............................................................ 101
5.3
Necessity of the relevance Categories ............................................................... 103
Moving beyond TREC-style QA systems ................................................................. 106
6.1
Clarifying the limitations of TREC-style QA systems........................................ 106
6.2
Overcoming the limitations of TREC-style QA systems .................................... 108
Conclusion................................................................................................................. 109
Chapter 5: A Relevance-Based Theoretical Framework for Automated QA ................ 110
1
Introduction ............................................................................................................... 110
2
General Framework ................................................................................................... 111
3
Definition of a Question Answering system.............................................................. 112
4
Determining an answer.............................................................................................. 114
5
Answerer Prejudices.................................................................................................. 117
6
Advantages ................................................................................................................ 123
7
Conclusion................................................................................................................. 124
Chapter 6: Semantic Relevance ......................................................................................... 125
1
Introduction ............................................................................................................... 125
2
Previous Work ........................................................................................................... 126
3
A sentence similarity metric for Question Answering .............................................. 127
3.1
Basic Semantic Relevance Algorithm ................................................................ 128
3.2
Use of WordNet to calculate similarity ............................................................. 129
3.3
Disambiguation ................................................................................................. 133
4
Relevant Features ...................................................................................................... 135
5
Evaluation.................................................................................................................. 137
5.1
Method............................................................................................................... 137
5.2
Usefulness for finding definite answers............................................................. 141
5.3
Beyond definite answers .................................................................................... 142
6
6
Using semantic relevance to improve the YorkQA system....................................... 144
7
Conclusion................................................................................................................. 144
Chapter 7: Goal-directed Relevance.................................................................................. 146
1
Introduction ............................................................................................................... 146
2
Modelling Goals ........................................................................................................ 148
2.1
The need to examine user models ...................................................................... 148
2.2
Types of User Models ........................................................................................ 150
2.3
Constructing user models .................................................................................. 153
2.4
An alternative to user goal modelling: clarification dialogue .......................... 156
3
Clarification dialogues in Question Answering......................................................... 157
4
Using clarification dialogue to implement goal-directed relevance .......................... 159
5
The TREC Context Experiments............................................................................... 160
6
Analysis of the TREC context questions................................................................... 161
7
Experiments in Clarification Dialogue Recognition.................................................. 162
8
Collection of new data............................................................................................... 162
9
8.1
Dialogue Collection A ....................................................................................... 162
8.2
Dialogue Collection B ....................................................................................... 164
Clarification Recognition Algorithm......................................................................... 165
10
Sentence Similarity Metric .................................................................................... 168
11
Results ................................................................................................................... 168
12
Usefulness of Clarification Dialogue Recognition ................................................ 172
13
Conclusion............................................................................................................. 174
Chapter 8: Logical Relevance............................................................................................. 176
1
Introduction ............................................................................................................... 176
2
The difference between logical relevance and previous approaches......................... 177
3
Approaches to finding the logical connection between a question and an answer.... 180
4
5
3.1
Work prior to TREC .......................................................................................... 180
3.2
Logical connections in TREC-style QA systems................................................ 181
3.3
Logical inference in TREC-style QA systems .................................................... 183
3.4
Inference of answers.......................................................................................... 183
Implementation issues ............................................................................................... 188
4.1
Analysis of the complexity of TREC questions and related documents ............. 188
4.2
Analysis of the complexity of inference in clarification Dialogues ................... 194
4.3
Issues in answer proving ................................................................................... 196
From Logical Proof to Logical Relevance ................................................................ 197
7
5.1
Building on TREC-style logical connection ...................................................... 197
5.2
Implementing relevance through relaxation rules............................................. 197
5.3
Relaxation rules for the TREC test data ............................................................ 201
5.4
Evaluation on the TREC collection ................................................................... 210
6
Limitations................................................................................................................. 211
7
Conclusion................................................................................................................. 212
Chapter 9: Morphic Relevance .......................................................................................... 214
1
Introduction ............................................................................................................... 214
2
Implementation.......................................................................................................... 214
3
From morphically correct answers to morphic relevance.......................................... 216
4
Goals and morphic relevance .................................................................................... 217
5
Conclusion................................................................................................................. 218
Chapter 10: Background knowledge for relevance judgements ..................................... 219
1
Introduction ............................................................................................................... 219
2
Automated discovery of telic relations for WordNet ................................................ 220
2.1
Telic Relationships ............................................................................................ 221
2.2
Related Work ..................................................................................................... 221
2.3
Method............................................................................................................... 221
2.4
Identification and extraction of telic words....................................................... 222
2.5
Telic Word Sense Disambiguation .................................................................... 224
2.6
Results ............................................................................................................... 226
3
Effect of augmented knowledge ................................................................................ 227
4
Conclusion................................................................................................................. 228
Chapter 11: Integrating and Evaluating the Components of Relevance........................ 229
1
How the purpose of a QA system determines overall relevance ............................... 229
2
Examples of relevance integration ............................................................................ 231
2.1
A teaching system .............................................................................................. 231
2.2
An advertising system ........................................................................................ 234
3
A framework for evaluating relevance ...................................................................... 237
4
Using the relevance framework to improve and evaluate a TREC-style system....... 240
5
4.1
Making reference to the relevance framework developed ................................. 240
4.2
Improving a TREC-style QA system .................................................................. 243
4.3
Evaluation.......................................................................................................... 250
Conclusion................................................................................................................. 254
8
Chapter 12: Conclusion and Open Issues.......................................................................... 255
1
Overview ................................................................................................................... 255
2
Conclusion................................................................................................................. 255
3
Open Issues................................................................................................................ 258
3.1
Question answering as an engineering problem ............................................... 258
3.2
Machine Learning ............................................................................................. 258
3.3
User interface issues in question answering system design .............................. 260
3.4
Other implementation issues ............................................................................. 260
3.5
Cultural bias in question answering theory ...................................................... 261
3.6
Legal and Ethical Issues.................................................................................... 261
Bibliography......................................................................................................................... 263
9
Acknowledgement
I would like to thank my supervisor, Dr. Suresh Manandhar, for his encouragement and
advice in carrying out this research.
I would also like to thank my assessor, Dr. James Cussens, for his support, and a number of
anonymous reviewers for their comments on the papers which helped form this thesis.
10
Author’s declaration.
The following papers helped form this thesis:
Chapter 3 (The YorkQA question answering system):
De Boni, M., Jara, J.L., Manandhar, S., “The YorkQA prototype question answering
system”, in Proceedings of the 11th Text Retrieval Conference (TREC-11),
Gaithersburg, US, 2003.
(The author of this Thesis, M. De Boni, produced the overall design and implementation of the
system as well as the individual QA components; J.-L. Jara-Valentia contributed the sentence
splitter and Named Entity recogniser)
Alfonseca, E., De Boni, M., Jara, J.L., Manandhar, S., “A prototype Question
Answering system using syntactic and semantic information for answer retrieval”, in
Proceedings of the 10th Text Retrieval Conference (TREC-10), Gaithersburg, US,
2002.
(The author of this thesis, M. De Boni, produced the overall design and implementation of the
system as well as the individual QA components and Information Retrieval module; E.
Alfonseca contributed the NP chunker and Part-of-Speech tagger;
J.-L. Jara-Valentia
contributed the sentence splitter; J.-L. Jara-Valentia and S. Manandhar contributed the Named
Entity recogniser)
Chapter 6 (Semantic relevance):
De Boni, M. and Manandhar, S., “The Use of Sentence Similarity as a Semantic
Relevance Metric for QA”, in Proceedings of the AAAI Symposium on New
Directions in Question Answering, Stanford, 2003.
11
Chapter 7 (Goal-directed relevance)
De Boni, M. and Manandhar, S., “Implementing Clarification Dialogues in Open
Domain Question Answering”, accepted for publication in Journal of Natural
Language Engineering, 2004.
De Boni, M. and Manandhar, S., “An Analysis of Clarification Dialogue for Question
Answering”, in Proceedings of HLT-NAACL 2003, Edmonton, Canada, 2003.
Chapter 10 (Background knowledge)
De Boni, M. and Manandhar, S., “Automated Discovery of Telic Relations for WordNet”, in
Proceedings of the First International WordNet Conference, India, 2002.
12
Chapter 1
Introduction
Executive Summary
Question Answering systems are introduced and the limitations of research in this area are
outlined. We show how this thesis addresses these limitations by providing and implementing
a theoretical framework based on the notion of relevance. We then give an outline of the
contributions of this thesis to research in this area, providing a breakdown of the thesis by
chapter.
1
Question Answering Systems
While there is a very large amount of written information available in electronic format, there
is no easy way of quickly and reliably accessing this information: a vast amount of
information has, for example, been stored on the Internet, but there is no straightforward
means of finding an answer to a simple question such as “Who is the current President of the
USA?”; the full text of the past few years’ Financial Times and Economist is available on
CD-ROM, but there is no easy way of answering a question such as “Which country had the
lowest inflation in 1999?”.
Previous attempts to solve this problem have relied on Information Retrieval systems to
answer questions indirectly by giving an indication of which documents probably contain
information related to the user’s question: this has been the approach used, for example, in
Internet search engines. The drawback of this approach is that, once a set of documents have
been provided, it is then up to the user to examine a subset of all the given document and then
read the chosen documents in order to find an answer, a time consuming and often frustrating
process (search engines sometimes retrieve millions of documents relating to a topic), which
makes this solution far from ideal. Another downside of Information Retrieval systems is that
they do not attempt to understand the meaning of users’ questions (and in fact questions are
referred to as queries, which are considered a bag of words) and often present as an “answer”
a set of documents which are unrelated to the question, being unable, for example, to
13
Chapter 1: Introduction
distinguish the (not too) subtle difference between “Who loves Natalie Portman?” and “Who
does Natalie Portman love?”.
Current research in Question Answering (QA) systems address these issues by trying to find a
method for answering a question by searching for a precise response in a collection of
documents. Unlike search engines, QA systems seek to understand users’ questions and aim
to present a user with an answer as opposed to a set of documents containing an answer:
given a question such as “When was Shakespeare born?” a search engine would provide a set
of documents containing the biographical details of the playwright and (possibly) also a large
number of unrelated documents, containing, for example, biographical information about
living actors who have taken part in Shakespeare plays; QA systems, on the other hand,
would be able to present the user with the exact answer, “1564”.
2
Thesis motivation
Research into the philosophical, linguistic and psychological foundation of question
answering can be traced to Socrates’ analysis of dialogue and research in automated question
answering systems spans more than thirty years. Nevertheless there is a significant lack of
theoretical understanding of QA systems and consequently a considerable amount of
confusion about their aims and evaluation. While ideas from psychology or philosophy may
be helpful in understanding QA from a theoretical point of view, these investigations have
lacked a coherent focus and there is little agreement on their conclusions (see chapter 2,
paragraphs 4 onwards); moreover there has been no systematic attempt to apply these ideas to
QA systems and initial attempts to build systems based on ideas from cognitive science had
very limited applicability and have long since been abandoned (see for example Schank 1975;
Schank and Abelson 1977; Dyer 1983; Graesser and Black 1985; see chapter 2, paragraph 1).
The study of question answering systems appears to have made slow progress and current QA
systems are no more than prototypes, being able to answer only the simplest of questions
(concept-filling questions such as “Who painted the Gioconda?” or “When did Goethe write
Faust?”, but not more complicated questions asking “Why did Leonardo paint Gioconda?” or
“What were the first reactions to Goethe’s Faust?”) and even then only with low accuracy
(see Moldovan et al. 2003 for a description of the most successful current system). While
there is agreement amongst researchers on the generic aim of QA systems (presenting an
answer to a question as opposed to a set of documents associated with a query) little work has
been done on clarifying the problem beyond the establishment of a standard evaluation
framework for QA, the Text Retrieval Conference (TREC) QA track set out by the US
National Institute of Standards (the evaluation framework set out for TREC-8 can be found in
14
Chapter 1: Introduction
Voorhees and Tice 2000 and Voorhees 2000, and was subsequently refined in Voorhees 2001,
2002, 2003). Current programmes, for example the AQUAINT programme, make reference to
this “standard” framework; we shall show that this framework has a number of shortcomings,
mainly due to the lack of clarity in the problem setting, which have generated confusion about
what exactly the framework is evaluating and the value of the evaluation itself. It is unclear,
for example, what user model is implicitly employed when judging the appropriateness of
answers and how such a user model would constrain an answer. The status of prior knowledge
is also badly defined and unclear in the final evaluation of the systems. In addition there is
confusion as to the notion of “good” answer, with ambiguity as to whether this should be
taken to be an “exact” answer or something different. Another limiting factor has been that
most current research has either aimed at solving the engineering problem of building and
improving systems capable of achieving high scores within this framework without
questioning the solidity of the framework, or in looking at “future directions” but without
having clarified the problem setting and what directions this should lead to.
There is consequently a need for a theoretical investigation into QA systems which would
clarify their purpose and elucidate their constituent structure and hence dispel the confusion
which is currently present in the design and evaluation of these systems.
3
Thesis aims
This thesis addresses the shortcomings of current QA research by:
•
Examining from a philosophical point of view the concept of answerhood as
applicable open domain QA systems, and, in particular, the conditions which
determine the answerhood of an answer in relation to a question.
•
Providing a theoretical framework for open domain QA systems research, based on
the concept of answerhood examined above
•
Showing that this new framework moves beyond the limitations of the TREC-style
evaluation by clarifying exactly what is sought in QA, what the constraints are, what
should be evaluated and what is needed in terms of research directions
•
Illustrating how the framework can be applied to improve current TREC-style QA
systems by implementing and evaluating it in a working system
4
Thesis contribution
The original contribution of this thesis will be to:
•
provide an account of the conditions of answerhood based on the notion of relevance,
showing the limitations of previous accounts of relevance, developed in other fields
15
Chapter 1: Introduction
such as epistemology and information retrieval, when applied to question answering;
we will then demonstrate how relevance for question answering is made of a number
of categories: semantic, goal-directed, logical and morphic relevance.
•
set out an explicit theoretical framework for automated question answering, based on
the notion of relevance, which spells out the constraints under which question
answering systems must operate and paves the way for future developments which
move beyond the limitations of current QA systems
•
implement a new TREC-style QA system (YorkQA), submitted for official evaluation
at the TREC QA track, indicating its limitations in the light of the limitations of the
TREC QA framework
•
show how the QA system can be improved and its design made clearer with reference
to the theory developed, by taking into account the notions of semantic relevance,
goal-directed relevance, logical relevance and morphic relevance
•
implement semantic relevance on the developed system and show its usefulness for
improving answer retrieval in a typical system
•
implement goal-directed relevance on the developed system through the use of
clarification dialogue and show how this can be used to improve performance of a
QA system
•
show how logical relevance may be implemented, clarifying how it moves beyond the
notion of question categorisation, named entity recognition and logical inference
•
show how morphic relevance is implemented on the developed system and explain
the usefulness of clarifying its status
•
show how machine learning techniques can be applied to enhance the QA system’s
knowledge base (WordNet), a crucial element in the design of QA systems.
•
show how the elements of relevance previously implemented can be brought together
into a complete system
•
5
provide a framework for the evaluation of the framework on a complete system
Chapter summary
The thesis will be organised as follows:
Chapter 2 will outline previous research on QA systems, with particular emphasis on the
TREC evaluation framework. It will also look at related work in linguistics and philosophy
and show how research in these areas can profitably be used for QA systems.
16
Chapter 1: Introduction
Chapter 3 will describe the YorkQA TREC-style QA system, with an evaluation of its
performance against the TREC QA criteria and a discussion on the limitations of such an
evaluation. The system will be used as a baseline for the experiments carried out in the
following chapters.
Chapter 4 will introduce the notion of relevance and show how relevance can be applied to
question answering. In particular it will show that relevance should be understood as made up
of a number of components: semantic, goal-directed, logical and morphic relevance.
Chapter 5 will introduce a theoretical framework for QA systems based on the notion of
relevance described in chapter 4, explicitly identifying the constraints that determine what is
to be considered a relevant answer to a question.
Chapter 6 will describe a detailed implementation of semantic relevance for question
answering and an evaluation of its implementation for improving the YorkQA system
previously introduced.
Chapter 7 will describe an implementation of goal-directed relevance through clarification
dialogue and an evaluation of the implementation.
Chapter 8 will describe an implementation of logical relevance, showing how it moves
beyond question type categorisation and answer inference.
Chapter 9 will describe how morphic relevance is implemented in TREC-style systems and
the importance of its differentiation from other types of relevance.
Chapter 10 will argue the importance of automatically learning background knowledge for
QA and will describe a method for learning new WordNet relations and its application for
learning telic relationships.
Chapter 11 will describe how the components of relevance described in the previous chapters
can be integrated in a coherent system, providing a framework for the overall evaluation of
the system.
Chapter 12 will present some conclusions with a discussion of open issues and future
research directions.
17
Chapter 2
Question Answering Systems: An Overview
Executive Summary
In this chapter we provide an overview of research into Question Answering systems. We
show what research has been carried out in the past, what research is currently being
undertaken, and the shortcomings of current and past research.
1
1.1
Previous work on Question Answering Systems
The early years: databases, cognitive science and limited domains
Research in QA systems is not new: a number of systems attempting to understand and
answer natural language questions have been developed since the early sixties (see Simmons
1965, who reviewed 15 different systems which had been implemented to that date). These
early systems, such as the BASEBALL program described by Green et al. (1961) were based
on retrieving information in a very limited domain (in this case baseball games played over
one season in the American league) from a database. Another early experiment in this
direction was the SHRDLU system (Winograd 1972), which answered simple questions about
a world constituted of moveable blocks.
Simmons (1973) presents one of the earliest generic question answering algorithms, which
proceeds as follows:
1. Accumulate a database of semantic structures representing sentence meanings.
2. Select a set of structures that appears relevant to the question. Relevance is measured by
the number of lexical concepts in common between the proposed answer and the
question. This is done by ordering the candidates according to the number of Token
values they have in common with the questions.
3. Match the question structure against each candidate. This is done by:
Chapter 2: Question Answering Systems: An Overview
•
Determining if the head verb of the question matches the head verb of the
candidate. If there is no direct match, a paraphrase rule is applied to see if the
question structure can be transformed into the structure of the answer. Paraphrase
rules are stored as part of the lexicon and an examination of the lexical structures
of two words will be able to determine if there is a rule (path) connecting the two.
If there is not, the set of words that the first transforms into is recursively
examined to see if can be transformed into the second word. If this fails, the
transformation rules are recursively applied to the second word to see if a match
can be found. This procedure continues until either a match is found or an
arbitrarily set depth is reached.
•
Applying the same procedure to the other words in the question and the candidate
answer in order to transform the question structure into the form of the candidate
answer.
•
Examining quantifiers and modalities to see if quantificational, tense and
negation relationships are matched.
•
Examining the question’s semantic structure to determine if the question word
type (the wh-word) is present and satisfied in the answer
The key to this algorithm is the notion of semantic structure, which became a key theme in
QA research in the seventies, with the emergence of systems based on research in cognitive
psychology, attempting to model human intelligence. Lehnert (1978), for example, sought to
understand the nature of questions, in particular their classification, based on ideas from
dependency theory set out for example in Schank and Abelson (1977) . Related to this is the
work of Dyer (1983) who built the BORIS system, which attempted to understand short
narratives (in a restricted domain) and answer questions related to the stories. A similar
approach was also taken by Bobrow et al. (1977), who built the GUS system for modelling
human dialogue.
The common characteristic of all these systems was their limited scope: the domain which the
systems attempted to answer questions about was very limited and questions were restricted
to that limited domain; linked to this was the fact that there was no attempt to find “real” user
questions with systems only being able to answer the “toy” questions prepared by the
researchers. We shall now examine in more detail the basis of systems of Schank, Abelson
and Dyer.
19
Chapter 2: Question Answering Systems: An Overview
1.2
Answering questions through cognitive science: Schank and Abelson
One of the basic problems of understanding is the apparent complexity of the world (and
therefore of a collection of documents as a “world” of text) and the need to be able to explain
situations without resorting to unduly complex calculation and without needing a full
comprehension of the events at hand. This complexity is reduced, according to Riesbeck and
Schank (1989) by the use of “scripts”, guides for acting in stereotypical situations. Thinking
could therefore be seen as the application of scripts to particular situations: thinking is about
finding the right “pre-packaged” script, not generating new ideas.
The idea of a script is linked by Schank and Abelson (1977) to the idea of “knowledge
structures”, conceptual nets (the examination of which is referred to in Schank (1975) as
“Conceptual Dependency Theory”) that link together various pieces of information about a
narrative. According to Schank, to understand a text it is necessary to understand the “script”
or the “plan” underlying it, which gives additional information regarding the context and
meaning of words in that context, fitting the words into a structure which can be used for
reasoning. Scripts are “specific knowledge” that enable us to understand a situation without
requiring a huge amount of processing, providing “connectivity” in stories: they are “a
predetermined, stereotyped sequence of actions that defines a well-known situation”, thus
allowing for new objects to be referenced as if they had already been mentioned. A plan is “a
repository for general information that will connect events that cannot be connected by use of
an available script” or causal chains: it contains information about “how actors achieve
goals”, allowing us to make sense of seemingly disconnected sentences. Goals are needs like
“sleep” or “hunger”, which in turn are explained by “themes”, such as “success” or “love”.
These structures are built up from the “conceptual roles” (such as actor, action, object and
state) that make up the Conceptual Dependency Theory. In turn, the conceptual roles (or
types) make up a number of conceptual categories (such as PP, a physical object which may
be an actor; ACT, what can be done by an actor to an object; T a point in time; PA, attributes
of an object). Various rules are given which define the ways (and the only ways) in which
conceptual categories can be combined (e.g. “Certain PPs can ACT” or “Conceptualizations
can result in state changes for PPs”). A number of primitive actions are then identified, such
as ATRANS (transfer of an abstract relationship), GRASP (grasping of an object by an actor)
and SPEAK (an action producing sounds). Finally, a list of self-explanatory primitive states is
given, for example HEALTH, FEAR, ANGER, MENTAL STATE and PHYSICAL STATE.
20
Chapter 2: Question Answering Systems: An Overview
Based on these structures it is possible to make “conceptual inferences” in order to understand
a text. This type of inference, a “new piece of information which is generated from other
pieces of information” (Rieger 1975), differs from logical deductions and theorem proving in
that:
•
it is a “reflex response”: inferences are not only drawn from “interesting” concepts, but
from all concepts
•
it is not necessarily a logically valid deduction, and may even lead to inconsistencies: the
intent of inferencing making is to fill in missing information from an utterance, trying to
tie pieces of (possibly partial) information together, in order to discover something
interesting. It may even be the case that many of the inferences made prove to be useless
•
it is not either true or false; rather, there is a degree to which the information is likely to
be true
•
it does not have a “direction”, and therefore does not seek to prove a “goal”. The aim of
conceptual inferences is “to see what they can see”, in the hope that something interesting
will appear
•
the strength with which some information is believed may vary depending on new
information: it is therefore important to remember why a certain piece of information
exists, i.e. what inference it derived from
1.3
Answering questions through cognitive science: Dyer
The idea of a script is re-elaborated by Dyer (1983) who presents the idea of “Thematic
Abstraction Units”, knowledge structures for understanding (and answering questions about)
events (or stories). Thematic Abstraction Units are more abstract than the scripts presented by
Schank and moreover contain processing information for dealing with plans relating to the
events at hand. In other words, they define abstract knowledge structures which aim to
capture a deep level of understanding that is used for story categorization. Story
categorization is achieved through “cross contextual remindings” and by recalling abstract
story types (adages). Adages provide planning information, i.e., given a particular situation,
they determine what plan should be used, what its intended effect is, why the plan failed and a
recovery strategy for the plan.
Given a text (a story), therefore, a system following the ideas of Dyer would construct a
complex knowledge dependency structure which would characterize the story in terms of
plans, failed plans and attempted recovery of plans. It would do so by constructing so-called
I-links (intentional links), “common sense” relationships between goals, plans and events, in
21
Chapter 2: Question Answering Systems: An Overview
order to “organize and package” the internal components of the knowledge structures. I-links
are used to represent the motivations and the intentions of characters in a narrative. Examples
of I-links are linking Event to Event (forces, forced-by), linking Goal to Event (thwarted-by,
motivated-by, achieved-by) and linking Plan to Event (realizes, blocked-by).
In order to answer a question the system uses search heuristics to traverse the I-links in order
to determine the meaning of the question and to retrieve an appropriate answer. This is done
initially by categorizing the question, an approach similar to what we shall see below is usual
in the current TREC Question Answering evaluation framework. Question categorization is
done following the conceptual categorization scheme developed by Lehnert (1978) and
expanded by Kolodner (1980), and adding some new categories. Lehnert divided questions
into the following types (although only the first seven are implemented by Dyer’s system):
•
causal antecedent (e.g. why did Tom do this?)
•
goal orientation (e.g. why did John want this from Bill?)
•
causal consequent (e.g. what happened after... ?)
•
verification (e.g. did this happen?)
•
instrumental (e.g. how did John do this?)
•
concept completion (e.g. who won the race?)
•
expectational (e.g. why didn’t Susan do this?)
•
feature spec (e.g. how old is Mary?)
•
quantificational (e.g. how many people did this?)
•
enablement (e.g. what did Bill need in order to do this?)
•
disjunction (e.g. did Kate do this or that?)
•
judgmental (e.g. what should I do now?)
•
request (e.g. could you pass me the pen?)
Kolodner added the following categories to the ones proposed by Lehnert:
•
time (e.g. when did this happen?)
•
setting (e.g. where was Pete?)
•
identification (e.g. who was Anne?)
•
duration (e.g. how long did the meeting last?)
Dyer added the following:
•
event spec (e.g. what happened at the party?)
•
affect/empathy (e.g. how did Sam feel about that?)
22
Chapter 2: Question Answering Systems: An Overview
Having determined the question type, the system starts traversing the I-links contained in the
Thematic Abstraction Units which represents the narrative in order to find an appropriate
answer concept linked to the question concept. It is interesting to note that in order to avoid
giving “obvious” answers to questions (e.g. Q: Why did George drive home? A: To get home)
the system performs a “knowledge state assessment” (again following Lehnert 1978) to
determine that the answer could not have been reconstructed by the questioner on his own.
Moreover, question parsing and searching of knowledge structures occur at the same time, so
that it may well be possible that an answer may be known before the question has been fully
parsed.
1.4
Limitations of Schank and Dyer’s approach
Scripts and Thematic Abstraction Units lack any attempt at a comprehensive implementation,
and are used in very limited settings (a restaurant setting and a divorce story). No indication is
given as to how these scripts could be constructed (manually? following intuition? using some
form of machine learning?) and no thought is given whether these ideas could be applied
where the texts examined are much longer than the “toy” systems presented by Schank and
Dyer. Moreover the given ideas (for example, Schank’s “conceptual primitives” or Dyer’s Ilink types) appear to be, if not arbitrary, certainly not based on any “scientific” examination
of language, the use of language and relationships between concepts.
The idea of question categorisation has however been considerably influential, forming the
basis of most current question answering systems developed for the TREC QA evaluation
framework, although only the simplest categories (concept completion and feature
specification, refined as quantification, time, setting, identification, duration) have been
considered. We shall see though that these systems ignore the results of cognitive science and
dependency theory and instead use the methods developed in the field of information
extraction for Named Entity recognition to find answers based on the given question
categories.
1.5
Beyond cognitive psychology: open-domain Question Answering
A first attempt to move beyond the limited domain question answering systems was the
FAQFinder system of Burke et al.(1997) which tried to link users’ questions to a set of
previously stored “question and answer” files (taken from the “Frequently Asked Questions”
posts of a number of newsgroups) by locating the most similar question in the document
collection and therefore the most probable answer: questions could therefore be phrased at
will on a very large number of different topics, the topics being limited by the previously
23
Chapter 2: Question Answering Systems: An Overview
stored question and answer files. The task performed by the FAQFinder system however is
more accurately described as answer finding rather than question answering (see Berger et al.
2000, who describe similar work trying to find a statistical relationship between questions and
answers, rather than the semantic relationship described by Burke), i.e. the search for an
answer to a question in a collection of ready-made answers, as opposed to a collection of
generic documents: in other words, the system is not required to actively seek and construct
an answer from unrestricted text or a knowledge base.
2
Question Answering, Information Retrieval and Information
Extraction
2.1
Information Extraction through template filling
A method of searching for information in some ways similar to Question Answering is
Information Extraction. Information Extraction systems use templates which specify what
information should be extracted from texts according to user requirements. Thus, for example,
a user requirement for information about car accidents might use a template made of fields
such as “Number of injured”, “Number of cars involved”, “Names of victims”, “Location”.
The Information Extraction engine would then attempt to fill these fields as if entering the
information in a database. This problem has been examined in detail in the Message
Understanding Conferences (see for example Hobbs 1993), which aimed to produce systems
which could accurately carry out tasks such as identifying names, dates and organizations in a
text. The results have been very encouraging and the “Template Element” task (filling in
fields in a template) has been carried out with a reported accuracy of over 95% for the best
systems (see Wilks and Catizone 2000).
There are however a number of limitations to template filling. Templates are usually handcrafted by human experts to suit a particular domain and therefore cannot be easily ported to a
new domain. Consequently there has been the need to seek alternative approaches such as
automatically learning extraction templates. Early experiments in this sense were the LIEP
system described by Huffman (1995) which induced patterns from positive training examples
and user input; the PALKA system described by Kim and Moldovan (1993) which
inductively built IE patterns but required prior knowledge of a semantic hierarchy, an
associated lexicon and answer keys; the system described by Bagga et al. (1997), which
generalizes from sentences selected by an expert. A semiautomatic method for inducing
information extraction patterns is the ELA algorithm described by Catala’ et al. (2000). This
algorithm minimizes human intervention and ensures that the Information Extraction system
24
Chapter 2: Question Answering Systems: An Overview
is portable amongst different domains. A fully unsupervised template learning system was
presented in Collier (1998): there are however doubts that this system could effectively learn
anything other than simple prototype results (see Wilks and Catizone 2000 and ECRAN
1998).
The need to customize templates for the needs of a new domain can be considered a subproblem of the more general issue of suiting a generic IE system for the needs of a particular
user. A user-driven IE system would adapt to the requests of the particular user without the
need for expert intervention, for example changing the template for a text to match the
particular interests of a particular user. User driven IE is however a very poorly researched
area, almost a “concept only” (Wilks and Cotizone, 2000). Question Answering on the other
hand improves on Information Extraction through templates and is much more in line with the
idea of user-driven Information Extraction, allowing users to specify exactly what they want
the extraction machine to provide.
2.2
Information Retrieval
Closely related to question answering is the idea of information retrieval, typically
exemplified by internet search engines (for an overview see Baeza-Yates and Ribeiro-Neto
1999 and Spark-Jones 1997). Information retrieval systems however do not respond to a
question but to a query, a set of words where syntactic structure is ignored, and do not return
an answer, but a set of documents which are considered relevant to the query, i.e. which it is
hoped will be useful to the user. As will be shown below open-domain QA systems generally
make use of information retrieval engines in order to narrow down the number of documents
to be searched and processed in order to find an exact answer to a question.
2.3
Query relevant summarisation
A different approach to Question Answering is found in the Information Retrieval task of
Query-Relevant (or User-Focused) Summarization (see Mani and Bloedorn, 1998 and Berger
and Mittal, 2000). Given a document d and a query q, query-relevant summarization aims to
extract a portion s from d which best relates d to q. The system’s aim is therefore to select a
span of text (complete sentences or paragraphs) from the original document that will best
answer the question posed (where the question could be either a natural language question or
a query made up of a set of keywords or another type of sentence). This is done through
statistical analysis and only a very limited application of syntactical and semantic knowledge.
The main drawback of this approach is therefore an output which makes sense statistically but
that rarely resembles the output that would be presented by a human.
25
Chapter 2: Question Answering Systems: An Overview
Query-Relevant Summarization is heavily reliant on statistical techniques such as word
frequency count, and does not entail any linguistic understanding on the part of the machine.
This means that a user will never be given a definite answer to a question and they will need
to process the given answer by scanning the portion of text returned in order to find the
information that was actually requested. In order to return the specific information required,
natural language processing techniques would have to be applied both to the query posed by
the user and to the portion of text returned by the information retrieval system. But natural
language techniques have been applied to information retrieval tasks with mixed results (see
for example Voorhees 2000b): queries have been found to be particularly difficult to deal with
because they are short and offer little to assist linguistic processing.
3
TREC-QA and Open-Domain Question Answering
3.1
Open-Domain Question Answering
Open domain question answering seeks to move beyond the limitations of previous
approaches to question answering research by looking at a “realistic” setting of the problem
where:
•
Users can ask any question, phrased in any way, on any topic
•
Knowledge is not readily encoded, and answers must be sought and constructed from a
collection of documents made up of “real” text, with the errors, omissions and
complications these contain.
The emphasis is therefore much more on solving a practical problem rather than constructing
a cognitive theory. On the other hand this often means that research is directed at solving
challenging engineering problems (see for example Chu-Carroll et al. 2003) but without the
theoretical underpinning that can give research a clear understanding of the issue at hand.
Paraphrasing Kant, while initial research on QA was interested in ideas but lacked the content
given by experience and hence was in a sense “empty” in that the ideas did not have a
“realistic” practical application, open-domain QA lacks vision in that it is interested in
practical, “realistic” problems but lacks the framework that can only come from a deep
conceptual understanding of the problem. These difficulties will be addressed in more depth
below, where we will develop a theoretical framework for open-domain question answering
which will provide the conceptual underpinning necessary for research in this area.
26
Chapter 2: Question Answering Systems: An Overview
3.2
The TREC QA Evaluation Framework
The Text REtrieval Conferences (TREC) are a series of workshops with the aim of advancing
the state-of-the-art in text retrieval (Voorhees and Tice 2000,Voorhees 2000). Starting from
8th TREC (TREC-8), a Question Answering track was added to the conferences, with the aim
of providing a common evaluation framework for the development of open-domain QA
systems.
3.3
3.3.1
TREC-8 QA
TREC-8 QA: Problem Setting and Evaluation
Participants to the TREC-8 QA track were given a very large set of documents from a variety
of sources (The Financial Times, The Los Angeles Times and the Foreign Broadcast
Information Service) and a number of questions to be answered in a limited time (one week)
by using the documents. The questions were limited to short-answer, fact-based questions,
and it was guaranteed that at least one document in the collection contained the answer to the
question. Systems were expected to return a ranked list of five pairs of the form {documentid, answer-string} per question where the document-id specified the id of the document the
string was derived from, and the answer string was a string containing the answer. Answer
strings had a limit of either 50 or 250 bytes, which meant that specific answers to questions
were not expected. Rather, a “window” within which the answer was to be found was to be
returned by the systems. However, answer strings which contained the correct answer, but
also other entities of the same semantic type and no means of deciding which was the “true”
answer were not considered to be correct.
The evaluation of each question was carried out by team a human assessors who decided if
the answer provided was correct or not. A score was then computed based on the mean
reciprocal rank of the answers: each question received a score that was the reciprocal of the
rank at which it placed the correct answer (i.e. 1 for an answer at rank 1, 1/4 for an answer at
rank 4) or 0 if no correct answer was returned. The overall score was the mean of these
individual question scores. Systems did not receive credit for returning multiple correct
answers, nor did they receive credit for realizing they did not know the answer.
3.3.2
TREC QA algorithms
The systems presented at all the TREC conferences follow a similar algorithm, with only
slight variations:
27
Chapter 2: Question Answering Systems: An Overview
1. Create a database of terms from the document collection using a standard Information
Retrieval engine.
2. Select a set of documents from the collection which are relevant to the question using the
Information Retrieval engine.
3. Find the portions of the document set that are most relevant to the type of question asked.
This is done by a) initially matching entity-types (person, location, …) to wh-words in the
questions (who, where, …) and then by b) narrowing down the choice further by using
some form of pattern matching and/or making use of syntactic and semantic information,
possibly with some form of logic prover.
There is no significant difference in design concept between the systems presented at the early
conferences (e.g. TREC-8) and the later conferences (TREC-11): later systems tend to
improve on earlier algorithms without any radical modifications other than what is needed to
conform to the slight variations in evaluation setting.
3.3.3
TREC-8 QA : approaches
In this first TREC workshop, all systems applied statistical techniques from Information
Retrieval; a number of participants then attempted to improve performance by adding other
algorithms which sought to actually understand (to varying degrees) the document texts and
the question.
Given a query (question) Information Retrieval systems generally apply statistical methods to
find appropriate documents. Extending this idea to question answering, systems based on
Information Retrieval algorithms answer questions by adopting a “bag-of-words” approach
which considers the question independently from its grammatical or semantic characteristics
and applies statistical techniques to return answer passages (see Prager et al. 2000). While
systems based on a similar approach coped with the 250 byte problem they proved inadequate
to retrieve a much shorter 50 byte long string.
Having attempted the question-answering track both with a passage based (bag-of-words
approach) system and a system that used simple natural language processing techniques
Singhal et al. (2000) concluded that while the passage based system coped very well with
retrieving 250 byte strings, a more sophisticated, linguistic-based system was required to
retrieve shorter (50 byte) strings, due to the fact that linguistic-based systems were able to
consider context when seeking the correct answer. Their passage-based system searched for
the correct answer by:
28
Chapter 2: Question Answering Systems: An Overview
•
retrieving the top fifty documents for a question using a straight vector match
•
breaking up each section of the retrieved document into sentences which were then
assigned a score calculated using an algorithm which took into account such elements as
query term weight and the appearance of bigrams, query words contained in adjoining
sentences.
•
ranking the passages depending on the score given
Takaki (2000) describes a similar system based on an algorithm which ranked documents
based on the results of a formula taking parameters such as query term weight, number of
documents in the collection, number of documents known to be relevant to a specific topic,
number of relevant documents containing a particular term, the frequency with which a term
appears in a topic and the document length. Once the relevant documents have been retrieved,
the best answer text is calculated using a formula dependant on variables such as the distance
of a query term from a particular position in the text and the relative distances of query terms.
Again good results were achieved when retrieving strings under 250 bytes long but results
were poor in the 50 byte category, which lead the author to speculate that applying linguistic
techniques to question analysis would yield better results.
Even where it was recognised that linguistic processing was necessary, the amount of
documents to be analysed was such that it ruled out any full processing of the entire text
collection for each question. All systems that made use of linguistic knowledge therefore first
used conventional Information Retrieval techniques to filter out candidate relevant documents
or passages. Participants then applied linguistically based techniques to different extents:
some used semantic knowledge for query expansion; others used named-entity recognition
and a shallow parse of the question to find a correspondence between the answer and the
documents; another approach taken was to use relatively deep parsing of the question or the
documents to extract the correct answers from these chosen documents.
Most systems making use of linguistic techniques processed the questions posed in order to
establish the type of question the user intended. This superficial parse, or semantic tagging of
wh-words (Oard and Wang 2000) determined, for example, that questions beginning with
“Who” needed an entity of type “Person” as an answer, questions beginning with “When”
required an entity of type “Date” as an answer, and questions beginning with “Where”
required and entity of type “Location” (see for example Breck et al. 2000; Hull 2000;
Moldovan et al. 2000; Singhal et al. 2000). Different systems added different levels of
complexity to this basic idea: Hull (2000), for example, added heuristics that would correctly
29
Chapter 2: Question Answering Systems: An Overview
determine the question type even in the absence of a standard wh-word to determine the
question type; in this system, a question starting with “How” was deemed to be of type
“Quantity” if followed by the adjectives “long” or “short”, but was correctly interpreted as
“Money” if followed by the adjectives “rich” or “poor”; questions such as “Find the price of
X” were also correctly interpreted as being of type “Money” given the presence of the
keyword “price”. All systems used some type of named-entity recognition to parse the
documents in order to recognize the type of entities included in the text (time, location,
person, etc.) and match them to the required question type.
A number of systems attempted to improve performance by employing some form of deeper
parsing, including the use of grammar, co-reference resolution and semantic information.
Humphreys et al. (2000) used a custom-built question grammar based on the test questions
given before the task. This produced a quasi-logical-form of the question which contained a
“question-variable” indicating the entity requested by the question. The answer to the
question was then considered as an hypothesis and the system’s coreference mechanism
(Gaizauskas and Humpreys 1996) attempted to find an “antecedent” for the hypothesis from
the subset of documents retrieved. The documents had also been previously analyzed in order
to translate their contents into quasi-logical form. Pairs of entities from the question and the
documents were then compared for a similarity measure, by a) looking at semantic classes of
entities, b) considering the values of “immutable” attributes (i.e. fixed single-valued attributes
such as gender or number) of the entities, and c) identifying events of compatible classes
which apply to the entities under consideration by making reference to the
hypernym/hyponym relations of WordNet (Fellbaum 1998). While WordNet was only used to
compare event classes, the authors envisaged the possibility of using a similar approach in
order to compare object classes.
Oard and Wang (2000) used dependency grammars to match questions to answers in the
retrieved documents: after collecting a set of sentences that contain matching words in the
query (looking at word roots, stemming and wh-words) the dependency trees in the query and
the document acted as constraints on the possible answers.
A different approach was taken by Litkowski (2000) who aimed ultimately to tie together the
elements of a document using lexical chains and coreference to create a hierarchical view of
the document. The implementation for TREC-8 did not achieve this objective, but used parse
trees to extract semantic relation “triplets” consisting of a discourse entity, a semantic relation
(characterizing the role of the entity in the sentence) and a governing word to which the entity
stands in semantic relation. These triplets were then inserted in a database and compared with
30
Chapter 2: Question Answering Systems: An Overview
the records for the questions (also represented as triples acquired in the same manner) to
decide if they provide an answer.
3.3.4
TREC-8 QA: limitations and results
The questions for the track were a selection of 200 question from a total of 1500 questions
submitted by the NIST team, the assessors, the participants and question logs from the
FAQFinder system from Cornell University. The questions were not, however, checked
thoroughly before being given to the participating systems, and a number were found to have
ambiguous or non-existent answers. Judging correctness for a question was not found to be a
clear-cut decision and it was finally decided that three separate assessors had to evaluate each
question. A number of “human” errors were, as could be expected, associated with evaluation,
with assessors marking right answers as “wrong” or simply pressing the wrong button when
evaluating a response. These problems with evaluation prompted Voorhees and Tice (2000) to
refer to the “myth of the obvious answer” to illustrate the fact that no answer is ever
obviously right and the “granularity” of an acceptable response depends on both the question
and the person receiving the answer. This difficulty in evaluation was however deemed to be
useful as the technology developed for “live” use would have to deal with different user
opinion and expectations.
The problems with evaluation meant that the results could not be wholly relied upon to
indicate the performance of a system and had to be taken with care. However, the systems that
performed best were the ones using shallow parsing to determine the question type and a form
of Named Entity recognizer to find the relevant answer. The best systems that used this
approach were the ones presented by Srihari and Li (2000, who obtained a mean reciprocal
rank of 0.66, and Moldovan et al. (2000), who obtained a mean reciprocal rank of 0.64. Some
of the worst performing systems appeared to be the ones attempting some form of deep
parsing, with, for example, the system by Humphreys et al. (2000) obtaining a mean
reciprocal rank between 0.111 and 0.60.
3.4
3.4.1
TREC-9 QA
TREC-9 QA: Problem Setting and Evaluation
The systems presented at the 9th TREC QA track continued along the lines of the systems
developed for TREC-8, relying on a combination of statistical and shallow parsing techniques
in order to provide an answer to questions on a very large set of documents (the document set
was expanded to include text from AP Newswire, the San Jose Mercury and the Wall Street
Journal) (see Voorhees 2001 for an overview).
31
Chapter 2: Question Answering Systems: An Overview
A very large collection of documents from the Foreign Broadcast Information Service, Los
Angeles Times, Financial Times, Wall Street Journal, Associated Press Newswire and San
Jose Mercury News (which totals about 1 gigabyte of compressed data) was given, together
with a set of 693 questions which required an answer from the given documents. The
participants were required to provide (depending on the sub-track they were taking part in), a
sentence containing the answer in less than a 50 byte string (“lenient” answer), a 50 byte
string containing a “strict” (i.e. supported) answer, or a 250 byte string containing the answer
(strict and lenient) .
The scores were calculated as mean reciprocal ranks of the best answer over all questions (5
ranked answers were allowed for each question). Scott and Gaizauskas (2001) report the mean
scores of all participants as being:
•
Strict answer, 50 bytes: 0.220 (31% of answers in Top 5)
•
Lenient answer, 50 bytes: 0.227 (32.2% of answers in Top 5)
•
Strict answer, 250 bytes: 0.351 (49% of answers in Top 5)
•
Lenient answer, 250 bytes: 0.363 (50% of answers in Top 5)
3.4.2
TREC-9 QA: approaches
The best results were obtained by the system described by Harabaigiu et al. (2001) which
combined Natural Language Processing and Knowledge Engineering techniques and used an
iterative approach for the information retrieval part of the problem. The system proceeded as
follows:
•
The system sees if the question, or a similar question has already been asked before, in
which case the cached answer is given. A similarity measure is calculated between the
given question and previous answered questions in order to see if the given question is a
reformulation of a similar question.
•
The answer type is determined, using semantic information and WordNet.
•
The question keywords are passed to a Boolean retrieval engine which returns a set of
paragraphs which are thought to possibly contain the answer.
•
The paragraphs are then analyzed and if they are not of sufficient quality (measured by
the number of retrieved paragraphs) words are added to the information retrieval query,
and the information retrieval engine is called again. This is repeated until sufficient
paragraph quality is obtained.
•
The retrieved paragraphs are transformed into a semantic form and unification between
the question and answer semantic forms is attempted. If the unification fails, lexical
32
Chapter 2: Question Answering Systems: An Overview
alternations of the keywords (e.g. synonyms and morphological derivations) are
considered and added to the information retrieval query; the system then restarts the
information retrieval process.
•
Logical proof of the answer is sought through an abductive backchaining from the answer
to the question. If this succeeds, the answer is deemed to have been found, semantic
alternations are used for the information retrieval query (finding related concepts in
WordNet), and the process is restarted from the information retrieval stage
The results were very good, with 59.9% of answers in the top 5 retrieved strings of text for the
lenient 50-byte track, 58% for the strict 50 byte track and 77.8% and 76% for the lenient and
strict 250 byte tracks and mean reciprocal ranks of almost 0.6 and over 0.7 for the 50 and 250
byte tracks.
The system described by Litkowski (2001) is typical, in that it relies on a “generic”
(statistical) information retrieval engine to narrow down the search for the answer to the given
question by taking a collection of 50 of the total documents. The chosen documents were then
processed as follows:
•
An initial function divided the documents into sentences, as the given format did not split
the documents into definite paragraphs. A number of errors occurred in this function due
to a misunderstanding of the format of some of the document collections.
•
The sentences were then passed to a parser that output bracketed parse trees describing
parts of speech and giving lexical entries for each word in the sentence.
•
The parse tree was then examined and “semantic relation triplets” extracted and stored in
a database in order to be used to answer the question. A semantic relation triplet consists
of a discourse entity (which in general are what information extraction calls “named
entities”, i.e. noun sequences, ordinals, time phrases etc.), a semantic relation which
characterizes the entity’s role in the sentence (i.e. roles such as agent, theme, location,
modifier, manner, purpose, time, etc.), and a governing word to which the entity stands in
the semantic relation (i.e. for “SUBJ”, “OBJ” and “TIME”, the main verb of the sentence;
for prepositional phrases, the noun or verb that was modified; for adjectives and numbers,
the modified noun). The same function was applied to the questions to create a set of
records for each question.
•
A “coarse filtering” function was then applied to the database record, selecting sentences
which had discourse entities which were substrings of discourse entities in the given
question, in order to further reduce the sentences to be processed. A machine-readable
dictionary was used to aid the recognition and correct division of multi-word units.
33
Chapter 2: Question Answering Systems: An Overview
•
The question type was then determined (i.e. time, location, who, what, size and number)
and ambiguous cases were resolved (i.e. “what was the year” was recognized as a “time”
question, not a “what” question).
•
Potential answer sentences were then identified, given the question type and the semantic
relations in the database record triplets.
•
Answer quality was then determined by looking in the sentences for key verbs, nouns and
adjectives for the key nouns. This was aided by the use of the lexicon (for example
looking for the noun “assassination” where the key verb was assassinate). Anaphora
resolution was also used at this stage.
The results were not too good, achieving scores of 0.287, 0.119, 0.296, 0.135 in the two (strict
and lenient) 50 byte and 250 byte subtracks. However, this was partly due to the documents
that were retrieved by the initial search engine, as 130 questions did not have an answer in the
top 50 documents retrieved. It was also noted that the incremental improvement from
processing more documents was small, thus indicating that simply retrieving more documents
was not the simple answer to the problem. Adjusting the score to reflect the number of
questions that could be answered, given the retrieved documents, the scores improve to 0.412,
0.170, 0.394 and 0.179 respectively.
Breck et al. (2001) also used a standard Information Retrieval engine to limit the number of
documents to be subsequently analyzed. Given the set of returned documents the system
proceeded as follows:
•
The question was processed and (hand-crafted) answer types assigned to it
•
A set of taggers found entities of the type assigned to the question and related types in the
retrieved set of documents
•
The entities were ranked using a probability function (tuned by using a training corpus),
which looked at features such as question/context overlap and type match to calculate the
most probable answer
It was reported however that the system contained a “trivial” bug which caused the answers to
be ranked randomly, thus invalidating any results. The unofficial results they presented,
having corrected the bug are 0.206 (33.2% answers in top 5) for the 50 byte lenient run and
0.282 (41.7% answers in top 5) for the 250 byte lenient run.
The system presented by Takaki (2001) also relied on a generic information retrieval engine
to select a limited number of candidate documents to be processed further. The system:
34
Chapter 2: Question Answering Systems: An Overview
•
Determined the answer type (person, location, etc.) and extracted the query terms which
are passed to the IR engine
•
Used dictionaries to identity word types (corresponding to answer types) in the
documents selected by the IR engine and documents were ranked by query term and
answer type
•
Restricted the output to the specified length
The system’s best scores were 0.231 (37% answers in top 5) and 0.391 (57% answers in top
5) for the 50 and 250 byte answer categories (it is unclear whether this is in the strict or
lenient tracks).
The system described by Scott and Gaizauskas (2001) carried out an initial search with a
generic information retrieval engine in order to select a small number of document passages
to be further processed. Once this was done:
•
Candidate passages and the question were split into sentences, tokenized and given a
“quasi-logical form”, i.e. a predicate argument representation that doesn’t consider sense
disambiguation and quantifier scoping
•
Each identified entity in the quasi-logical form was added to a semantic net in order to
provide a framework for integrating information from multiple sentences in the input
•
Coreferences between entities in the question and entities in the candidate answer
passages were established, and each sentence was given a score, indicating its likelihood
as answer, based on the similarity between question and sentence and whether it contains
a possible answer entity
•
The sentences were ranked and 5 output as answers
The system’s best scores were 0.220 (31% answers in top 5), 0.227 (32.2% answers in top 5),
0.351 (49% answers in top 5) and 0.363 ( 50.5% answers in top 5) for the strict and lenient 50
and 250 byte answer categories.
The system described by Elworthy (2001) was similar to the system described by Scott and
Gaizauskas (2001), using the same generic Information Retrieval engine to retrieve a set of
candidate documents (using the question as a query) and then:
•
Splitting the retrieved documents into sentences which are ranked (ordered according to
the number of terms from the question they contained) and the top ones fully analyzed
linguistically and given a “logical form”
•
Analysing the question in a similar manner and producing a logical form for it
35
Chapter 2: Question Answering Systems: An Overview
•
Applying a matching algorithm to select the answer from the sentences, using a set of “ad
hoc” rules, which in some cases are admittedly “hacks” which “seemed to work”. Using
the rules to look for relationships between nodes in the question logical form and in the
sentence logical form and measuring the proximity of the sentences to the most
significant question terms
The system did not perform very well, and its best scores were 0.196, 0.203, 0.264 and 0.274
for the strict and lenient 50 and 250 byte answer categories.
A simple question-answering machine was presented by Cooper and Rueger (2001). They
divided the given documents into paragraphs and gave the paragraphs to a standard search
engine for initial retrieval, thus getting passage retrieval as opposed to document retrieval.
The selected passages were then processed as follows:
•
The question was parsed and its focus determined (when, who, where, why, describe,
define)
•
Entities were then recognized by using lists of place names, proper names, etc.
•
The document paragraphs were split, using a set of heuristics, into sentences
•
Candidate sentences were chosen by matching them with a regular expression made up of
the disjunction of the WordNet hyponyms of the question’s answer concept, with a few
hand-crafted rules added
•
A set of heuristics were used to evaluate the probability that a sentence contains the
answer (counting word occurrences and location of punctuation)
Results were not given although the authors claim their system “compared favorably” against
other participants.
The system presented by Woods et al. (2001) adopted a different approach from most other
participants, in that, instead of relying on a generic retrieval engine to retrieve an initial set of
documents to be analyzed, it indexed and searched the entire document collection for an
answer. The system proceeded as follows:
•
At indexing time a taxonomy of the words and phrases in the document collection was
built, based on morphological structure of words, syntactic structure of phrases and
semantic relations between words in the system’s lexicon
•
The question was analyzed to determine the answer type of the desired result (e.g. “when”
questions are of type “date”, “where” questions are of type “location”) and then processed
to strip out unnecessary punctuation (e.g. “US” instead of “U. S.”) and words (e.g. “name
36
Chapter 2: Question Answering Systems: An Overview
of”, “take place”), replace inflected word forms with base forms, generalize certain words
and skip (if necessary) parenthetical comments
•
Concepts that are subsumed by the terms in the given query are retrieved and potentially
relevant passages in the document collection are located
•
Answers are located in the retrieved passages by selecting the 250 byte window that
contains as many candidate answers as possible
The system only took part in the 250 byte track, where its best results were 0.345 (strict) and
0.354 (lenient) with 46.9% and 47.5% of answers in the top 5 hits. It did however suffer from
a bug which restricted the way it looked up the lexicon; having corrected the bug, the strict
run improved to 0.380, with 52.6% of answers in the top 5 hits.
3.4.3
TREC-9 QA: limitations
In common with TREC-8 there appeared to be problems with evaluation with a number of
systems giving unreliable results due to trivial bugs or due to the unreliability of the
“standard” information retrieval engine provided by NIST. An interesting question is posed
by Litkowski (2000), who, having noted that his system’s failures were heavily dependent on
the failures of the “standard” information retrieval engine provided, asks whether the QA
track is measuring retrieval performance or question-answering ability. All the systems
presented are in fact heavily dependant on a “standard” information retrieval engine that
filters the large number of documents provided and passes a (small) number of (probably)
relevant document to a different module for further processing. As seen above, this is due to
the fact that the second step, which typically applies shallow natural language processing
techniques, is computationally intensive and could not realistically be applied to the whole set
of documents to be analyzed.
The main limitation of the systems presented was their restricted linguistic knowledge,
limited to the examination of wh-words and question structure to determine the type of entity
that a relevant answer would correspond to, and only occasionally extended to employing
simple logical forms to attempt to infer answers from the text. In particular, the semantic
knowledge of the systems was very poor, considering only relations such as synonymy and isa relations. In order to improve performance, systems would have to have a much deeper
linguistic knowledge, including the knowledge of the complex semantic structures that
humans make use of to determine if a portion of text is relevant (and therefore may contain an
answer) to a question.
37
Chapter 2: Question Answering Systems: An Overview
3.5
TREC-10 QA
As in the previous TREC QA tracks, TREC 2001 (TREC-10 – see Voorhees 2002) required
participant systems to find the answers to a set of 500 questions from a given set of
documents provided by NIST (derived from various newspaper articles in electronic format).
Unlike previous tracks, however, answer strings had to be less than 50 bytes long, thus
constraining systems to be more precise in the pinpointing of an answer.
The TREC-10 QA systems displayed an architecture not much different from the architecture
of the previous (TREC-8 and TREC-9) systems, relying on an information retrieval engine to
choose a subset of documents which are then analysed by some sort of Named Entity
recogniser to find a Named Entity corresponding to the question type (e.g. person for “who”
questions, place for “where” questions etc.). The best systems did not appear to be the ones
making use of any novel ideas, but rather the ones that managed to “tune” each component to
perform optimally.
The most interesting aspect of TREC-10 was the very good performance of the system
described in Soubbotin and Soubbotin (2002) which was heavily reliant on pattern matching
techniques derived from text summarisation. The summarisation system it was based on
selected a number of text fragments from a given set of source documents, arranging them
according to certain discourse patterns, thus automatically producing new texts (even
unrelated to the given documents): this was achieved through the use of “massive […]
heuristic indicators”, i.e. a very large number of (non-trivial) patterns derived through a
systematic analysis of source texts. Similar patterns were used to locate answers, given a
question type: so, for example, given the question “When was X born”, a pattern such as “X
(AAAA-YYYY)” was sought, where AAAA, representing a year, was deemed to be the
answer. Their QA system outperformed all other participants, answering 70.1% of the
questions correctly (lenient score).
The system presented by LCC (Harabagiu et al. 2002), which obtained the second-best score
(67.7% - lenient score) used, in addition to the “standard” components of a QA system (IR
engine, Named Entity Recognition), a form of abduction which attempted to match questions
with answers using lexical chains which found relations between concepts present in the
question and concepts present in the answer.
Other novel approaches included the use of an Internet search engine to gather possible
answers to questions from the World Wide Web before looking for the answer in the given
38
Chapter 2: Question Answering Systems: An Overview
(standard) set of documents (Brill et al. 2002). While this approach did yield some positive
results (with a score of 60% correct answers - lenient score), it is doubtful whether this
approach could be widely used: the answers to the kind of generic questions used in the
TREC evaluation may well be found on the Internet, but it is doubtful that a production
system, which would probably be required to find answers in a very specific set of documents
(a company Intranet, or a collection of legal documents, for example) would be able to first
find the answers on the Internet.
In addition to the main task, there was also a context question task, where systems had to
return answers to a set of related and sequential questions about the same context, and a list
task, which required systems to return answers containing more than one occurrence of an
entity (e.g. answers to the question “What are 9 novels written by John Updike?”). These new
sub-tasks however did not produce any new insights into the problem of question answering,
as the same systems used for the main task were used for these sub-tasks, without any major
modification.
3.6
TREC-11 QA
The Trec 2002 QA task differed from previous years in that this time a definite answer (in the
form of a single noun or noun phrase) was required as an answer to the given questions, as
opposed to a full sentence or sentence snippet containing the answer.
The top system was produced by the LCC (Language Computer Corporation) team,
(Moldovan et al. 2003). The LCC system performed exceptionally well, with 83% of correct
answers, a result which was far better than any other of the systems which were evaluated; the
next ranking scores were 58%, National University of Singapore (Yang and Chua 2003); and
54%, InsightSoft, (Soubbotin and Soubbotin 2003). Significanty, only 6 of the 67 runs
presented managed to correctly answer over 31% of the questions, which makes the 83%
obtained by the “top” system all the more remarkable.
Again, most of the participating systems made use of the “standard” question answering
architecture which consist of retrieving a subset of documents ranking paragraphs or
sentences and looking within these for a named entity appropriate for the question type. Given
that this year a sentence containing the answer was no longer acceptable as an answer, and the
answer had to be a noun, noun compound or noun phrase containing the answer and nothing
but the answer, systems were heavily reliant on accurate Named Entity Recognition.
39
Chapter 2: Question Answering Systems: An Overview
It is worth examining the LCC system in detail however, as it has a number of elements which
are different from other systems, in particular:
•
a customised parser
•
a customised form of logical form representation
•
the use of lexical chains which enhanced the logical form representation
•
the use of a customised logical prover
The parser was a probabilistic parser improved over a number of years and trained especially
for the Question Answering task. It identified simple noun phrases, verb phrases and particles
which may be significant. Only relevant passages of relevant documents are parsed. In
addition to this there is a coreference resolution component which determines if a noun phrase
refers to the same entity as another noun phrase (e.g. “President Bush” and “George Bush”)
and resolves definite and indefinite pronoun coreference. A light form of temporal
coreference was also implemented.
The logical form representation was devised to be as close as possible to English and
syntactically simple. It was designed to represent syntax-based relationships such as syntactic
subjects and objects, prepositional attachments, complex nominals and adjectival/adverbial
adjuncts. The logical form is derived using a small subset (10) of grammar rules used for the
parse tree of a sentence.
Lexical chains, which link semantically disambiguated words, are used in order 1) to increase
document retrieval recall and 2) to improve answer extraction by providing world knowledge
axioms that link question and answer word. Lexical chains use WordNet relations and
WordNet glosses to find links between concepts. A highly accurate parser specialised for
WordNet glosses was developed to transform the glosses into a logical form which could then
be used as an axiom (an axiom is a logical form expression of a synset and its gloss).
The logic prover was a version of Otter which was extended specifically for the Question
Answering task. It made use of the Logical Form representation of the question and the
possible answers together with a number of relevant axioms derived from the lexical chains
linking question and answer concepts: these axioms are then used to guide the proof. No
indication is given as to how problems such as computability are resolved, nor how well the
theorem prover performs.
40
Chapter 2: Question Answering Systems: An Overview
It is important, however, to note that no exhaustive evaluation of the single components is
given and no in depth analysis is given of the relative performance or contribution of the
single components to the overall performance of the system.
3.7
Critical evaluation of the TREC QA track
Although the TREC QA is a useful benchmark to test QA systems, it presents a number of
problems, which mean results must be analyzed very carefully and with extreme caution.
3.7.1
Mixing Information Retrieval and Question Answering
The most serious problem appears to be the fact that the QA track is simultaneously assessing
systems’ effectiveness in both Information Retrieval and Question Answering. Participants
were given a very large collection of documents and a number of questions to be answered
from these documents. However, the number of documents was so high that systems could
not attempt to analyze all the texts for an answer, but had to restrict themselves to a relatively
small subset. Participants had the choice of either using a standard set of 50 documents
retrieved by the organizers’ Information Retrieval system or to use their own systems. This
would indicate that the organizers did not consider the information retrieval task crucial or of
particular interest. Nevertheless, systems that used custom-built information retrieval engines
performed consistently better than systems using the “standard” documents, thus indicating
that the retrieval task was, in fact, crucial to the successful solution of the question answering
problem.
Information retrieval from questions is certainly different from information retrieval from
query terms, as a question will give many more clues than a simple “bag of words” to the type
of document sought and therefore is a legitimate research area separate from more traditional
information retrieval. However the retrieval of relevant documents, given a question, is also a
separate task from the analysis of the documents to find an actual answer to the question, and
therefore it makes sense to separate the tasks. All participants in the TREC QA task designed
their systems in a modular fashion, distinguishing between the information retrieval and the
answering seeking moments of the overall system, thus confirming the desirability of a
separation between these functions.
However, the track evaluation did not separate the tasks, making it very difficult to compare
the results, given that errors in the information retrieval module would propagate in the
question answering module. As an example, a system with 90% accuracy in spotting the
correct documents and 80% accuracy in finding the answer within a document would find the
correct answer 72% of the time on average, while a system with a 60% information retrieval
41
Chapter 2: Question Answering Systems: An Overview
accuracy but a 90% question answering accuracy would find the correct answer only 54% of
the time: the only measure given was however the final score, thus making it very difficult to
judge which systems performed best at which task. In particular, there was no attempt in the
evaluation to distinguish between systems that had used the “standard” document set and
systems that had attempted their own information retrieval, thus confusing the issue of
whether systems performed better because of their information retrieval capabilities or
because of their question answering capabilities.
It would be desirable to compare systems’ performance in similar tasks (information retrieval
and answer extraction) and while some preliminary analysis (e.g. Likowski 2001) does
indicate a marked difference between the overall performance of a system and the
performance of the individual modules, unfortunately the given results do not allow for an indepth evaluation of this type.
3.7.2
Knowledge Base
A similar problem is the fact that systems are not given a “standard” knowledge base, which
again confuses the issue of what exactly is being measured. A number of systems used
WordNet, others a highly customised version of WordNet, others electronic dictionaries
which are not freely available; others still used the Internet as a huge encyclopaedia. It would
be interesting to be able to assess how much the different knowledge bases contributed to the
overall performance, but the type of knowledge base used is ignored in the evaluation making
comparisons very difficult. On the other hand the evaluation, being concerned with building
QA systems, and not with building knowledge bases, should perhaps provide some “standard”
knowledge base which would allow systems to be fairly evaluated on their ability to answer
questions and not on the strength of the knowledge bases used.
3.7.3
Exact Answers
TREC-11 sought to improve on the previous conferences by moving away from the
requirement of returning five text snippets one of which should have been a string containing
an answer by introducing the requirement for a unique exact answer. Nevertheless there are
still a number of problems with evaluation. One problem has been the definition of “exact”
answer, with assessors judging various “exact” answers in different ways: for example, given
the question “What river is called Big Muddy?” there was considerable debate whether
“Mississippi” was the only “exact” answer or whether “Mississippi River”, “Mississip”[sic]
were also acceptable answers. At the same time the TREC evaluation framework has ignored
the fact that rather than not have any answer at all, users may prefer partial answers or
answers which are somehow related to the question without directly giving an answer.
42
Chapter 2: Question Answering Systems: An Overview
3.7.4
Concept-filling Questions
There has been no attempt to answer more complex, non-Named Entity questions (for
example “Why” or definition questions). Logical inference (as used for example in Moldovan
et al. 2003) can only work where we are looking for single concepts and it is difficult to say
how easily they could be adapted to more complex questions. Similarly, pattern matching
(e.g. Soubbotin 2003) will only work for concept-filling questions and it is difficult to see
how this could be applied to questions asking, for example, reasons for a certain event, which
would require gathering and processing knowledge from a number of different sources.
3.7.5
Multiple answers
While TREC-11 specifies that systems may only return one answer, this may not be what a
user wants, as it may well be possible that a document collection contains multiple, possibly
contradictory answers, all of which could be of interest. In the case of definition questions, for
example, a number of answers could be considered correct. For example, given, the question
“What is paracetamol?” both “an analgesic” and “a pain reliever” would be acceptable
answers: a good system should really be able to identify and present all acceptable answers,
limiting output only if required to do so.
On the other hand there may be some partial answers that, while not giving the user all the
information required, provide some useful information: a good system should be able to
provide this extra information where an answer has not identified, but possibly also where an
answer has been found as background information that could improve useful to a user.
3.7.6
User model
Another limitation of the TREC evaluation framework is the absence of any notion of user
model. Without an explicit user model it becomes impossible to correctly evaluate the quality
of answers, as correctness, incorrectness or usefulness of answer are to a large extent
subjective judgements dependent on the questioner. Given a question such as “Who was
Kierkegaard?”, for example, a very young child might be content with the answer “A famous
man who wrote books about the meaning of life”, but would probably puzzled at the meaning
of the answer “A philosopher”, while a knowledgeable adult would probably prefer an answer
such as “A Danish philosopher and theologian”.
3.7.7
Answerer model
Another problem is the absence of any answerer model. Depending on whether the person
giving an answer is a social worker, a philosopher or a politician, we will get different
answers to the question “What can we do about unemployment?”, depending on their
43
Chapter 2: Question Answering Systems: An Overview
individual agendas and prejudices. As will be shown below (see the section on Gadamer and
the dialectic of question and answering) it is not possible to shed all prejudice when
answering a question, and it therefore becomes necessary to state explicitly (so far as
possible) what the prejudices are when answering a question. A complete evaluation should
explicitly state the answerer model as well as the questioner model.
3.7.8
Lack of a clear theoretical foundation
In summary, the limitations seen for TREC evaluation framework are due to the lack of any
clear theoretical foundation to the problem of question answering, a foundation which should
clearly set out what question answering is trying to achieve and the conditions (e.g. questioner
model, answerer model, knowledge base) which constrain how QA systems can and should
perform. TREC-inspired research has been for the most part concerned with the how of
question answering while ignoring the what. Drawing any conclusions from the TREC
evaluation is therefore difficult given that it is not at all clear exactly what problem is trying
to be solved. It therefore becomes important to clarify the problem of question answering,
providing a clear theoretical foundation which can provide an unambiguous framework for
work in this area.
3.8
Other research: future directions in QA
While TREC has been the main point of reference for researchers in QA, other research
programmes (e.g. AQUAINT; see AQUAINT 2003) have sought to push the boundaries of
knowledge in this area even further. Research has been carried out, for example, on setting the
foundations for temporal question answering, i.e. making sense of temporal relationships
between portions of text in order to give meaningful answers (see Pustejovsky et al. 2003;
Schilder and Habel 2003); answering questions about videos (Katz et al. 2003); recognising
“opinion” answers (Wiebe et al. 2003; Cardie et al. 2003).
Hirschman and Gaizauskas (2001), following on from Breck et al. (2000), provide a set of
criteria for answer evaluation which would enable researchers to move beyond the TREC
evaluation framework. According to this view an answer given by a QA system should have
the following properties:
•
Relevance: an answer should be a response to the question.
•
Correctness: an answer should be factually correct.
•
Conciseness: an answer should not contain extraneous or irrelevant information.
•
Completeness: a partial answer should not get full credit.
44
Chapter 2: Question Answering Systems: An Overview
•
Coherence: an answer should be phrased in such a manner that a questioner can read it
easily.
•
Justification: to allow a reader to determine why this was chosen as an answer to the
question.
There is however very little discussion on the specific meaning of these properties and no indepth discussion about the justification for including these properties and not others: the term
“relevance” is highly ambiguous, as will be shown in Chapter 4; the notion of “factually
correct” is highly contentious from a philosophical point of view; there may be cases when
questioners would want “extraneous” information in addition to a precise answer; in some
cases an answerer may not want the questioner to easily understand an answer (for example,
when a teacher wants to give a challenge to a pupil). Moreover there is no concrete proposal
as to how these points could be implemented in actual QA systems.
In common with TREC, these research programmes lack a solid theoretical underpinning
setting out exactly what is being sought through question answering and clearly defining
concepts such as “answer”, and setting out questioner and answerer models explicitly. Again
it becomes important to clarify the problem of question answering, providing a clear
theoretical framework for work in this area
4
Related work in other fields
We saw above how cognitive science informed some of the earliest research in question
answering; nevertheless, research in this direction produced little beyond “toy” systems which
worked in very limited domains. Subsequent research in QA systems, in particular the type of
research carried out within the TREC evaluation framework has largely ignored theoretical
problems and has concentrated almost exclusively on practical issues. On the other hand, the
theoretical concepts underpinning question answering have been examined independently in
research in fields such as linguistics, philosophy and mathematical logic. We shall now
examine how researchers in these areas have tackled these problems.
5
Linguistics
Questions, answers, and the relationship between questions and answers have all been
examined by linguists, typically investigating very narrow problems in a very limited setting.
Hagstrom (1998) for example examines the syntactical problems of questions in Japanese,
Sinhala and Shuri Okinawan; Lambrecht and Michaelis (1998) examine accent in
informational questions; Ginzburg (1998) examines ellipsis in dialogue questions; Romero
45
Chapter 2: Question Answering Systems: An Overview
and Han (2001) examine yes/no questions. One of the most far-reaching works is by Ginzburg
and Sag (2001) who attempt a comprehensive look at questions in English, but neglect the
relationship between questions and answers and only form a fragmented theory. Given the
fragmented and specialised nature of research in questions and answers in linguistics and the
complexity and limited applicability of the formalisms proposed, it is difficult however to see
how this research could be applied in practice beyond giving some general ideas and pointers.
6
Psychology and Cognitive Science
As seen above cognitive psychology had a significant influence on research in question
answering, with a number of authors attempting to mimic human cognitive processes through
computer simulations. The psychology of questions and question answering has been
examined from a multitude of perspectives, for example collecting empirical evidence, how
questions come to mind during text comprehension and the difference between children’s and
adult’s questions (see Graesser and Black 1985). Graesser (1990) proposed a model of human
question answering (QUEST) which simulates the psychological processes and answers of
adults when they answer questions. According to this model the following components make
up the process of answering a question:
•
Interpreting the Question, which consists of parsing the question into logical form and
identifying the appropriate question category
•
Use of Information Sources such as episodic knowledge structures, generic knowledge
structures and knowledge represented as conceptual graph structures
•
Convergence Mechanisms such as the intersection of nodes from different sources, arc
search procedures, structural distance gradient and constraint satisfaction
•
Pragmatics: taking into account the goals of questioner and answerer, the common ground
between questioner and answerer and the informativity of an answer
While containing some very interesting ideas, this model (as all psychological models) seeks
to give an account of human behaviour, and does not aim to show what the conditions of
answerhood are: the way humans behave may or may not be the best way to solve a problem
and it may well be the case that humans are incapable of solving certain problems in a
satisfactory way.
7
Pedagogy
Correctly asking and answering questions is a fundamental part of learning and there has been
considerable research into the process of question asking and question answering from a
46
Chapter 2: Question Answering Systems: An Overview
pedagogical point of view. The earliest work in this direction was the Platonic investigation of
dialectics as a method for searching for the truth; more recent investigations have covered
topics such as how text comprehension is influenced by questions regarding the text
(Reynolds and Anderson 1982; Olson et al. 1985), the type of questions produced by teachers
and students in a classroom setting (Singer and Donlan 1982; Bean 1985), and how students
answer questions in an examination setting (Pollitt and Ahmed 1999).
8
8.1
Philosophy
Hermeneutics and Question Answering
The problem of answering a question given a text is neither new nor simple and has in fact
been the subject of intense philosophical debate at least since the heyday of Classical Greece
(a survey of Hermeneutics can be found in Gadamer 1960 and Martini 1991). The field of
hermeneutics aims to examine the problems associated with trying to answer questions
relating to a written source, be it a literary, religious, juridical or other document. The word
itself has been associated with the Greek god Hermes (the counterpart of the Egyptian Thoth),
the messenger who brought news from the Immortals, usually in response to a question
(prayer, oracle) posed by humans. Although etymologically incorrect (in reality its origin
derives from a word indicating the effectiveness of linguistic expressions), this association is
a good metaphor for the general problem of hermeneutics: given a document and a question,
the study of hermeneutics tries to justify the way in which answers are derived from the
document, or, in other words, it explains how a reader can act as a messenger, relating the
meaning of the text to the given question.
8.2
Hermeneutics: from Plato to Scheiermacher
The hermeneutiche techne, or art of interpreting written text, was initially examined (without
any pretence of, nor interest in, completeness) both by Plato in works such as his Symposium
and Epinomides and by Aristotle in his work On Interpretation. The problem gained
importance however in mediaeval philosophical and theological thought with the interest in
the interpretation of the meaning of the Bible, culminating in the formalization of the doctrine
of the “four senses” (literal, allegorical, moral, spiritual) that any text was deemed to contain
in response to a question. Hermeneutics did not however rise to the status of an autonomous
field of study until the eighteenth century German philosopher Friedrich Schleiermacher who
can be considered the precursor of twentieth century hermeneutic thought. Schleiermacher
introduced a number of principles which were not intended to be applied just to the Biblical
texts, but to every text, and have since been at the forefront of hermeneutical thought. In
particular Schleiermacher proposed:
47
Chapter 2: Question Answering Systems: An Overview
•
The principle of understanding an author better than the selfsame author, i.e. the idea that
a reader should see in a text more than what the author saw and be able to find concepts
that the author was not aware of or was only partially aware of.
•
The principle of the hermeneutic process being a creative re-production of the past, i.e.
the idea that the reader is not simply a passive receptor, but actively interprets and gives
rather than receives meaning from a text.
•
The idea that any historical knowledge is interpretative in nature, i.e. the application of
the hermeneutical principles of textual interpretation to the wider field of Humanities
(Geisteswissenschaften).
•
The doctrine of the “linguisticity” of comprehension, or the principle that the only
presupposition in hermeneutics is language, i.e. the fact that when a reader interprets a
text, the reader must already know the linguistic rules that the text follows.
These ideas were analyzed in greater depth by Dilthey in the nineteenth century and in the
following century by Heidegger, who extended the concept of interpretation of texts to the
idea of the interpretation of the whole of “being” (i.e. the comprehension of the meaning of
life is seen as an hermeneutical process). Heidegger’s philosophy is however more concerned
with deep ontological and metaphysical problems than with the problem of textual
interpretation. A more relevant analysis has however been given by Hans Georg Gadamer,
whose work brought Hermeneutics to the centre of “continental” philosophical debate in the
late twentieth century.
8.3
Gadamer and the dialectic of Question and Answer
Gadamer’s major treatise, Truth and Method (Gadamer 1960), a far-reaching work which
expands the concept of hermeneutics to include deeper metaphysical questions reaching into
the fields of ethics, aesthetics and ontology, gives a very clear exposition of his ideas on
textual comprehension. The principal ideas which are of interest from the perspective of the
problem of textual understanding are:
•
The “hermeneutical circle”: the idea that a text cannot be understood “once and for all”,
but that comprehension requires the reader to continuously re-visit the document, with
each reading revealing more about the meaning of the document and enabling the reader
to start another, even more revealing, reading. In other words, a text is never fully
understood: comprehension is a never ending process through which previous
48
Chapter 2: Question Answering Systems: An Overview
(necessarily partial) readings of the text enable the reader to understand the text better and
therefore have a less partial understanding of the text at the following reading.
•
The positive role of “prejudice”. The act of understanding a text is seen as something
fundamentally different from the “natural sciences”: while physicists, biologists and
chemists try to carry out their investigations with a “clear”, unprejudiced mind
(Descartes’ tabula rasa), in order to understand nature, to understand a text prejudice is a
necessity. To understand a text it is necessary to have some sort of preformed judgement
about the text itself: this prejudice could be a prior understanding of the language (a
central element of the hermeneutical process) in which the text is written, but could also
be “traditions” about the text, for example “authoritative” interpretations by other readers.
•
The “fusion of horizons”: comprehension is reached when the reader’s prejudices meet
the text. In other words, a reader’s questions find an answer when the prejudices which
form the basis of the questions form a unified whole with the text and find something in
common with the text.
•
The dialectic of question and answer. The hermeneutical process is seen as a dialogue
(dialectic) between the reader (who is asking a question of the text) and the text (which is
answering the reader’s question). Understanding a text means considering the text as an
answer to a question that the author had in mind when writing and therefore finding the
original question that the author of the text was trying to answer. But a reader wants to
understand a text because the reader has a question that needs answering. So
understanding a text means trying to find a relationship between the question that the
author was trying to answer and the question the reader is seeking to answer. On the other
hand, a reader has a question because the text itself moves the reader to ask certain
questions: a book on geography moves a reader to ask questions about geography, not
psychology. So the reader’s questions depend in a way on the questions that the author
had asked.
8.4
Practical application of hermeneutics
The central problem of hermeneutics can be seen as seeking a relevant answer to a reader’s
question, given a text. An application to automated Question Answering systems would
include an examination of the following:
•
The active role of the reader in understanding (and therefore answering), the idea of the
positive role of prejudice and the “linguisticity” of comprehension: a Question Answering
system must have a certain amount of prior knowledge in order to carry out its task
effectively. This prior knowledge must be linguistic knowledge, but also general
49
Chapter 2: Question Answering Systems: An Overview
(authoritative) ideas about the content of text and the way in which it should be
interpreted: a system which looked for answers to questions in a collection of newspapers,
for example could take letters to the editor as being far less reliable than court circulars; a
system examining an English history book could assume that “King John” refers to the
King of England and is not a translation of the name of a Spanish monarch. This type of
“prejudice” would give a system a better idea of what could be considered relevant to a
question and what could not.
•
The “hermeneutical circle”: comprehension is a circular activity, a continuous (and
possibly infinite) process of refinement. Question Answering systems should therefore
strive to improve their given answer by revisiting the textual evidence in the light of the
best answer so far.
•
The dialectic of question and answer: Question Answering systems should be aware that
the nature of a text limits the type of question that can be asked of it: not all questions are
relevant in a given context. Seeking an answer is therefore finding what constitutes a
relevant question to a given text.
8.5
Eco and semiotics: a critique of Derrida
One of the most fruitful examinations of the philosophical foundations of interpretation (and
consequently of question answering as interpretation of questions, answers and text which
answers a question) has been given within the area of semiotics (the study of signs), the major
exponent of which is Umberto Eco. I will examine his more recent thoughts, as given, for
example in Eco (1990), rather than other, more controversial works, such as Eco (1962),
which, although being interesting from a philosophical perspective, have little relevance to the
problem at hand.
The starting point of Eco’s argument can be seen as a critique of the radical point of view of
the “drift” of meaning expressed by Derrida (see for example Derrida 1967), whereby
(simplifying the argument) a text does not have any fixed meaning and therefore may respond
to a question with any answer that the reader pleases (a parallel may be drawn here with the
point of view of “classical” formal logic, whereby given that most “real” texts contain
contradictions one may infer quodlibet, anything whatsoever): in other words, from the point
of view of question answering, Derrida claims that any answer whatsoever can be taken to be
a meaningful answer to a question. This point of view is likened to the approach to textual
interpretation of the Hermetic philosophical tradition which culminated with the alchemists of
the Renaissance: the “Hermetic semiosis” of this tradition was characterized by an infinite
search for sense which never reached, and, more importantly, never wanted to reach, a
50
Chapter 2: Question Answering Systems: An Overview
definite answer to a question posed to a text (the typical question being, in this case, “How is
the philosopher’s stone produced?”).
8.6
Eco: inference as abduction
To this, Eco counters the idea of “limits of interpretation”, constraints forced upon the reader
when examining a text, whereby a text may only give a limited number of answers to a
question asked about it. According to this point of view readers adhere to a number of
principles in order to make sense of what they are reading. These include:
•
The idea of textual “economy”, whereby simple explanations of the meaning of a text
prevail over more complicated ones
•
Considering texts as “small worlds” which have a certain degree of coherence and
completeness but must be integrated with knowledge derived from our experience of the
wider, “real” world.
•
Continuously making assumptions about missing information in order to understand the
meaning of the text.
One of the most interesting points, however, is the way in which readers manage to apply
these principles in order to make inferences from the text. Inferences are not drawn using
deduction (from first principles to conclusions) but through abduction, a process which is
likened to the work of an investigator, proceeding from the evidence at hand to a hypothesis
which explains the evidence. Abductions are classified as:
•
Hypercodified abductions, or hypotheses, an automatic type of abduction which does not
require a great deal of thought.
•
Hypcodified abductions, where the rule must be selected from a set of equally probable
rules given by our current knowledge of the world
•
Creative abductions, which requires a rule to be invented “from scratch”.
•
Meta-abductions, which consist of deciding if the possible universe given by the previous
types of deductions is the same universe as the universe of our experience.
Using abductions, therefore, a reader will infer a rule which gives a particular meaning to a
text and therefore answers the question asked of the text.
51
Chapter 2: Question Answering Systems: An Overview
8.7
Practical application of semiotics
The central idea expressed by Eco is the thesis that, when examining a text to find an answer
to a question, “any answer” is not an answer. Only a limited number of answers may be
considered relevant, and there are a number of principles which may be applied to find these
relevant answers. A number of these principles are applicable to Question Answering
systems, in particular:
•
The “economy” criterion, whereby the simplest answer is deemed to be the best
•
The “small world” hypothesis, whereby if a text lacks coherence or completeness an
“outside” explanation should be sought, preferably through knowledge of the world, but if
necessary by making assumptions.
•
The idea of inference through abduction, whereby a system would not attempt to “prove”
an hypothesis (answer) by deduction, but would creatively “seek” an hypothesis (answer)
through abduction
8.8
Relevance Theory and Grice
As seen in the examination of Gadamer and Eco, an answer does not casually follow a
question: questions and answers are utterances which have a definite relationship; in
particular there are a number of “limits” or “prejudices” which define what is an acceptable
answer. Outside the field of hermeneutics, the relationship between utterances, and
specifically what makes us believe it is better to follow on from an utterance in a certain way
rather than in another, have been the subject of much debate from the point of view of the
philosophy of language and logic. Jean Piaget, founder of genetic epistemology, introduced
the notion of relevance as a bridge which stands between meaning and implication: relevance
distinguishes between genuine implications between antecedents and consequents (typical of
a strictly logical point of view) from a weaker form of relationship (see Piaget 1980 and
Piaget and Garcia 1991 for an overview). From the point of view of Anglo-Saxon philosophy
the work of Grice on communication has been significant for the understanding of the concept
of relevance (see Grice 1957, 1961, 1967, 1989).
Grice argues that communication is achieved by producing and interpreting evidence: the
audience infers what the communicator intends to communicate based on the evidence
provided for this purpose. However this evidence alone is not usually sufficient to determine
the speaker’s intentions, due to problems such as semantic ambiguities and referential
ambivalences. The inference on the part of the audience must therefore be guided in some
manner; according to Grice guidance is given to the audience by the so-called co-operative
52
Chapter 2: Question Answering Systems: An Overview
principle, which limits the number of hypotheses that must be considered according to a set of
maxims:
•
•
•
Maxims of quantity:
-
Make your contribution as informative as required
-
Do not make your contribution more informative than is required
Maxims of quality:
-
Do not say what you believe to be false
-
Do not say that for which you lack adequate evidence
Maxim of relation:
-
•
Be relevant
Maxims of manner:
-
Avoid obscurity of expression
-
Avoid ambiguity
-
Be brief
-
Be orderly
The ideas of Grice have been considerably influential in Anglo-Saxon philosophy; Sperber
and Wilson (1986, 2003), for example, build on the co-operative principle to develop a theory
of relevance which seeks to explain what makes some utterances more relevant than others as
responses, setting out the following principles:
•
Relevance of an input (e.g. an utterance) to an individual
-
The greater the positive cognitive effects achieved by processing an utterance, the
greater the relevance of the input to the individual at that time.
-
The greater the processing effort expended, the lower the relevance of the input to the
individual at that time.
•
Cognitive Principle of Relevance
-
•
Human cognition tends to be geared to the maximisation of relevance.
Communicative Principle of Relevance
-
Utterances presume their own optimal relevance, where an utterance is optimally
relevant if a) it is relevant enough to be worth the audience’s processing effort and b)
it is the most relevant utterance compatible with the communicator’s abilities and
preferences.
53
Chapter 2: Question Answering Systems: An Overview
From this they develop a relevance-theoretic comprehension procedure, which sets out the
following rules that a hearer must follow, given an input:
•
Follow a path of least effort in computing cognitive effects; interpretive hypotheses
(disambiguations, reference resolutions, implicatures, etc.) should be tested in order of
accessibility.
•
Stop when expectations of relevance are satisfied.
Relevance theory has had a significant impact on Information Retrieval (for some of the
earliest definitions of relevance for Information Retrieval, see Bar-Hillel 1958, Goffman 1964
and Cooper 1971; for an overview, see Greisdorf 2000 and Borlund 2003), but has been
largely ignored in Question Answering except insofar as QA systems make use of
Information Retrieval engines in order to find answers. While the work of Grace has received
some acknowledgement (see for example Harabagiu and Moldovan 1996), there has been no
attempt to fully implement his ideas and no discussion on the wider issue of the concept of
relevance. The ideas of Sperber and Wilson on the other hand appear to have had no impact
on practical research in fields which explicitly deal with responses to utterances such as
Question Answering, perhaps because their emphasis on cognition puts their work close to the
type of research in cognitive psychology seen above which worked for limited domains but
could not be easily expanded for open domain Question Answering.
9
Logic
Question answering systems such as those described above by Schank and Dyer, rely on the
ability to transform natural sentences into some form of logic and then using a mechanical
procedure to discover (or prove) an answer. A number of researchers have examined the
logical form of questions (e.g. Hamblin 1973) and questions as logical entities distinct from
statements (see for example Prior and Prior 1955, who introduced the term “erotetic logic” to
describe this field of study; Belnap and Steel 1976; and Harrah 1984 for a comprehensive
review of work in this area). The limitation of this work is however that it only considers a
limited part of the problem of question answering, neglecting usage and taking into account
only very simple examples of questions, usually closed-concept questions such as “Who did
Verdi write his Requiem for?”, ignoring the more complicated open-ended questions such as
“Why is Manzoni important in the history of the Italian language?”.
54
Chapter 2: Question Answering Systems: An Overview
9.1
Entailment and relevance logic
As seen above, when answering a question humans are seeking an answer sentence which is
relevant given the previous utterance. The idea of relevance has been examined by a number
of researchers in the field of logic, independently and apparently unaware of the work of
authors such as Grice.
In logic, relevance is the idea that in “if ... then” propositions the antecedent must be in some
way “relevant” to the consequent. In human logical thinking, the conditional between two
propositions, as in “if X then Y”, depends not only on the truth of X and Y, but also on a
causal relationship between X and Y (Cheng 1991). This is in opposition to the mathematical
treatment of “if ... then” propositions, where the relevance of antecedent to consequent plays
no part whatsoever and in fact propositions such as (-X ∧ X) Æ Y are considered “true”, even
though natural human logical thinking would immediately reject them. The idea of relevance
in logic, although central in classical philosophers from Aristotle until the beginning of the
nineteenth century, was abandoned by the more mathematically inclined tradition of Frege,
Whitehead and Russell (Anderson and Belnap 1975). “Relevance logic”, brought to the
attention of mathematicians by Ackermann (1956), tries to give a mathematically satisfactory
way of grasping the idea of “relevance” in implication.
9.2
Relevance logic, entailment logic and knowledge engineering
A logic for knowledge engineering, i.e. the representation of human thoughts, should,
according to Cheng (1991), have the following properties :
•
Naturalness: there should be a natural correspondence between the semantics of the logic
and the logical structure of events in the real world
•
Soundness and completeness: the logic must be based on a well-defined formal
foundation
•
Steadfastness: no proposition whould be syntactically or semantically entailed by a
contradiction. This means that the logic must not have (-X ∧ X) Æ Y as a theorem.
•
Nonmonotonicity: a set of propositions A may not syntactically or semantically entail X
even if X is syntactically or semantically entailed by a subset of A. In other words, the
conclusion of an argument can only be reasoned from those premises that are relevant to
the conclusion
•
Expressiveness: the logic must be able to represent different kinds of domain-specific
knowledge
55
Chapter 2: Question Answering Systems: An Overview
Cheng also adds the following requirement, specifically aimed at the implementation of
entailment logics:
•
Tractability: Reasoning based on the logic must be able to be manipulated efficiently
using computers
Entailment logics try to satisfy all these conditions, in particular trying to avoid the “fallacies
of relevance” and “implicational paradoxes” of mathematical logic (Anderson and Belnap
1975; Dunn 1986) such as A Æ (B Æ A) and A Æ (B Æ B). Entailment logics therefore
require that logical theorems should be a valid form of reasoning in human logical thinking
and therefore rejects the notion that implication is simply an abbreviation for (-X ∨ Y) and
that every logical theorem is a tautology according to its truth-value semantics.
A number of logic systems that attempt to avoid the “implicational paradoxes” of
mathematical logic have been devised. One of the earliest mathematicians to argue that in
order for the sentence AÆB to be true there should be “some connection in meaning between
A and B” was Nelson (1933); a number of relevance logic systems have been developed since
then, notably, the “system R of Relevant Implication”, first formulated (independently) by
Moh (1950) and Church (1951); “system PI’ of rigorous implication” presented in Ackermann
(1956); the “system E of entailment” in Anderson and Belnap (1975); the logic “E(fde) of
tautological entailments” of Dunn (1976), based on a four-valued denotational semantic
developed independently by Belnap (1975); the “systems Cm, Cn and Cnd of entailment” in
Lin (1987); the “system C of conditional” described by Cheng (1991).
9.3
Application of relevance logic for Information Retrieval
While relevance logic has not been taken into consideration for question answering, it has
been examined for the related task of information retrieval, i.e. the problem of finding a set of
relevant documents in response to a user query.
Meghini and Straccia (1996) formalise the information retrieval task with a logical model
called the terminological model, which must be augmented in order to address the following
problems:
•
Capturing of relevance: in order for information retrieval to be successful, there must be a
tight connection in meaning between the query and the documents retrieved in response to
the query. Relevance logic helps in this by substituting the two-valued semantics of
56
Chapter 2: Question Answering Systems: An Overview
“classical” logic with a four-valued semantics where the truth values are the powerset of
{t,f}: {t}, indicating that the system has evidence to the effect that a proposition is true;
{f}, indicating that the system believes that the proposition is false; {}, corresponding to
lack of knowledge; {t,f}, indicating inconsistent knowledge. This allows for only a
“mild” form of modus ponens, avoiding the “paradoxes” of logical implication, in
particular: not allowing contradictory premises to entail anything (i.e. not allowing A ∧ A Æ B ): this is important in information retrieval as it is unrealistic to assume that the
content representations of all the documents will be consistent in not allowing a tautology
to be implied by anything (i.e. not allowing A Æ t): this is important to ensure that the
premises will be relevant to the conclusion.
•
Challenging the “open-world” assumption: “classical” logics interpret a set of logical
sentences as a description of a state of the world which may be partial, thus opening the
possibility that an incomplete set of documents may not entail a sentence A, nor its
negation -A. This could be extremely inconvenient when reasoning about documents, as
both what the document says and what it does not say must be explicitly set out in order
to obtain the desired behaviour. It would seem therefore useful for a logic for an
Information Retrieval system to adopt, in certain circumstances, a closed-world view of
the set of documents, assuming that the inability to establish a fact should be evidence to
the contrary of that fact
An implementation of this model, using relevance logic for information retrieval (Meghini
1996) has been shown to be decidable, but EXPTIME-hard. It is therefore doubtful that
current implementations of relevance logic could be successfully applied to a more complex
problem such as question answering. More importantly, however, there is no indication that
relevance logic would be able to represent the complex linguistic structures that make up
language and that enable humans to make relevance judgments about texts.
10
Conclusion
Question Answering Systems try to solve the problem of determining an answer to a question
by searching for a response in a collection of documents. While research in this area spans
almost three decades (see Winograd 1972 and Dyer 1983 for early systems; Moldovan et al.
2003 for the most successful recent example), progress has been slow, and even the more
successful recent systems are very limited (Moldovan et al. 2003 for example find a correct
answer for just over 83% of the questions examined; the questions were however limited to
closed-type questions requiring a single concept as an answer and avoided the more
complicated queries asking “why” or “how”). Moreover, research in Question Answering
57
Chapter 2: Question Answering Systems: An Overview
systems has been characterised by a lack of theoretical underpinnings, and a consequent
confusion regarding the aim of such systems: while there had been an initial attempt to
characterise in more detail the problem, either by developing generic question answering
algorithms (Simmons 1973) or by proposing a generic framework within which to work (e.g.
Lehnert 1978 and 1986, working within the conceptual dependency model developed in
cognitive science), there has been little in this direction since, with work such as Graesser and
Franklin (1990) being focused on developing a cognitive model of human question answering
(and trying to partially implement this model) rather than providing a model for the problem
of automated question answering systems. Thus, it is often unclear what Question Answering
systems are trying to achieve: Berger (2000), for example, provides a good example of answer
finding, i.e. the problem of searching for a response to a question in a collection of readymade answers, as opposed to searching in a collection of generic documents; Brill et al.(2002)
offer a good example of lack of clarity about whether an answer should be sought within a
collection of documents or within a knowledge-base; Prager et al. (2002) highlight the
difficulties in providing answers to questions without first defining the concept of user model.
At the same time, linguists and philosophers have been examining the problem of determining
the general nature of questions and answers (e.g. Gadamer 1960; Eco Hiz 1978; Eco 1990;
Ginzburg 1995a and 1995b) and what determines a “relevant” answer to a question (Grice
1967; Brown and Yule 1983; Wilson and Sperber 1986; Sperber and Wilson 1995).
Nevertheless there is little, if any, evidence of interaction between the theoretical and
practical strands of question answering research.
A new theory therefore needs to be developed specifically for the practical problem of
automated question answering, given that current systems, aiming at an ambitious application
of automated systems to an “open domain” with “open” questions, lack a solid theoretical
underpinning and research in this area is limited by the ambiguity of what is being evaluated
and the uncertainty of the direction research should take. The following chapters will address
this problem by examining the what of question answering, as opposed to most current
research, which is solely concerned with the how, presenting a clear theoretical foundation
which can provide an unambiguous framework for work in this area. This will give a clear
explanation of the problem question answering systems aim to solve and a theoretical
foundation for the evaluation of question answering systems. In order to do this we shall take
as a starting point Eco’s critique of the extreme point of view given by Derrida that any
answer is an acceptable answer to a question: we shall therefore seek to identify the limits or
prejudices (in a Gadamerian sense) which constrain what is considered a “good” answer to a
question. These limits will be set out by reinterpreting the notion of “good” answer as the
notion of “relevant” answer: our theory will consequently build on research carried out in
58
Chapter 2: Question Answering Systems: An Overview
relevance theory to set out the conditions which define the answer to a question. From this we
will then be able to formulate a generic framework for automatic question answering which
will then be implemented.
Before introducing the new theoretical framework, however, we shall present YorkQA, a
typical TREC-style question answering system, which will be used as a baseline a) to show
the limitations of current systems and, once we have introduced the new framework, b) to
show the improvements given by the adoption of the proposed theory.
59
Chapter 3
The YorkQA TREC-Style QA System
Executive Summary
We present a novel Question Answering system, YorkQA, which was evaluated in the TREC-10 and
TREC-11 QA tracks. The system and its evaluations are described in detail and the limitations of the
system and the evaluation framework are discussed. The TREC-11 YorkQA system is used as the
starting point for the experiments set out in the following chapters.
1
The TREC Evaluation Framework
1.1
TREC-10
As seen above in Chapter 2, the TREC-10 QA track aimed at providing a standard evaluation
framework for QA systems. NIST provided:
•
a set of documents (electronic versions of US and UK newspapers) which were to be
used to find answers
•
a set of 500 questions which were provided at the start of the evaluation period
Systems were then given one week to process the questions and find appropriate answers in
the document collection. Each system could return 5 ranked answers for each question, each
answer consisting of a portion text of no more than 50 bytes long. The answers were then
examined by human assessors at NIST and systems were provided feedback consisting of:
•
a percentage of “correct” answers, divided into “lenient”, indicating that although the
answer was correct it was somehow unsupported by the document, i.e. the document
itself could not be used on its own to infer an answer (e.g. answering “Bach”, based
on the sentence “Bach composed a number of choral works” in response to the
question “Who wrote the famous Toccata and Fugue in D minor?”: “Bach” is a
Chapter 3: The YorkQA TREC-Style QA System
correct answer, but the sentence does not justify this, as it does not state that Bach
composed the Toccata and Fugue) and “strict” scoring, where documents justified
the answer.
•
A mean reciprocal rank, indicating the average rank at which a system placed an
answer.
1.2
TREC-11
TREC-11 differed from TREC-10 in that systems could not return text snippets as answers,
but had to return “exact” answers: thus, given a question such as “Who composed Don
Giovanni?”, an answer such as “Mozart, who lived in Vienna”, was not acceptable as it
contained irrelevant information: acceptable answers were instead “Mozar” or “W. A.
Mozart”; moreover, given a question such as “What was the writer Hugo’s first name?”,
“Victor Hugo” was considered incorrect, the only correct answer being “Victor”, as the
surname “Hugo” was not required in the question. Again, a set of 500 questions were made
public and systems were given one week to produce a set of unique answers. A group of
human assessors at NIST evaluated the answers and systems were given a percentage score
indicating the number of correct answers.
2
Objective of the experiments
Building a fully-fledged QA system with a performance similar to the best TREC systems
would have required considerable time and manpower, tailoring the various components
specifically for the task at hand (for example ensuring that the tagger and Named Entity
recogniser could cope with the type of text given in the TREC document collection). Our
objective was not to build a fully optimised system, but to provide a system which we could
use to understand the limitations of current technology, and which could then be used as a
baseline for subsequent experiments to test new ideas.
3
YorkQA at TREC-10
The YorkQA system was designed for the TREC-10 QA track evaluation (see Alfonseca et al.
20021). This was our first entry at TREC and the system we presented was, due to time
TP
PT
constraints, an incomplete prototype. Our main aims were to verify the usefulness of syntactic
TP
1
PT
It should be noted that the author of this Thesis, M. De Boni, was responsible for the overall design
and implementation of the system as well as the individual QA components and Information Retrieval
module; E. Alfonseca contributed the NP chunker and Part-of-Speech tagger; J.-L. Jara-Valentia
contributed the sentence splitter; J.-L. Jara-Valentia and S. Manandhar contributed the Named Entity
recogniser.
61
Chapter 3: The YorkQA TREC-Style QA System
analysis for QA and to experiment with different semantic similarity metrics in view of a
more complete and fully integrated future system. To this end we made use of a part-ofspeech tagger and NP chunker in conjunction with entity recognition and semantic similarity
metrics. Unfortunately due to time constraints no testing and no parameter tuning was carried
out prior to TREC. This in turn meant that a number of small bugs negatively influenced our
results. Moreover it was not possible to carry out experiments in parameter tuning, meaning
our system did not achieve optimal performance. Nevertheless we obtained reasonable results,
the best score being 18.1% of the questions correct (with lenient judgements).
3.1
Design
3.1.1
Question-Answering algorithm
Our algorithm proceeded as follows:
•
Index the TREC-supplied documents using a standard Information Retrieval engine
•
Read in the next question, and repeat the following until there are no more questions:
1. Analyse each question to determine the question category, i.e. the type of entity
the answer should contain
2. Decide the query to send to the IR engine and send the query
3. Pass the retrieved documents to the sentence splitter, tagger, chunker and named
entity recogniser
4. Analyse the sentences to determine if they contain the correct entity type.
5. Rank the sentences containing the correct entity type according to their semantic
similarity to the question.
6. Look for the best entity in the top ranked sentences or, if an entity of the
appropriate type is not found, an appropriate 50 byte window.
We envisaged making use of a parser to determine the answer by finding a match between the
question and the ranked sentences, but the fact that the TREC evaluation was carried out on
“real” documents with “real” questions meant that the parser, which was only tested on toy
documents could not cope with the TREC data.
What follows is a more detailed description of the main parts of our system.
62
Chapter 3: The YorkQA
orkQA TREC-Style QA System
Question
Tagging,
Question
Chunking
Categorisation
Query
Sentence
Extraction
Ranking
Similarity
Tagging,
Chunking,
NER
Prise IR
SMART
IR
Relevant
Ranked
Documents
Sentences
window
Answer
NIST
Documents
Answer
choice
Top 5
Answers
3.1.2
Sentence
Location
Output
Formatting
Question Type Identification
The question analyser followed the generic design set out in the previous TREC QA tracks
(e.g. Harabagiu et al. 2001; Breck et al. 2001; Takaki 2001) and used pattern matching based
on wh-words and simple part-of-speech information combined with the use of semantic
information provided by WordNet (Miller 1995) to determine question types. However,
unlike some of the question analysers constructed for previous TREC QA tracks (e.g.
Harabagiu et al. 2001) our question type analyser only recognised a small number of different
63
Chapter 3: The YorkQA TREC-Style QA System
types, in view of the fact that our Named Entity recogniser only recognised a small number of
entities and a finer granularity in the recognition of question types would not have been made
use of. Consequently, question types were limited to time, place, currency, organisation,
length, quantity, reason, place, constitution, person, length, mode, recommendation, truth and
a fallback category of thing. A number of these, for example recommendation (“Should I do
X?”), reason (“Why did X occur”) and truth (“Is X Y?”) revealed themselves not to be needed
for the track as no questions of that type were present.
The question type recogniser proceeded by first eliminating redundant pleasantries used as
initial words such as “Tell me”, “Please” etc. which could not be usefully employed. In other
words, it transformed questions containing these words as follows:
•
“Tell me who is the president of the US” becomes “Who is the president of the US?”
•
“Please let me know what is the capital of Luxemburg” becomes “What is the capital
of Luxemburg”
It then looked for a number of simple wh-patterns such as the following:
•
“Who” (recognised as looking for a person – the fact that it could be a company, as in
“Who produces the Tomb Raider games?”, was ignored in this first version) ,
•
“Where” (recognised as being a place),
•
“When” (recognised as being a time)
If a match was not found it then looked for more complicated patterns such as the following:
•
“What… made of?” (recognised as being constitution)
•
“How long … ?”, “How wide … ?” (recognised as length)
•
“How much … ?” (recognised as quantity)
The next step was to use WordNet hypernyms to find types: in the case of questions starting
with “What” the first noun following the wh-word was looked up and all its hypernyms
examined to find a correspondence with a question type. For example:
•
“What American president visited China?”, was recognised as a question of type
“person”, as the first noun after the wh-word is “president”, which is a hyponym of
“person”.
The use of WordNet was however of limited benefit as, due to the absence of a word
disambiguation module, it incorrectly classified a number of words: an example was
64
Chapter 3: The YorkQA TREC-Style QA System
“satellite” (as in “Which satellite of Saturn…”), which was classified as a person (satellite
also being a type of vassal). Moreover WordNet did not contain useful hypernym
relationships for “place” words such as “lake”, “river”, “sea” etc. (as in “What is the longest
river in…”, which we would have liked to be automatically labelled as “place”, given that our
named entity recogniser labelled river names as places).
Some question types, such as definition (“What is aspirin”) and explanation for a famous
person (“Who was Galilei?”) were ignored and given the generic label “thing”, as the system
did not have the capability of finding answers of that type, as they could not be answered by
looking for a Named Entity and instead required some complex form of pattern matching.
Finally, if the system could not recognise the question type it gave the question a generic type,
“thing”.
The question recogniser was shown to have an accuracy of 83%. Of the 17% misclassified
questions 5% were due to problems with the sentence splitter which incorrectly split questions
before passing them to the question type recogniser.
3.1.3
Information Retrieval
The initial steps of our algorithm were carried out by
a) using the SMART Information Retrieval system (for an introduction to SMART, see
Paijmans 1999) to index the documents and retrieve a set of at most 50 documents
using the question as the query; and
b) taking the first 50 documents supplied by the Prise Information Retrieval engine, as
provided by the TREC organisers.
Due to the tight time constraints of the NIST evaluation framework we were unable to tune
the SMART IR system for the task at hand and in fact the documents retrieved by the
SMART engine were much worse (very low in both precision and recall) than the documents
provided by NIST. Experiments in using a simple form of query expansion (using WordNet
hypo- and hypernyms) also failed to improve performance.
The YorkQA system then transformed the output of the retrieval engine into appropriate
XML documents and used a number of tools for linguistically processing them, in particular a
tokeniser, sentence splitter, part-of-speech tagger, morphological analyser, entity recogniser,
QP- and NP- chunker.
65
Chapter 3: The YorkQA TREC-Style QA System
3.1.4
Tagging and chunking
The texts were processed using the TnT Part-of-Speech tagger (Brants, 2000) and a Noun
Phrase chunk parser based on Transformation Lists (Ramshaw and Marcus, 1995) to find nonrecursive noun phrases. To improve the accuracy of the chunker, we trained it on a copy of
the Wall Street Journal corpus that was manually corrected by a human annotator, so as to
improve the quality of the training data. While performance during training was satisfactory
(around 95% accuracy) both modules gave poor results on the actual TREC data (questions
and answers) with accuracy below 80%.
The sentence splitter was based on the one given by Mikheev (2002 – the initial submitted
was available from the author in 1999). It processed dot-ending words by firstly deciding
whether they were or were not abbreviations. Non-abbreviations ending with dots all
indicated sentential endings while abbreviations followed by a non-capitalised word were
never considered sentence ends. The second step addressed the difficult case, when an
abbreviation is followed by a capitalised word, and several heuristics were used to decide
whether a sentence ended there or not.
The performance of these components on the actual TREC data was however rather poor,
especially for the questions, given that the training data used contained very few questions.
The actual documents also proved problematic as they not only contained a number of
“difficult” sentences, but also tables, captions and notes which were hard to process.
3.1.5
Named Entity Recognition
This was an important step in the processing of the text as the YorkQA system initially tried
to find sentences containing an appropriate entity that might answer a given question.
Nevertheless this module was, for this version, the weakest link in the processing pipeline.
Six types of entities were recognised:
•
Currency expressions, such as “$10,000”, “2.5 million dollars”, etc. The evaluation
showed a very high accuracy for this sub-module (over 99%), but all the expressions
found were in dollars and it is not clear that this performance could be obtained for
other types of currencies. Its precision however should remain very high in any
stricter evaluation.
•
Locations, such as “Chicago”, “Thames”, “Mount Kilimanjaro”, etc. This sub-module
is programmed to recognise locations by context words, so that it recognises “New
York City” but not “New York”. This limitation is clearly reflected in its performance
66
Chapter 3: The YorkQA TREC-Style QA System
as this module missed all location expressions (0% accuracy) in the sentences
selected for the manual evaluation.
•
Organisations, such as “Climbax Corp.”, “Nobel Prize Committee”, “The National
Library of Medicine”, etc. The approach used by this sub-module is to look for a clear
organisational word at the end of a sequence of capitalised words. Therefore, it
recognises “National Cancer Institute” but not “Massachusetts Institute of
Technology”. Consequently, we determined that only 35% of the organisations
mentioned in the evaluated text were correctly tagged and that most of the errors were
due to poor recall.
•
People names, such as “Reagan”, “Marilyn Monroe”, “G. Garcia Marquez”, etc.
Again precision is preferred to recall in this sub-module. Therefore, people's names
are marked when they are surrounded by clear context words such as personal titles
(“Mr.”, “Dr.”, etc.) and common pre-name positions (“President ...”, “Sen. ...”, etc.)
or both the forename(s) and the surname(s) are found in a gazetteer of common
names in several languages. The evaluation is consistent with this bias as most of the
69% of incorrect answers from this module were the consequence of poor recall.
•
Quantity expressions, such as “twelve”, “11/2 billion”, etc. Because of the relative
regularity in this kind of expressions, this sub-module got a fairly good accuracy
(82%) and most errors are misinterpretations of the words “a” and “one”.
•
Time expressions, such as “June, 28 1993”, “Today”, “late 2000”, etc. The accuracy
of this sub-tool was around 50%, mainly because it missed many relative time
expressions (such as “the day after the great storm”) which were difficult to capture
by regular expressions only.
The system was limited both by the limited range of entities recognised and the inaccuracies
in recognising the entities correctly. In particular the “quantity” entity would have been better
subdivided into speed, weight, length, duration, etc.
3.1.6
Measuring Semantic Similarity
The central part of our system was the semantic similarity measuring module, which analysed
the tagged and chunked sentences in the documents selected by the Information Retrieval
engine, comparing them with the question in order to find the sentence which was most
similar to the question. Due to the time constraints given by the evaluation, the algorithm
described below was not entirely satisfactory and was subsequently reviewed and improved
considerably. What we had was therefore an initial sketch for a similarity algorithm and
Chapter 6 will provide in much more detail the redesigned and improved algorithm.
67
Chapter 3: The YorkQA TREC-Style QA System
To calculate similarity it was necessary to calculate the semantic (or conceptual) distance
between a sentence and a question. A number of algorithms have been presented to measure
semantic (or conceptual) distance (see for example Miller and Teibel 1991, Mihalcea et al.
1989). WordNet (Miller 1995) has been shown to be useful in this respect, as it explicitly
defines semantic relationships between words (see, for example, Mihalcea and Moldovan
1999; Harabagiu et al. 1999). Our system, however, differed from previous approaches in that
it did not limit itself to considering the WordNet relationships of synomymity and
hyperonymy, but made use of all semantic relationships available in WordNet (is_a, satellite,
similar, pertains, meronym, entails, etc.) as well as information provided by the noun phrase
chunker. The similarity algorithm took the question and answer sentences as bags of words
Q={q1, q2, …} and A={a1, a2, …} which ignored stop-words such as articles and prepositions
and took the base form of the words: so, for example, given the question
“Who was the first president of the US?”
and the sentence
“Before becoming president, Reagan was an actor who took part in a number of Westerns”
we would have
Q={first, president, US}
A={become, president, Reagan, actor, take, part, number, Western},
Similarity was then calculated as the sum of the distance between each word in A and each
word in Q, where the distance was given by the “closeness” between the two words according
to WordNet relations; words were not disambiguated and the relationships were therefore
calculated as the relationships between all the possible meanings of the words. The
“closeness” of words was determined in the following order:
1. Identical words
2. synonyms
3. words which were in a hyper- or hyponym relation
4. meronym and the other available relationships
In other words, relations which were intuitively closest were given a high score, while
relations which were intuitively weaker were given a lower score. Where there was no
68
Chapter 3: The YorkQA TREC-Style QA System
immediate relation between two words, “chains” of meaning were constructed, up to a set
limit, in order to find a relationship: so, for example, “man” and “nurse” were found to be
related as both were hypernyms of person. In the example above,
•
“become” would be compared to each of the words in Q with no relationship found
•
“president” would be compared to each of the words in Q finding an identity
relationship between the word “president” in A and the word “president” in Q, thus
being given a high score
•
“Reagan” would be compared without finding any relationships
•
“actor” would be compared to each of the words in Q and would be found to be
related to “president” (both are hyponyms of “person”), thus being assigned a score
which is lower than the score for identical words
•
the other words would be compared in a similar manner
An overall “similarity” score would then be calculated as the sum of the given similarity
scores, with an increased score being given if a group of similar words were part of the same
noun phrase (as for example in the two noun phrases “the first president” and “the first Head
of State”).
As already noted, due to the time constraints given by the evaluation, the algorithm was still a
prototype and was not optimised for best performance: the similarity algorithm was
redesigned and improved considerably successively and will be discussed in detail in Chapter
6 where it is shown to be the basis of semantic relevance, one of the relevance metrics which
will be seen to be fundamental to the understanding of the problem of question answering.
3.1.7
Answer identification
In the case of questions of type quantity, person, place, time and currency, an appropriate
Named Entity was sought in the retrieved sentences. The five sentences which were ranked
highest by the similarity algorithm and which also contained an appropriate Named Entity
were then taken to contain the answer and a 50 byte text window surrounding (and including)
the Named Entity was returned. There was a check to ensure no other Named Entity of the
same type was contained in the text window (for example “Thatcher met Reagan”), to
conform to the evaluation guidelines, which judged “correct” (in the sense that they did in
effect contain the answer to the question) sentences as wrong if there was ambiguity as to
which Named Entity was the answer.
69
Chapter 3: The YorkQA TREC-Style QA System
In the case of questions of type reason, recommendation, mode and difference, heuristics,
based on an analysis of the TREC supplied documents as well as the questions given in the
TREC-9 QA track, were used to find an appropriate portion of text, as illustrated by the
following examples:
•
the part of sentence following the word “because” was taken to probably be a reason
for what preceded
•
the part of sentence preceding “in order to” was taken to be the reason for what
followed
•
the part of sentence following “is made of” was taken to probably indicate what
constituted the concept mentioned immediately before.
•
the part of sentence following “the difference between … and … is” was taken to be
the difference between the two previously mentioned concepts
•
the part of sentence following “one should” or “you should” or “it is advisable to”
was taken to be a recommendation
The five sentences containing an appropriate set of keywords which were ranked highest by
the similarity algorithm were then taken to contain the answer and a 50 byte text window
surrounding (and including) what was considered to be the answer was returned.
If the question analyser failed to recognise any question type (i.e. the answer type was the
catch-all category “thing”), the system took the five sentences which were ranked highest by
the similarity algorithm and looked for a portion of text 50 bytes long within a sentence which
was closest to the words contained in the question, but contained the lowest number of
question words: this “naïve” heuristic was derived experimentally and was shown to have the
highest probability of finding an answer (although a low overall probability of containing an
answer due to the simplicity of the algorithm). This heuristic proved to be useful for
attempting to find answers to definition and “famous person” questions, as in the case of:
Who was Galileo?
where the system correctly guessed that the answer would probably be somewhere in the
following snippets:
70
Chapter 3: The YorkQA TREC-Style QA System
“discovered in 1610 by Italian astronomer Galileo”
“Galileo what to do and when to do it”
“a 760-pound Galileo probe and a 4,890-pound”
“Galileo studied and taught in the late 1500s”
“Galileo 's new solid-fueled booster is less”
It was not considered useful to improve the heuristic due to the fact that it was meant to be a
fall-back option and improvements on the previous aspects of processing (e.g. more precise
question type and Named Entity recognition, the use of patterns for recognising definitions)
should have made this particular part redundant.
3.2
Novel aspects of YorkQA at TREC-10
Although the overall design was not new in the context of the TREC evaluation framework,
the weight given to the sentence similarity algorithm was a novel idea. This enabled the
system to give satisfactory results even with a largely incomplete Named Entity recogniser, a
rudimentary question type recogniser and a disappointing syntactic analyser. Although a
number of systems attempted paragraph retrieval as opposed to document retrieval at the
initial stage of processing, this differed considerably from our use of a sentence similarity
measure: while paragraph retrieval simply relied on a statistical analysis of words (query
terms) in common between a question and a document paragraph (and hence had to be
supplemented with more complex methods of answer-finding), sentence similarity sought to
find complex relationships in meaning between a question and a possible answer sentence,
relationships which were limited only by the ontology used (WordNet).
3.3
Results
Given that our system was a simple prototype which had not been tested, nor tuned, the
results we got were encouraging, the best score being 18.1% of the questions correct (with
lenient judgements), which was an average system performance.
The official results given by the NIST evaluators are summarised in the following table:
Strict judgement
Lenient judgement
YorkQA01 MRR
0.040
0.048
YorkQA01 Correct
6.7%
7.7%
YorkQA02 MRR
0.111
0.121
YorkQA02 Correct
16.5%
18.1%
71
Chapter 3: The YorkQA TREC-Style QA System
As can be seen from the table, YorkQA02, which made use of the documents NIST retrieved
using the Prise information retrieval engine, performed significantly better thanYorkQA01,
which made use of the SMART information retrieval engine, both in terms of percentage of
correct answers and mean reciprocal rank (MRR). The MRR gave an indication of how well
our system managed to rank answers correctly within the five possible answer sentences (a
perfect system would always rank a correct answer at the top of the five sentences).
4
YorkQA at TREC-11
The YorkQA system described above was modified for the TREC-11 QA track evaluation
(see De Boni et al. 20032) and again submitted for evaluation to NIST.
4.1
Design
The aim of our system in the TREC-11 QA track was 1) to see how much our previous system
could be improved simply by removing bugs and inconsistencies and 2) to test new
techniques on “real” data.
The system produced was therefore again a prototype experimental system rather than a
“complete” machine ready for deployment. In particular it still lacked an information retrieval
engine (we abandoned the SMART IR engine and relied entirely on the documents retrieved
by NIST's PRISE IR engine), and had only an outline Named Entity Recogniser and an
incomplete answer extraction module.
4.1.1
Improvements on the previous version
A close examination of the system used for TREC 2001 revealed a number of small bugs
across all modules which were significant enough to affect performance. These were
corrected for TREC-10 entry. Furthermore, a close analysis of our results which took into
account the contribution of each module revealed that a number of components we used did
not improve performance and in fact were detrimental. The Noun Phrase Chunker which we
used last year was removed as its output was not precise enough to be used productively; the
sentence splitter was also modified considerably; the SMART information retrieval engine
was also abandoned and we relied wholly on the NIST-supplied documents.
2
It should be noted that, as for the TREC-10 system, the author of this Thesis, M. De Boni, was
responsible for the overall design and implementation of the system as well as the individual QA
components; J.-L. Jara-Valentia contributed the sentence splitter and Named Entity recogniser.
72
Chapter 3: The YorkQA TREC-Style QA System
The question recogniser which we previously used proved very hard to maintain and improve,
based as it was on a large number of patterns and exceptions with very limited use of
linguistic resources. It was therefore re-written from scratch in a much more elegant way in
order to make much more use of linguistic resources such as WordNet.
The Named Entity recogniser was also rewritten from scratch, as was the answer extractor,
which had to cope with this year's track aim, which was to extract an exact answer as opposed
to a string of words.
Question
Question
Tagging
Categorisation
Query
Sentence
Extraction
Ranking
Similarity
Tagging,
NER
Prise IR
Relevant
Ranked
Documents
Sentences
NIST
Documents
Unique
Answer
Unrecognised NE
Answer
Location
Location
Output
Formatting
73
NE
Chapter 3: The YorkQA TREC-Style QA System
4.1.2
Question Recogniser
We improved the previous version of YorkQA, by rationalising the set of rules used to
determine the focus of the question, allowing for more complex phrasing of questions, for
example:
•
“What Belgian city … ?” recognised as being of type “place”.
•
“Who was the first Brazilian president?”, recognised as being of type “person”
The initial recognition, however, was too refined to be of much use: given a question such as
“What president's wife commented on the affair”, it would return a question type of
“president's wife”. The WordNet is-a hierarchy was therefore used to determine a less finegrained answer type. So, we would have, for example:
•
“What actor’s father … ?” recognised as being of type “person”
As noted above in analysing the TREC-10 system, WordNet is deceptive without accurate
word-sense disambiguation: satellite is a person, if satellite is intended in the sense of synset
107546753, “a person who follows or serves another”, but not if it is intended in the more
common meaning of synset 103275905, “a man-made object that orbits around the earth”.
Given that we did not have an accurate word-sense disambiguation module, we resorted to
assuming that the meaning of any word was the most common word and added a hand-crafted
series of rules, based on an analysis of the questions used in TREC-9 and TREC-10, which
reflected this (e.g. our rules state that “satellite” is not a person, but is a “thing”).
On the other hand, WordNet could not always be relied upon to determine whether a question
was looking for MUC entities (e.g. locations) due to inconsistencies such as the fact that
according to the hypernym hierarchy, a city is a location, but a lake is not; again hand-crafted
rules based on the previous TREC questions were applied to state facts such as “sea is a
location” and hence determine that a question asking “What sea…” could be answered by
looking for a location Named Entity.
Overall the question type recogniser had an accuracy of 96% on the TREC-11 questions.
4.1.3
Named Entity Recogniser
Texts were initially tokenized by applying several hand-coded heuristics. Then they were
tagged using the TnT part-of-speech tagger used for the TREC-10 evaluation. Sentences were
then split by an algorithm that utilised a number of heuristics and a list of abbreviations
74
Chapter 3: The YorkQA TREC-Style QA System
extracted by another algorithm that uses active learning when the case is highly ambiguous.
Although there was no rigorous evaluation of this sentence splitter, an informal evaluation
showed that is worked well in most cases and did not replicate the errors noticed in the
previously employed splitter.
For the Named entity recognition the YorkQA system used our own implementation of
Nymble (Bikel et.al. 1997) which utilised hidden Markov models to identify named entities in
the text. An improved version of this algorithm showed high accuracy in MUC-7. It was
estimated that this version of Nymble reached about 70% recall and 80% precision.
These results were a big improvement compared to the TREC-10 version of YorkQA.
However, the program only identified MUC entities (person names, organisation names,
location names, dates, time expressions, currency expression and percentage expression), and
ignored other entities such as speeds, distances, durations, etc. which were needed for a
correct identification of an answer.
Our Named Entity Recogniser was also insufficiently precise to correctly identify subtypes of
the MUC entities: thus we were unable to distinguish, within the type “person name”,
between first names, second names and nicknames; within the type “date” we could not
distinguish between year, month and day; within “location” we could not distinguish between
countries, regions and cities. This proved to be a serious problem as the TREC-11 evaluation
required very precise answers which did not contain “irrelevant” information, where
information such as “England” in “London, England” was considered irrelevant (and hence
the whole answer judged incorrect) if the question asked “What city…?”.
4.1.4
Semantic Similarity Metric
The semantic similarity algorithm used for the TREC-10 was improved considerably and then
subjected to a thorough analysis of its components. In particular, we evaluated how the use of
WordNet information, part-of-speech information and head-phrase chunking could provide a
reliable estimate for deciding whether an answer sentence was relevant to a question from a
semantic point of view. Our experiments indicated that a semantic similarity metric based on
all WordNet relations (is-a, satellite, similar, pertains, meronym, entails, etc) and using
compound word information, proper noun identification, part-of-speech tagging and word
frequency information gave best results. This metric and the extensive evaluation which we
carried out on the algorithm will be discussed in more detail below in Chapter 6, on semantic
relevance.
75
Chapter 3: The YorkQA TREC-Style QA System
Nevertheless, while the new semantic similarity algorithm was very accurate in ranking
sentences containing an answer (it was able to accurately rank an answer sentence in the top
position 31% of the time and in the first five positions 57% of the time, which was
comparable to the top performing systems in TREC-10: the top five systems returned an
answer in the top five sentences on average 62.1% of the time), this was not enough for the
TREC-11 evaluation framework, which judged answer sentences incorrect as it required that
an answer contain nothing more than the nouns making up the exact answer.
4.1.5
Answer Word Finder
To determine an answer, the answer word finder took the sentence which was ranked highest
by the sentence similarity algorithm and retrieved the Named Entity corresponding to the
question type, after checking that the Named Entity was not already in the question; where
there was more than one Named Entity of the correct type it guessed by taking the one closest
to the highest number of question words as an answer. As an example, given a question such
as:
Who succeeded Ferdinand Marcos?
and assuming that the highest ranked sentence given by the sentence similarity algorithm was:
Aquino succeeded Ferdinand E. Marcos
The question type recogniser would have recognised that the question was seeking a Named
Entity of type “person” and the Named Entity recogniser would have tagged the document
sentence as follows:
“<person>Aquino</person> succeeded <person>Ferdinand E. Marcos</person>”
The answer word finder would hence take “Aquino” as the answer as this word is the correct
Named Entity type and does not appear in the question.
As noted above however, our Named Entity recogniser only recognised a small number of
entity types, which did not correspond to all types identified by the question type recogniser;
the question type recogniser was, for example, able to recognise very specific questions such
as
What disease is carried by mosquitoes?
76
Chapter 3: The YorkQA TREC-Style QA System
correctly recognised as being of type “disease”, and
What colours make up orange?
correctly recognised as being of question type “colour”.
It would have been very impractical to design a Named Entity recogniser for such rare
entities. In these cases, in order to determine possible answers, the system made use of
WordNet's is-a relationships, looking in the answer sentence which had the highest semantic
similarity measure for hypernyms of the question type. This approach appeared to be fruitful
in determining “rare” entities for which it would be impractical to build a Named Entity
recogniser to tag: a good example was the question:
What's the name of King Arthur's sword?
the question type was correctly identified as “sword” and the answer found in the documents
was “Excalibur” as Excalibur was listed in WordNet as a hypernym of sword.
When the answer finder failed to find a Named Entity or an hypernym of the correct type, or
when the answer type was the catchall “thing”, it “guessed” the correct answer by retrieving a
text window made up of nouns closest to the highest number of question words.
4.2
Novel aspects of YorkQA at TREC-11
The originality of the TREC-11 YorkQA system lay again in the reliance on an accurate
semantic similarity measure to find the correct answer sentence. This enabled the system to
correctly identify answers without relying on complex forms of pattern matching (as used by
Soubbutin and Soubbutin 2002) or logical inference (as used in Moldovan et al. 2003). As
will be shown in more detail in the following chapters this algorithm gave a very good
performance. Nevertheless, due to the evaluation rules, it needed to be complemented with a
strong Named Entity recognition element to give good results within the TREC framework.
In view of this limitation, another novel idea was the use of a WordNet-based answer finder
to complement the given Named Entity recogniser where it was unfeasible to construct a
recogniser for rare entity types such as animal species or drugs. This proved a fruitful idea
which was able to recognise a number of answers which would otherwise have gone
77
Chapter 3: The YorkQA TREC-Style QA System
undetected. The low number of questions actually needing this approach however made it
difficult to quantify its strength exactly.
4.3
Results
The results of the official evaluation provided by NIST staff were again average and are
summarised in the following table:
Right
8.4%
Unsupported
1.2%
3%
Inexact
Total Correct answers
12.6%
Table 1
The TREC-11 evaluation confirmed that, from an engineering perspective, Question
Answering is a complex task which requires accurate component parts: underperformance in
any of the components results in a detrimental effect to the system as a whole. In such a
complex system it is not a simple task to determine why things go wrong and indeed why
things go right. Nevertheless it is important to determine how the individual parts interact in
order to improve performance, and, while such an analysis can be extremely laborious, it is
necessary. The disadvantages of working in a very small team were also highlighted, as the
complexity of the problem requires extensive work in disparate areas.
5
5.1
Limitations
Limitations in System Design for TREC
Systems designed for the TREC evaluation framework are necessarily limited by the
constraints given by the evaluation rules. It therefore becomes easy to concentrate on
improving the evaluation score, giving a large emphasis on solving largely engineering
problems such as the interaction between modules, speed and error propagation, forgetting the
larger picture, i.e. what users of question answering systems would want and what question
answering systems should be doing to meet those requirements. Thus there has been little
change in the overall design of question answering systems from TREC-8 to TREC-11, with
systems generally improving by becoming faster and more accurate at the same tasks
(information retrieval, question type recognition, named entity recognition, logical inference),
and therefore able to achieve a better score in the TREC evaluation, but not seeking to move
towards more realistic, usable systems.
78
Chapter 3: The YorkQA TREC-Style QA System
5.2
Limitations of the TREC evaluation framework
While the TREC evaluation framework presents a useful common benchmark for question
answering systems’ evaluation, it does have a number of limitations, as highlighted above in
chapter 2, section 3; we can now clarify these limitations further in order to understand what
needs to be done to move beyond current QA systems.
The TREC framework has concentrated on how to ensure a standard in evaluation, i.e. an
agreement amongst the people who evaluate the systems’ performance (Vorhees 2003b), but
has ignored fundamental issues about what automated QA is, what QA systems should be
aiming for and what it should mean to give an answer to a question. Leaving aside difficulties
such as agreement between assessors, many aspects of the problem setting for the framework
itself are unclear:
•
Lack of clarity about the meaning of “answer”. The evaluation has concentrated on
systems’ ability to pinpoint exact answers to concept-filling questions (such as “Who
is the president of the US?” or “Where was Frank Sinatra born?”), but has ignored
whether this is what users would want from a QA system. Much more clarity is
needed into what is meant by a satisfactory answer, and what conditions need to be
met in order to have a satisfactory answer.
•
Multiple answers or one answer? A number of questions (e.g. the definition
questions, but also many concept filling questions) had a number of equally valid but
different answers, yet the assessors were only looking for a unique answer (TREC11) or ignored multiple answers (TREC-10). Thus systems which were able to
retrieve multiple answers, possibly representing multiple opinions in the document set
were not rewarded in any way, whereas, especially if there are conflicting opinions or
answers with different levels of reliability in the document collection, a user would
probably want more than one answer.
•
Lack of clarity about what was being evaluated. The use of the NIST-provided Prise
Information Retrieval system, as opposed to a custom IR system was detrimental to
performance as the Prise system did not always retrieve documents containing an
answer; systems which made use of Prise were therefore penalised, but there was no
way of telling which systems had made use of the NIST document set. A similar
problem arose with previous knowledge: different TREC systems made use of
previous knowledge to different extents, with some systems (e.g. YorkQA) using
dictionaries such as WordNet, while other systems made use of the Internet as a vast
Encyclopaedia (see for example Lin et al. 2003). It is unclear however in the
79
Chapter 3: The YorkQA TREC-Style QA System
evaluation results how the use of different knowledge sources affects systems’
performance, and clarity, if not uniformity, would be desirable in order to be able to
compare systems rigorously. As the evaluation stands, comparisons are largely
meaningless and it is unclear whether the systems which scored highest would also be
the best systems in a realistic setting.
•
User modelling (hence referred to as Questioner modelling). It is unclear who the user
is assumed to be for the purposes of evaluation. Depending on the type of user,
different questions will have different answers: for example, a question such as
“What is an atom?” should have a different answer depending of whether the user is a
curious elementary school child or a high-school student studying physics. We shall
henceforth refer to user models as questioner models, clarifying the fact that we are
dealing specifically with question answering systems.
•
User goals (hence referred to as Questioner goals). A particular aspect of answer
modelling is the goals of the user. Answers should be different to accommodate
different user goals. For example, the question “Who is a German Philosopher?”
could be interpreted as meaning “I want some examples of German philosophers” or
as meaning “What characteristics distinguish German philosophers from AngloSaxon philosophers?”. We shall henceforth refer to user goals as questioner goals,
clarifying the fact that we are dealing specifically with question answering systems.
•
System goals (hence referred to as Answerer goals). Answers should also vary
depending on the goals of the question answering system itself: a doctor, for example,
and hence a question answering system which is designed to provide answers similar
to a doctor’s answers, will answer the question “What is Acetaminophen?” in a
different way from a scientist or a salesperson. We shall henceforth refer to system
goals as answerer goals, and system models as answerer models, thus clarifying the
fact that we are dealing specifically with question answering systems.
•
Question context. Questioners rarely ask questions in isolation: questions are more
likely to be part of a wider exchange (dialogue) which provides a context which
constrains what should be considered a relevant answer. Although TREC-10
proposed a “context” sub-track which attempted to provide an initial attempt to
address this problem (see Voorhees 2002; Harabagiu et al. 2002), experiments in this
direction have subsequently abandoned and TREC-11 and, more recently, TREC-12
have continued examining questions in isolation.
80
Chapter 3: The YorkQA TREC-Style QA System
6
Moving Beyond TREC
It therefore becomes necessary to define exactly the problem setting for question answering,
solving concerns such as:
1. What makes a good answer?
2. Is it possible for more than one answer to be a “good” answer?
3. What is automated question answering trying to achieve?
4. What are the constraints under which automated question answering systems operate?
5. How do issues such as questioner models, questioner goals, answerer goals and
background knowledge and question context fit into question answering?
Once these problems have been solved it becomes possible to evaluate QA systems by
subdividing the problem into smaller parts and evaluating the single components individually.
In the next chapter we shall clarify what makes a good answer; we shall then examine the
problem setting for automated question answering, providing a theoretical framework for
automated question answering systems; in the following sections we shall show how, once we
have shed light on the problem setting, it becomes possible to tackle question answering in a
clearer way by looking at its individual components separately and this enables us to push
research into question answering beyond the limits of current systems.
81
Chapter 4
A Definition of Relevance for Automated Question
Answering
Executive Summary
While previous approaches to QA, including TREC, have sought to find unique answers to questions,
we show the importance of the concept of relevance, the idea that there is no clear cut distinction
between “good” and “bad” answers, but there are varying degrees of interestness. We examine this
concept in depth, showing that relevance may be seen from multiple points of view. We then propose a
new approach to relevance arguing that for automated QA relevance should be considered as being
made up of a number of categories (semantic, goal-directed, logical and morphic relevance). This sets
the foundation for the theoretical framework for automated QA which will be developed in the
following chapter.
1
Introduction
The TREC evaluation framework has been shown to be limited as it leaves a number of
important matters unresolved: it is primarily focused on consistent evaluation while ignoring
the wider issue of what QA is trying to achieve, and hence is not concerned with providing a
firm theoretical foundation for the problem. Other research frameworks such as the
AQUAINT programme, building on the TREC experiments, have also lacked a solid
theoretical backing and have been concerned with advancing the state of the art without a
prior thorough analysis of the problem setting. At the same time, past approaches such as
Lenhert or Dyer’s, which attempted a theoretical foundation based on modelling human
cognitive processes have not been fruitful as they have been shown to be applicable only to
limited domains; moreover, while they attempted to model actual human cognitive processes
involved in question answering they ignored the more fundamental problem which asks
whether these processes are best suited for automated question answering, i.e. what
automated question answering should be as opposed to what human question answering is.
We shall address these issues by providing a theoretical framework for automated open
domain question answering which sets out the conditions necessary for successfully
Chapter 4: A Definition of Relevance for Automated Question Answering
answering a question. We saw above, in the discussion on philosophical approaches to the
problem of question answering, how, following authors such as Derrida, it is possible to claim
that any answer is a good answer to a question, as any number of different, and equally valid,
interpretations can be given to both question and answer. Eco provided a plausible objection
to this extreme view, arguing against the “drift of meaning” and proposing instead that
meaning in general (and hence the meaning of specific questions and answers) must be
limited by a number of practical rules for interpretation which are derived through an
abductive interpretation of a text. This idea ties in with the concept of “prejudice” put forward
by Gadamer: when asking a question or providing an answer to a question we follow a
predetermined route (a judgement which has already been made: a pre-judgement) which
constrains the type of answer we are able to provide and the meaning we give to both question
and answer; for each question there are at least as many answers as there are interpretations of
the question, but the types of interpretation that are possible are limited by some a priori
judgement. But, as can be expected of philosophers, explicitly concerned with clarifying
theory rather than in providing designs which could be useful from a technical or engineering
point of view, these authors lack the sort of detail that is needed to provide a theoretical
underpinning to automated question answering: Gadamer does not offer any concrete and
detailed proposal as to how prejudices could be formulated, while Eco does not present any
rules for interpretation which could be easily put to use in an actual system. One philosophical
concept that does however seem to be potentially useful in clarifying the nature of answers
and the limits to what may be considered an answer to a question, is the notion of relevance.
Relevance theory has examined the conditions by which sentences in conversations are said to
follow on from previous utterances3 in an appropriate (pertinent, felicitous) manner;
TP
PT
moreover, it has been shown to provide ideas which can be applied in practice to research
areas such as information retrieval which are closely related to question answering.
Consequently, in order to address the shortcomings of theories such as those of Eco and
Gademer, we examine the idea of relevance to understand what rules (pre-judices in the
Gadamerian sense) can and should be applied in automated systems when endeavouring to
understand and answer a question.
We shall set out the theoretical framework in two stages:
•
In the present chapter we examine what we consider the fundamental concept for QA:
relevance.
3
TP
PT
Note that in this work I do not distinguish between “sentence” and “utterance”, taking both to mean
grammatically well-formed as well as “colloquial” expressions.
83
Chapter 4: A Definition of Relevance for Automated Question Answering
•
Once the concept of relevance has been suitably clarified we shall move, in the
following chapter, to incorporate it in a general theory of automated question
answering which rigorously defines what QA systems are and under what constraints
they operate.
Having established this framework, we shall then demonstrate in chapters 6-11 its usefulness
for improving TREC-style open domain question answering systems, and in particular the
YorkQA system described in the previous chapter.
2
Theoretical limitations to the framework
Before we set out the framework in detail it is useful to clarify what this framework does and
does not intend to do, in order to avoid any confusion or misunderstanding.
2.1
A psychological model?
The proposed framework makes no claim to being a psychological model of QA. Modelling
actual human cognitive processes involved in question answering ignores whether this model
is best suited for question answering. Humans cannot always answer questions easily, often
giving ambiguous, incoherent, illogical, misleading or wrong answers; moreover, humans
have difficulty in quickly finding answers to questions about topics on which they have little
expertise and therefore usually ask questions in a cooperative manner, asking the right
question to the right person at the right moment.
At a deeper level, there are numerous philosophical objections to the attempt to build a model
of cognitive processes which would invalidate much of the work carried out in cognitive
science and hence any attempt to formulate a psychological model of question answering.
Kant’s Kritik der reinen Vernunft of 1781 remains a classic in this sense (and no summary can
do justice to the original; we recommend Cassirer 1981 for a “classical” introduction): Kant
examined the cognitive faculties in seeking to clarify the conditions which would allow us to
claim that certain assertions, such as the assertions of mathematics and physics, are universal
and necessary. Taking as starting point Hume’s sceptical empiricism, which questions
whether it is possible to make any universal assertions based on empirical evidence alone,
Kant recognises that an empirical examination of knowledge could not move beyond this
scepticism; attempting to do so would, in a circular manner, simply return the investigator to
the starting point, i.e. the contingency of any empirical observation: if empirical observation
is not universally true (for example it is subject to errors and conflicting interpretation of
phenomena), it cannot be used to show how we can arrive at universally accepted true
statements such as the truths of mathematics. Kant therefore proposes a transcendental
84
Chapter 4: A Definition of Relevance for Automated Question Answering
investigation, by which he means an investigation into the conditions which make it possible
to claim that certain assertions have universal truth. This type of investigation is carried out
by reason alone, or in Kant’s terminology, pure reason, without recourse to empirical
observation and without any claim to empirical validity: the conditions of knowledge which
he sets out are not meant to be a model of actual human cognitive processes (a model of the
human brain for example) but an explanation of the reasons why certain knowledge may
claim universality. This approach to investigating knowledge was refined by thinkers such as
Schelling, Fichte and Hegel, who sought to investigate the foundations of knowledge not
through an empirical observation of the working of the mind but through a purely theoretical
examination of the concept of “mind” and its manifestations (see Marcuse 1985 for an
overview of Hegelian and post-Hegelian philosophy). This approach has not been without
objections: philosophers such as Wittgenstein have objected that without empirical
verification there is no way of telling whether the observations made by Kant and Hegel
actually make sense; on the other hand, Wittgenstein (in his so-called “second” period)
himself recognised that even such a claim is ultimately “transcendental” (in the Kantian
sense), requiring the thinker to metaphorically climb the ladder of though and then discard the
ladder as useless once the conclusion has been reached (see Wittgenstein 2001 for a recent
edition of his major work). A similar claim was made by A. J. Ayer in his work on
“Language, Truth and Logic”, but again was recognised as being itself transcendental in
nature and hence retracted in the second edition of the book (see Ayer 2001 for a recent
edition).
It is however beyond the scope of the current work to examine the complex arguments
brought forward in this area by researchers in transcendental philosophy (Kantian
philosophy), philosophy of Mind (Anglo-Saxon philosophy), theoretical philosophy
(Neohegelian philosophy) or gnoseology (Classical philosophy). We shall instead avoid
embroiling ourselves in these arguments by declining to attempt any psychological model of
question answering and by not making any claims about the psychological validity of the
framework presented.
2.2
An empirical analysis?
Close to the idea of a psychological model is the idea of an empirical model of human
behaviour. Within the field of Information Retrieval a number of researchers have sought to
identify methods to determine the relevance of documents to a query: approaches have
included attempts to model the human cognitive processes involved in defining relevance
(e.g. Ingwersen 1996), an analysis of the satisfaction users felt about retrieved documents
(Gluck 1996), an examination of the notion of “value” to a user of search results (Su 1998),
85
Chapter 4: A Definition of Relevance for Automated Question Answering
an examination of the notion of “utility” to a user of search results (Bates 1996), an
examination of the notion of “pertinence”, i.e. the perceived correspondence of results to
information need (Howard 1994), an examination of the relationship with the task which the
user is performing (Belkin 1990). Other approaches have included the ethnographic study of
the relevance judgements of individuals carried out by Anderson (2001). The results of such
ethnographic studies are however far from final and there is no agreement as to how concepts
such as value, utility or pertinence should be defined and measured, which makes the
statement made by Caudra and Katter (1967) that relevance is a “black box” still convincing.
Here we shall not attempt to build a model of how questioners would judge the relevance of
answers in actual fact by observation or experimentation. Instead, following on from what we
said above in paragraph 2.1, we shall pursue a Neohegelian methodology of conceptual
analysis (see Hegel 1812 for a key work illustrating this methodology and De Boni and
Prigmore 2003 for an example of a recent application to another problem). While it is beyond
the remit of the present work to discuss the merits of conceptual analysis versus empirical
analysis, and to consider the ongoing debate as to the validity of both these approaches, we
shall present the fundamental motivation for this approach below, with the understanding that
it is necessarily incomplete.
Observing the behaviour of questioners could not yield definitive results from a theoretical
point of view as it would, at most, provide a description of what occurs when people measure
what they perceive as relevance, not what should be perceived as relevance. Aside from issues
such as the fact that different questioners would understand the word “relevance” in different
ways, and hence we would not be observing people judging the same phenomena, the more
fundamental objection is methodological: we are not attempting to model human behaviour;
instead we are attempting to construct a model which will be used by a system to carry out a
specific task. Observing human behaviour may well give some interesting insight into the
process of judging relevance, but human behaviour is not necessarily the best way of
understanding the problem: humans are often inaccurate, unpredictable and confused. By
observing the way people make relevance judgements we would obtain a more or less
accurate description of what people do, not an understanding of the concept of relevance.
Taking an extreme example, the fundamental objection is that observing people’s behaviour
when they judge relevance in order to understand the concept of relevance is similar to an
astronomer observing people’s use of the word “sun” to understand the sun.
86
Chapter 4: A Definition of Relevance for Automated Question Answering
2.3
A Process?
We set out to formulate the conditions which need to be met in order for a system to be able
to satisfactorily carry out the process of answering a question. We do not aim to formulate a
general process that question answering systems should follow to correctly answer a question.
In other words we shall examine the conditions of answerhood, not how these conditions are
to be met; to put it in yet another way, instead of showing how a human or a system goes
about answering a question we shall show the properties that make me say that something is
an answer to a question. Different systems will attempt to meet these conditions in different
ways, some more efficient than others, some more elegant than others, some more
successfully than others. We argue that, independently of how question answering systems
actually go about formulating an answer, they fit within a common generic framework and
their results must conform to a well defined common set of conditions. This approach is in
stark contrast with the TREC evaluation framework (and similar methodologies) which
examine the how of question answering while largely ignoring the what: as has been shown
with the TREC evaluation, ignoring the “what” leaves a number of unresolved issues which
make solving the “how” very problematic. By setting out clearly the conditions under which
question answering should occur, it should be easier to work on solving the problem of
implementing these conditions in an automated question answering system.
2.4
A general theory of question answering?
What we set out is a theoretical framework for automated question answering systems. We are
concerned with giving an account of automated, man-made systems which take as input a
question and a set of documents and give as output an answer. We are not formulating a
philosophy of question answering and therefore ignore the more general problems involved in
considering the wider context in which the question answering process occurs; the issue, for
example, of how a questioner formulates a question (with, for instance, an analysis of
“leading” questions, which prejudice the way an answerer will go about finding an answer)
and what a questioner does with answers (for instance, the fact that a questioner may interpret
an answer at will to mean anything at all, if we follow the idea of “drift of meaning” set out
by Eco) is beyond the scope of this work. Our framework also does not attempt to model
question answering on the part of human beings, a different task from building a
psychological model of question answering, which would have to take into account results
from fields such as linguistics, physiology and sociology: humans do not simply take into
consideration the verbal outputs of answerers, but also note gestures, expressions, intonation,
speed of speech, social context, etc. which convey the full meaning of a response.
87
Chapter 4: A Definition of Relevance for Automated Question Answering
3
Relevance for automated Question Answering
Before setting out in detail a theoretical framework for question answering, it is necessary to
examine the notion of relevance, showing its importance for question answering and
analysing its constituent parts. We argue that while previous approaches to QA have been
concerned with finding an answer to questions, we should really be seeking relevant answers:
answers cannot be simply divided into good and bad, true and false etc. but must be seen as
being part of a continuum, a graded scale of worthiness to a questioner. Ignoring the fact that
for every question there is possibly more than one answer is doing a disservice to the
questioner, withholding potentially useful and interesting information. While in a practical
implementation it may well be decided to design a system which only gives one answer at a
time (as TREC-style systems do), a good system should be aware that this answer is not the
only relevant answer and be able to retrieve further “interesting” answers if required to do so.
On the other hand, although such a design would model human answering behaviour
accurately (humans usually only give one answer at a time to a question) this would not
necessarily be the best solution: it does not take advantage of the option that an automated
QA system has of giving multiple answers at any one time, ranked by relevance, thus
providing a wealth of information a human answerer could not easily offer.
3.1
The notion of answer in TREC QA and previous approaches
The TREC QA evaluation framework is looking for true answers, not relevant answers,
ignoring the fact that a question may have more than one correct answer and that an answer
may be helpful (or good, useful or interesting) without necessarily being a full (and hence
true) answer to the question. The evaluation has been concerned with finding an answer,
whether this be an answer within five attempts at responding to a question (TREC-9, TREC10) or an answer within a single attempt (TREC-11): it has not concerned itself with the
possibility of degrees of answerhood and multiple answers, but with single definite answers.
Thus evaluation has concentrated on the capability of systems to give one “correct” answer
either by measuring a Mean Reciprocal Rank (TREC 9 and 10) or simply as a percentage of
correct answers (TREC-11). The same objection holds for other approaches to automated QA,
for example the systems developed in the context of the AQUAINT programme or the
systems developed by Lehnart, Dyer, Schank and Abelson, all of which seek unique answers
in slightly more complex settings. It therefore becomes necessary to define relevance for QA
systems.
We argue that we should not be looking for a unique answer: rather, we should be looking for
answerhood as a property which determines the relevance of different responses to a question.
88
Chapter 4: A Definition of Relevance for Automated Question Answering
3.2
Beyond current approaches: relevant answers vs. unique answers
Wilson and Sperber (2002) question the notion that communication is governed by a norm of
truthfulness (as was argued, for example by Grice 1967 and Lewis 1975) and instead propose
that verbal communication is also governed by expectations of relevance. They define
relevance as “a property of inputs to cognitive processes which makes them worth
processing”, which applied to QA, can be translated as saying that to affirm that an answer is
relevant to a question is to claim that certain features of the answer make it worthwhile for the
questioner to examine that answer more carefully: we are not therefore talking about fully
meeting the expectations of a questioner (providing an answer) but about making it
worthwhile considering the particular answer which has been given; in other words a relevant
answer is an answer that, while not necessarily providing all the information desired,
nevertheless does provide some information that it is worth having, information which is in
some way helpful or useful or interesting.
Another way to look at relevance is to say, paraphrasing what has been written regarding
information retrieval systems, that it is the criterion used to quantify the phenomena involved
when questioners gauge the “utility, importance, degree of match, fit, proximity,
appropriateness, closeness, pertinence, value of bearing” (Rees 1966) of responses to a
question: in other words, answers aren’t classified by questioners simply as right or wrong (a
judgement based solely on truthfulness), but are judged according to many different criteria
and ranked accordingly.
Answers do not fit neatly either into a set of good or a set of bad answers: what we are talking
about are degrees of answerhood which vary depending on circumstances; answerhood is a
fuzzy concept with a possibly infinite number of degrees between an utterance being a good
or a bad answer. Again, answers are good answers to a certain extent, depending on the
situation, but cannot be considered unconditionally good answers. From a philosophical point
of view we are therefore arguing, in the tradition of Protagoras, for the relativism of answers:
answers are true in relation to the person who measures their truthfulness or utility depending
on the particular circumstances of that individual, not in an independent, absolute sense.
Consider for example the question:
Q: “Who wrote Paradise Lost?”
and the answers:
89
Chapter 4: A Definition of Relevance for Automated Question Answering
A1 :”An English poet”
A2 : “Milton”
A3 : “It wasn’t Shakespeare”
A4: “Have you tried looking in the Britannica?”
A5: “Camomile tea is considered a good relaxant”
While it is clear that A5 is not a good answer to the question, all the answers A1-A4 are in a
sense correct, but which one is the best answer depends on the circumstances in which the
question is asked.
If the questioner was a high school student writing an essay about Shakespeare, A3 would
probably be the best answer, followed by A4, then A2 and A1: the questioner probably thinks
Shakespeare wrote Paradise Lost and is about to add this fact to their essay; telling them that
it wasn’t Shakespeare will prevent them from making a gross mistake, while pointing them in
the direction of the Encyclopaedia Britannica is probably the most pedagocially sound
approach; telling them the author was Milton helps them avoid a mistake, but not as quickly
as telling them it wasn’t Shakespeare; telling them it was an English poet on the other hand is
not going to be helpful, as it wont stop the student from making a mistake.
Conversely, if the questioner was a non-European student trying to get a broad overview of
European culture, answer A1 may well be sufficiently informative; A2 might be useful if the
questioner knew who Shakespeare was; A4 could be valuable if the questioner has access to
the Encyclopaedia Britannica, while A3 is not going to be at all helpful.
It is clear from this simple example that while some answers are patently of no use to the
questioner it is difficult to claim that there is one correct answer: a number of answers may be
useful to a questioner and, depending on the circumstances in which a question has been
asked, some answers will be more relevant than others; in some cases it may well be that a
number of answers can be considered equally relevant.
3.3
A definition of relevance for automated QA
Unlike the definitions of relevance seen above which define relevance in terms of a
relationship between an answer and a questioner, we claim that relevance must be also
defined in relation to the answerer. QA systems are not solely aiming at maximising
relevance for the user: in certain settings (for example a QA system used in education), the
answerer (i.e. the system, and implicitly the system designers, or the stakeholders who
sponsored the design: in an educational setting, teachers) may decide that it is more important
90
Chapter 4: A Definition of Relevance for Automated Question Answering
to consider relevance from the answerer’s point of view. A QA system in a school, for
example, will certainly want to provide answers which pupils consider helpful (and hence
relevant from their point of view), but, if the pupils are asking questions to help them
compose an essay, it may well be pedagogically more sound (and hence more relevant from
the system’s – and indirectly the teacher’s - point of view) to provide hints as opposed to full
answers.
Following from the above discussion we propose the following definition of relevance for QA
systems:
From the point of view of automated QA, relevance is the concept which
expresses the worthiness of an answer in relation to a previously asked
question, a questioner and an answerer.
It will now be necessary to examine in more detail how this “worthiness” is to be measured
and what features contribute to making a worthy answer.
4
Previous work
4.1
Relevance, related concepts and categories
A number of authors have either examined the concept of relevance directly, by explicitly
referring to the relevance relationship between a question and an answer; others have
examined this concept indirectly, by analysing the relationship between question and answer
with terminology which is different, but nevertheless closely related to the idea of relevance.
The following are some of the approaches taken:
•
Researchers in Information Retrieval have talked explicitly of relevance but have
referred to the relevance of a set of documents to a query as opposed to the relevance
of an answer to a question
•
Philosophers have talked of “principles” which link generic utterances in a felicitous
manner
•
Linguists have talked of “conditions” of answerhood.
By interpreting this body of work in terms of relevance, we can see that the concept of
relevance has effectively been divided into a number of distinct categories, representing the
different ways in which relevance may be considered. We shall examine some representative
91
Chapter 4: A Definition of Relevance for Automated Question Answering
classifications which have been proposed, without claiming exhaustiveness. It will become
clear that there is little interaction between these different proposals, but we shall show how
the differences between these classifications can be reconciled through a reduced number of
new relevance categories.
As seen above, Grice (1957; 1961; 1967; 1989) set out a number of principles speakers should
follow when responding to an utterance. While these principles are not explicitly referred to
as different types of relevance, nor do they apply solely to question answering, they do in
point to a number of elements which can be used to judge the relevance of an utterance as a
response. While only the maxim of relation talks explicitly about relevance, all the maxims
can be considered types of relevance, as they all set out principles which “co-operative”
speakers should follow: to be “co-operative” is to follow on a previous utterance in a manner
which the other partner in conversation would consider helpful, and hence relevant. We can
therefore paraphrase Grice by posing the following “Gricean” relevance categories:
•
Quantity, which looks at the informativeness of a response: a response should give no
more and no less information than is required
•
Quality, which takes in consideration the truthfulness of a response: a response should be
true and not provide information for which there is not enough evidence
•
Relation, which considers the “aboutness” of a response, explicitly defined as relevance
•
Manner, which looks at the way a response is formulated: responses should be clear,
unambiguous, brief and orderly
A number of objections have been made to this categorisation (see for example, Sperber and
Wilson 1995, who argue that the fundamental relationship should be taken to be Relevance,
which in turn should be interpreted as the ability of the hearer to make relevant inferences
from a given utterance). From the point of view of question answering the following
questions arise:
•
Quantity. It may not be always the case that a speaker wants to give the information
required and nothing more: a teacher for example may want to stimulate a pupil by
making an utterance more informative than required by the pupil’s question; on the other
hand the same teacher may want to withhold information to push the student to think
harder.
•
Quality. While Grice assumes that a “good” response is truthful, wouldn’t a speaker at
times want to give half-truths depending on their goals? Wouldn’t a speaker at times want
to give (and a hearer want to hear) answers for which there is no adequate evidence rather
than be silent (e.g. speculations or hypothesis)? And doesn’t the notion of truth depend on
92
Chapter 4: A Definition of Relevance for Automated Question Answering
the logical framework that the speaker and hearer employ, a framework which may be
different or even incompatible?
•
Manner. It may not always be the case that a speaker should be brief, orderly and
unambiguous: some hearers may prefer the challenge of a complex, obscure argument to
a more straightforward response.
Nevertheless, if we ignore the specific content which we argued against, we can derive the
following general points:
•
Answers are related to questions in logical terms (quantity and quality: informativeness
and truthfulness)
•
Answers are related to questions in terms of the topic they are concerned with, or, in other
words, in terms of their generic meaning (relation)
•
Answers are related to questions in virtue of the way in which they are expressed
(manner)
Ginzburg (1995b, but see also 1995a and Ginzburg and Sag 2000) argues that in order to
determine the relationship between an answer and a question, it is necessary to consider the
questioner's mental situation, in particular the questioner's beliefs and goals, as determined by
that mental situation. This, together with some notion of consequence, will determine which
of all the potentially resolving answers actually resolve the question in that context, which
partially resolve the question, and which fulfil the questioner's goals. Accordingly, Ginzburg
classifies the “asnwerhood” of answers to a question as:
•
being “about” the question, meaning the subject matter of the answer is appropriate to the
question, i.e. it provides some sort of direct answer.
•
fully resolving the question, meaning the question is no longer open: this is relative to the
purpose or goal of the question and the belief or knowledge state of the questioner.
•
partially resolving the question: this would include answers which direct the questioner to
a resolving response, as, for example, when, given the question “What platform does the
train leave from?” one answers “Ask the guard”.
•
fulfilling the goals associated with the asking of the question.
As was seen for Grace, Ginzburg does not talk about relevance. Nevertheless, the concept of
an answer being able to partially resolve a question implicitly recognises that answers are not
simply correct or incorrect, but relevant to varying degrees: the most relevant answers would
fully resolve a question, while less relevant answers would be partially resolving.
93
Chapter 4: A Definition of Relevance for Automated Question Answering
Berg (1991) identified four different types of relevance which can be used when considering
the formulation of responses in conversation to determine if the response is relevant to the
preceding exchange(s):
•
semantic relevance: a response can be deemed relevant in virtue of the meanings of the
expressions in the sentence. Thus the semantic relations between a sentence and the
response to the sentence will determine whether the response is relevant or not.
•
topical relevance: a response is relevant when it refers to the same topic as the preceding
exchange.
•
inferential relevance: a response is relevant when the questioner can draw a number of
interesting inferences from the answer.
•
goal directed relevance: a relevant response is a response that addresses the needs of the
conversants, i.e. helps achieving the goals that the conversants intended to achieve by
engaging in the conversation. Goal directed relevance, therefore does not concentrate
simply on the verbal aspects of a conversation, i.e. the syntactic and semantic qualities of
the conversation: rather it focuses on the behavioural aspects of the conversation, the fact
that conversations are also actions which take place in particular situations.
A number of problems arise from this classification, in particular:
•
It is unclear what is the difference between “topical” and “semantic” relevance: surely
topicality is related to the “meaning” of a sentence? A response to a question which is “on
topic”, will be necessarily semantically related to the question; at the same time, as the
“topicality” of the answer diminishes there nevertheless remains a sense in which the
meaning of question and answer are related
•
What is meant by “interesting” in inferential relevance? “Interesting” to whom? Does this
not depend on the questioner’s goals, and possibly the answerer’s goals? A teacher may
want to impart information that is interesting from a pedagogical point of view, but rather
tedious from a child’s point of view. A question answering system which acts as a
customer care centre for a corporation may want to give customers “interesting”
information about new products or services, while the customer is only interested in an
answers to the particular question at hand.
Saracevic (1996 and 2003), in the context of Information Retrieval systems, identifies the
following types of relevance:
•
Systems or algorithmic relevance: also referred to as comparative effectiveness, this is a
relation between a query and files of a system as retrieved or failed to be retrieved by a
given algorithm.
94
Chapter 4: A Definition of Relevance for Automated Question Answering
•
Topical or subject relevance: also referred to as “aboutness”, this is a relation between
topic of the query and topic covered by the retrieved objects.
•
Cognitive relevance or pertinence: also referred to as informativeness and novelty, this
indicates a relation between the knowledge state and cognitive informational need of a
user and the objects provided.
•
Motivational or affective relevance: also referred to as satisfaction, this indicates a
relation between intents, goals and motivations of a user and objects retrieved by a
system.
•
Situational relevance or utility: also referred to as usefulness in decision-making or
reduction of uncertainty, it refers to the relation between the task or problem-at-hand and
the objects retrieved.
While Saracevic is concerned with the relationship between a query and the documents
retrieved by an information retrieval system, substituting “question” for “query” and
“answers” for “documents” or “retrieved objects” in his argument gives an interesting insight
into different types of relevance which could be considered for question answering. One
problem with this classification however is the fact that it is unclear what is the difference
between motivaltional, situational and cognitive relevance: do they not all refer to user needs,
i.e. the goals of the user?
Morris Engel (1980), in the context of an analysis of argumentation, distinguishes between:
•
Logical relevance: indicating that some form of reasoning brings us to conclude that a
sentence is relevant in relation to another
•
Deductive relevance: a particular form of logical relevance which relies on the idea of
deduction
•
Psychological relevance: this is based on the idea that the stream of consciousness often
includes associations between ideas that are not at all logically related
While this classification was devised to explain how arguments follow on from each other in
argumentation, it could well be applied to question answering, as an answer to a question may
be seen as an argument which follows, logically or psychologically, from the preceding
question.
Other authors have talked about yet more aspects of relevance: in the context of Information
Retrieval, Kando (2002) for example talks about:
•
Topical or Objective Relevance, looking at the topicality of a query through
correspondence of query terms
•
Subjective or Psychological Relevance, looking at needs.
95
Chapter 4: A Definition of Relevance for Automated Question Answering
•
Situational Relevance, Interactive Relevance, looking at the context in which the query is
posed
•
Motivative Relevance: looking at the reasons why a query has been asked
•
Interpretational relevance: looking at the “horizon” within which the person who is
making the query will consider the results.
Here it is difficult to see a fundamental difference between subjective, situational, interactive,
motivational and interpretational relevance, as they all refer to the particular needs and goals
of a user, whether this be the particular situation of the user, their interpretational horizon or
their subjective situation.
4.2
Limitations of these approaches to relevance
While this brief review of the different types of relevance which have been identified (mostly
in areas other than question answering) does not claim to be exhaustive, it is representative of
the variety of opinion expressed by researchers in this area. The use of these categories for an
understanding of relevance for question answering is however problematic for a number of
reasons:
•
Not all the categories above (for example Grice and Ginzburg) explicitly refer to
relevance
•
Very few of the given categorisations were conceived directly for question
answering (most were conceived in the area of information retrieval) and they
must therefore be adapted for this area
•
It is unclear what the relationship is between the various categorisations, as there
has been little (if any) discussion by each author of other approaches given by
other authors
•
The meaning of the categorisations is not always fully clear and there is
considerable ambiguity in the meaning of a number of the relevance categories
proposed, especially as regards practical implementation
•
It is unclear how the categories would be applied in practice as there has been no
attempt at a comprehensive implementation of the proposals
We shall now seek to resolve these issues and make sense of the variety of relevance
categories by proposing a reduced number of relevance categories for question answering
which:
96
Chapter 4: A Definition of Relevance for Automated Question Answering
•
Explicitly refer to the notion of relevance
•
Explicitly refer to the problem of automated question answering
•
Take into consideration the variety of categories proposed by other authors
•
Are clearly defined
•
Are satisfactory from a practical point of view, i.e. can be implemented in a
working system (the implementation of the given relevance categories will be
examined in Chapters 6-10 below).
5
Relevance categorisation for QA systems
5.1
Proposed categories
Above we defined relevance from the point of view of question answering systems as the
concept which expresses the worthiness of an answer in relationship to a previously asked
question, a questioner and an answerer. We shall now define what exactly is meant by
worthiness, i.e. what conditions make an answer worthwhile listening to.
As seen, relevance is not a simple concept, and it can be understood from different
viewpoints. The different types of relevance identified above have however been shown to be
limited (paragraph 5 above) and cannot therefore be simply taken “as is” and applied to the
problem of automated QA. In order to solve these limitations we therefore propose a number
of categories that we will show to be sufficient for automated QA systems to grasp the various
points of view from which relevance can be expressed (the relationship between the proposed
categories and the categories identified in the theories above will be discussed in paragraphs
6.2 and 6.3):
•
Semantic Relevance, relating the meaning of question and answer (see chapter 6.1
below for a more detailed discussion and examples). Semantic relevance would
examine features such as the relationship between the subject matter of question and
answer and the relationship between the general meanings associated with the words
contained in the question and the answer; an example of a relation of semantic
relevance between a question and an answer is:
Q: Why did Napoleon occupy Venice?
A1: Wars are always evil
B
B
97
Chapter 4: A Definition of Relevance for Automated Question Answering
In this example, although the answer does not provide the questioner with the requested
information, it may nevertheless be considered of interest, and certainly of more interest
than an answer such as:
A2: The rabbit looked at his watch
B
B
A2 has no semantic relationship with Q, while A1 shares a number of connotations with
B
B
B
B
the question (wars entail occupations; wars are carried out by armies and Napoleon was
the head of an army).
•
Goal-directed Relevance, relating answers to informational goals, trying to satisfy the
informational needs of the questioner and answerer (see chapter 7.1 below for a more
detailed discussion and examples); an example of a relation of goal-directed
relevance is:
Q: When does the next train leave for London?
A1: There is a timetable over there
B
B
Where the answer, although not providing all the needed information helps the questioner
meet their goal of finding out the train time by pointing them to a timetable, which will
hopefully contain an answer. Consider instead the answer:
A2: In the future
B
B
which answers the question, but does not provide any useful information.
•
Logical Relevance, relating answers to questions through some process of reasoning,
for example logical inference (see chapter 8.2 and 8.5.2 for a more detailed discussion
and examples). An example is:
Q: Where was Johann Sebastian Bach born?
A1: Johann Sebastian Bach was born in Eisenach.
B
B
Where A1 is considered to be a correct answer as it provides the information needed (a
B
B
birthplace) and refers to all the constraints given in the question (the birthplace of Johann
Sebastian, not Johann Christian or Hans Bach). Compare instead:
98
Chapter 4: A Definition of Relevance for Automated Question Answering
A2: Bach adapted some of Vivaldi’s concertos for the organ.
B
B
where A2 is relevant (it is about the same topic of the question), but from a semantic point
B
B
of view (it considers the same subject matter: a musician named Bach), not from a logical
point of view (it does not provide the information requested by the questioner and cannot
be used to infer the information).
•
Morphic Relevance, relating answers to the questioner and the answerer through the
way it is expressed (see chapter 9.3 for a more detailed discussion and examples).
Consider a hurried questioner who asks a system:
Q: When is the next train to London?
And the possible answers:
A1: 10:15, platform 1.
B
B
A2: There will shortly be a train which will depart from platform one at ten fifteen.
B
B
Although the information provided by both questions is similar, the questioner, being in a
hurry, would find A1 more relevant than A2 as the format of the answer it is shorter and
B
B
B
B
quicker to read.
These relevance categories are the conditions which give answers their worth: to judge the
relevance of an answer, and hence its worthiness, is to consider the answer from the point of
view of semantic, goal-directed, logical and morphic relevance. The following table
summarises the meaning of these categories:
99
Chapter 4: A Definition of Relevance for Automated Question Answering
Category
Semantic Relevance
Relation
Considers how questions and answers are
related through their meaning
Considers the informational goals associated
Goal-directed Relevance
with a question and its associated answer,
both in the mind of the questioner and the
answerer
Considers the relationship that exists between
Logical Relevance
the unknown information a question is asking
about and an answer in virtue of the way we
reason about answers in relation to questions
Morphic Relevance
Considers the way the answer is expressed,
i.e. the outer form of the answer
In proposing these categories I will need to prove that they are:
•
Necessary to encompass all the relevance types referred to by the authors above
•
Sufficient to explain all the relevance types referred to by the authors above
And
•
Able to clarify the limits of TREC-style QA systems
•
Able to provide a solution to the limits of TREC-style QA systems
It should be noted that we do not propose a formal argument in the tradition of mathematical
logic, but rather an informal but nevertheless rigorous argumentation as is appropriate for this
subject matter. Consequently it will not always be feasible to separate necessity and
sufficiency as the arguments for both often overlap.
100
Chapter 4: A Definition of Relevance for Automated Question Answering
5.2
Sufficiency of the Relevance Categories
I shall first show that the relevance categories mentioned are sufficient to encompass the types
of relevance described above. While superficially there appears to be a significant
disagreement on the number and types of relevance and relationships between question and
answer, a closer examination reveals that these have a number of common features and can be
grouped in a reduced number of categories which correspond to the relevance categories
proposed above: semantic, goal-directed, logical and morphic relevance. The following table
illustrates this relationship. It should be noted, however that the vagueness of some of the
definitions given means that what we propose is not an exact match, but an attempt to find the
features common to the given categories.
Non-relevance
Common
Features
Relevance categories
categories
Grice
Ginzburg
Saracevic
Berg
Morris
Kando
Engel
About,
Topical
Semantic
Partially
Relevance
resolving
Meaning:
Relation
or
Semantic
Psychological
Topical
Subject
Relevance,
Relevance
Objective
Relevance
Topical
or
Relevance
Relevance,
Inferential
Relevance
Goals:
Fulfilling
Motivational
Goal-
Psychological
Subjective
Goal-
goals
Relevance,
directed
Relevance
Psychological,
Cognitive
Relevance;
Situational,
Relevance,
Inferential
Interactive,
Situational
relevance
Motivative,
directed
Relevance
Relevance
Interpretational
Relevance
Inference:
Quantity,
Logical
Quality
relevance
Resolving
Algorithmic
Logical
relevance,
Relevance,
Cognitive
Deductive
relevance,
Relevance
Topical
relevance
Form:
Manner
Morphic
Relevance
101
or
Chapter 4: A Definition of Relevance for Automated Question Answering
As pointed out above, I do not claim that the shown authors have proposed the only or the
“best” theories of relevance, or the best analysis of the properties of answers, but they are
representative of the body of opinion expressed on relevance: it can therefore be inferred that
any conclusions made about these theories, and in particular the relationship between these
theories and the proposed categorisation of relevance, will plausibly be applicable in general.
It can be seen from the diagram that, unless the shown correspondences are incorrect, the
proposed categories are sufficient to encompass the other categories. We shall now seek to
show the correctness of the correspondence by analysing the conditions of felicitous
answerhood proposed by Grice and Ginzburg and the relevance types proposed by Berg,
Saracevic, Morris Engel and Kando, showing their limitations and the way in which they can
be subsumed under the proposed relevance categories.
Semantic Relevance: considering the relationship in meaning between question and answer,
this encompasses the “relevance” relationship identified by Grice, the “aboutness” identified
by Ginzburg. “Topical” relevance (Saracevic, Berg, Kando – interestingly Kando refers to
this as “objective” relevance, while Saracevic calls it “subject relevance”), concerned with the
topic of sentences, is again looking at relationships between meanings: we prefer to use the
term “semantic” relevance as opposed to “topical” relevance, as the notion of topicality is too
restrictive, referring more to the focus of a question and answer rather than the meaning of all
the constituent parts. On the other hand, the psychological relevance of Morris Engel also
encompasses meaning as one of the non-logical features which link questions and answers.
Goal-directed relevance. Ginzburg explicitly talks of answers fulfilling the goals associated
with a question and Berg explicitly talks about goal directed relevance. But Berg’s concept of
inferential relevance, with its reference to “interesting” inferences to be made by a questioner
also implicitly relates to goals, as something can only be judged “interesting” when the
context (i.e. the aim of the question) is taken into consideration: but this context is made up of
questioner goals, and hence inferential relevance can be subsumed under goal-directed
relevance. The same argument can be applied to the concepts of Motivational Relevance,
Situational Relevance (Saracevic, Kando), Cognitive Relevance (Saracevic), Psychological or
Subjective Relevance (Kando, Morris Engel), Interactive and Interpretational Relevance
(Kando): motivation refers to a goal, as it is an aim which motivates; cognitive, interactive
and interpretational relevance, referring to the “informational need” of the questioner again
refers to a goal; situational relevance, referring to the “task or problem at hand” is again
looking at goals; finally, psychological or subjective relevance is concerned with the
individual’s goals and their relationship to an answer.
102
Chapter 4: A Definition of Relevance for Automated Question Answering
Logical relevance: The Gricean maxims of quantity and quality, looking at information
content, truthfulness and evidence are in effect looking for a logical relationship between
utterances, a logic which may be, according to Morris Engel, either inductive or deductive.
Another way of looking at induction or deduction is to say, following Saracevic, that there is
an algorithm (i.e. a mechanism, and hence a logic) which can be used to construct the answer;
the use of such an algorithm, i.e. the process of relating two utterances through our cognitive
faculties, would, again in the terminology of Saracevic, be a particular example of cognitive
relevance. Yet another way of considering the logical relationship between a question and an
answer is to say, following Ginzburg that an answer “resolves” the question: from the answer
it follows logically that the information required in the question is instantiated, i.e. resolved.
The notion of logical relevance however goes beyond the concept of resolvedness by
introducing the idea that different answers may be considered resolving to different degrees
while still instantiating the information required in the question.
Morphic relevance: the Gricean maxims of manner are concerned with the way in which
answers are expressed and therefore subsumed under morphic relevance. On the other hand
the maxims of manner are not sufficient to encompass morphic relevance, as not only they do
not include the notion of relevance (i.e. the idea that some sentences may be more
morphically relevant than others, but nevertheless still morphically relevant), but also because
they only specify a fixed number of constraints of responses: clarity, concision and order.
Morphic relevance is much more wide-ranging, encompassing all the forms in which a
response may be expressed.
5.3
Necessity of the relevance Categories
While both Grice and Ginzburg’s division of attributes of answers have a neat correspondence
with some of the proposed categories, neither talks explicitly about relevance, only properties
of unique answers: to talk about relevance is to say that there is no clear-cut distinction
between “good” and “bad” answers, and to claim instead that answers are more or less good.
The given relevance categories are therefore necessary to make explicit the fact that we are
talking about properties of relevant answers, not simply properties of answers: in other words
we are making explicit the fact that there is no neat distinction between answers and nonanswers, but answers have varying degrees of relevance to a question. On the other hand, the
Gricean maxims of manner are not sufficient to encompass morphic relevance, as they have
been shown to be closely linked to goal-directed relevance and only specify a fixed number of
constraints of responses, e.g. clarity or concision. As long as it can be proven that the
categories cannot be reduced to each other in a different manner, the necessity of the proposed
categories is therefore proven.
103
Chapter 4: A Definition of Relevance for Automated Question Answering
I will now show that the proposed relevance categories cannot be meaningfully reduced to a
smaller number and are therefore necessary to encompass the relevance categories described
above: in other words, I will show that none of the proposed categories can be removed
without also giving up one or more important relations between question and answer.
The first necessary distinction is between semantic and goal-directed relevance. The necessity
of this distinction is given by the difference between topical, subject relevance, a relation of
aboutness and a relation which considers the goals associated with the given utterances which
are being compared. Semantic relevance ignores the issues associated with considering the
goals which an utterance could or could not help fulfil: it is only interested in similarities in
subject matter, independently of how the utterers want or could use the subject matter; the
necessity of goal-directed relevance is therefore given by the necessity to consider the scope
for which utterances are made. On the other hand goal-directed relevance only considers the
goals associated with utterances and does not necessarily consider the wider context in which
the utterances are given: utterances may well be about a topic without necessarily fulfilling
any goals which the utterees may immediately have; semantically relevant utterances may be
useful because, together with other information they may help the questioner. The necessity
of semantic relevance is therefore given by the necessity to consider the subject matter of
utterances independently of the uses to which these utterances may be made.
The second necessary distinction is between semantic and logical relevance. Logical
relevance is telling us there is a good reason to believe an answer is effectively an answer to a
question: it is telling us there is compelling evidence for considering an answer an answer,
through some sort of logical argument, perhaps a formal argument (of the type aÆ b) or by
using more informal, persuasive or rhetoric (in the sense of Aristotle) arguments. Logical
relevance differs from semantic relevance in that it is not saying an answer is “about” a
question, containing similar subject matter and potentially interesting for this reason, but is
considering whether it can be “proven” that an utterance is an answer to a question through
some form of reasoning: an answer may well be of relevant subject matter without being an
answer which logically follows from the question. On the other hand an answer may well be
logically an answer without necessarily being of the same subject matter: the question “When
did Picasso die?” is “about” painters and death, while the sentence “1973” is “about” a date;
but, given a sentence in a document with the words “Picasso died in 1973”, “1973” is a
logical answer to the question. As another example, given the question “who killed Aldo
Moro”, the answer “politicians often meet a violent end” is semantically relevant, as it talks
about politicians and their death, as did the question, but is not logically relevant as it cannot
104
Chapter 4: A Definition of Relevance for Automated Question Answering
be reasoned to contain an answer (the question remains unresolved: the answer does not
contain the names of any culprits). On the other hand an answer like “Fate” would certainly
be logically relevant if the document collection contains the sentence “Fate often kills men”,
but would not be very relevant from a semantic point of view as it does not talk about murders
or politics (see Chapter 8, p. 182, for further examples).
The third necessary distinction is between goal directed and logical relevance. Goal directed
relevance takes into account the goals associated with the user’s questions. Goal directed
relevance is therefore distinct from logical relevance as an answer may help the questioner
achieve their goals without necessarily being an answer from a logical standpoint: if a
questioner has the goal of learning about Richard Strauss, and, asking the question “When
was Richard Strauss born?” gets the answer “Richard Strauss was assistant conductor to
Bülow”, the answer is certainly relevant from a goal-directed point of view, as it helps the
questioner understand more about Richard Strauss’ life, but is irrelevant from a logical point
of view as the question (with its main focus being “When?”) remains unresolved.
The next necessary distinction is between morphic relevance and semantic relevance.
Morphic relevance is concerned with the form the answer takes in relation to users’
preferences, irrespective of its content. An answer to a given question may or may not be
semantically relevant, but can still take different forms which the questioner may or may not
prefer: “Red and Black” may or may not be a relevant answer to a question, but can take
various forms when given as an answer, for example “The answer is: «Red and Black»“, or
“<answer>Red and Black</answer>“.
The same argument applies to the distinction between logical and morphic relevance.
Morphic relevance is clearly distinct from logical relevance: it is concerned with the way an
answer appears, not the logical properties of the answer or the way an answer has been
derived; a logically valid answer may on the other hand be given to the user in a variety of
different ways which may or may not please the questioner.
Finally, morphic relevance must be distinguished from goal-directed relevance. Morphic
relevance may or may not be linked to goal-directed relevance depending on the situation, but
this optionality entails the necessity of a distinction. A questioner may want an answer to be
output by a QA system using simple language as this may help their informational goal of
understanding the answer. On the other hand, they may want an answer to be output by a QA
system in some mark-up language because they then want to process using another program
(not an informational goal, but a higher “life” goal); or they may well want an xml-like output
105
Chapter 4: A Definition of Relevance for Automated Question Answering
simply because it looks neat (i.e. a preference as opposed to a goal). In a sense morphic
relevance could therefore be said to be related to the aesthetic sensibilities of the questioner
and the answerer, rather than their informational needs and related goals.
6
6.1
Moving beyond TREC-style QA systems
Clarifying the limitations of TREC-style QA systems
We shall now show that the proposed relevance categories enable us to understand current
QA systems more precisely, but also give us the theoretical foundation necessary to solve the
issues raised in chapter 2 and 3 in the discussion the limitations of current approaches to QA
systems.
We shall first show that the proposed relevance categories clarify the structure of a TRECstyle QA system. As seen above, the TREC QA evaluation requires:
•
A text snippet containing a precise answer (TREC-8, 9, 10) or a precise answer without
the addition of other material (TREC-11), justified by the document which contains the
answer.
•
A specific format for the output from the systems set by NIST
Answers which do not conform to these criteria are considered incorrect. We can see that
these requirements are met by considering:
•
Logical relevance: this ensures that the answer is logically connected to the question (i.e.
is justified by the document which contains it), does not contain any irrelevant
information (i.e. is precise), and hence resolves the problem posed by the question.
•
Morphic relevance: this ensures that the answer is given in an appropriate manner, with
appropriate tags and required information (e.g. document id and question id).
106
Chapter 4: A Definition of Relevance for Automated Question Answering
Table 2
Proposed
TREC
relevance categories
evaluation criteria
Semantic Relevance
Goal-directed Relevance
Logical Relevance
Text snippet containing exact answer
justified by document or exact answer
justified by document
Morphic Relevance
Output rules: formatting according to
the NIST requirements
Overall relevance
Not relevance: a single correct answer,
or a correct answer out of five attempts
The given relevance criteria therefore provide an insight into the limitations of TREC-style
QA, in particular:
•
The lack of a concept of semantic relevance: answers are either correct or incorrect;
sentences which for example provide background information or contextual
information are considered incorrect, even if no “precise” answer is given
•
The lack of a concept of goal-directed relevance: questioner and answerer goals are
ignored, even if different goals would require different, and possibly incompatible,
answers
•
The lack of a concept of relevance: TREC requires a single correct answer, not a set
of more or less relevant answers; even when the TREC evaluation required
participants to provide five answers, this was not a case of relevance, but simply
giving systems five opportunities to find a single correct answer: the evaluation was
not concerned whether more than one answer could be given to a question and only
considers one “correct” answer, even when it allows systems to return more than one
answer.
Note that this does not mean that systems do not use some form of semantic or goal-directed
relevance in their processing: most systems for example use a form of semantic relevance,
through the use of an information retrieval engine, to narrow down the search for documents
containing an answer; from the point of view of the evaluation this is however irrelevant, as
evaluation is based solely on the observation of exact answers, within a text snippet or in
107
Chapter 4: A Definition of Relevance for Automated Question Answering
isolation: semantically relevant sentences are ignored and only one logically relevant answer
is considered when judging systems’ performance.
6.2
Overcoming the limitations of TREC-style QA systems
In chapter 3 we examined the limits of the TREC framework and summarised the unresolved
issues opened by the framework with the following questions:
1. What makes a good answer?
2. Is it possible for more than one answer to be a “good” answer?
3. What is automated question answering trying to achieve?
4. What are the constraints under which automated question answering systems operate?
5. How do issues such as questioner models, questioner goals, answerer goals, background
knowledge and question context fit into question answering?
Taking the concept of relevance as fundamental to question answering we now can now
provide the following solutions:
1. A good answer is an answer which is relevant. In question answering we cannot simply
take “good” to mean correct or incorrect, true or false, but we must recognise that for
each question there are a number of answers which may be considered more or less
interesting or helpful but all of which are nevertheless “good” answers to different
degrees.
2. More than one answer can be “good”, given that more than one answer can be relevant to
a question; we should not therefore talk about “good” answers as opposed to “bad”
answers, but of answers which are relevant to varying degrees.
3. Automated question answering aims to provide a set of relevant answers to questions, not
simply a single, “correct” answer.
4. Automated question answering systems constrain answers by looking at their relevance
from various points of view: semantic, goal-directed, logical and morphic relevance.
5. The different categories of relevance ensure that user and system (questioner and
answerer) models, goals and background knowledge are taken into consideration when
answering a question; in Chapter 5 we shall clarify this aspect further, explicitly showing
how we need to refer to questioner models and answerer models. Furthermore we shall
ensure that question context is taken into consideration by explicitly referring to
previously asked questions and answers.
108
Chapter 4: A Definition of Relevance for Automated Question Answering
In the following chapter we shall clarify these points more rigorously by applying the notion
of relevance and relevance categories to define a formal theoretical framework which
explicitly and unambiguously sets out the objective of automated question answering, the
constraints under which automated question answering systems operate and how questioner
models, questioner goals, answerer goals, background knowledge and question context fit into
question answering.
7
Conclusion
We have shown how the concept of relevance clarifies the relationship between question and
answer: answers are not simply correct or incorrect, right or wrong, good or bad, but should
be seen on a sliding scale of interest and usefulness (“worthiness”). This interest and
usefulness is determined by four relevance categories (semantic, goal-directed, logical and
morphic relevance), which determine the conditions an answer must meet to be a relevant
answer.
In the following chapter we shall show how this concept of relevance can be used within a
semiformal theory of automated QA. The relevance categories give the conditions an answer
must meet to be relevant, but to meet these conditions, a QA system must take into
consideration a number of constraints such as questioner and answerer knowledge and goals:
we shall show how these other constraints can be reconciled with the proposed concept of
relevance.
109
Chapter 5
A Relevance-Based Theoretical Framework for
Automated QA
Executive Summary
A semiformal theoretical framework for automated QA is set out, implementing the relevance
theory previously introduced. The framework sets out the constraints which will determine
how a question answering system will be implemented to be able to give a relevant answer to
a question thus clarifying exactly what a question answering system is trying to achieve and
the limits within which it must operate.
1
Introduction
We have shown above (chapter 2 and 4) that, from a theoretical point of view, it is necessary
to counter the extreme philosophical view, as set out for example by Jacques Derrida, that any
answer can be provided as a valid response to any question: Gadamer’s idea of prejudice gave
a first indication of how the set of suitable answers to a question should be limited by some
prior rules (pre-judgement) on the part of both the answerer and the questioner; Eco’s idea of
limits on interpretation also gave an indication of how the set of possible answers to a
question should be limited by a number of interpretation rules. The ideas of Gadamer and Eco
were not however considered to be appropriate for a practical application, and in the previous
chapter we examined relevance theory as an area of research which could provide a more
practically applicable theory of the limitations of the interpretation of questions and answers.
From this we introduced a theory of automated question answering based on the concept of
relevance, showing that to talk about an answer to a question is to speak about a relevant
answer to a question. In other words, we showed the conditions of answerhood, i.e. what
makes one answer more relevant than another. Here we build a metatheory which
Chapter 5: A Relevance-Based Theoretical Framework for Automated QA
incorporates the given relevance theory by presenting a high-level formal4 framework for
automated question answering; the metatheory will show how a generic model for a question
answering system can be built which incorporates the concept of relevance previously
defined.
2
General Framework
From the discussion in chapters 2, 3 and 4, it follows that we want a framework that:
1. Implements the notion of relevance, i.e. the idea that there can be more than one
answer to a question (see chapter 4, paragraph 2.2): a question answering system must
provide an ordered set of relevant answers
2. Recognises that an overall relevance judgement is based on the components of
relevance (see chapter 4, paragraph 5): semantic, goal-directed, logical and morphic
relevance: the ordered set of relevant answers will be determined by an overall
relevance judgement based on semantic, goal-directed, logical and morphic
relevance
3. Explicitly recognises that the prejudices of the answerer constrain what is considered
an answer to a question (see chapter 2, paragraph 8.3 and 10): the relevance
judgements made will depend on the prejudices of the answerer
4. From points 2, and 3 above it follows that prejudices must be associated with each
component of relevance; prejudices will be specific for semantic relevance, goaldirected, logical and morphic relevance: the prejudices will be made up of
background knowledge (semantic relevance), goals (goal-directed relevance),
inference mechanisms and knowledge (logical relevance), answer form preferences
(morphic relevance)
5. From the discussion in chapter 3 (paragraph 5.2) it emerged that we need to explicitly
recognise the existence of both questioner and answerer model: the answerer
prejudices will refer both to an answerer model and a questioner model
6. From the discussion in chapter 3 (paragraph 5.2) it was noted that the question
answering process does not occur in isolation, and that an answer may also, for
example, depend on a wider interaction which has taken place between questioner
and answerer: prejudices will also be made of context such as previously asked
4
Strictly speaking, from the point of view of mathematical logic, what we present is a semi-formal
framework: we do not define such things as sets and symbols and instead, following common practice
in computer science, we assume them as given.
111
Chapter 5: A Relevance-Based Theoretical Framework for Automated QA
questions and previously given answers. In a more futuristic vision, context could
also be taken to be, for example, hand gestures accompanying a question.
7. Explicitly recognises that knowledge of an answer may come from both internal and
external sources: in chapters 2 and 3 we saw that current open-domain question
answering systems do not rely solely on in-built knowledge, but seek an answer in an
external set of documents: the answer will also depend on an external source of
knowledge, for example a set of documents
In the following sections we shall address these points in order to construct a satisfactory
theoretical framework.
3
Definition of a Question Answering system
We have:
•
From section 2, point 1, a question answering system can be taken to be a function
which returns an ordered set A of relevant answers to a question.
•
From section 2, point 3, we have that A will be constrained by the answerer’s
prejudices P.
•
From section 2, point 7, it follows that A will be also constrained by a set of
documents D which can be consulted to find an answer.
From this we give the following definition of a question answering system:
Given a questioner with an information need expressed by an utterance q (the
question) in the context of a set of documents D and the system’s prejudices
P, a question answering system is a function which returns an ordered set A
of answers derived from D which are relevant to q.
Formally, therefore, a question answering system is a function
qa( q, D, P ) = A
where
•
q is an utterance representing an information need. This utterance would usually be a
question (e.g. “Who was Caesar?”), but could well be another expression of an
information need such as a command (“Tell me who killed Caesar”), a wish (e.g. “I
112
Chapter 5: A Relevance-Based Theoretical Framework for Automated QA
would like to know why Gaul was divided into three parts”), or a statement (“I need
to know when Caesar crossed the Rubicon”). Note that restricting q to an utterance
representing an information need is narrowing the problem to that of modelling a
question answering system, as opposed to a conversational system.
•
P is a non-empty set representing the questioner’s prejudices. As seen in Chapter 2 in
the discussion on Hermeneutics, it is not possible to answer (or even to ask) a
question without a pre-judgement about what the answer will be. The answer given
by a question answering system will therefore depend on the answerer’s prejudices,
and will change with different prejudices.
•
D is a non-empty set representing the documents in which an answer is to be sought.
By specifying a set of documents as opposed to a knowledge base (a set of documents
could be seen as a very loosely structured knowledge base), we distinguish Question
Answering systems from expert systems which answer questions from a given
structured knowledge base, and hence find ourselves following the line of enquiry
started by the TREC QA framework, where answers are to be sought in generic
natural-language documents such as newspaper articles. From this point of view,
expert systems can be considered a particular type of question answering system
where either the set of documents is highly structured (in the limit they are written in
a logical form which can be immediately manipulated by a machine) or the set of
documents is empty and answers are to be sought solely within the answerer’s
prejudices (i.e. an internal knowledge base).
•
A is an ordered set of relevant answers. The need for an ordered set was set out
above: it has been shown that is incorrect to talk about good or bad answers, as
answers should be considered in their relevance to a question; but some answers are
more relevant than others to a question, hence the ordering. In particular am ≺ an
B
B
B
indicates the fact that am is more relevant to q than an in the context of q, P and D.. As
B
B
B
B
noted above, restricting A to a set of relevant answers is narrowing the problem to
that of modelling a question answering system, as opposed to a conversational
system, which would also be able to respond to a question with another question
A consequence of this framework is that in an implemented QA system, the user or operator
could choose to change document collection during the session. Moreover, assuming a
dynamic set of prejudices, as the session goes on, a different answer could be given to the
same question, as the answerer’s prejudices change. Lastly, we have a function because we
want one and the same answer set: this is a deterministic machine.
113
B
Chapter 5: A Relevance-Based Theoretical Framework for Automated QA
4
Determining an answer
The following will describe how the function qa will determine the set A of relevant answers.
The set of possible answers to a generic question is constrained by the set of documents
answers are to be sought in and the prejudices peculiar to the answerer. An answer to a
question q is an element a of the set A representing all the relevant answers to q. The
elements of set A are in turn a subset of the set PA of possible answers to q.
The set of possible answers PA is determined by the documents D and the answerer’s
prejudices P: the answerer is not omniscient and cannot give any possible answer. In other
words:
PA= f(D, P)
The set PA could well be infinite: a question answering system will therefore only consider a
subset RA of all these answers, which we shall call the set of relevant answers:
RA ⊂ PA
From paragraph 2, point 1, above, a question answering system is not seeking a single answer,
but the ordered set A, which made up of the elements of RA, and represents a ranking of
relevant answers to q. In other words:
A = ( RA, ≺)
But we must have a mechanism to assess the relative ranking of all the answers in RA: we
shall call such a mechanism relevance logic. From paragraph 2, point 5, above, we need to
take into consideration the way in which both questioner and answerer will rank the answer:
in other words A will be determined according to relevance logics RL and RL’. But we also
need a conflict resolution process CR to resolve cases where RL and RL’ provide different
rankings. The reason we distinguish between RL and RL’ is that questioner and answerer may
take different approaches to forming an overall relevance judgment, given the individual
relevance assessments from the point of view of the different relevance categories; CR will
resolve any conflict by setting priorities or seeking compromises, for example ignoring the
questioner’s approach to making an overall relevance judgment.
114
Chapter 5: A Relevance-Based Theoretical Framework for Automated QA
From paragraph 2, point 2, the framework must explicitly show that relevance is a complex
concept made up of a number of different categories: in order to determine A, RL will have to
take into account the different relevance categories identified in the previous chapter through
the use of the mechanisms S-CR, G-CR, L-CR, M-CR, O-CR , which provide a means of
assessing semantic, goal-directed, logical, morphic and overall relevance. One approach is to
consider S-CR, G-CR, L-CR, M-CR and O-CR as algorithms with which to construct the
ordered sets S-ANSWER, G-ANSWER, L-ANSWER, M-ANSWER and A of relevant
answers in accordance with the relevance categories identified. These set will be constructed
by a number of functions which take as input the question to be answered, the documents
where the answers are to be found and a subset of the answerer’s and questioner’s prejudices.
We will have a function
semantic-relevance( q, PA, K, K’ ) = S-ANSWER
which, under the constraints of a question q, possible answers PA, the answerer’s prior
knowledge K and the questioner’s prior knowledge K’ will return an ordered set S-ANSWER
of answers which are semantically relevant to q, using a conflict resolution process CR. to
resolve any conflicts there may be between the questioner and answerer’s judgement of
relevance from a semantic point of view.
In the a similar manner we will have a function
goal-directed-relevance( q, PA, G, G’ ) = G-ANSWER
which, under the constraints of a question q, possible answers PA, the goals associated with q
from the answerer’s point of view G and the questioner’s point of view G’, will return an
ordered set G-ANSWER of answers which are relevant to q from a goal-directed point of
view.
We will have a function
logical-relevance( q, PA, L, L’ ) = L-ANSWER
115
Chapter 5: A Relevance-Based Theoretical Framework for Automated QA
which, under the constraints of a question q, possible answers PA, the answerer’s inference
mechanism L and the questioner’s inference mechanisms L’, will return an ordered set LANSWER of answers which are logically relevant to q.
We will have a function
morphic-relevance( q, PA, M, M’ ) = M-ANSWER
which, under the constraints of a question q, possible answers PA, the answerer’s answer
form (morphic) preferences M and the questioner’s answer form (morphic) preferences M’,
will return an ordered set M-ANSWER of answers which are morphically relevant to q.
Given the above functions we are looking for a function relevant-answer which gives an
overall relevance judgement: relevant-answer takes as input q, the sets of relevant answers
determined according to the different relevance categories, relevance logics RL and RL’ and a
conflict resolution process CR (which will determine how the various components are
balanced against one another) and gives as output the ordered set A of answers relevant to q.
In other words:
relevant-answer( q, S-ANSWER, G-ANSWER, L-ANSWER, M-ANSWER, RL, RL’, CR
)=A
The function relevant-answer will take into account the relevance of an answer to the
question q from the point of view of semantic, goal-directed, logical and morphic relevance
and give an overall judgment on the relevance of the answer to the question: the various
relevance categories may therefore be given different weightings of importance, or indeed
ignored.
It is worth noting that it may be argued that CR determines the final goal (in Aristotelian
terms) of the answerer, differently from RL and G, which are a partial goals, possibly in
conflict with RL’ and G’; CR could also be seen as a meta-goal of the system as it is on a
different conceptual level from G and RL: it represents the goal of the designer or the sponsor
of the question answering system, as opposed to the goals of the question answering system
itself. As an example, a question answering system for a bank may have as goal to advertise
116
Chapter 5: A Relevance-Based Theoretical Framework for Automated QA
new products, whereas customers using the system may have as goal to find out about interest
rates or to find financial advice. The designers of the system may wish the system to react in
different ways to customers with different goals, at times making the system’s goal of
advertising take precedence and at other times allowing the customer’s goals to take first
place: it is useful from a conceptual point of view to distinguish in this case between the fixed
goals of the system (to advertise) and the changing goals of the system’s sponsors (at times to
advertise, at times to satisfy customer’s information needs).
5
Answerer Prejudices
As noted in paragraph 2, point 3, an answerer cannot give an answer without prejudice (in the
Gadamerian sense). From paragraph 2, point 6, we can say that, in the case of QA systems,
the set of prejudices P is made up of the question context (e.g. previously asked and answered
questions), and from paragraph 2, point 5, a questioner model and an answerer model;
formally,
P = {QC, QM, AM}
where:
•
QC is the context in which a question is uttered: where as context we take the
previously asked questions and given answers, QC is a finite, possibly empty,
ordered set of ordered pairs ((q1, a1), .., (qn, an)) where qi represents a previous
B
B
B
B
B
B
B
B
B
B
question (posed by the current questioner) and ai represents the answer given to that
B
B
question. The ordering of the set represents the temporal order in which the utterances
were made. Note that by doing this we have limited our problem to question
answering systems: a system which also engaged the user in conversation would be
able to pose questions and would need to understand answers, hence would be a
conversational system; conversational systems would also allow pairs in the form (qn,
B
B
q’n) (an, q’n) (an, a’n). Examining conversational systems is beyond the scope of this
B
B
B
B
B
B
B
B
B
B
work as it adds a substantial complication in that it must investigate question
formulation and the relevance relationship between the system’s questions and the
user’s answers, a different problem from the examined relevance relationship
between the user’s questions and the system’s answers.
•
QM is a model representing the questioner. The importance of considering notions
such as questioner needs and desires was highlighted in the previous chapter on
relevance.
117
Chapter 5: A Relevance-Based Theoretical Framework for Automated QA
•
AM is a model representing the answerer (i.e. the question answering system itself).
While most models of question answering have considered attributes such as
questioner needs and goals, the discussion in the previous chapter highlighted the
importance of modelling the answerer, as characteristics such as the goals of the
answerer (the system itself) also influence what answer is given.
The reason for including question history is that in a system with the notion of session this
may influence both questioner and answerer model: it may, for example, become apparent to
the system that the questioner has got different needs from what was assumed initially or that
the questioner is a different type of user from what was initially thought, e.g. a high school
student as opposed to a primary school pupil, which may in turn influence the answerer
model.
We saw in the previous chapter that to answer a question is to seek a relevant answer, and that
relevance could be seen as being made up of semantic, goal-directed, inferential and morphic
relevance. The questioner and answerer models will therefore need to be made up of elements
which are sufficient to ensure these relevance conditions are met. In particular it will be
necessary to consider prior knowledge, in particular as regards semantic relationships for
semantic relevance, goals or desires for goal-directed relevance, prior knowledge and
inference mechanisms for logical relevance, preferences for morphic relevance; the answerer
will also need a mechanism for determining overall relevance given the different relevance
categories, and a way of resolving any conflicts between the needs of the questioner and the
needs of the answerer. We therefore have:
QM = {K, G, L, M, RL }
AM = {K’, G’, L’, M’, RL’, CR }
Where
•
K and K’ represent prior knowledge
•
G and G’ represent goals associated with questions and answers
•
L and L’ represent inference mechanisms (logics) that link answers to questions.
These need not necessarily be formal logics, but may also be in the form of rhetorical
argumentation or persuasion. We need to take into consideration the questioner’s
inference mechanisms as the questioner may not believe the answerer’s logic. On the
other hand the answerer may have a better understanding of logic than the questioner
118
Chapter 5: A Relevance-Based Theoretical Framework for Automated QA
and should not necessarily be limited by the limits of the questioner and we need a
means of resolving any potential conflict.
•
M and M’ represent preferences for the form (morphe) an answer should take.
•
RL and RL’ represent a mechanism (logic) that enables the formulation of an overall
relevance judgement, given the relevance of an answer in relationship to the different
relevance categories. Again questioner and answerer may approach this in different
manners.
•
CR is a conflict resolution mechanism, as RL and RL’ may be contradictory:
questioner and answerer goals and preferences may be incompatible; the questioner
may not agree with the answerer’s inference mechanisms; the questioner’s prior
knowledge may be in conflict with the answerer’s prior knowledge; in all these cases
there must be a mechanism to decide which takes priority when seeking a relevant
answer.
The answerer model AM may be constructed in a variety of ways, which we shall examine in
detail in the chapter on the implementation of goal-directed relevance; one way it may be
built is through the previously asked question answer pairs Q: if a questioner interacts with
the QA system over a period of time, asking more than a single question, the questioner
model may change as new information about the questioner is gathered. Other approaches
could be the use of stereotypes or the use of implicit models.
RL and RL’ will be made up of mechanisms (logics) to provide relevance judgements from
the point of view of the different relevance categories: we will therefore have different
mechanism for semantic relevance, goal-directed relevance, logical relevance and morphic
relevance; in addition we will have a mechanism that enables the formulation of an overall
relevance judgement. RL and RL’ will be made up as follows:
RL = { S-RL, G-RL, L-RL, M-RL, O-RL }
where
•
S-RL, represents a mechanism for determining semantic relevance.
•
G-RL, represents a mechanism for determining goal-directed relevance.
•
L-RL, represents a mechanism for determining logical relevance.
•
M-RL, represents a mechanism for determining morphic relevance.
•
O-RL, represents a mechanism for determining overall relevance.
119
Chapter 5: A Relevance-Based Theoretical Framework for Automated QA
As is the case for RL, CR will have to take into consideration the various aspects of relevance
as well as the overall relevance judgement. We therefore have:
CR = { S-CR, G-CR, L-CR, M-CR, O-CR }
where
•
S-CR, represents a mechanism for resolving conflicts between questioner and
answerer judgements of semantic relevance.
•
G-CR, represents a mechanism for resolving conflicts between questioner and
answerer judgements of goal-directed relevance.
•
L-CR, represents a mechanism for resolving conflicts between questioner and
answerer judgements of logical relevance.
•
M-CR, represents a mechanism for resolving conflicts between questioner and
answerer judgements of morphic relevance.
•
O-CR, represents a mechanism for resolving conflicts between questioner and
answerer judgements of overall relevance.
The mechanisms for determining relevance (relevance logic) will have to make use of
information given by prior knowledge, goals and preferences and inference logics, as follows.
Prior knowledge will be made up of dictionary knowledge, encyclopaedic knowledge and
question frames (or scripts in the terminology of Schank and Dyer). The sets K and K’ will
therefore be made up as follows:
K = { DK, EK, QF }
Where
•
DK represents dictionary knowledge such as definitions
•
EK represents encyclopaedic knowledge which links terms in a rich and varied
manner
•
QF represents a number of question frames which represent knowledge about the
context within which certain questions are asked
Given that both questioner and answerer may have a number of different goals which may be
conflicting, the sets G and G’ will be made up as follows:
120
Chapter 5: A Relevance-Based Theoretical Framework for Automated QA
G = { GH, GCR }
where
•
GH is a hierarchy of goals representing priorities and subgoals
•
GCR is a set of rules which decide what should be done should there be a conflict
between goals
In the same vein, questioner and answerer may have a number of different ways of reasoning
or drawing inferences about answers (logics) which may be conflicting. The sets L and L’
will be made up as follows:
L = { LH, LCR }
where:
•
LH is a hierarchy of logics which may be used for finding a logical connection
between question and answer, representing factors such as priorities and confidence
•
LCR is a set of rules which decide what should be done should there be a conflict
between the different logics
Again, questioner and answerer may have a number of different preferences as regards the
form an answer should take, which may be conflicting. The sets M and M’ will be made up as
follows:
M = { MH, MCR }
where:
•
MH is a hierarchy of preferences for the form (morphe) of an answer, representing
factors such as priorities
•
MCR is a set of rules which decide what should be done should there be a conflict
between the different preferences
121
Chapter 5: A Relevance-Based Theoretical Framework for Automated QA
⎡
⎤⎤
⎡
⎡ Dictionary Knowledge
⎤
⎢
⎥⎥
⎢
⎢
⎥
⎢
⎥⎥
⎢ Prior Knowledge ⎢ Encyclopae dic Knowledge ⎥
⎢
⎥⎥
⎢
⎢⎣ Frames
⎥⎦
⎢
⎥⎥
⎢
⎢
⎡Goal Hierarchy
⎤
⎥⎥
⎢
⎢
⎥⎥
⎢Goals ⎢Goal Conflict Resolution Process ⎥
⎣
⎦
⎢
⎥⎥
⎢
⎢
⎥⎥
⎢
⎡ Logic Hierarchy
⎤
⎢
⎥⎥
⎢ Inference Logics ⎢
⎥
⎢
⎥⎥
⎢
⎣ Logic Conflict Resolution Process ⎦
⎢Questioner Model ⎢
⎥⎥
⎤⎥⎥
⎢
⎢ Form Preference s ⎡ Preference Hierarchy
⎢ Preference Conflict Resolution Process ⎥ ⎥ ⎥
⎢
⎢
⎣
⎦
⎢
⎢
⎥⎥
⎡Semantic Relevance
⎤
⎢
⎢
⎥⎥
⎢
⎥
⎢
⎢
⎥⎥
⎢Goal - directed Relevane ⎥
⎢
⎢
⎥⎥
⎥
⎢
⎢ Relevance Logic ⎢ Logical Relevane
⎥⎥
⎢
⎥
⎢
⎢
⎥⎥
⎢ Morphic Relevace
⎥
⎢
⎢
⎥⎥
⎢⎣Overall Relevane
⎥⎦
⎢
⎢
⎥⎥
⎢
⎢
⎥⎥
⎥⎦ ⎥
⎢
⎣⎢
⎢
⎥
⎢
⎥
⎢
⎤ ⎥
⎡
⎡ Dictionary Knowledge
⎤
⎢
⎥ ⎥
⎢
Prior Knowledge ⎢⎢ Encyclopae dic Knowledge ⎥⎥
⎢
⎥ ⎥
⎢
Prejudice ⎢
⎥ ⎥
⎢
⎢⎣ Frames
⎥⎦
⎢
⎥ ⎥
⎢
⎢
⎡Goal Hierarchy
⎤
⎥ ⎥
⎢
Goals
⎢
⎢Goal Conflict Resolution Process ⎥
⎥ ⎥
⎢
⎣
⎦
⎢
⎥ ⎥
⎢
⎢
⎥ ⎥
⎢
⎤
⎡ Logic Hierarchy
⎢
⎥ ⎥
⎢ Inference Logics ⎢
⎥
⎢
Logic
Conflict
Resolution
Process
⎥ ⎥
⎢
⎣
⎦
⎢
⎥ ⎥
⎢
⎤⎥ ⎥
⎢
⎢ Form Preference s ⎡ Preference Hierarchy
⎢ Preference Conflict Resolution Process ⎥ ⎥ ⎥
⎢
⎢
⎣
⎦ ⎥
⎢
⎥
⎢
⎢ Answerer Model ⎢
⎡Semantic Relevance
⎤
⎥ ⎥
⎢
⎢
⎥
⎥ ⎥
⎢
Goal - directed Relevance ⎥
⎢
⎢
⎥ ⎥
⎢
⎢
⎥
⎥ ⎥
⎢ Relevance Logic ⎢ Logical Relevance
⎢
⎢
⎥
⎥ ⎥
⎢
⎢
⎢ Morphic Relevance
⎥
⎥ ⎥
⎢
⎢
⎢⎣Overall Relevance
⎥⎦
⎥ ⎥
⎢
⎢
⎥ ⎥
⎢
⎢
Semantic
Relevance
⎡
⎤
⎥ ⎥
⎢
⎢
⎢Goal - directed Relevance ⎥
⎥ ⎥
⎢
⎢
⎢
⎥
⎥ ⎥
⎢
⎢
⎥
⎥ ⎥
⎢Conflict Resolution Logic ⎢ Logical Relevane
⎢
⎢
⎥
⎥ ⎥
⎢
⎢
⎢ Morphic Relevace
⎥
⎥ ⎥
⎢
⎢
⎢⎣Overall Relevane
⎥⎦
⎥⎦ ⎥
⎢⎣
⎢
⎥
⎢⎣Question Context [Previous Question Answer Pairs ]
⎥⎦
122
Chapter 5: A Relevance-Based Theoretical Framework for Automated QA
6
Advantages
We saw above in chapters 2 and 3 that the aims of TREC-style QA systems were far from
clear and that consequently it was difficult to define exactly what these systems were trying to
achieve, how they could be evaluated and how they could be improved. This framework
solves the issues identified with TREC-style (and similarly inspired) QA systems by
clarifying the problem and explicitly showing what question answering is trying to achieve
(the outputs of a question answering system) and the component parts of question answering
systems (the various relevance judgements and prejudices); in particular, we have met the
requirements set out in paragraph 2 above by:
1. Employing the concept of relevance to show how question answering systems should
be seeking a set of relevant answers to a question, not a single “precise” answer:
answers to a question should be considered more or less appropriate in relation to a
question, not simply right or wrong.
2. Examining how an overall relevance judgement about answers should be reached by
making reference to the component parts of relevance: semantic, goal-directed,
logical and morphic relevance.
3. Explicitly recognising the prejudices which influence the formation of the set of
relevant answers, or, in other words, the constraints which define how a QA system
will go about finding an answer set.
4. Analysing the prejudices which constrain what are considered relevant answers,
showing that they are made up of the necessary constituents to provide a judgement
about semantic, goal-directed, logical and morphic relevance: in particular,
background knowledge will constrain semantic relevance judgement, goals will
constrain goal-directed relevance, inference mechanisms and knowledge will
constrain logical relevance and answer form preferences will constrain morphic
relevance.
5. Further analysing the prejudices showing they can be grouped into a questioner
model, but also an answerer model, showing how both questioner and answerer
model will influence the way relevance judgements are made, and making explicit
reference to a conflict resolution process which will resolve contradictions
(differences in beliefs, aims, ways of reasoning or preferences) between questioner
and answerer models.
6. Recognising that questions do not always occur in isolation and that, if they occur in a
context of previously asked questions and provided answers, this context will be part
of the answerer prejudices which constrain an answer.
123
Chapter 5: A Relevance-Based Theoretical Framework for Automated QA
7. Openly separating the question, prejudices and the set of documents from which
answers are to be gathered from, thus clarifying the problem setting.
Having clarified the problem setting we are now in a position to attempt to improve current
systems by applying the given framework. In chapters 6-11 we shall show how the framework
can be implemented to improve a standard QA system such as YorkQA. Having shown that
the theory may be applied in practice we will then provide, in chapter 11, an evaluation
framework, based on the theory, which will enable us to compare systems in a rigorous and
clear manner.
7
Conclusion
We have introduced a theoretical framework which encompasses the relevance theory
described in the previous chapter, outlining the overall shape of a question answering system.
From this framework we can derive a number of implementations which may or may not
realize all of the characteristics noted above: a question answering system, for example, may
have no concept of user model (a TREC-style system, for example) or may have only a
rudimentary concept of semantic or goal-directed relevance. We shall show in the next
chapters an implementation of the framework which puts into practice the different notions of
relevance identified, demonstrating that this framework can be used to improve the YorkQA
question answering system through the use of mechanisms for explicitly establishing
semantic, goal-directed, logical and morphic relevance.
124
Chapter 6
Semantic Relevance
Executive Summary
We present an implementation of semantic relevance for question answering. An algorithm is given to
calculate the semantic relevance of an answer sentence in relation to a given question. A number of
features of the algorithm are analysed to determine which best contribute towards calculating semantic
relevance and an in-depth evaluation of these features and their usefulness is given.
1
Introduction
We saw above how relevance is made up of a number of relevance categories, including
semantic relevance. We shall now show how semantic relevance can be implemented in a QA
system and provide an evaluation of such an implementation, demonstrating that even on its
own (i.e. without the other components of relevance) it can be used to improve the
performance of a QA system.
The idea of semantic relevance is the notion that two sentences can be deemed to be relevant
in respect to each other in virtue of the meanings of the words contained in each sentence, as
exemplified by the following examples.
Given the question
Q1: What are we having for dinner?
and the possible answers
A11: Spaghetti
A12: Strict counterpoint.
A listener would immediately take A11 to be a reasonable answer (independently of its truth or
its completeness) as the meaning of the word dinner is more related to the word “spaghetti”
Chapter 6: Semantic Relevance
(one eats at dinner, and spaghetti is something to eat) than to the word “Strict counterpoint”.
In other words, A11 would be considered more semantically relevant than A12 in respect to Q.
On the other hand, at times a semantically relevant answer may not be a full answer to a
question but may reveal some interesting information. Given the question
Q2: Where can I buy a new battery?
and the possible answers
A21: Peter’s shop sells all sorts of useful items.
A22: Leaving a battery in the sun for a while can partially recharge it.
A23: Christina Rossetti wrote a poem about Goblins.
A24: At Frank’s hardware store.
while A23 would be considered irrelevant to the question (there is no meaning associated with
A23 which could connect it to the question) and A24 would be considered a satisfactory answer
(as it gives the questioner all the information required to find somewhere to buy a new
battery), A21 and A22 also contain possibly interesting information (if a shop sells useful items
it may sell batteries; if the questioner can’t get a new battery a temporary solution may be to
attempt to recharge it); if the answer A24 was not available, the questioner would probably
prefer A21 or A22 rather than no answer. But A21 and A22 are clearly semantically related to the
question: the meaning of “buy” refers to shops, selling and items for sale in A21, while the
meaning of battery is closely related to battery itself (in a rather obvious fashion) but also to
the idea of recharging in A22. Even on its own (without considering for example issues such as
user goals) semantic relevance can prove to be a useful indicator for finding “interesting”
answers.
What follows is an implementation of semantic relevance which purposefully ignores issues
such as user goals (which shall be instead examined in the following chapters which
implement the other aspects of relevance) and is instead concerned with being able to give a
precise indication of the semantic relevance of an answer to a question.
2
Previous Work
To talk about semantic relevance is to talk about the semantic similarity between sentences.
WordNet (Miller 1995; Fellbaum 1998), a lexical database which organizes words into
synsets, sets of synonymous words, and specifies a number of relationships such as
126
Chapter 6: Semantic Relevance
hypernym, synonym, meronym which can exist between the synsets in the lexicon, has been
shown to be fruitful in the calculation of semantic similarity between sentences. One approach
has been to determine similarity by calculating the length of the path or relations connecting
the words which constitute sentences (see for example Green 1997); different approaches
have been proposed (for an evaluation see Budanitsky and Hirst 2001), either using all
WordNet relations (Hirst and St-Onge 1998) or only is-a relations (Resnik 1995; JiangConrath 1997; Lin 1998). Mihalcea and Moldovan (1999), Harabagiu et al. (1999) and De
Boni and Manandhar (2002 and 2003a) found WordNet glosses, considered as microcontexts, to be useful in determining conceptual similarity. These approaches have however
been primarily concerned with general similarity between sentences, not the specific problem
of using a similarity measure for question answering; Lee et al. (2002) have applied
conceptual similarity to the Question Answering task, giving an answer A a score dependent
on the number of matching terms in A and the question; nevertheless, although specifically
intended for question answering, this measure lacks the sophistication of the approaches
mentioned above, and, simply relying on matching terms, cannot claim full success.
Here we build on these ideas, applying them to the task of calculating semantic relevance for
question answering, and evaluating what new features (for example noun phrase chunking,
part-of-speech tagging or word frequency information) can provide a reliable measure for
semantic relevance.
3
A sentence similarity metric for Question Answering
In order to determine the relevance of a possible answer to a question, a basic semantic
relevance algorithm was devised and then augmented with a number of features in order to
evaluate what additions could improve the measure.
We want to be able to implement the function seen previously
semantic-relevance( q, PA, K, K’)
which returns a set of answers derived from PA, ordered according to their relevance to q,
given the prior knowledge of the answerer K and the prior knowledge of the questioner K’. In
order to simplify the problem we shall take K=K’, from which we construct a function
Semantic_Relevance_Order(q, U, K) = OU
127
Chapter 6: Semantic Relevance
which takes as input a question q, a given set of utterances U, and the knowledge K, and maps
this to an ordered set OU containing all the elements of U ranked in order of relevance.
Compared to the more general function, the implementation combines questioner and
answerer knowledge (and can therefore ignore conflicts and contradictions) and ranks all the
utterances in U, including utterances which are not semantically relevant to q (which will
therefore be given the lowest rank).
But in order to rank a set of utterances we can proceed by giving a score to each utterance in
turn and then compare those scores. We can therefore define a function which maps an
utterance, a question and some knowledge to a score:
Semantic_Relevance_Score( u, q, K) = s
Where Semantic_Relevance_Score(u1, q, K) < Semantic_Relevance_Score(u2, q, K)
B
B
B
B
represents the fact that utterance u1 is less relevant than u2 in respect to the question q and the
B
B
B
B
given knowledge K.
But what aspects of each u ∈ U and of q should be considered to determine relevance? We
saw above that we are interested in semantic relevance, so we are interested in the meanings
in u and q. But what features of u are important from this perspective? The function
Semantic_Relevance_Score is determined by
Semantic_Relevance( UF, QF, K) = s
where UF is a set of features of u and QF a set of features of q which are important when
determining relevance. We must now determine the set of features which need to be
considered in order to successfully determine relevance for the task at hand.
3.1
Basic Semantic Relevance Algorithm
In the simplest case, K is the empty set Ø (i.e. there is no background knowledge) and UF and
QF are the sets made up of the words in u and q; in other words:
UF={uw | uw is a word in u }
B
B
QF={qw | qw is a word in q }
B
B
128
Chapter 6: Semantic Relevance
The relevance score s is in this case given by the words in common between the two
utterances, i.e. the items in common between UF and QF:
| UF ∩ QF |
s=
In other words, what we are looking for is simply a word match between u and q.
The next level of complexity considers as relevant features the base form of the words in u
and q which are not in a set of stop-words SW (e.g. “the”, “a”, “to”) which are too common to
be able to be usefully employed to estimate semantic similarity.
UF={uw | uwi is the base form of a word in u ∧ uwi ∉ SW}
B
B
B
B
QF={qw | qwi is the base form of a word in q ∧ qwi ∉ SW}
B
B
B
B
Again, in this case
| UF ∩ QF |
s=
3.2
Use of WordNet to calculate similarity
At the next level of complexity, we are no longer simply looking for matching words, but are
using a knowledge base to calculate similarity. In this case, the similarity between a word u
U
U
and a word q is given by
∑
1<p<n
B
B
argmaxm similarity(qwp, uwm,K)
B
B
B
B
B
B
where
similarity( w1, w2, K) = m
B
B
B
B
with 0<m<1, representing how semantically related the two words are, given knowledge K;
similarity( wi, wj, K)< similarity( wi, wk, K) represents the fact that the word wj is less
B
B
B
B
B
B
B
B
B
B
semantically related than wk in respect to the word wi. In particular similarity=0 if two words
B
B
B
B
are not at all semantically related and similarity=1 if the words are the same.
129
Chapter 6: Semantic Relevance
In our implementation, K is taken to be the knowledge provided by WordNet. While a
number of different knowledge bases are available which could also have been used
(examples are Cyc - see Lenat 1995 - and Roget’s thesaurus - see Roget 1991), WordNet was
chosen for the following reasons:
•
It has been widely used in QA research to improve systems’ performance by
providing background knowledge (see for example Harabaigiu et al. 2001; Harabagiu
et al. 2002; Moldovan et al. 2003).
•
it explicitly names the type of relations which hold between words (is-a, similar, etc.),
which in turn means it is both easy to understand how knowledge is encoded and to
correct any errors, omissions or inconsistancies
•
it is easily extensible and customisable
•
a version of WordNet is available in the major European languages (EuroWordNet,
see Vossen 1997) and a number of authors have worked on versions of WordNet for
yet more languages (see for example Petrova and Nikolov 2002; Kahusk, Kadri Vider
2002; and Chakrabarti et al. 2002). An algorithm which uses this knowledge base is
therefore easily applicable to languages other than English, which means that the
semantic relevance algorithm developed could be applied for non-English question
answering
The value of the function similarity is calculated using WordNet as follows:
similarity( w1, w2, K)
returns a number between 0 and 1:
if w1=w2 return 1
else if synonym(w1, w2) return value3
else if hypernym(w1, w2) return value1
else if hyponym(w1, w2) return value2
… // similarly for all WordNet relationships
else if exists hypernym chain return value10
else if exists hyponym chain return value11
else return 0
the words w1 and w2 are compared using all the available WordNet relationships (is-a,
B
B
B
B
satellite, similar, pertains, meronym, entails, etc.), with the additional relationship, “same-as”,
which indicates that two words were identical. Each relationship is given a weighting
(value1, value2, etc. above) indicating how related two words. The weighting was given
after an analysis of the type of relationships found in WordNet and an analysis of the types of
130
Chapter
ter 6: Semantic Relevance
relationships found between questions and answers in the TREC-8 QA dataset, in order to
gauge the strength of the relationship; different weightings were tried (but no formal
evaluation was carried out, as this was not the main focus of the experiments) with the final
weighting taking the “same as” relationship to be the closest relationship, followed by
synonym relationships, hypernym, hyponym, then satellite, meronym, pertains, entails. In the
limit, s=1 where two words were exactly the same and s=0 where no relationship could be
found between the words.
In the case of hypernym and hyponym relationships, chains of more than one relation were
allowed if an immediate relation was not found, within a specified limit, defined by
experimentation. So, for example while there is no immediate relationship between “daisy”
and “plant” we could construct the chain “daisy is-a flower” and “flower is-a plant”, thus
discovering the relationship “daisy is-a plant”. In order to avoid the use of misleading
relationships such as “flower is-a thing” (misleading given that “thing”, being a very generic
word, is semantically related to almost every word in any sentence and hence does not give
any useful clues as to its semantic relevance) a limit was put to the number of hypernyms that
could be used to construct a chain, the limit being determined by the vagueness of the terms
used as hypernyms: words such as “thing” and “being” were therefore not allowed to be used
in finding hypernym relationships. A number of experiments were also carried out on the use
of chains for other WordNet relationships and the use of chains using different relationships, a
chain made up, for example of a synonym, an is-a relationship, another synonym and finally a
meronym, as in the following diagram:
Word1 – Word2 (Synonym)
|
Word3 (Hyponym) – Word4 (Meronym)
It was found, however, that, apart from triangular hypernym-hyponym relationships (such as
A is-a B and C is-a B, relating A to B) complex chains of semantic relationships were
detrimental to performance as it gave a number of very misleading (due to their tenuous
nature) relationships between words, an effect similar to the “drift of meaning” mentioned in
the discussion of Eco in chapter 2, which allows a reader to make any connection whatsoever
between any two words. This type of misleading association can be seen in the following
example, which shows how a semantic relationship can be found between the words “set” and
“quarterback”:
Set - Collection (Synonym)
|
Team (Hyponym) – Quarterback (Meronym)
131
Chapter
ter 6: Semantic Relevance
Hypernym-hyponym relationships, on the other hand, were usefully employed to find
common hypernyms between words, as illustrated by the following diagram:
Word3 (Hypernym)
|
|
Word1
Word3 (Hypernym)
|
Word4 (Hypernym)
|
Word5
An example is given by the relationship between the words “paracetamol” and “nicotine”:
Drug (Hypernym)
|
|
Nicotine
Analgesic (Hypernym)
|
Non-narcotic Analgesic (Hypernym)
|
Paracetamol
Again, a limitation was given on the length of the chain created, in order to avoid misleading
relationships such as the relationship between “nurse” and “cat” (nurse is a human being,
which is a living thing, cat is an animal which is a living thing).
In using WordNet to find semantic relationships between words, a number of problems arose
due to limitations in the way WordNet was constructed. A major difficulty was the fact that
WordNet is not consistent in the way it handles knowledge, with parts of the knowledge base
being much more fine-grained than others, meaning that calculating the strength of is-a
relationships was not a straightforward task, as, for example, a “chain” of two is-a
relationships (e.g. a Siamese is a Cat and a Cat is a Feline) in one part of the knowledge base
was not equivalent to two is-a relationships in more (or less) fine-grained part of the
knowledge base. In order to maintain consistency, hypernym relationships were weighted
according to the polysemy of the synsets involved in the relationship: relationships between
words with few or no synonyms (e.g. between “Panadol” and “analgesic”) were considered
stronger than relationships between words with more synonyms (e.g. between “chairman” and
132
Chapter 6: Semantic Relevance
“man”). Another limitation was the presence of errors in the knowledge base, including some
recursive definitions (A is-a B, is-a C, is-a A) which had to be manually corrected.
As a very simple example of the way in which semantic relevance was calculated, given the
question
Who went to the mountains yesterday?
and the possible answer
Fred walked to the big mountain and then to mount Pleasant
initially, QF and UF would be constructed from the base form of the words in each sentence,
ignoring stop-words. The words “Who” “to”, “the”, “and” “then” would be ignored as they
are common words and hence part of the list of stop-words. We would therefore have:
QF ={go, mountain, yesterday}
and
UF = {Fred, walk, big, mountain, mount, Pleasant}
In order to calculate similarity the algorithm would consider each word in turn. “Go” would
be related to “walk” in a is-a (hypernyum) relationship (to walk somewhere is an instance of
going somewhere) and receive a score s1. “Mountain” would be considered most similar to
“mountain” (same-as relationship) and receive a score s2: “mount” would be in a synonym
relationship with “mountain” (“mountain” and “mount” both refer to a similar object) and
hence would given a lower score than a same-as relationship, so it is ignored (as seen, the
function takes the maximum score, i.e considers the strongest relationship found between
words). “Yesterday” would receive a score of 0 as there are no semantically related words in
Q. The similarity measure of Q in respect to A would therefore be given by s1 + s2.
3.3
Disambiguation
In order to use WordNet to find relationships between words it is not possible to simply use
the root form of a word (e.g. “book” for “books”, “see” for “saw” etc.), but it is necessary to
associate the word with a synset, which represents a particular meaning of the word: to work
with WordNet it is therefore necessary to disambiguate the meaning of the words in a
133
Chapter 6: Semantic Relevance
sentence in order to choose the particular meaning a word takes within that sentence. There
are however a number of difficulties with this approach:
•
The senses given in WordNet are often too fine grained to be of practical use, often
making very fine distinctions between words which appear unnecessary. Moreover,
deciding how many meanings a word has and distinguishing between meanings is
neither a simple nor an uncontroversial task and WordNet cannot be considered a
definite authority in this area.
•
Often, isolated questions (without reference to a wider context), being made up of
very few words, are difficult, or indeed impossible to disambiguate (e.g. the question
“Where is the bank?”: does this mean where is the river bank? or where is the Bank
of England? or where is a high-street bank?). But QA systems do not always have
access to a wider context and therefore it becomes an impossible task to fully
disambiguate the meanings of all the words in a question. In particular:
•
most of the questions used for the TREC QA evaluation are too short to be able to
carry out any meaningful disambiguation.
•
Different meanings for words are often the result of some form of metaphor (e.g.
“green” in the sense of naïve is derived from the fact that immature fruit is green in
colour and naivety is often associated with immaturity); it can therefore be argued
that metaphoric relationships are a particular type of semantic relationship and hence
something which must be taken into consideration when calculating semantic
relevance.
Given these difficulties, we therefore decided, unlike other authors (e.g. Mihalcea and
Moldovan (1999), Harabagiu et al. (1999)), not to carry out word disambiguation on the
sentences prior to calculating semantic relevance. A small number of experiments which were
carried out by manually disambiguating the sentences before processing did not improve
results, indicating the correctness of this approach. The algorithm therefore took into
consideration all possible meanings of the words in the question and the answer sentences,
finding the strongest relationships between any of these meanings. In other words, the sets QF
and AF were made up of the synsets representing each word, i.e. QF and AF were made up as
follows: { {synseta, synsetb, synsetc}, { synsetd}, {synseta, synsetb}, … }, where each group of
B
B
B
B
B
B
B
B
B
B
B
B
sysnsets represented a word. In the example above we would have:
134
Chapter 6: Semantic Relevance
" go"
" yesterday"
⎧64444
mountain"
744448 644"4
4744448 64444744448 ⎫⎪
⎪
QF = ⎨{synset1, synset 2, synset 3}, {synset 3, synset 4, synset 5}, {synset 6, synset 7, synset8}⎬
⎪
⎪
⎩
⎭
where synset1, synset2, synset3 represent the synsets which include the word “go”, synset4,
synset5, synset6 represent the synsets which include the word “mountain”, etc. In the case of
proper nouns (e.g. “Fred” in the example above), no synset would be associated with the
word, and the word itself would be retained. Again following the example above, we would
have:
" big"
⎧6
" Fred"
"4
walk"
744444
8 644444744444
8 ⎫⎪
⎪ 78 64444
AF = ⎨" Fred ", {synset 9, synset10, synset11}, {synset12, synset13, synset14},...⎬
⎪
⎪
⎭
⎩
The sets of synsets associated with each word in QF would then be compared with the set of
synsets associated with each word in AF to find a link. A relationship would therefore be
sought between synset1 and synset9, synset10 and synset11, then between synset2 and
synset9, synset10 and synset11, and so on; the strongest relationship found would then be
considered the strongest semantic relationship between the words “go” and “walk”.
Consequently, none of the senses of “go” or of “walk” would be lost. In practice highly
ambiguous words (e.g “do”, or “be”), associated with a high number of different synsets, were
usually ignored as they were members of the set of stop-words which were considered too
common to be of any use; only a limited number of synsets were therefore associated with
each word, which simplified processing and avoided in most cases gross errors due to a wrong
interpretation of the meaning of words.
4
Relevant Features
The semantic relevance algorithm was constructed to make use of a number of different
features which contained intuitively useful information about sentences. The simplest
algorithm was the following:
Word Match (WM). This simply took into consideration words in a sentence, but ignored
any semantic relationships apart from equality.
On the next level of complexity the algorithm made use of the following:
135
Chapter 6: Semantic Relevance
WordNet Relationships (WNR). This took into account the wider meaning of words by
considering WordNet relationships (hypernym, synonym, etc.) between sentence words, as
shown above.
WordNet relationships were then used in conjunction with other features. In other words, the
sets UF and QF of relevant features were expanded and the function similarity was modified
to not only take into account individual words, but also other properties, as follows.
NP chunking (NPC) The intuition behind the use of NP chunks to determine similarity:
“When did [the big bad wolf] eat [red riding hood] ?” is more similar to “[the bad wolf] ate
[red riding hood] [who] carried [a pink chicken]” than “[the wolf] who wore [a pink riding
hood] ate [the bad red chicken]”. Words appearing in similar NPs in both question and answer
were therefore considered more relevant than words that did not. The Noun Phrase chunk
parser was based on Transformation Lists (Ramshaw and Marcus, 1995). In addition to this
we experimented with a "naïve" NP chunker (NNP) which used a very simple algorithm
(gathering words found between articles and verbs) to determine chunks and was used as a
benchmark to test the performance of the chunker.
Compound noun information (CN) WordNet was used to find compounds, the motivation
being the fact that the word “United” in “United States” should not be considered similar to
“United” as in “Manchester United”. As opposed to when using chunking information, when
using noun compound information the compound is considered a single word, as opposed to a
group of words: chunking and compound noun information may therefore be combined as in
“[the [United States] official team]” or “[the [Coca cola] company]”. Compound nouns were
identified by looking in WordNet to see if pairs of consecutive words were present as a
compound.
Part-of-Speech tagging (POS) The intuition behind the use of Part-of-Speech tagging is that
this should disambiguate the senses of the words sufficiently to avoid gross errors in
determining similarity, for example when considering the word relative in the two phrases
“the facts relative [adjective] to the issue” and “John visited his relatives [noun]”. Part-ofSpeech tagging was based on the TnT Part-of-Speech tagger (Brants, 2000). In addition to the
usual form of part-of-speech tagging we also used a “flexible” tagging algorithm (FPOS),
which allowed for confusion between tagging of nouns and verbs in order to further test how
the performance of the tagger influenced the overall algorithm.
136
Chapter 6: Semantic Relevance
Proper noun information (PN) The intuition behind this is that titles (of books, films, etc.)
should not be confused with the “normal” use of the same words: “Blue Lagoon” as in the
sentence “the film Blue Lagoon was rather strange” should not be considered as similar to the
same words in the sentence “they swam in the blue lagoon”. A simple set of rules looking for
words starting with a capitalized letter was sufficient to successfully recognise proper nouns.
Word frequency information (WF) This is a step beyond the use of stop-words, following
the intuition that the more a word is common the less it is useful in determining similarity
between sentence. So, given the sentences “metatheoretical reasoning is common in
philosophy” and “metatheoretical arguments are common in philosophy”, the word
“metatheoretical” should be considered more important in determining similarity than the
words “common”, “philosophy” and “is” as it is much more rare and therefore less probably
found in irrelevant sentences. Given that the questions examined were generic queries which
did not necessarily refer to a specific set of documents, the word frequency for individual
words was taken to be the word frequency given in the British National Corpus (see BNCFreq
2002 for a list of the most frequent 3000 words in English according to the British National
Corpus). The top 100 words (see BNCFreq 2002), making up 43% of the English Language,
were then used as stop-words and were not used in calculating semantic similarity. The
similarity between words was therefore weighted to take into account the commonality of the
words in generic English.
5
Evaluation
5.1
Method
The semantic relevance measure was used to rank answer sentences (taken from the TREC
document collection) for a total of 200 questions randomly chosen from the test questions
used in the NIST TREC-10 Question Answering track (Voorhees 2002). The reasons for
using the TREC answer and document collection were:
•
Real questions: although the TREC questions are limited, as they contain for the most
part only concept filling questions, they are taken from “real” collections of questions
and hence reflect different ways in which questions may be formulated.
•
Real text: although the TREC document collection is limited to newspaper articles, it
contains “real” text, taken “as is” and not preprocessed and hence reflects the type of
text a system should be able to cope with.
137
Chapter 6: Semantic Relevance
•
Availability: the TREC collection is widely available and accepted as a standard test
collection, which ensures that the experiments we carried out would be reproducible.
In order to evaluate the ability of the algorithm to measure semantic relevance it was
necessary to find some objective measure which could be used to determine whether the
algorithm was functioning as desired. Given a question from the TREC collection, evaluating
a semantic relevance ranking of all possible answer sentences from the provided documents
would prove an all but impossible task, given the very large number of possible answer
sentences (if each document contained 50 sentences, not a large number in the document
collection, this would require an examination of 2,500 ranked sentences), and the highly
subjective nature of a human ranking (it would be very hard for a human to ignore the full
questioner model, including the goals of the questioner) which would require an evaluation on
the part of more than one subject. It was therefore decided to seek a more objective
measurement of the algorithm’s performance by noting that sentences containing what the
NIST evaluators consider an answer to a question are very relevant from the point of view of
semantic relevance. It was therefore conjectured that a good semantic relevance algorithm
would rank a sentence containing an answer (as defined by the TREC QA track evaluators) as
one of the most relevant sentences, and the evaluation therefore sought to measure the
algorithm’s ability to place sentences containing an answer amongst the top ranked sentences.
It should be noted however that, as indicated in Chapters 4 and 5, semantic relevance is only
one component of answerhood and it would not be expected on its own to provide an answer.
The answer sentences were sought in the set of 50 documents also provided by NIST for each
question, which should have (but were not guaranteed to) contained an answer to the
questions. The documents were manually checked in order to verify that an answer was
actually present in the texts and questions for which no answer was given in the documents
were discarded (about 32% of the questions did not have an answer in the given set of
documents). The documents were then split into sentences using the YorkQA sentence splitter
and ranked using the semantic relevance algorithm; finally, the first 15 answer sentences were
manually checked to establish at which rank (if any) an answer sentence was found. In line
with the type of evaluation carried out for the TREC QA workshop (Voorhees 2002), the
mean reciprocal rank (MRR), i.e. the average reciprocal of the rank at which the answer was
found, was then computed.
As an example, given the TREC-10 question:
What was the last year that the Chicago Cubs won the World Series?
138
Chapter 6: Semantic Relevance
The system would rank the sentences contained in the documents retrieved by the NIST IR
engine for this question in the TREC-10 evaluation (documents which contained news articles
from sources such as the Wall Street Journal, the San Jose Mercury and the Los Angeles
Times). What follows is an example of the output, showing the seven highest ranked
sentences, in decreasing order of semantic relevance (the code preceding each sentence
indicates the document the sentence was taken from):
WSJ890915-0148, Naturally , when the Cubs won the pennant in 1932 , we wanted
to see a World Series game , " he goes on
WSJ910520-0171, " Yes , the Cubs have not won the World Series since 1908 , and
yes , they have not been in a World Series since the end of World War II
SJMN91-06302107, A record 51 players were used in the 1945 Detroit TigersChicago Cubs World Series , but 10 players weren 't used
LA101089-0082, Clark might have noticed that life is also about squirming , which is
what the Cubs did in that eighth inning , when their hopes for their first World Series
in 44 years ended
WSJ910520-0171, I never expected you to lower yourself to the cruelty of " Chicago
Cubs Bashing
WSJ910520-0171, However , your article stating that the poorly equipped , fledgling
teams of the Romanian Baseball Federation were not " even the Chicago Cubs , " was
a needless rubbing of salt into decades-old wounds
SJMN91-06302188, Catcher Damon Berryhill and pitcher Mike Bielecki , obtained
from the Chicago Cubs in the season 's final week -- too late to be eligible for the
postseason -- were voted $500 each
In the case of the example the answer which resolves the questioner’s informational needs is
“the Cubs have not won the World Series since 1908”, and the first sentence containing the
answer is ranked second in order of semantic relevance, giving a reciprocal rank score of 0.5
(1/2). Note however how reading all the semantically relevant sentences, not just the sentence
containing the answer, gives some interesting background information that may also aid the
questioner, such as the fact that the cubs were once a successful team, winning a
championship in 1932 and playing in the 1945 World Series, but now have a reputation as a
139
Chapter 6: Semantic Relevance
very poor side which nevertheless still hopes to rebuild its past glory. This last point will be
discussed in more detail below where we examine how semantic relevance enables a system
to move beyond unique, definite answers.
A correct answer sentence was strictly taken to be a sentence which “justified” an answer, i.e.
from which an intelligent reader could infer an answer without the use of any knowledge
which could be employed to derive an answer without the need to consult the answer
document. In a number of instances sentences which could have been deemed to contain an
answer were judged as non-answers as they did not justify the answer: for example, given the
question “When was the first stamp issued”, the sentence “A block of “penny black” postage
stamps, issued in 1840, were sold at auction Tuesday for about $442,680, an auction house
said.” was not judged to contain an answer, as the correct answer (1840) could not be inferred
from this sentence alone.
A number of runs were carried out in order to determine which of the features identified
above would best determine relevance (results are summarised in Table 1). For each set of
features the ranked answers were examined manually to identify the exact rank of the
sentence containing an answer. In line with the type of evaluation carried out for the TREC
QA workshop (Voorhees 2002), the mean reciprocal rank (MRR), i.e. the average reciprocal
of the rank at which the answer was found, was then computed for the given results.
Table 1
WNR
WNR
WNR
Rank
WNR
WNR
+POS
+ POS +POS
+NPC +NNP
Ans. 1 (%)
WNR
WNR
WNR
+FPOS +FPOS +FPOS
+NPC +NNP
+NPC +NNP
24.5
14.3
16.3
28.6
18.4
24.5
24.5
28.6
24.5
53
55.1
51
55.1
51
55.1
53
57.1
49
0.52
0.433
0.43
0.505
0.436
0.465
0.347
0.519
0.496
Ans. 1-5 (%)
MRR
WNR
Rank
WNR
WNR
WNR
WNR
+POS
+POS
+POS
+POS
+WF
+WF
+WF
+WF
+NPC
+CN
+CN
+PN
+PN
WM
+NPC
Ans. 1 (%)
Ans. 1-5 (%)
MRR
28.6
20.4
30.7
24.5
18
53
51
57.1
53
37
0.451
0.434
0.484
0.481
0.2
140
Chapter 6: Semantic Relevance
The semantic relevance algorithm performs much better than a simple word match. However
the individual features contribute in different measure to this result. Experiments were carried
out using a combination of all the features indicated above on pp.138-140, as summarised in
table 3.
Initial experiments indicated that NP chunking did not contribute to improving performance.
It was speculated that this was in large part due to the POS tagger, which incorrectly tagged
adjectives and nouns in a number of occasions. In order to verify this, experiments were
carried out with “flexible” tagging which allowed words to be considered both as nouns and
adjectives. With this correction NP chunking gave the best results. Another possible cause for
the poor results in NP chunking was incorrect chunking. In order to verify this, experiments
were carried out using a simple NP chunker that gathered together words found between
articles and verbs. This gave better results when using the standard POS tagging, but worse
results in all other cases, indicating that the given NP chunker was indeed better that a simpler
chunker. Further analysis of the questions indicated that the reason NP chunking did not
prove as useful as initially hoped was the type of questions given, which were usually very
short, sometimes with NPs consisting of no more than one word. The addition of information
about Proper nouns (taken to be words beginning with a capital letter) and compound nouns
(taken to be compounds identified by WordNet, e.g. “United States”) improved performance.
It is important to note that recognising compound nouns gave different information from the
recognition of NP chunks: a simple example is the NP “the Coca Cola company” where the
compound noun was taken to be simply “Coca Cola”. Experiments were also carried out on
the use of word frequency information. This approach also gave good results. A question
answering system tailored for a specific task, would however have to use a word frequency
table specific to that task: “analgesic” may be an infrequent term in everyday usage, but is
quite common in medical documents. While word frequency improved the accuracy of the
measurement not employing NP chunking, this approach had a detrimental effect when
combined with NP chunking, probably due to the fact that “chunks” containing very common
words were not able to furnish any useful information gain.
5.2
Usefulness for finding definite answers
Semantic relevance has been shown to be valuable when looking for a definite answer: the
best results (using as features WNR, POS, WF, CN, PN) indicated that a sentence containing
an answer would be ranked as the top answer sentence in the document collection in 30.7% of
cases. An answer sentence would be ranked amongst the top five most relevant sentences in
57.1% of cases. In most of the remaining cases (around 90%) an answer was found in the top
141
Chapter 6: Semantic Relevance
fifteen sentences considered relevant, but the ranking could not be correct due to very short
questions (e.g. “Where is Perth?”): indeed, in a number of cases all sentences were assigned
the same similarity score. In other cases the answer sentence required additional background
knowledge that was not provided by WordNet: examples which caused answers to be missed
are the lack of any connection between the words “midway” and “middle” and the very
tenuous connection between “man made object” and satellite, which requires an elaborate
traversal of a number of different semantic relations. In other cases some form of non-trivial
reasoning would have been necessary to deduce an answer: for example, there was no answer
in the set of documents to the question “During which season do most thunderstorms occur?”,
but an intelligent reader could have counted the number of times thunderstorms were
mentioned in articles about the weather and deduced a statistical analysis of the occurrence of
thunderstorms during the year. Yet more questions required assumptions about questioners’
intentions, e.g. the question “What is Wimbledon?”, which could have had as answer “A part
of London”, or “A tennis competition”.
To get an idea of the performance that a system which used this algorithm, simply relying on
semantic relevance, could achieve in carrying out the TREC QA task a useful comparison is
with the results of the TREC-10 QA workshop, the last workshop which allowed systems to
return multiple answer sentences. The best system at TREC-10, which employed a very rich
and complex system of pattern matching, found an answer in the top 5 sentences 69.1% of the
time, with a MRR of 0.676 (Soubbotin 2002); the top five systems had an average MRR of
0.518 and an answer in the top five sentences on average 62.1% of the time (see Voorhees
2002). Simply using documents provided by a standard Information Retrieval engine, the
semantic relevance algorithm presented above correctly ranked the sentence containing an
answer in the top 5 sentences 57.1% of the time, with a MRR of 0.519. Combining this
semantic relevance algorithm with the techniques used by the TREC participants (complex
pattern matching, precise question-type and named-entity recognition, simple forms of logical
inference or abduction) could therefore give some significantly improved results.
5.3
Beyond definite answers
On the other hand, the usefulness of semantic relevance goes beyond being able to pinpoint a
specific answer to a question: what it gives is a ranking of sentences which are about an
answer, which may be of interest to a questioner (for example by providing background
information about an answer) even when they do not give an answer as such. Given a
question such as the TREC question:
What mineral helps prevent osteoporosis?
142
Chapter 6: Semantic Relevance
While a user would certainly want a definite answer, e.g. “Calcium and phosphorous”, the
following sentences, ranked as relevant, i.e. as being “about” the question according to our
semantic relevance algorithm, would probably also be considered interesting and worthwhile
reading:
Giving calcium (the main mineral in bone) directly as a dietary supplement does not
appear to prevent osteoporosis in well-nourished women.
Boron, an element long used as a water softener and mouthwash, apparently plays an
important role in hardening bones and could help prevent osteoporosis.
UCLA is conducting a study using a nasal spray medication called calcitonin that
prevents osteoporosis and has none of the side effects of estrogen.
While it is recommended that women get the recommended amount of calcium in
their diet, taking calcium supplements alone will not prevent osteoporosis.
Another example is the question
Where are the Rocky Mountains?
The document set contains the exact answer “In north-central Colorado”, but also the very
useful sentence:
For more information about Rocky Mountain National Park, call the National Park
Service at ( 303 ) 586-2371 , or write : c/o Superintendent, Rocky Mountain National
Park , Colo. 80517-8397
The use of semantic relevance will give a system the ability to move beyond simply providing
“an answer”: being able to pinpoint a number of different sentences with different pieces of
information which nevertheless are about the same topic as the answer will be much more
helpful to a user than a single, short, albeit correct, answer.
Taking “interesting information” to mean a sentence which provides some background
information about the question or the answer to the question, we evaluated the ability of our
semantic relevance algorithm to provide additional “interesting information” in the top 5
ranked sentences (using the algorithm which provided the best results above, using as features
143
Chapter 6: Semantic Relevance
WNR, POS, WF, CN, PN). In 82% of cases, the top five answer sentences contained
interesting information in addition to, or instead of, an answer; where a precise answer was
given in one the top 5 sentences, additional relevant answer sentences containing interesting
information were found in 93.9% of cases; where no precise answer was given in the top 5
sentences, these nevertheless provided interesting information in the form of semantically
relevant answer sentences in 66.3% of cases.
6
Using semantic relevance to improve the YorkQA system
Using the semantic relevance algorithm to improve the YorkQA system described in chapter
3 above would:
•
significantly better results in the TREC evaluation
•
allow the system to move beyond some of the limitations of the TREC evaluation
framework
The system evaluated for TREC-10 (which gave systems the possibility of presenting five
sentences as answer) would have a score improved from 18.1%, with a MRR of 0.121 to
57.1%, with a MRR of 0.519. The system evaluated for TREC-11 (which gave the systems
the opportunity to present a unique and precise answer) would have, using an accurate named
entity recogniser, a score improved from 12.6% to 30.7%.
The use of the algorithm would moreover allow the YorkQA system to move beyond the
limitations of the TREC evaluation framework by:
•
having the capability to find relevant answers to complex questions, not only conceptfilling questions
•
giving the user relevant answers, not simply correct answers, by presenting
information that may be of interest to the questioner even if it does not directly
answer the given question.
7
Conclusion
We have shown how a particular aspect of relevance, i.e. semantic relevance, can be
implemented. An in-depth analysis and evaluation of the component parts of the algorithm
was then given and it was shown that even considered in isolation from the other forms of
relevance identified, semantic relevance can be usefully employed to improve “standard”
TREC-style QA systems to tackle the problem set out in the TREC evaluation framework.
144
Chapter 6: Semantic Relevance
The real strength of the use of semantic relevance is however the ability to find useful answer
sentences which may not immediately resolve the information need of the questioner. TRECstyle question answering systems have been concerned with providing a unique and
straightforward answer, giving the response “no answer” if no definite answer was found;
semantic relevance on the other hand enables a question answering system to provide
responses which, although not immediately resolving the informational needs of the
questioner, nevertheless provide useful information which could be valuable to the questioner
in addition to a definite answer or indeed instead of no response if no definite answer has
been found.
145
Chapter 7
Goal-directed Relevance
Executive Summary
We present an implementation of goal-directed relevance using clarification dialogues, a mechanism
for allowing users to meet their informational needs by asking a series of related questions either by
refining or expanding on previous questions with follow-up questions. We develop an algorithm for
clarification dialogue recognition through the analysis of collected data on clarification dialogues and
examine the importance of clarification dialogue recognition for question answering. The algorithm is
evaluated and shown to successfully recognize the occurrence of clarification dialogue in the majority
of cases. We then show the usefulness of the algorithm by demonstrating how the recognition of
clarification dialogue can simplify the task of answer retrieval.
1
Introduction
We saw above how relevance is made up of a number of relevance categories, and showed an
implementation of semantic relevance; we shall now illustrate how goal-directed relevance
can be implemented in a QA system through the use of clarification dialogue and present an
evaluation of such an implementation.
Goal-directed relevance considers the overall informational goals that motivate the questioner
to ask a question (goal-directed relevance therefore is not concerned with more general “life”
goals, such as being awarded a Nobel prize). Considering goals helps determine relevance
where more than one sentence is a possible answer to a question but only one answer answers
the question in the context of the questioner’s goal.
So, for example, given the question:
Q: Where was Kennedy shot?
And the possible answers:
A1: Kennedy was shot in Dallas
Chapter 7: Goal-directed Relevance
A2: Kennedy was shot in the liver
The most relevant answer would be considered to be the answer which satisfies most closely
the goals that the questioner has in asking Q. Thus, if the questioner is a journalist writing an
article on political assassinations, the goal immediately relevant for Q is to find out the city
where Kennedy was shot, so A1 is most relevant. But if the questioner is medical researcher
writing a paper on gunshot wounds, A2 is probably more relevant.
Another example is the question:
Q: What is aspirin?
And the answers
A1: Aspirin is a drug used to relieve fever, mild pain and inflammation.
A2: Aspirin is a non-narcotic analgesic and antiplatelet drug.
A1 would be the most relevant answer if the goal of the questioner is to have a general
understanding of what aspirin is, but A2 would be more relevant if the goal was to understand
in detail what type of drug aspirin was.
On the other hand the answerer’s goals may also influence what is to be considered a relevant
answer. For example, given the question:
Q: What is the hypotenuse of a right-handed triangle with sides measuring 3 cm and 4 cm?
And the possible answers:
A1: 5cm
A2: The hypotenuse is given by Pythagoras’ theorem
If the answerer is a question answering system for helping schoolchildren with their
homework, A2 may be considered the more relevant answer as it is consistent with the
system’s (the answerer’s) goal to help rather than to provide answers to homework questions.
In this instance, the answerer’s goals may take precedence over the questioner’s goals (which
are probably to get the homework done as quickly and painlessly as possible).
147
Chapter 7: Goal-directed Relevance
2
2.1
Modelling Goals
The need to examine user models
As seen above, a requirement for a question answering system is to understand the overall
relevance of an answer to a question in terms of questioners’ and the answerer’s goals (goal
directed relevance), knowledge and inference mechanisms (semantic and inferential
relevance) and preferences (morphic relevance). But this is equivalent to constructing a model
of the questioner and the answerer. Examining the construction of user models is therefore a
prerequisite for successfully and satisfactorily answering user questions.
One approach, which we used above for the implementation of semantic relevance, is to avoid
the problem of modelling a specific user and instead to adopt a generic knowledge base (the
specific implementation we presented used WordNet) representing the knowledge of a
generic questioner which will then be applied to all users of the system. In the implementation
above we further simplified the problem by assuming that questioner and answerer knowledge
were equivalent and hence we were able to ignore the problem of potential conflicts between
questioner and answerer models.
While there has been little mention of user modelling in research on open-domain question
answering systems, there is a large amount of research on user modelling for natural language
dialogue, and hence, indirectly, for question answering. Wahlster and Kobsa (1989) proposed
a classification of dialogue systems according to their requirements for increasingly refined
user modelling components:
a) question answering and biased consultation
b) cooperative question-answering
c) cooperative consultation
d) biased consultation pretending objectivity
In this classification, systems belonging to a) would give short, factual, database-derived
answers to questions concerning simple problems such as flight or train times; systems of type
b) try and anticipate users’ needs in answering questions; systems of type c) attempt to give
advice which would realise users’ goals; finally, systems of type d) would act like salesmen,
attempting to convince users that the answers given best meet their needs even if the system
knows they do not.
148
Chapter 7: Goal-directed Relevance
While this classification was constructed for generic dialogue systems this same classification
may be adapted to the question answering problem considered here, indicating increasingly
sophisticated question answering. The classifications a) to c) can in fact be roughly mapped
onto the “QA roadmap” (Burger et al. 2001) which anticipates an ever increasing complexity
in the requirements for question answering systems, from questions regarding simple facts to
broad and complex queries requiring judgement and understanding of a users’ context. The
QA roadmap does not envisage anything which would correspond to d), possibly because it
was conceived in the context of military intelligence rather than commercial use. The
classification itself however is not entirely satisfactory as it ignores the concept of system
goals (although a system acting like a salesman must have at least one goal: to sell some
merchandise); moreover, it is unclear what the difference is between cooperative question
answering (type b) and cooperative consultation (type c), as fully anticipating user needs must
also help realise the user’s goals. We therefore propose the following adaptation of Wahlster
and Kobsa’s categorisation for question answering systems to clarify the need for increasingly
sophisticated user modelling when considering goals:
a) simple, fully biased, question answering: does not take into account user models and
hence goals, and simply attempts to find a factual answer which best fits a question.
Current systems (as for example seen in the TREC workshops) correspond to this
category.
b) Cooperative question answering: takes into account user models, in particular the
goals, knowledge and expectations of questioners. Such systems would overcome
some of the more obvious limitations of the TREC systems, as highlighted for
example in the debate on what constitutes a good answer to a “definition” question
(Prager et al. 2002) which followed the TREC-10 workshop. With the use of more
refined user models such systems would also be able to anticipate users’ goals in
order to fully meet users’ expectations.
c) Knowingly biased question answering: would draw on complex user models,
especially regarding user beliefs, credulity and proneness to persuasion, but would
also draw on the notion of answerer (or system) goals and beliefs. Such systems
would be able to knowingly give a wrong or misleading answer, and could be used,
for example, for advertising (a bank’s question answering system could tell a user that
there is no better deal elsewhere when asked “Can I get a better interest rate at
another institution?” even though there might be a better deal elsewhere; a disclaimer
at the beginning of the question answering session would overcome any legal
objections).
149
Chapter 7: Goal-directed Relevance
A complete question answering system would therefore require the ability to model the
knowledge, beliefs and expectations both of the questioner and the answerer. The problem
therefore arises of what such a models would look like and how they would be constructed.
2.2
Types of User Models
Users may be modelled in many ways. One of the most influential paradigms developed to
represent users is the BDI (belief, desire, intention) model for users engaged in a dialogue
(see Cohen and Perrault 1979; Perrault and Allen 1980; Allen and Perrault 1980); this model
axiomatizes actions and planning with the aim of inferring a user’s plans from a formal
description of a user’s dialogue contribution. The user model is built up using a number of
axiom schemas (e.g. B(A,P) ∧ B(A,Q) ⇒ B(A,P ∧ Q) ) used to manipulate predicates such
as “S believes P”, “S knows that P”, “S knows whether P”, “S wants P”; in addition to this are
a number of “action schemas”, consisting of action preconditions, action effects and goal
states to be achieved in performing an action, as in the following example, which formally
defines the act whereby a speaker informs a hearer of something:
INFORM( S, H, P) is defined as:
Constraints: Speaker(S) ∧ Hearer(H) ∧ Proposition(P)
Precondition: Know(S,P) ∧ W(S, INFORM(S,H,P))
Effect: Know(H, P)
Body: B(H, W(S, Know(H,P)))
where B( X, Y) means X believes that Y and W( X, Y) means X wants Y.
In addition to this there are a number of “plan inference rules”, heuristics which define
plausible inferences from speech acts, or, in other words, constrain the way in which speech
acts will be interpreted. An example is the Precondition-Action-Rule, defined as:
For all agents S and H, if X is a precondition of action Y and H believes S wants X to obtain,
then it is plausible that H believes that S wants Y to be done.
This model has provided the foundation a wide range of work in this area, with numerous
adaptations and simplifications. Kautz and Allen (1986), for example, examined the
representation of user goals, presenting a “Decomposition Hierarchy” and a “Generalization
Hierarchy” of goals. The decomposition hierarchy explains how complex actions are
executed, while the generalization hierarchy makes explicit the relation that exists between
more general and more specific actions, allowing properties to be “inherited” within the
150
Chapter 7: Goal-directed Relevance
hierarchy. Both these structures need, however, to be “filled” with appropriate information
(i.e. concrete goals) in order to be of any use and methods must be found to infer the user’s
goals.
However, while certainly very expressive and capable of addressing the subtlest uses of
dialogue acts, there are no examples of systems built which are capable of coping with the
extensive knowledge needed for a realistic and non-specific task such as open-domain
question answering. This is probably due to the fact that a) the very richness of these
structures means that they are extremely time-consuming to develop and maintain; and b) the
complexity of the inferences used in this model would give rise to serious computability
issues.
Another approach to user modelling has been the use of stereotypes. Stereotypes enable users
to be grouped into categories with common features. An early example of the use of
stereotypes is described by Rich (1979), who developed a book recommendation system
based on assumptions about user characteristics such as educational level, genre preference
and tolerance for descriptions of violence or suffering. These assumptions form a user
stereotype built up initially from a small amount of information requested of the user before
any interaction takes place, but which can be revised by the system in light of questions posed
to the user about previous recommendations. Another example is KNOME (Chin 1989),
which classifies users as beginners, novices, intermediates and experts and makes
assumptions about their operating system knowledge (and therefore their need for help) based
on these assumptions. Strachan et al. (1997), used a very simple (“minimalist”) form of user
stereotyping: their system (an aid for financial planning) modelled users as novices or experts
and then further as generic users and financial planners. The user model itself contained
information about settings which allowed adaptations to each user model and information
about the user’s past interaction with the system’s strategies and assistants. The initial model
was created based on each user’s job title, a self-assessment about the user’s general
proficiency within the Windows operating system environment, and previous experience with
the system. Once this was in place, the system adapted the user model dynamically once a day
(and, interestingly, with the user’s consent) as a consequence of the user’s interactions with
the system
Similar to stereotypes are the attempts to build general models of user goals and specifically
the efforts by a number of researchers in the field of Information Retrieval to examine the
problem of information seeking models. Schneiderman (1998), for example, identified four
unstructured tasks describing information seeking goals:
151
Chapter 7: Goal-directed Relevance
•
specific fact finding,
•
extended fact finding (more open questions),
•
open-ended browsing (for example finding out whether desertification and carbonmonoxide levels are related), and
•
exploration of availability (seeing what information is available on a topic).
Ellis (1989) abstracted user behaviour (based on observations of sample users) into six
categories:
•
starting (e.g. focussing on known references or authors),
•
chaining (following references from the documents at hand),
•
browsing (scanning on-line material or shelves),
•
differentiating (making use of difference in quality between information sources),
•
monitoring (maintaining awareness of developments), and
•
extracting (systematically working thorough a particular source in order to locate
material of interest).
Chen and Dhar (1991), on the basis of protocol analysis of actual user behaviour when
interacting with librarians or with an information retrieval system, abstracted a number of
approaches used to traverse the problem-space:
•
Known-item instantiation (retrieval of a known item, for example an author name or
title to get index terms to be used in following queries)
•
Search-option-heuristics (specifying queries by employing different search options,
for example keyword title or subject search)
•
Thesaurus-browsing
•
Screen-browsing
•
Trial-and-error
A more general approach was developed by Taylor (1991) in the area of communication
science. Taylor identified eight classes of information use generated by the perceived needs of
groups of users in particular situations:
•
Enlightenment: the desire for context information in order to make sense of a
situation
•
Problem understanding: the need for a better understanding of particular problems
•
Instrumental: the need to find out how to do something or what to do
•
Factual: the need for precise data
•
Confirmational: the need to verify some information
152
Chapter 7: Goal-directed Relevance
•
Projective: the need to be future oriented (estimates and probabilities)
•
Motivational: the need to find additional information based on personal involvement
with a task
•
Personal or Political: the desire to control situations (relationships, status, reputation)
Stereotypes are however necessarily rigid (a flexible stereotype would no longer be a
stereotype), which could cause problems if the user model which is constructed is based on
erroneous assumptions. Moreover, being a general model, they can only cope with very
general goals and cannot model the specific needs of a specific user at a specific time.
Another problem is the limited applicability of the stereotypes as they are usually constructed
for specific domains and cannot be easily transferred to other problem areas.
Once a decision has been made on the type of representation that is to be used to model the
user’s goals, it is necessary to ask how such a representation can be constructed: how can we
decide which stereotype to use in a particular situation? How can we decide what intentions to
ascribe to a user in a BDI model?
2.3
Constructing user models
Examples of methods to construct user models can be found in the work of Kautz (1990) and
Pollack (1986 and 1990) who attempted to provide techniques for plan recognition in natural
language discourse using inference; following on from this work, Appelt and Pollack (1992)
used a form weighted abduction for plan recognition; another approach has been the use of
probabilistic methods such a Bayesian networks in order to determine user models (see, for
example, Horvitz et al. 1998 and Horvitz and Paek, 2000). A number of researchers have also
investigated plans in the context of particular task-oriented dialogues, for example in the
domain of train timetables or university course enrolment (see, for a representative sample,
Ardissono et al. 1993 and Allen et al. 1995) and a number of simple systems have been built
based on the acquisition of assumptions about user’s knowledge and beliefs (see for example
Kobsa 1985, which derives assumptions about users’ beliefs and goals from their naturallanguage assertions and questions).
Whalster and Kobsa(1989) identify a number of knowledge sources which can be used to
obtain information about users:
•
Default assumptions: a system may make assumptions about certain user
characteristics such as general knowledge and goals which it then applies
153
Chapter 7: Goal-directed Relevance
indiscriminately to all users unless some evidence is found which invalidates these
assumptions
•
Initial user models: a system may possess a model for a particular user gained as a
result of previous interactions. Rich (1979), for example, presented a system which
requested information from a user before any interaction took place, in order to
classify the user according to certain given models.
•
Assumptions based on user input: in this case a system gradually constructs a user
model based on the history of the input the user gives the system.
•
Assumptions based on dialogue contributions made by the system: the user may
respond in different ways to the system’s responses, possibly confirming or
disproving assumptions made by the system. If a system used a particular word
assuming that the user had a particular linguistic competence, but the user
subsequently asked the meaning of that word, the assumption about the user’s
linguistic competence would have to be dropped.
•
Assumptions based on non-linguistic input of the user: such systems would make use
of the visual, acoustic and other external cues which human beings use when making
assumptions about interlocutors. An example of a system which uses acoustic cues to
build a user model is Horvitz and Paek (2001), which enhance speech recognition
with the use of probabilistic user models, making use of non-verbal cues such as
hums or moments of silence.
The systems described above are, however, very restricted in their domain of discourse and it
is doubtful whether some of the techniques described could be applied outside the specific
domains within which they were developed (see Kobsa 1990 for a detailed discussion).
Nevertheless, this classification is still relevant for the problem of question answering and can
be used (bar the last point, which is beyond the current investigation, as it does not concern
itself with non-verbal cues) to show how question answering systems could determine user
models. A simplified version of the above, specifically relevant to question answering, is
therefore proposed; a question answering system could determine user models through a
combination of:
•
Prior assumptions about questioners: even the simples type of question answering
system will have to make assumptions about the questioner (for example the fact that
the answer is expected to be in English and not in Finnish), and hence all systems will
have at least one user model, which is consequently a default user model. The next
step in complexity would the presence of a number of models which have been
154
Chapter 7: Goal-directed Relevance
previously constructed and which may be applied to current users prior to any
question answering exchange taking place.
•
Assumptions based on questioners’ responses: at the highest level of complexity
questioner models will be gradually constructed based on the questioner’s queries
(e.g. the phrasing of a question, which could indicate a specific level of linguistic
competence) and responses to the proposed answers (e.g. requests for clarification,
which could indicate that the assumed goals have not been met satisfactorily).
These two points are however closely linked, as assumptions could never be entirely based on
questioners’ input, but always rely on the presence of a (at least implicit) user model: the
answerer therefore is prejudiced and the prejudices (prior assumptions) determine the possible
questioner models which will be constructed.
An alternative approach to building user models has been the use of heuristics to determine
user intentions by applying a set of domain-independent rules to user input (see Pohl et al.
1995). Examples of these heuristics are:
•
If a user employs objects correctly, then the user is familiar with these objects (see for
example Sukaviriya and Foley 1993, Nwana 1991 and Chin 1989). Applied to
question answering it can be said for example that, if a user employs words correctly
(in the correct context, with correct spelling, etc.), that user must be familiar with
these words and must have a linguistic competence sufficient to understand these
words and words of similar complexity.
•
If the user uses object incorrectly then the user is not familiar with those objects (see
for example Quilici 1989). Applied to question answering, if a user employs words
incorrectly (in the wrong context, misspelled, etc.) the user cannot be familiar with
those words, and therefore must have a linguistic competence which does not allow
those words, and words of similar complexity, to be used correctly.
•
If the user requests an explanation for a concept, the user cannot be familiar with that
concept (see Chin 1989 and Boyle and Encarnacion 1994). In question answering this
could be applied by hypothesising that, if a user asks for the meaning of a word, the
user cannot be familiar with that word and consequently must have a linguistic
competence which does not allow those words, and words of similar complexity, to
be used correctly.
•
If the user wants to be informed about an object in more detail, the user must be
familiar with this object (see Boyle and Encarnacion 1994); in this case the user is not
looking for an explanation which would enable the understanding of the concept
155
Chapter 7: Goal-directed Relevance
through simpler, more familiar, concepts, but is seeking to explore the concept in
more complex detail. This may be applied in question answering to hypothesise that,
if a user requests an answer to be elaborated, the user must already be familiar, to a
certain extent, with the answer and is looking for new information that has not yet
been given.
A user model may be built on the preceding heuristics following the simple rule that if the
feedback a user gives, following a system output that was based on the assumptions in the
user model, is positive, the plausibility of these assumptions should be increased; conversely,
if the feedback a user gives is negative, the plausibility of these assumptions should be
decreased (see Rich 1979). As an example application to question answering, if a user is
assumed to have a low linguistic competence and an answer is given based on this model of
the user (e.g. a simple answer is preferred over a more complex one in order to cater for the
low linguistic competence) and the user objects to the answer (for example by engaging in a
clarification dialogue which requires much more detail), the given user model should be
considered less plausible than before and perhaps a new user model (that states, for example,
that the user’s linguistic competence is higher than previously assumed) should be adapted in
answering the following questions.
2.4
An alternative to user goal modelling: clarification dialogue
The short (and incomplete) survey above has shown the complexity of both deciding how to
model a user and how to construct a model for a particular user within a given modelling
framework:
•
although a number of experimental systems have been developed, few “live” systems
have used any form of complex user modelling and have in any case been limited to
specific domains.
•
the very richness of structures such as the BDI model means that they are extremely
time-consuming to develop and maintain
•
the use of heuristics to build stereotypes is problematic due to the fact that
constructing the rules is a complex and laborious task
Even if these issues were resolved other problems would remain. Even if a generic question
answering system could seek to identify the user’s current goal by choosing from a general
taxonomy of goals the specific goal which is most coherent with the observed user actions
(the question posed by the user), nevertheless, the problem still remains that:
156
Chapter 7: Goal-directed Relevance
•
The taxonomy may be incomplete
•
The model that has been constructed of the user may be incorrect
Therefore, simply using questioner models derived from previous dialogue may not give the
answer the questioner is looking for. Consequently, a different solution needs to be found: one
such solution could be a dialogue with the user which would enable the answering system to
refine its understanding of the questioner’s needs. In this way, the questioner would be given
the possibility of engaging in clarification dialogue with the answerer in order to reach the
specific goals associated with the question.
We therefore propose a different approach to ensuring user goals are met by using
clarification dialogues to ensure that user questions are not always considered independently
of any context, but are considered as part of a larger interaction made up of a number of
related questions which together enable users to meet their goals.
A clarification dialogue would enable the questioner to refine a previously asked question or
to request an elucidation of an answer. This would ensure that a goal associated with a
previous question which had not been adequately addressed by the previously given answer
would eventually be satisfied (unless, of course, in an extreme case of incomprehension, the
questioner simply gave up and abandoned the initial goal, having concluded that the answerer
could not possibly understand the actual goal).
3
Clarification dialogues in Question Answering
Often a single question is not enough to meet users’ goals: a wider dialogue (which, in the
case of QA systems is limited to a series of question/answer pairs, and not, as happens in
human dialogue, also question/question pairs), either elaborating and building on information
gathered, or clarifying previously given information is required, i.e. a dialogue which will
enable the user to fully achieve their informational goals. We shall refer to this type of
interaction as “Clarification Dialogue”, in virtue of the fact that we are a) examining a
dialogue, albeit a very limited one where only one party in the dialogue asks questions and
only one party gives answers; and b) we are considering a dialogue which clarifies some
concept in the questioner’s mind, whether this be by asking for some new information related
to the topic investigated or asking for an explanation of something already given.
157
Chapter 7: Goal-directed Relevance
One example of clarification dialogue is in the form of questions which seek to clarify the
meaning of an answer, for example when the user has not understood a term contained in the
answer, as in the following exchange:
(1) Q1: What is a fairy tale?
A1: The American Heritage dictionary tells me it is a fanciful tale of legendary deeds and
creatures.
Q2: What does fanciful mean?
A2: …
Other times users want to expand on a given answer in order to have more details, as in the
following example, where the user, having discovered a need (the necessity to have a license
to fish) wants more details about how to go about fulfilling that need (the cost of the license):
(2) Q1: Do I need a license to fish in the Tiber river?
A1: Yes.
Q2: How much?
A2: …
In other cases the user’s goal is to form a broad picture about some topic and a number of
separate questions are needed in order to achieve the breadth of information required:
(3) Q1: Where was Frank Sinatra born?
A1: Hoboken, N.J.
Q2: Where did he grow up?
A2: Hoboken, N.J.
Q3: What kind of childhood did he have?
A3: …
The common link between the above dialogue fragments is the fact that the question/answer
sequences form coherent units of discourse quite different from an interaction such as the
following:
(4) Q: What is caffeine?
A: A stimulant.
Q: What imaginary line is halfway between the North and South Poles?
A: The equator.
158
Chapter 7: Goal-directed Relevance
Q: Where is John Wayne airport?
A: …
In (4) there is no relationship between the questions or between the questions and previous
answers and hence in seeking an answer there is no immediate reason to take into
consideration previously asked questions or previously given answers. In fragments (1) to (3),
however, in order to answer the questions correctly it is necessary to take into consideration
the previous context in order to satisfy the user’s goals. In (1), for example, the user isn’t
asking for the generic meaning of the word fanciful (the American Heritage Dictionary, for
example, gives three separate meanings for the word fanciful) but the specific meaning that
word takes in the sentence “it is a fanciful tale of legendary deeds and creatures”. Similarly in
(2) the question “How much?” makes no sense, and cannot be answered, without reference to
the context. Exchanges such as those in examples (1) to (3) therefore have in common the
feature that to answer a question satisfactorily some reference must be made to previously
asked questions and previously given answers. We shall hence refer to such exchanges as
“clarification” dialogues, as the questions that constitute them either clarify previous
questions or answers or clarify the mental picture the user is trying to build by elaborating on
previously asked or given information.
4
Using
clarification
dialogue
to
implement
goal-directed
relevance
The way we implement goal-directed relevance is not to explicitly construct a function which
ranks answers according to their relevance to a question with reference to a model of goals.
Instead we ensure that answers are relevant from a goal-directed point of view by allowing
the users to clarify previous questions and thus reach their goals through a type of dialogue
with the answering system.
We improve on TREC-type systems, by considering questions not in isolation, but in
relationship with each other. Implicitly therefore we have a function goal-directed-relevance
which divides answers in answers which would be relevant if we were to consider the current
question as part of a clarification dialogue and answers which would be relevant if this was
not the case.
What follows is a description of how the YorkQA system was improved with the addition of
clarification dialogue, first looking at how clarification dialogue was implemented and then
159
Chapter 7: Goal-directed Relevance
seeing how the use of clarification dialogue helps find answers which are relevant to the goal
of clarifying a previously asked question.
While a number of researchers have looked at clarification dialogue from a theoretical point
of view (e.g. Purver et al. 2002; Ginzburg 1998b; Ginzburg and Sag 2000; van Beek at al.
1993), or from the point of view of task oriented dialogue within a narrow domain (e.g.
Ardissono and Sestero 1996), there has been little work on clarification dialogue for open
domain question answering systems such as the ones presented at the TREC workshops,
where the task that the user is pursuing and the subject matter of the user’s investigations are
not known a priori. Initial work in this direction have been the experiments carried out for the
(subsequently abandoned) “context” task in the TREC-10 QA workshop (Voorhees 2002;
Harabagiu et al. 2002) and the initial experiments presented by De Boni and Manandhar
(2003b). Here we seek to partially address this problem by looking at a particular aspect of
clarification dialogues in the context of open domain question answering: the problem of
recognizing that a clarification dialogue is occurring, i.e. how to recognize that the current
question under consideration is part of a previous series (i.e. clarifying previous questions) or
the start of a new series; we then show how the recognition that a clarification dialogue is
occurring can simplify the problem of answer retrieval.
5
The TREC Context Experiments
The TREC-2001 QA track included a "context" task which aimed at testing systems' ability to
track context through a series of questions (Voorhees 2002). In other words, systems were
required to respond correctly to a kind of clarification dialogue in which a full understanding
of questions depended on an understanding of previous questions. In order to test the ability to
answer such questions correctly, a total of 42 questions were prepared by NIST staff, divided
into 10 series of related question sentences which therefore constituted a type of clarification
dialogue; the sentences varied in length between 3 and 8 questions, with an average of 4
questions per dialogue. These clarification dialogues were however presented to the question
answering systems already classified and hence systems did not need to recognize that
clarification was actually taking place. Consequently systems that simply looked for an
answer in the subset of documents retrieved for the first question in a series performed well
without any understanding of the fact that the questions constituted a coherent series.
In a more realistic approach, systems would not be informed in advance of the start and end of
a series of clarification questions and would not be able to use this information to limit the
subset of documents in which an answer is to be sought.
160
Chapter 7: Goal-directed Relevance
6
Analysis of the TREC context questions
We manually analyzed the TREC context question collection in order to determine what
features could be used to determine whether a question was part of a longer a question series,
with the following conclusions:
1. Pronouns and possessive adjectives. For example:
-
What does transgenic mean?
-
What was the first transgenic mammal?
-
When was it born?
where questions were referring to some previously mentioned object through a pronoun
(“it”). The use of personal pronouns (“he”, “it”, …) and possessive adjectives (“his”,
“her”,…) which did not have any referent in the question under consideration was
therefore considered an indication of a clarification question. Notice there is no need to
use any form of coreference resolution to classify these questions as being part of a wider
series.
2. Ellipsis, as in:
-
What type of vessel was the modern Varyag?
-
…
-
In what country is this facility located?
-
On what body of water?
-
How long is the Varyag?
-
How wide?
where the incomplete syntactical construction is a clear indication that the question
referred to some previous question or answer.
3. Semantic relations between words in question series, as in the following:
-
Which museum in Florence was damaged by a major bomb explosion?
-
Which galleries were involved?
-
How many people were killed?
-
…
-
How much explosive was used?”
where there is a clear semantic relation between “museum” and “galleries, and between
“explosion”, “killing” and “explosive”. Questions belonging to a series were “about” the
161
Chapter 7: Goal-directed Relevance
same subject, and this aboutness could be seen in the use of semantically related words. A
particular case of semantic relation between words was the repetition of proper nouns, as
in:
-
What type of vessel was the modern Varyag?
-
…
-
How many aircraft was it designed to carry?
-
How long was the Varyag?
where the repetition of the proper noun indicates that the same subject matter is under
investigation.
7
Experiments in Clarification Dialogue Recognition
We speculated that an algorithm which made use of these features would successfully
recognize the occurrence of clarification dialogue. In order to verify this hypothesis we
collected two sets of new data on which to test the algorithm: we made use of the first set to
carry out an initial evaluation; following this initial evaluation changes were made to the
algorithm and therefore a second set of data was necessary in order to test the changes. The
collected questions were fed into a system implementing the algorithm, which attempted to
recognize the occurrence of a clarification dialogue and the results given by the system were
compared to the manually recognized clarification dialogue tags. We then conducted a
number of experiments to verify the usefulness of clarification dialogue recognition in
improving answer retrieval performance in a question answering system.
8
Collection of new data
This was done in two stages: a first collection which aimed at testing the algorithm and
understanding problems associated both with the algorithm and the collection of the dialogue
data itself; a second collection which improved the data collection process in light of the
problems noted in collecting the first data and was used to test any modifications of the
algorithm made in light of the first collection.
8.1
Dialogue Collection A
Given that the only available data was the collection of “context” questions used in TREC-10,
it was felt necessary to collect further data in order to test our algorithm rigorously. This was
necessary both because of the small number of questions in the TREC data and the fact that
there was no guarantee that an algorithm built for this dataset would perform well on “real”
162
Chapter 7: Goal-directed Relevance
user questions. A collection of 253 questions was therefore put together using the method
described below.
A number of questions on a wide range of topics were chosen from the TREC collection used
for the TREC QA open domain question answering track (not the context question collection)
and used as a collection of possible dialogue topics. 24 users were then invited to interact with
a question answering system using the “Wizard of Oz” method (see Preece et al. (1994) for a
description of this methodology). First they were required to choose a topic from the given
collection of topics. They were then asked to seek information on the given topic; no details
were given as to how much information or what type of information was to be collected and
users were told it was up to them what information and how much information was to be
gathered. As an “ice-breaker” they were then invited to ask the QA system the exact question
on the topic as given in the TREC collection, after which they were free to interact with the
system as they felt appropriate. In order to ensure a realistic interaction which was not
dependant on the performance of the Question Answering system itself, the questions were
answered by an operator using the Google search engine to find relevant answers. These
questions thus collected made up 24 clarification dialogues, varying in length from 3
questions to 23, with an average length of 12 questions. A typical interaction would therefore
proceed as follows:
Question topic chosen: Philadelphia.
Initial “ice-breaker” question (n. 1526 in the TREC-11 question series): “What is the
city of brotherly love?”
System: “Philadelphia”
At this point the user was free to ask any question to the system and
proceeded with:
User: “Where is Philadelphia?”
System: “USA”
User: “Where more precisely?”
etc.
Once collected the questions were recorded electronically and manually tagged to recognize
the occurrence of clarification dialogue.
Users were chosen from a variety of age groups (20-60), educational and social backgrounds
(from researchers with PhDs to administrators with middle school qualifications) and care
163
Chapter 7: Goal-directed Relevance
was taken to ensure that the sexes were equally balanced in the sample population. This
ensured that in the spirit of “open domain” question answering we were not targeting any
specific user population and instead gathered data from potential users a wide range of
different needs and expectations.
Differently from the TREC collection, the collection we gathered highlighted the tendency of
users to make use of cue-words such as “exactly” or “precisely” (as in “where exactly?”) to
clarify previously asked questions. Other than this there did not appear to be any significant
difference in the type of questions asked in the TREC collection and in the collection we
gathered (apart from one single user who attempted to have a lifelike dialogue with the
system with some very intricate questions) the average length of a dialogue interaction was
significantly higher in our collection (an average of 12 questions per dialogue as opposed to
an average of 4) with, as would be expected, a wider range of extremes (the shortest dialogue
in our collection was of 3 questions, the longest 23, as opposed to 4 and 8 in the TREC
context questions).
8.2
Dialogue Collection B
It was noted that the method used to collect dialogue data was lacking in rigour due to:
a) not having given the users a specific task to accomplish: without a specific goal in
mind it was difficult to compare the dialogues
b) not having a well-defined procedure for deciding what answer to give the user: followup questions
More experimental data was therefore gathered. In this case, users were given the task of
collecting enough information from the system in order to then be able to write a short
paragraph on a given topic which they had no information about beforehand (they were asked
if they had any information about a topic before starting the experiment). They then interacted
with the QA system using the wizard-of-Oz methodology in order to gather information. The
operator of the QA system consulted the Google search engine for an answer and provided the
first answer found in the retrieved documents by taking the sentence snippet from the
document which answered the question in the most concise manner.
A total of 16 dialogues were collected: question length was much more consistent than
collection A, with an average length of 9 questions per dialogue, a maximum of 17 and a
minimum of 5. There appeared to no significant difference compared to the previously
collected dialogues.
164
Chapter 7: Goal-directed Relevance
The differences between the TREC “context” collection and the new collection are
summarized in the following table:
9
Groups
Qs
Av. len
Max
Min
TREC
10
41
4
8
4
Collection A
24
253
12
23
3
Collection B
16
144
9
17
5
Clarification Recognition Algorithm
Our approach to clarification dialogue recognition looks at certain features of the question
currently under consideration (e.g. pronouns and proper nouns) and compares the meaning of
the current question with the meanings of previous questions to determine whether they are
“about” the same matter.
Empirical work such as (Ginzburg 1998b) and (Purver et al. 2002) indicates that questioners
do not usually refer back to questions which are very distant, and this was consistent with our
data. In particular Purver et al. analyzed the English dialogue transcripts of the British
National Corpus, finding that clarification request source separation (CSS, the distance
between a question and the question or answer which it is attempting to clarify) was at most
15 sentences and usually less than 10 sentences. We therefore set the question window to be
the average length of the clarification dialogues in the two sets of data, and hence considered
the set of the previously mentioned 8 questions, i.e. set the question window n=8 . This is
consistent with the empirical observations of Purver et al. as our maximum distance of 8
questions is equivalent to a CSS of 16 sentences (i.e. 8 pairs of questions and answers).
A question is deemed to be a clarification of a previous question if:
1. There are direct references to nouns mentioned in the previous n questions through the
use of pronouns (he, she, it, …) or possessive adjectives (his, her, its…) which have no
references in the current question; this was altered after the experiments carried out on the
first sample of collected data to also include cue-words such as “precisely”, “exactly” etc.
clearly indicating a reference to a previous question or answer
2. The question does not contain a verb phrase
165
Chapter 7: Goal-directed Relevance
3. There are explicit references to proper and common nouns mentioned in the previous n
questions, i.e. repetitions which refer to an identical object; or there is a strong sentence
similarity between the current question and the previously asked questions.
In other words, given
-
a question q0
-
a question window n which determines how far back a question can refer within
B
B
the clarification dialogue sequence
-
n previously asked questions q-1..q-n
B
B
B
B
we have a function Clarification_Question which is true if a question is considered a
clarification of a previously asked question:
Clarification_Question( qi , qi-1..qi-n )
B
B
B
B
B
B
is true if any of the following are ture
1. qi has pronoun and possessive adjective references to
B
B
qi-1..qi-n
B
B
B
B
2. qi does not contain any verbs
B
B
3. qi has repetition of common or proper nouns in qi-1..qi-n
B
B
B
B
B
or qi has a strong semantic similarity to some q ∈ qiB
B
B
B
1..qi-n
B
B
B
This basic algorithm was improved by using a sigmoid function to simulate decaying
importance of previous questions in order to avoid a sharp step. The algorithm above can be
considered a step function with an abrupt cut-off point which makes a binary decision on the
possible relevance of previous questions to the current question. In particular, a question
which is too far back to be included in the question window n will be ignored, no matter how
strongly it is related to the current question; furthermore, no differentiation is made between
recently asked questions and questions which are significantly more distant in time as long as
they are within the allowed window: hence, when looking at similarities between previous
questions and the current question, there is no distinction between the question
which
immediately preceded the current one and a question which was uttered 6 moves ago in the
dialogue. What we really want is a stronger similarity the further away a question is: we are
therefore looking for a scaling factor which ensures that in the limit weak similarities between
very close questions are more important than strong similarities between very distant
questions.
166
Chapter 7: Goal-directed Relevance
Fig. 1 shows the initial function used in the algorithm above, a step function which ignores
any questions outside the given window of 8 questions. Fig.2 shows the function used instead
of the step function, derived from a basic sigmoid function5: this is much more satisfactory as
TP
PT
it ensures that questions outside the given window are not simply ignored but considered
relevant with a lower degree of probability.
Fig. 1
1.2
Value
1
0.8
0.6
0.4
0.2
0
-0.2
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
Question
Fig. 2
1
Value
0.8
0.6
0.4
0.2
0
1
2
3
4
5
6
7
8
9
10
11 12
13 14
15
Question
The algorithm above was therefore modified so that when looking at similarities between
current and past questions the distance between the questions was also taken into account. In
particular the similarity measure was weighted according to formula (a). This ensured that as
the distance between questions increased the similarity between the questions had to increase
in order to be considered relevant, i.e. part of the same dialogue. We therefore have:
5
TP
PT
Given the basic sigmoid function y = 1 / ( 1 + exp( -x ) ) ) we are seeking a function which rapidly decays
once it has reached the given question window n. The sought-after function is therefore:
y =1−
1
1 + e −( x − n )
y =1−
1
1 + e − ( x −8 )
which in our case becomes (fig.2):
167
Chapter 7: Goal-directed Relevance
3.a
qi
has
B
repetition
B
of
common
or
proper
nouns
in
q ∈ qi-1..qi-n or qi has a strong semantic similarity to some q
B
B
B
B
B
B
∈ qi-1..qi-n, weighted according to the distance between q0 and
B
B
B
B
B
B
q
according
to
the
formula
similarity=similarity*(1(1/(1+exp(-(m–8))))) where m indicates the number of
questions between qi and q
B
B
Similarity between questions is therefore measured as a decaying function, not a step
function, allowing for references to questions far back in the dialogue, but without running
into the danger of overplaying similarities which, due to the distance between the questions,
should not be considered significant.
10
Sentence Similarity Metric
A major part of our clarification dialogue recognition algorithm is the sentence similarity
metric which looks at the similarity in meaning between the current question and previous
questions. The sentence similarity measure was given by the same method described above to
calculate semantic similarity (see chapter 6), using WordNet as background knowledge and
taking into consideration information about compound nouns, proper nouns and word
frequency.
11
Results
An implementation of the algorithm was evaluated on the TREC context questions used to
develop the algorithm and then on the collection of new clarification dialogue questions.
Initially, the individual components of the algorithm were evaluated separately to gauge their
contribution to the overall performance of the algorithm. The overall algorithm, optimized for
best performance, was then evaluated. The evaluation consisted in testing the algorithm’s
ability to:
•
recognize a new series of questions (indicated by N, for “New”, in the results table)
•
recognize that the current question is clarifying a previous question (indicated by C, for
“Clarification”, in the table)
In order to understand the overall performance of the algorithm an F-measure or weighted
harmonic mean (van Rijsbergen 1979) is used, based on the formula
F - measure =
168
2 NC
N +C
Chapter 7: Goal-directed Relevance
The individual components were evaluated as follows:
•
Common Words (CW). This was baseline method which did not use any linguistic
information and simply took a question to be a clarification question if it had any
words in common with the previous n questions, else took the question to be the
beginning of a new series.
•
Reference Words (RW). This method employed point 1 of the algorithm described
in section 6 by looking for "reference" keywords such as he, she, this, so, etc. which
clearly referred to previous questions. Interestingly this did not misclassify any "new"
questions.
•
Absence of Verbs (AV)+RW. This method employed points 1 and 2 of the algorithm
described in section 6 by looking for the absence of verbs combined with keyword
lookup.
•
Noun Similarity (NS1) +AV+RW. This method employed the full algorithm
described in section 6 by looking at the similarity between nouns in the current
question and nouns in the previous questions, in addition to reference words and the
absence of verbs. Note that the data was manually checked to identify and correct
gross errors in compound noun identification which were noted in initial experiments
(which were responsible for degrading performance in the identification of new
sequences of collection A to 67%).
•
Noun Similarity (NS2) +AV+RW. This differed from the previous method in that it
specified a similarity threshold when employing the similarity measure.
•
Decaying Function (DF)+PS+NS2+AV+RW. This implemented the full algorithm,
with the use of a decaying function to give a weighting to the similarities between
questions in a sequence (i.e. using point 3.a above as opposed to 3). Moreover this
method employed a similarity threshold which was optimized for best performance.
•
An examination of the results of experiments carried out on test Collection A with the
features examined above, indicated that it was also necessary to consider Answer
Similarity (ANS) (this feature was not evident in the TREC data collection due to the
way the data was put together), for instance clarifying the meaning of a word
contained in the answer, or building upon a concept defined in the answer. An
example was the question “What did Antonio Carlos Tobim play?” following “Which
famous musicians did he play with?” in the context of a series of questions about
Frank Sinatra: Antonio Carlos Tobim was referred to in the answer to the previous
question, and nowhere else in the exchange. ANS indicated a strong semantic
relationship between the current question and the answer given immediately before
169
Chapter 7: Goal-directed Relevance
the question was asked. In the gathered data questions referred to the immediately
preceding answer, and not to answers within a given “window”; consequently, our
algorithm only considered the immediately preceding answer. The fact that this
feature was identified following an analysis of the experiments carried out on
Collection A made it necessary to gather and experiment on Collection B to verify the
strength of the algorithm.
The results on the TREC data, which was used to develop the algorithm, are summarized in
the following table:
CW
RW
TREC
RW
RW
RW
RW
+AV
+AV
+AV
+AV
+NS1
+NS2
+NS2
+DF
N
90%
90%
90%
60%
80%
90%
C
47%
53%
59%
78%
72%
77%
62%
67%
71%
68%
76%
83%
Fmeasure
The results for the same experiments conducted on the collected data were as follows:
CW
RW
Collected A
RW
RW
RW
RW
RW
+AV
+AV
+AV
+AV
+AV
+NS1
+NS2
+NS2
+NS2
+DF
+DF
+ANS
N
100%
100%
100%
71%
87%
96%
96%
C
64%
62%
66%
91%
89%
93%
96%
F-measure
78%
77%
80%
80%
88%
94%
96%
The same experiments, using all the features available to the full algorithm, were then carried
out on the Dialogue Collection B; the only modification to the algorithm was the addition of
cue words to point 1 of the algorithm above. Results were similar to the results for Collection
A, as can be seen in the following summarizing diagram:
170
Chapter 7: Goal-directed Relevance
Collected B
Full Algorithm
N
93%
C
96%
F-measure
94%
In the experiments above the data was provided to the algorithm in the order in which it was
corrected. In order to verify how the particular order in which the questions were given
influenced the results another series of experiments were carried out in which the dialogues
from Collection A and Collection B were reshuffled and fed into the full algorithm. Results
are given in the following table:
No. of permutations
Average F-measure
Standard Deviation
Collected A
6
95.9
0.84
Collected B
6
94.3
0.98
As can be seen, the particular order in which the dialogues were given had a negligible effect
on the results.
Problems noted were:
•
False positives: questions following a similar but unrelated question series. E.g. "Are they
all Muslim countries?" (talking about religion, but in the context of a general conversation
about Saudi Arabia) followed by "What is the chief religion in Peru?" (also about
religion, but in a totally unrelated context). Optimization of the similarity threshold
partially solved this problem.
•
Absence of relationships in WordNet, e.g. between "NASDAQ" and "index" (as in share
index). Absence of verb-noun relationships in WordNet, e.g. between to die and death,
between "battle" and "win" (i.e. after a battle one side generally wins and another side
loses), "airport" and "visit" (i.e. people who are visiting another country use an airport to
get there).
As can be seen from the tables above, the same experiments conducted on the TREC context
questions yielded slightly worse results; a failure analysis revealed this was mostly due to the
inability to find semantic relationships in WordNet between words in the domain of two of
the ten questions: explosives (explosion – explosives – bomb) and wine growing (winery –
grape). The small sample size meant that errors in only two of the questions significantly
171
Chapter 7: Goal-directed Relevance
affected the results for the overall performance (the errors in the two questions were in fact
responsible for 71% of the mistakes made by the system).
The performance of the individual components of the algorithm followed a similar pattern in
both the TREC collection and collection A, with the individual components cumulatively
contributing to increasing precision in clarification recognition or new dialogue recognition,
which confirmed the usefulness of the components. The results on Collection B confirmed the
strength of the algorithm even with a slightly different type of dialogue setting.
12
Usefulness of Clarification Dialogue Recognition
Recognizing that a clarification dialogue is occurring only makes sense if this information can
then be used to improve answer retrieval performance. We hypothesized that clarification
dialogue recognition would in fact enable us to simplify the answer retrieval process (and
hence improve performance) by adding constraints to what the answer could be. Noting that a
questioner is trying to clarify previously asked questions is in fact important in order to
determine the context in which an answer is to be sought: answers to certain questions are
constrained by the context in which they have been uttered. The question “What does
attenuate mean?”, for example, may require a generic answer outlining all the possible
meanings of “attenuate” if asked in isolation, or a particular meaning if asked after the word
has been seen in an answer (i.e. in a definite context which constrains its meaning). In other
cases, questions do not make sense at all out of a context: no answer could be given to the
question “where?” asked on its own, while following a question such as “Does Sean have a
house anywhere apart from Scotland?” it becomes an easily intelligible query.
The usual way in which Question Answering systems constrain possible answers is by
restricting the number of documents in which an answer is sought by filtering the total
number of available documents through the use of an information retrieval engine. The
information retrieval engine selects a subset of the available documents based on a number of
keywords derived from the question at hand. In the simplest case, it is necessary to note that
some words in the current question refer to words in previous questions or answers and hence
use these other words when formulating the IR query. For example, the question “Is he
married?” cannot be used as is in order to select documents, as the only word passed to the IR
engine would be “married” (possibly the root version “marry”) which would return too many
documents to be of any use. Noting that the “he” refers to a previously mentioned person (e.g.
“Sean Connery”) would enable the answerer to seek an answer in a smaller number of
documents. Moreover, given that the current question is asked in the context of a previous
172
Chapter 7: Goal-directed Relevance
question, the documents retrieved for the previous related question could provide a context in
which to initially seek an answer.
In order to verify the usefulness of constraining the set of documents from in which to seek an
answer, a subset made of 15 clarification dialogues (about 100 questions) from the given
question data was analyzed by taking the initial question for a series, submitting it to the
Google Internet Search Engine and then manually checking to see how many of the questions
in the series could be answered simply by using the first 20 documents retrieved for the first
question in a series. The results are summarized in the following diagram (Fig. 1):
First Q in series
Words in Q
Coreference
Mini-clarification
Other
Search technique used for Question
•
69% of clarification questions could be answered by looking within the documents used
for the first question in the series, thus indicating the usefulness of noting the occurrence
of clarification dialogue. In contrast, using a standard QA system which simply used the
clarification question as a query to be sent to the search engine, only 37% of the answers
could be found in the first 20 retrieved documents.
•
For the remaining 31% a different approach had to be taken, in particular:
•
6% could be answered after retrieving documents by using the words in the current
question as search terms (e.g. “What caused the Boxer uprising?”);
•
14% required some form of coreference resolution and could be answered by combining
the words in the question with the words to which the relative pronouns in the question
referred (e.g. “What film is he working on at the moment”, with the reference to “he”
resolved, which gets passed to the search engine as “What film is Sean Connery working
on at the moment?”);
•
7% required more than 20 documents to be retrieved by the search engine or other, more
complex techniques: for example, a question such as “Where exactly?” required both an
understanding of the context in which the question was asked (“Where?” makes no sense
on its own) and the previously given answer (which was probably a place, but not
restrictive enough for the questioner).
173
Chapter 7: Goal-directed Relevance
•
4% constituted subdialogues within a larger clarification dialogue (a slight deviation from
the main topic which was being investigated by the questioner) and could be answered by
looking at the documents retrieved for the first question in the subdialogue.
Recognizing that a clarification dialogue is occurring therefore can simplify the task of
retrieving an answer by specifying that an answer must be in the set of documents used to
answer the previous questions. This is consistent with the results found in the TREC context
task (Voorhees 2002), which indicated that systems were capable of finding most answers to
questions in a context dialogue simply by looking at the documents retrieved for the initial
question in a series. As in the case of clarification dialogue recognition, simple techniques can
resolve the majority of cases; nevertheless, a full solution to the problem requires more
complex methods. The last case indicates that it is not enough simply to look at the
documents provided by the first question in a series in order to seek an answer: it is necessary
to use the documents found for a previously asked question which is related to the current
question (i.e. the questioner could "jump" between topics). For example, given the following
series of questions starting with Q1:
B
B
Q1: When was the Hellenistic Age?
B
B
[…]
Q5: How did Alexander the great become ruler?
B
B
Q6: Did he conquer anywhere else?
B
B
Q7: What was the Greek religion in the Hellenistic Age?
B
B
where Q6 should be related to Q5 but Q7 should be related to Q1, and not Q6. In this case,
B
B
B
B
B
B
B
B
B
B
given that the subject matter of Q1 is more immediately related to the subject matter of Q7
B
B
B
B
than Q6 (although the subject matter of Q6 is still broadly related, it is more of a specialized
B
B
B
B
subtopic), the documents retrieved for Q1 will probably be more relevant to Q7 than the
B
B
B
B
documents retrieved for Q6 (which would probably be the same documents retrieved for Q5).
B
13
B
B
B
Conclusion
In order to ensure that an answer is relevant from a goal-directed point of view, i.e. in order to
fully meet users’ needs and goals, attention must be paid to the occurrence of clarification
dialogues: in the case of clarification dialogue, the meaning of, and the intention behind,
users’ questions is constrained by previously asked questions and previously given answers.
But this in turn constrains the answer: in order to satisfactorily answer a question it is
therefore necessary to refer to the previous exchange of questions and answers. From this it
follows that recognizing that the question that is currently being examined by a question
174
Chapter 7: Goal-directed Relevance
answering system is part of a clarification dialogue is an important task to be carried out
before attempting to find an answer. An algorithm was developed based on the “context”
question sequences prepared by NIST for the TREC-QA task in order to recognize the
occurrence of clarification dialogue over 94% of the time (combined measure indicating the
ability to recognize that a new clarification dialogue is taking place or that the current
question is part of an ongoing clarification dialogue). The algorithm was then tested on a new
collection of clarification dialogues which was gathered for this purpose. The component
parts of the algorithm were examined in detail and it was shown to be able to recognize the
occurrence of a clarification dialogue . Finally, it was shown experimentally that in automated
open-domain question answering the use of an algorithm to recognize that a clarification
dialogue is occurring can simplify the task of answer retrieval by constraining the subset of
documents in which an answer is to be sought.
175
Chapter 8
Logical Relevance
Executive Summary
The idea of logical relevance is introduced and is shown to link question type analysis and logical
proofs of answers in TREC-style systems. A possible implementation based on the idea of constraint
relaxation rules is then presented and justified and is evaluated on the TREC data collection.
1
Introduction
Logical relevance is concerned with finding a relationship between an answer and the
unknown element that the questioner is seeking to know by asking a question. The
relationship is given by the way we reason about this relationship, i.e. by a form of logic,
syllogisms, mathematical logic, persuasion or other.
TREC-11 systems have been concerned with recognising correct and exact answers by
finding a logical connection between questions and answers, adopting variations of simple
logical rules such as the following:
If a question starts with the word “where”,
then the answer must be a place
else if a question starts with the word “when”,
then the answer must be a time
Used in combination with rules such as:
If a word in a document is in a database of place names, or it
starts with a capital letter and is preceeded by “in”, …
then it must represent a place
else if a word in a document is in a database of person names,
or is preceeded by “Mr.”, “Dr.”,…
then it must represent a name
Chapter 8: Logical Relevance
(The actual rules can be much more complex, for example making use of statistical
information). More precision has been sought by seeking to find not only a logical
connection, but a logical proof which connects question and answer, as in the following
example:
Given the question “Who killed Caesar?”,
if it is true that “Brutus killed Caesar.”
then the answer is “Brutus”
Proven (informally) as follows:
there is an x such that x is a person and x killed Caesar,
but Brutus is a person and Brutus killed Caesar,
therefore Brutus is the person who killed Caesar.
We seek here to move beyond both logical connection and logical proof, to introduce
the idea of logical relevance, which links both logical connection and logical proof.
2
The
difference
between
logical
relevance
and
previous
approaches
Logical relevance aims to:
a) move beyond the rigidity of formal proofs which cannot be applied to “real” user
questions and answer sentences, both containing errors and inconsistencies
but, more importantly,
b) provide a measure of relevance, i.e. move beyond the notion of a “true” or “false”
answer and instead present a ranking of answers which are to a greater or lesser
degree logically related to a question
a) While from a theoretical point of view a sentence should only be considered as an answer if
it can be transformed into some logical form which we can then logically prove to provide an
answer to a question, in practice this is not always possible. Often, “real” text such as
newspaper articles and “real” user questions contain inconsistencies, bad grammar and
complex phrasing which mean they cannot be easily transformed into a logical form.
Moreover, even when a question and a candidate answer sentence have been transformed into
177
Chapter 8: Logical Relevance
logical form, it is not always possible to prove the answerhood of an answer due to problems
such as gaps in the knowledge base used by the theorem prover or contradictions in the text.
b) Research such as that carried out by Moldovan et al. (2003 and 2003a), which has
successfully made use of theorem proving for finding answers to questions in a (relatively)
realistic setting such as the TREC experiments has recognised that it is not always possible to
have a “perfect” proof that an answer sentence is an answer to a question and hence has made
use of methods to prove an answer by eliminating some of the constraints on the proof.
Nevertheless, research in this area has ignored the notion of relevance (as set out above in
Chapter 4).
From an intuitive point of view, there are a number of reasons why a questioner could be
interested in being given a ranking of logically relevant answers as opposed to receiving a
unique answer which has been proven through some form of logical mechanism: the
questioner might not fully trust the document from which an answer is derived; or might find
additional information useful; or perhaps is unsure about some of the details in the question
(e.g. was Kennedy’s name “Rob” or “Robert”, when asking “When was Rob Kennedy
born?”). While giving an answer which is a logical answer to the given question, by using
logical relevance an answerer does not limit itself to the “obvious”, but provides a whole
range of potentially useful information.
Here we shall build on the idea that eliminating constraints on the proof of an answer can be
used not only to find answers where answers are malformed, contradictory or where there is
insufficient knowledge, but more importantly to be able to provide a ranking of answers in
order of relevance. What we are trying to do is show how different answers, while logically
connected to the question, and while giving an answer to the problem the questioner is asking,
cannot all be considered equally truthful, nor equally false. For example, given the question:
Q: Where does Charlotte live?
And the answer sentences:
A1: Charlotte was born in Austria.
A2: Charlotte works in Madrid.
A3: Charlotte owns a house in Rome.
A4: Charlotte lives in Tokyo.
178
Chapter 8: Logical Relevance
While it is clear that A4 contains the most “logical” answer as we could “prove” from Q and
B
B
A4 that the answer to Q is “Tokyo”, logical relevance tells us not to simply discard the other
B
B
information. While the answer “Madrid” would contradict the information given in A4, there
B
B
is the nagging question of how, if Charlotte lives in Tokyo, she can be working in Madrid, as
Spain and Japan are hardly within commuting distance. On the other hand if Charlotte owns a
house in Rome, she may be either a property speculator, or may enjoy holidays in Italy, in
which case she does live in Rome for at least part of the year. Finally, it is not unusual for
people to live in the place they were born, or near the place they were born, in which case
“Austria” could provide some clues as to the whereabouts of Charlotte if it turned out that the
information in A2, A3 and A4 was incorrect. Logical relevance attempts to resolve these issues
B
B
B
B
B
B
by ranking the answers by relevance as opposed to declaring all answers but A4 to be false (as
B
B
a typical TREC machine would). Logical relevance would therefore present the questioner
with a ranking of the answers above by judging A4 to contain the most relevant answer
B
B
(Tokyo), followed by A2 (Madrid), then A3 (Rome) and finally A1 (Austria). If the questioner
B
B
B
B
B
B
considered that the document containing the sentence A4 was unreliable they would therefore
B
B
still have some useful information with which to work.
Summarising, logical relevance has the following properties:
•
It attempts to prove a logical relationship between a question and an answer sentence: it
attempts to show that an answer “resolves” a question, not simply that it is “about” the
question: in semantic relevance the wh- word in a question is ignored – not in logical
relevance, where it becomes crucial.
•
While in TREC-style systems correct answers must be linked to a question through a
logical connection (question type and answer type correspond) or (in the case of the more
refined systems) logical inference (a proof), logical relevance is not the only property
which defines the overall relevance of an answer to a question, nor is it necessarily the
most important component of relevance. Logical relevance provides supporting evidence
given the relevance of an answer to a question from a semantic, goal-directed and
morphic point of view: an answer may be relevant even if it is not strictly logically
relevant (we can accept less than a perfect or strict proof that an answer is an answer to a
question). Given the following, for example:
Q: When is the train leaving?
A1: Ask the conductor over there, he'll put you on it.
B
B
A2: In the future
B
B
179
Chapter 8: Logical Relevance
While A2 is a relevant answer from a logical point of view (it is a time, and the question is
B
B
asking about a time, as indicated by the word “When”), A1, which is relevant from a
B
B
semantic point of view, and, to a certain extent, from a goal-directed point of view, but
not from a logical point of view (it does not directly give the required information), is a
much better answer. TREC-style systems, looking for a connection between the
question’s wh-word and the answer type, would however classify A2 as “the” correct
B
B
answer.
•
Logical relevance is not equivalent to the strict logical inference of answers. We don't
necessarily require an answer to strictly entail a question using the methods of
mathematical logic, as it may be the case that our background knowledge is lacking or our
theorem proving methods incomplete or inconsistent: logical relevance is seeking
reasonable proof that something could be an answer, not a mathematical proof that it is
an answer.
The implementation we propose will build on the work carried out for TREC-style question
answering. We shall therefore first describe how logical connections and inferences can be
implemented in TREC-style systems and then show how connections and inferences can be
seen as the two extremes of logical relevance, proposing a method for determining relevant
answers falling between these two extremes.
3
Approaches to finding the logical connection between a question
and an answer
3.1
Work prior to TREC
Besides the work seen above in Chapter 2 by logicians such as Belnap and Steel (1976) on
question answering, theorem proving has long been used as a model for question answering
systems. Some of the earliest examples of this approach are Green and Raphael (1968), Green
(1969), Luckham and Nilsson (1971) and Reiter (1978) who used resolution refutation to
prove answers; in this framework, answers to a “yes/no”-type questions Q are given by
adding the addition of the negation of Q to a knowledge base K: if this addition renders the
knowledge base inconsistent, then the answer to the question is positive. More recent work
has included that by Cholvy and Demolombe (1986) and Cholvy (1990) who used a logic
framework in order to understand generic answers; Borgida and McGuinness (1996), who
used description logics in order to find both generic and specific answers to questions; and
Burhans (2002), who carried out an in-depth analysis of the use of resolution refutation for
question answering.
180
Chapter 8: Logical Relevance
Most of this work has however been concerned with the idea of finding (or proving) unique
answers rather than relevant answers. An exception in this sense is Burhans, who explicitly
talks of relevant answers, and partitions them into specific, generic and hypothetical answers.
However, while this body of work provides some satisfactory analysis from a theoretical
perspective, it is however difficult to see how these ideas could be realistically applied to a
task such as the TREC QA track, which uses actual user questions and documents such as
newspaper articles which cannot always be easily transformed into a logical form and require
extensive background knowledge to be understood correctly. Here we shall present a method
of overcoming these limitations by making use of the notion of relevance and showing how
logical relevance may be implemented in a realistic setting, within the framework provided by
the TREC QA test data (while the TREC test data does have its limitations - for example the
fact that the document collection is made up of news articles, ignoring texts such as novels
and manuals - it still offers a useful indication of the sort of problems that a QA system would
encounter with “real” data).
3.2
Logical connections in TREC-style QA systems
TREC-style systems such as the YorkQA system described above attempt to find a logical
connection between question and answer by first analysing the question to determine the
question type and then analysing the answer documents to find a Named Entity of the
appropriate type.
For example, given a question such as
Where did Donald Degnan live?
And a document such as the following (part of NYT19990821 in the TREC/AQUAINT
document collection):
Donald E. Degnan, an advertising executive who helped guide the sport of
competitive croquet through a period of extraordinary growth in the late 1980s and
early 1990s, died on Aug. 7 while on a business trip to New Orleans. He was 76 and
lived
in
Palm
Beach,
Fla.,
and
Westhampton
The cause was a heart attack, said his wife, Connie Merlino.
181
Beach,
N.Y.
Chapter 8: Logical Relevance
A system such as YorkQA would start by analysing the question, noting that the question start
with the wh-word “where” and concluding the answer should be of type “place”, through a
rule such as:
Where X? Æ answer type = place
It would then analyse the documents (in this case a single document), splitting them into
separate sentences and seeking Named Entities of the correct type, tagging the document as
follows (the YorkQA system used XML to tag documents for Named Entities, but also to tag
such information as beginning and end of sentences and paragraphs and part-of-speech
relations; for simplicity the following snippet is tagged in a more readable fashion only for
named entities):
Sentence 1: <person>Donald E. Degnan</person>, an advertising executive who
helped guide the sport of competitive croquet through a period of extraordinary
growth in the late <date>1980s</date> and early < date>1990s</date>, died on <
date>Aug.
7</date>
while
on
a
business
trip
to
<
location>New
Orleans</location>.
Sentence 2: He was <number>76</number> and lived in <location>Palm Beach,
Fla.</location>, and <location>Westhampton Beach, N.Y.</location>
Sentence 3: The cause was a heart attack, said his wife, <person>Connie
Merlino</person>.
In this case both Sentence 1 and Sentence 2 contain entities of type “location” and may
therefore contain an answer. Sentence 3 on the other hand does not contain any entities of the
correct type and is therefore ignored. The YorkQA system then proceeds to identify the
sentence containing the answer by looking at the semantic similarity between question and
answer: sentence 2 should therefore be chosen as the sentence most likely to contain an
answer, and the two locations “Palm Beach, Fla” and “Westhampton Beach, N.Y.” will be
considered answers and the closest to the main verb of the sentence “live” will be chosen as
the final answer (the system does not allow multiple answers).
This procedure is typical of TREC-style systems which attempt to find an answer by
categorising the question, looking for an entity of the same category of the question and, if
there is more than one entity of the same time, using some heuristics to select a final answer.
182
Chapter 8: Logical Relevance
3.3
Logical inference in TREC-style QA systems
A small number of systems have attempted to move beyond heuristics to attempt “prove” that
an Entity is an answer to a given question by transforming both question and answer into
logical form and then trying to show that the answer implies the question. The most
successful attempt at using logical inference in TREC systems has been described in
Moldovan et al. (2003 and 2003a), who used a resolution refutation to “prove” the correctness
of an answer. They proceeded by transforming the question and the candidate answer
sentence into logical form through the use of a custom-built parser, and making use of
background knowledge provided by an extension of WordNet knowledge. A theorem prover
was then used to attempt to prove the answer: in their first experiments (Moldvoan et al.
2003), Otter (see McCune 1994) was used, while in the successive experiments a custom-built
prover, COGEX, was used. The prover was not however used to derive all answers, and the
overall system uses a combination of logical connection (through answer type and named
entity recognition) and logical inference in order to find an answer: in the most successful
case, 63.4% of the answers were given by “standard” methods of answer retrieval, while
19.6% were given by the logic prover.
We shall now build on this research, in particular following Moldovan et al. (2003 and
2003a), providing a method for proving that an answer is an answer to a question, given some
background knowledge, and then working with this method to move beyond logical inference
to implement logical relevance.
3.4
Inference of answers
To understand the complexity of the process of using refutation to prove an answer, we shall
consider a short example, showing that even a simple question and answer sentence requires
significant effort in order to logically “prove” an answer. A simple example is given by
question 1465 in the TREC QA collection:
What company makes Bentley cars?
The question itself is very short, and hence easily transformed into a logical form, and does
not contain any significant ambiguities (company has more than one meaning, but this does
not pose any significant problems).
An answer sentence containing an answer can be found in the TREC collection (NYT-199808-06):
183
Chapter 8: Logical Relevance
Volkswagen has made no secret of its plan to develop a more modest Bentley.
Again, this is not a particularly long sentence, does not contain complicated constructions or
significant ambiguities.
In order to prove the answer sentence, we would first have to transform it into some logical
form and then make use of appropriate background knowledge and a theorem prover. We
therefore need:
1) a parsing algorithm which could build a parse tree for the sentence,
2) a process to resolve coreference (e.g. the pronoun “its”, which explicitly becomes “of
Volkswagen”, which is taken to refer to the same object as the first “Volkswagen” in
the sentence)
3) a named entity recogniser (which would correctly recognise “Volkswagen” as being
the name of a company and not of a car)
4) appropriate background knowledge
5) a theorem prover
We shall now show how these points are addressed in our implementation.
1) Transforming a natural language sentence into a logic form is not a straightforward task: Rus and
Moldovan (2002) and Moldovan et al. (2003a) have sought to provide methods for high precision logic
form transformation applicable to documents such as the ones used for the TREC experiments. In our
work we did not aim to tackle these problems which were not the central focus of our research. We
therefore took “as given” a method for providing logical forms of sentences, through the use of simple
rules which were manually tweaked as required. What is given below is an outline of a possible method
for carrying out this task, a method which has no claim to being the only or the best solution: in other
words, we descibe what we did in order to be able to concentrate on the problem at hand, i.e. logical
relevance.
As an example of a parsing algorithm we made use of the link parser given by Lafferty (2000)
(for an introduction to link grammars and an in-depth explanation of the meaning of the parse
tree and the symbols used in it, see Lafferty, Sleator and Temperley 1992, Sleator and
Temperley 1993 and Temperley 1999):
(for reasons of space only a part of the parse tree produced is shown)
184
Chapter 8: Logical Relevance
+-----------------------------------------------X--|
|
+------O-----+
+----J---+
+-----W----+---S---+--PP--+
+---D--+--M--+
+--D--+-
|
|
|
|
|
|
|
|
|
LEFT-WALL Volkswagen has.v made.v no.d secret.n of its plan.n
The link grammar was used given the simplicity with which a sentence could be transformed
into logical form from the parsed output. Note however, that in this example, as in all the
following examples, there are a number of ways of parsing the sentence, using a variety of
parsing methods and grammars: what is presented is not a “absolutely” correct parse (it is
doubtful that such a thing exists), but an illustration of what could be the result of a parse. It is
indifferent to the overall argument how the sentence was parsed.
2) In order to correctly infer the question from the answer sentence we would have to
correctly infer that “its” is to be taken to mean “Volkswagen’s”. This could be (relatively
easily) provided by a coreference resolution system and a named entity recognition system. In
our implementation coreference resolution was carried out manually in order to minimize
errors (we were concerned with testing our ideas for logical relevance, not with testing or
improving a coreference resolution algorithm).
3) We also need to know that Volkswagen is a company while named entity recognition was
carried out through the YorkQA named entity recogniser. Given the information that
Volkswagen is a company and resolving “its” into “of Volkswagen”, the parsed sentence
could then be put into a simple logical form, resulting in:
exists x x1 x2 x3 (Volkswagen( x ) & company( x ) & has_made(
x, x1 ) & no( x1) & secret( x1 ) & of( x1, x2 ) & plan( x2 ) &
of( x2, x ) & Volkswagen( x) & develop( x, x3 ) & Bentley( x3 )
& more_modest( x3 ) ).
(note that for simplicity the articles “the” and “a” have been ignored, the verb “has made” has
been contracted to a single word, as has the compound adjective “more modest”).
4) In addition, we would have to use the background knowledge that
•
a Bentley is a car.
•
to develop something is to make it
185
Chapter 8: Logical Relevance
this knowledge would have to be provided by an extended dictionary such as WordNet or by
some sort of encyclopaedic compendium of knowledge. In our implementation, background
knowledge was provided by a combination of WordNet and manual input, again in order to
minimise errors. We could consequently assume the prior knowledge:
all x (Bentley( x ) -> car( x )). % a Bentley is a car
all x x1 (develop( x, x1 ) -> make( x, x1 )). % to develop something
is to make it
The same process could be applied to the question, resulting in the logical form:
exists x x1 x2 ( company( x ) & make( x, x1) & car( x1 ) &
make( x, x2) & Bentley( x2) ).
5) We saw above how resolution refutation has been used to prove answers: a recent
application has been given by Moldovan et al. (2003 and 2003a), who have shown that
refutation can be used to prove answers in “realistic” data such as the TREC documents. In
order to implement logical relevance, we build on this work, attempting to prove that an
answer a is an answer to a question q, given background knowledge K, if K ∪ { a } ∪ {¬ q}
is inconsistent. A number of theorem provers are available which could be used for resolution
refutation in order to prove an answer: examples are SNePS (Shapiro and Rapaport 1987 and
1992) and ANALOG (Ali and Shapiro 1993). Having examined the TREC data collection, we
concluded that the full capabilities of a system such as SNePS, capable of representing higher
order logic, were not required: we therefore followed Moldovan et al. (2003) in using the
Otter theorem prover (McCune 1994). We shall not concern ourselves here with the details of
Otter and its underlying theoretical grounding (see Mcune 1994 for details) as they are
irrelevant from the point of view of our discussion: we are not concerned with any particular
approach but instead are interested in the generic idea of proving an answer. Using Otter in
order to prove that the question “What company makes Bentley cars?” can be derived from
the answer “Volkswagen has made no secret of its plan to develop a more modest Bentley”,
we provide Otter with the logical form of the answer sentence above (using the link parser
detailed above), the background knowledge identified above, again in logical form, and
proceed per absurdum by negating the question:
-( exists x x1 x2 ( company( x ) & make( x, x1) & car( x1 ) &
make( x, x2) & Bentley( x2))).
186
Chapter 8: Logical Relevance
The proof given by Otter is as follows:
given clause #1: (wt=2) 5 [] Volkswagen($c5).
given clause #2: (wt=2) 6 [] company($c5).
given clause #3: (wt=2) 8 [] no($c4).
given clause #4: (wt=2) 9 [] secret($c4).
given clause #5: (wt=2) 11 [] plan($c3).
given clause #6: (wt=3) 7 [] has_made($c5,$c4).
given clause #7: (wt=2) 14 [] Bentley($c1).
given clause #8: (wt=2) 15 [] more_modest($c1).
given clause #9: (wt=2) 16 [hyper,14,1] car($c1).
given clause #10: (wt=3) 10 [] of($c4,$c3).
given clause #11: (wt=3) 12 [] of($c3,$c5).
given clause #12: (wt=3) 13 [] develop($c5,$c1).
given clause #13: (wt=3) 17 [hyper,13,3] make($c5,$c1).
---------------- PROOF ---------------1 [] |(-(Bentley(x)),car(x)).
3 [] |(-(develop(x,y)),make(x,y)).
4
[]
|(-(company(x)),|(-(make(x,y)),|(-(car(y)),|(-(make(x,z)),(Bentley(z)))))).
6 [] company($c5).
13 [] develop($c5,$c1).
14 [] Bentley($c1).
16 [hyper,14,1] car($c1).
17 [hyper,13,3] make($c5,$c1).
18 [hyper,17,4,6,17,16,14] $F.
Since the negation of the question sentence contradicts the answer sentence, the answer
sentence implies the question sentence. Notice however that we have only proven that the
answer sentence implies the question, and we have not provided a specific answer (e.g.
Volkswagen); furthermore we have said nothing about the quality of the answer sentence (e.g.
how much irrelevant material is contained in the answer sentence).
Notice that only the currently relevant background knowledge was given to Otter in order to
search for a proof. This is necessary in order to limit the search space and facilitate the proof
of the theorem. A possible approach to this could be to first seek possible semantic paths
between question concepts and answer concepts and then use these paths as the formulas to be
used in guiding the proof. A similar approach was taken by (Moldovan 2003) who first found
the relevant lexical chains between answer concepts and question concepts and then used the
axioms derived from these lexical chains to guide the proof.
187
Chapter 8: Logical Relevance
4
4.1
Implementation issues
Analysis of the complexity of TREC questions and related
documents
While there is a relatively straightforward procedure to infer whether a sentence is an answer
to a particular question, using simple proof by negation, an analysis of actual questions and
document sentences from the TREC collection indicates that a number of issues need to be
addressed before this procedure can be applied successfully. We shall now show the results of
an analysis of a sample of questions and answer sentences from the TREC-9, TREC-10 and
TREC-11 question and document collections, highlighting a number of problems which arise
when trying to apply logical proof using these documents.
Equivalent grammatical constructs: one problem often found with “real” questions is the
use of a number of different grammatical constructs for similar concepts. As an example,
consider the following question (1008 in the TREC collection):
What is Hawaii's state flower?
and the answer found in the Los Angeles Times document LA081989-0091
Yellow hibiscus is the state flower of Hawaii
Although there is a clear-cut connection between question and answer, either the parser would
first need to transform the word in the genitive case “Hawaii's” to the explicit construct “of
Hawaii” or the logic prover would need to know the equivalence between the two
grammatical constructs. Another example is given by question 1436:
What was Andrew Jackson's wife's name?
With the answer sentence given in the AP newswire document APW19990810:
The duel began as an argument over a horse race and escalated when Dickinson
insulted Jackson's wife, Rachel.
In order to find a logical proof of the answer there would need to be a rule to recognise
appositions (comma name) as being equivalent to a construct with the preceeding noun and
188
Chapter 8: Logical Relevance
the word “to be”, i.e. in the sentence above “Jackson's wife, Rachel” should be taken to be
equivalent to “Jackson's wife, who is Rachel”, which, once the pronoun has been resolved,
becomes “Jackson's wife, Jackson's wife is Rachel”.
Background knowledge: one of the most serious problems is the extensive background
knowledge required to successfully understand the information given in the documents and
find a logical connection between the documents and the questions. As a simple example, take
question 1408:
Which political party is Lionel Jospin a member of?
With the answer found in document XIE19970604.0015:
France's new Socialist Prime Minister Lionel Jospin has indicated that he will set up
a “clean” government whose members are not involved or suspected of being
involved in any corruption case.
which requires the knowledge that a socialist is a member of the socialist party and that the
socialist party is a political party. More complicated is the knowledge need to understand the
link between the question and another sentence in the same document:
No government officials will be allowed to hold a second post or to be on a second
payroll, and Jospin himself will resign from the post as Socialist Party chief in
November, said the officials.
Which requires the knowledge that if someone resigns from a post in a party they must have
been a member of that party.
Another example is question 1555:
When was the Tet offensive in Vietnam?
With the answer sentence found in document APW19981029:
Mizo, a Boston native who now lives in Stuttgart, Germany, was a sergeant in an
artillery unit when he was injured in a rocket attack in January 1968 in the buildup to
the Tet offensive.”
189
Chapter 8: Logical Relevance
Where the theorem prover needs to know that if the buildup to an event A is in a certain date
D, the event A occurred not long after the date D.
Knowledge about phrasing: a particular form of background knowledge is knowledge about
phrasing, as can be seen for the same question above (1555):
When was the Tet offensive in Vietnam?
With the answer in the New York Times document NYT19980903:
During the 1968 Tet Offensive, Hue fell to the Communists and then was retaken in a
battle that some historians regard as the war's worst.
Where it is necessary to know that, if an event occurred in a certain time, this is often
indicated by the phrasing “the YYYY Event”, where YYYY represents a year and Event
represents an event (e.g. The 1848 Revolution, The 1914-18 War). Another example is given
by question 1605:
How far would you run if you participate in a marathon?
with the answer in document XIE19961004:
Two races will be held for the 1997 event -- the full-length marathon of 42.195
kilometers and a half-marathon, 21.098.
Where the theorem prover would need to know a) the equivalence of “How far would you
run if you participate in a X?” and “How long is an X?” and b) the fact that the phrasing “the
marathon of Y kilometres” indicates that the marathon measures Y kilometres. A further
example can be seen in question 1012:
What was the last year that the Chicago Cubs won the World Series?
With the answer sentence in the San Jose Mercury document SJMN91-0689112:
It has been 83 years, since 1908, since the Cubs last won a World Series.
190
Chapter 8: Logical Relevance
Where it is necessary to know that the phrasing “it has been ... since YYYY, since Event” is
equivalent to saying “YYYY was the last year that Event”, where YYYY is a year and Event
is some event.
Combining knowledge from more than one sentence: often it is not possible to infer that a
sentence contains an answer without reference to other related sentences. For example, given
question 1482:
What county is Wilmington, Delaware, in?
And the two sentences found in the New York Times document NYT19991228
The couple, Richard and Dawn Kelso of Exton, Pa., were arrested Monday and
charged with abandonment of a child after they wheeled their 10-year-old son,
Steven, into the emergency room of a Wilmington hospital on Sunday morning,
telling a receptionist they wanted him admitted, the police said.
and
The receptionist went to look for a nurse and when she returned, the Kelsos were
gone, said Lt. Vincent Kowal, a spokesman for the New Castle County Police
Department in Delaware.”
in order to prove that New Castle County is the answer, we need to use information provided
by the first sentence (the events occurred in a hospital in Wilmington) combined with the
information in the second (the police involved were from New Castle County) to infer the
answer (Wilmington must be in New Castle County).
Another example is question 1399:
What mythical Scottish town appears for one day every 100 years?
With the answer provided by the New York Times NYT19990404:
In 1947 Brooks won the role of Tommy, a young American traveller who happens
upon a mysterious Scottish village that awakens only every 100 years, in a new
musical by Alan Jay Lerner and Frederick Loewe.
191
Chapter 8: Logical Relevance
and
The musical, “Brigadoon” went on to play 581 performances.
Where, in order to prove the answer, we must be able to link the name of the musical (which
can be assumed to also be the name of the village) given in the first sentence with the
summary of the musical given in the first sentence (which links with the facts outlined in the
question).
Coreference resolution: a major task which needs to be carried out successfully in order to
be able to use a theorem prover to prove that a sentence contains an answer to a question is
coreference resolution. In some cases it is a case of understanding that a name can be
simplified, as in the case of question 1421:
When did Mike Tyson bite Holyfield's ear?
Which has an answer in document XIE19991019.0306:
Former heavyweight world champion Tyson was suspended from boxing for 15
months for biting off a piece of Holyfield's ear in a June 1997 title bout
Where it is necessary to understand that “Mike Tyson” and “Tyson” refer to the same person.
A similar problem is understanding the use of nicknames, as in question 1409:
What vintage rock and roll singer was known as “The Killer”?
Where the answer, found in NYT19990302.0130, reads:
One of his schoolmates was the sister of singer Jerry Lee Lewis, and he regularly
visited the Killer's Memphis home during his boyhood.
In this case it is necessary to recognise that “the Killer” refers to “Jerry Lee Lewis” mentioned
earlier.
In other cases it is the more “usual” problem of recognising referents for pronouns. For
example, given question 1411:
192
Chapter 8: Logical Relevance
What Spanish explorer discovered the Mississippi River?
And the document NYT19981124 containing the sentence:
The 16th-century Spanish explorer Hernando De Soto, who discovered the
Mississippi River, wrote about these nuts.
It is necessary to link “who” to “Spanish explorer Hernando De Soto”.
More complicated is the case where pronouns refer back to previous sentences, as opposed to
nouns previously used in the same sentence. For example, the answer to question 1397,
What was the largest crowd to ever come see Michael Jordan?
can be found in document NYT19990221:
Jordan also drew crowds more than any other player in history, selling out all home
games and just about all road games.
When he made what was expected to be his last trip to play in Atlanta last March, an
NBA record 62,046 fans turned out to see him and the Bulls.
But in order to prove the answer we must know that the “he” in the second sentence refers to
“Jordan” in the first sentence (and in turn we must recognise that “Jordan” in this context is
the same person as “Michael Jordan”).
Parsing sub-sentences: a complication arises with the use of subphrases within sentences;
these must be correctly parsed, with pronoun references resolved as appropriate in order to
prove that they contain an answer. An exmple is question 1533:
Who directed the film “Fail Safe”?
With the answer from document NYT20000406:
Directed by Sidney Lumet (who also directed the “Fail-Safe” film), the climax came
off convincingly in rehearsal.
Where the answer to the question is within the parenthetical remark.
193
Chapter 8: Logical Relevance
4.2
Analysis of the complexity of inference in clarification Dialogues
We saw how goal-directed relevance may be implemented through the use of clarification
dialogue. A system which used both clarification dialogue and logical inference would need
to tailor inference rules etc. to cope with the peculiarities of clarification, for example the
extensive use of term referring words in previous questions. We shall now examine some of
the issues which would have to be dealt with in order for logical inference to be able to be
used in a system which allows the user to have a clarification dialogue. The examples are all
taken from the clarification dialogue data which was collected for the experiments carried out
for goal-directed relevance.
Coreference: Clarification questions often explicity reference previous questions, as in the
following example:
Who painted “sunflowers”?
How much is it worth?
Where “How much is it worth?” should be taken to mean “How much is “sunflowers”
worth?”, i.e. “it” = “sunflowers”.
Other times they reference previous answers (not necesserily immediately precedent), as in:
Who painted “sunflowers”?
How much is it worth?
When did he live?
Where “When did he live?” should be taken to mean “When did Van Gogh [answer to the
first question] live?”, i.e. “he” = “Van Gogh”.
Other more complex forms of coreference are also found, for example:
Which country colonized Hong Kong?
When did this happen?
Where “When did this happen?” should be taken to mean “When did Britain colonize Hong
Kong?”, i.e. “This happen” = “Britain colonize Hong Kong”.
Need to make implicits explicit (not simply coreference): For example, in the exchange:
194
Chapter 8: Logical Relevance
How did Adolf Hitler die?
What was the instrument of death?
The question “What was the instrument of death?” needs to be made more explicit, i.e. needs
to be rephrased as “What was the instrument of Adolf Hitler's death?”.
A different kind of example is:
Did the Boxer uprising cause many deaths?
How many amongst the English?
Where “How many amongst the English?” should be read as “How many deaths were
amongst the English in the Boxer uprising?”, where there is the addition of deaths, the
addition of the verb “were” and the addition of the reference to the Boxer uprising.
Need to recognise meta-questions: For example in the series
Did Ganhdi's death change the direction of politics?
What does attenuate mean?
The meaning of “attenuate” needs to be a) sought in the context of the answer document and
b) probably won't be found in documents, and therefore reference will probably have to be
made to some repository of prior knowledge (for example a dictionary).
Need to recognise missing words: a number of questions have missing verbs, for example
How many deaths were there?
How many amongst the English?
Where the second question should be taken as “How many deaths were amongst the
English?” with the addition, in this case of both a noun (deaths) and the main verb.
In other case only the question word is present in the question and the answerer must
complete fully missing sentences, as in the example:
Does he have a house anywhere else?
195
Chapter 8: Logical Relevance
Where?
Where “Where?” should be read as “Where else does he have a house?”.
Need to recognise specification, i.e. the exclusion of previous answers: for example, given
Where is Philadelphia?
Where exactly?
The question “Where exactly?” must first be expanded to read “Where exactly is
Philadelphia?” and then must be taken to implicitly contain the clause “the answer must not
be the previous answer”
4.3
Issues in answer proving
When trying to prove that an answer is an answer to a question, the scope for errors can be
severe, given that problems can arise:
•
with the parser, for example constructing a parse tree incorrectly, wrongly tagging partsof-speech both in the question and the answer sentence, incorrectly parsing sub-sentences,
or incorrectly assigning question types
•
with the named entity recogniser: incorrectly tagging entities in the answer sentence
would frustrate any attempt to prove the answer
•
with the co-reference mechanism, within a sentence, between sentences and between
questions and answers
•
with missing or incorrect information in the background knowledge
•
with the inference engine itself: if the engine cannot find a proof, this may well be a
limitation of the engine rather than the inexistence of the proof6
TP
6
TP
PT
PT
The completeness of any proof is limited by the resources and the algorithms employed by the
theorem prover: a question answering system, for example, could not afford a guarantee of a complete
proof because of time constraints (a questioner would not be prepared to wait an unacceptable length of
time for a proof). Another problem is that the soundness of any proof of an answer is difficult to prove,
given that this would entail that a) the procedure to convert natural language (or parsed) statements into
logical form is provably correct (sound) b) the theorem prover is provably sound: theorem provers
usually are an aid to theorem resolution, requiring the user to verify the proven theorems a posteriori
(e.g. Otter does not guarantee soundness)
196
Chapter 8: Logical Relevance
If a proof cannot be found for a given answer sentence either it is not an answer, or something
has gone wrong with the inference mechanism (e.g. lack of knowledge): the proof is therefore
supporting evidence which points to an answer, not the final answer. A perfect answer
proving mechanism would have to address all the issues above, a task which is far from
simple. In the experiments we carried out below we have assumed these issues have been
resolved by manually checking the input to the theorem prover, making corrections as
necessary and providing the necessary background information. While in a working system
such manual input would not be possible, such simplification of the problem is necessary in
order to be able to concentrate on the focus of our research, which is the notion of relevance,
in order to show how logical relevance may be implemented.
5
5.1
From Logical Proof to Logical Relevance
Building on TREC-style logical connection
We shall now show a possible implementation of logical relevance, building on the
techniques that have been used for TREC-style question answering, but moving from a logical
connection between question and answer to the concept of logical relevance. In particular, we
shall show that there is a continuum between systems such as the ones described above in
paragraph 4, which use logical inference (or abduction) to decide what constitutes an answer
(e.g. Moldovan et al. 2003; Moldovan et al. 2003a) and systems which use some form of
answer categorisation combined with named-entity recognition. The continuum will be shown
to be made by a progressive relaxation on the constraints on which any logical “proof” of the
answerhood of an answer depends.
5.2
Implementing relevance through relaxation rules
In order to implement logical relevance, we introduce the idea of a measured simplification
(where the measuredness of the simplification limits the ease of inference from unrelated
sentences) of the answer sentence (and, perhaps of the question sentence), through a
relaxation of the inference rules, which may give an insight as to how much the answer
sentence resolves the question. The measuredness would ensure that inferential and semantic
relevance do not equate.
For example, given the question:
Q: Who killed the mocking bird on Sunday?
and the possible answers:
197
Chapter 8: Logical Relevance
A1: Mary killed the mocking bird.
A2: John killed the mocking bird on Friday.
A3: Sam is an assassin who is interested in ornithology.
A4: Peter killed something last week.
Which we could transform into logical form as follows:
Q: Exists x y e ( person( x) & killed(e, x, y) & bird( y ) & mocking( y
) & sunday( e ) & time( e )).
A1: Exists x x1 e (person( x) & mary(x) & killed( e, x, x1) & bird( x1
) & mocking( x1 )).
A2: Exists x x1 e
(person( x) & john(x) & killed( e, x, x1) & bird( x1
) & mocking( x1 ) & friday( e ) & time( e )).
A3: Exists x y ( person( x) & sam(x) & assassin(x ) & interest( x, y) &
ornitology(y) ).
A4: Exists x y ( person( x) & peter(x) & killed(e, x, x1 ) & last_week(
e ) & time(e) ).
none of A1..A4 can imply Q, as a possible answer would have to be (minimally, as this is the
minimun set of constraints which would be compatible the question request) similar to the
following:
A5: Emmanuelle killed the mocking bird on Sunday.
with the logical form
exists x x1 e (person( x) & emmanuelle(x) & killed( e, x, x1) & bird( x1
) & mocking( x1 )& sunday( e ) & time( e )).
In order for A1..A4 to be considered answers from an inferential point of view we need to
“relax” the answer conditions, for example through a rule such as the following:
Relaxation rule 1: given a question, we can relax conditions
on the question by removing constraints of time and place
198
Chapter 8: Logical Relevance
From which we can infer the new question sentence:
Who killed the mocking bird?
Hence the (negated) logical form:
-( exists x y e (person(x) & killed(e, x, y) & bird(y) & mocking(y)))
From which we can infer that a1 is a possible answer, as is A2. But ideally we would want A2
to be “less relevant” than A1, as it does state the event happened on Friday, while the question
did ask for Sunday. Perhaps relaxation rule 1 should read:
Relaxation rule 1 (revised): given a question, we can relax
conditions on the question by removing constraints of
time and place, unless the answer sentence contains a
constraint of time or space
In order for A2 to be considered relevant we need a counterpart to rule 1 that applies to answer
sentences:
Relaxation rule 2: given an answer sentence, we can relax
conditions on the answer sentence by removing constraints
of time and place
Now, if we apply rule 2 to A2 we will be left with a sentence without constraints of time or
space, and we can therefore apply rule 1, which allows us to prove that A2 is in fact an answer
to the question. Note that with these rules A3 and A4 are still not considered relevant from an
inferential point of view. In order for us to be able to consider A4, we would need a rule such
as the following:
Relaxation rule 3: given a question, we can relax conditions
on the question by removing the object of the verb
Finally, to be able to consider A3, we would need to use rules 2 and 3 in combination with
some background knowledge such as the fact that if someone is an assassin they have
probably killed someone. But the use of background knowledge adds effort to the theorem
199
Chapter 8: Logical Relevance
proving process and will probably be less immediately evident as an answer to a questioner;
hence it is reasonable to add a rule to this effect:
Relevance rule 1: the more effort is required to prove a
sentence to be an answer, the less relevant the sentence
is
Relevance rule 1 subsumes the following:
Corollary 1: an answer sentence is more relevant the less
relaxation rules are applied to it
Corollary 2: an answer sentence is more relevant the less
background knowledge is required to prove it
Given the relevance rule, sentences A1… A5 can be ranked according to logical relevance,
given that A5 does not require any relaxation rules to be applied in order to be proven to be an
answer; A1 requires rule 1 to be applied; A2 requires (in order) rule 2 and rule 1; A3 requires
rule 3, rule 2 and rule 1; A4 requires background knowledge, rule 3 and rule 1. Given that we
have not specified that the use of background knowledge requires more effort than the
application of a rule and vice versa, A3 and A4 appear to be equally relevant. We would
therefore have the ranking A5, A1, A2, A4 and A3 (considered equally relevant):
A5: Emmanuelle killed the mocking bird on Sunday.
A1: Mary killed the mocking bird.
A2: John killed the mocking bird on Friday.
A4: Peter killed something last week./A3: Sam is an assassin who is interested in ornitology.
As noted above (paragraph 2) a questioner could be interested in a ranking of logically
relevant answers because, for example, the questioner might not fully trust the document from
which A5 is derived; or might not be aware that another mocking bird was killed last week
and would find this additional information useful; or perhaps was unsure if the mocking bird
was killed on Friday or Sunday and would be interested to know that one was killed on both
days. While giving an answer, “Emmanuelle” (A5), which is certainly a logical answer to the
given question, by using logical relevance an answerer does not limit itself to the “obvious”,
but provides a whole range of potentially useful information.
200
Chapter 8: Logical Relevance
5.3
5.3.1
Relaxation rules for the TREC test data
Method
We shall now show how a number of relaxation rules can be developed for the TREC
questions and how they can be used to calculate logical relevance.
As already noted, a full solution of the problems highlighted above in the analysis of the use
of theorem proving for the TREC questions and documents and our clarification dialogue
collection is beyond the scope of the present work, requiring extensive research in areas such
as coreference resolution, ontology construction and parsing. We shall therefore assume these
problems to be solved (in the experiments carried out below the provision of background
knowledge and co-reference resolution were carried out manually by a human evaluator).
Instead we shall focus our attention on the problem of constructing simplification rules for the
TREC questions.
We proceeded through a manual method which would allow us to overcome any errors due to
parsing, co-reference resolution or lack of appropriate background knowledge by
systematically analysing the TREC questions, comparing them to the answer sentences found
in the TREC documents and inducing a number of rules which could be used to eliminate
constraints and hence prove the answerhood of the answer sentences.
A subset of 400 of the TREC-9, TREC-10 and TREC-11 questions were analysed to
determine how, if at all, they could be rephrased in a simpler manner for inference, i.e. what
rules could be used to eliminate some of the constraints in the question. In particular,
questions were examined to see what constraints could be taken away from the questions
without significant loss to the meaning of the question itself, i.e. without the focus of the
question becoming too ambiguous. In order to do this, the answers to the questions were
sought within the TREC documents and within generic documents retrieved from the Internet
through the Google search engine (we did not limit ourselves to the TREC documents as they
did not always provide a sufficient variety of answers). Sentences which contained what an
intelligent reader would consider an answer but did not meet all the constraints specified in
the questions were kept; we then tried to determine what simplification rules would be
necessary in order for the answer sentences to be proven to be answers to the question through
the use of the Otter theorem prover.
A number of relaxation rules were derived in this manner. In this first attempt at formulating
such rules, they were expressed in an informal manner, the objective being to understand their
201
Chapter 8: Logical Relevance
general structure rather than to provide well defined algorithms: the implementation could be
as much through a pattern-matching procedure as through some highly formal logical
procedure. In order to evaluate the rules they were implemented and their usefulness was
verified using a second subset of 400 of the TREC questions (different from the questions
used for the analysis). As above, answers to the questions were sought within documents from
the TREC collection and the Internet and answers which could not be proved by enforcing all
the constraints specified in the question were kept; usefulness was then measured as the
ability to use the rules to discard constraints given by the question in order to to prove
answers through the use the Otter theorem prover.
5.3.2
Rules derived
We shall now give an overview of the rules which were derived, showing how they were used
to infer answers which could not otherwise have been proven. We will then present the results
of the evaluation.
Relaxation rule:
Relax constraints in the question by removing adjectives
and substantiated adjectives
Take for example the question:
What is the name of the volcano that destroyed the ancient city of Pompeii?
and the sentence containing the answer (From XIE-1996-10-04):
“...set off for the Italian resort city lying beside the Vesuve volcano which destroyed
the city of Pompeii”
Given that the answer sentence does not contain the constraint that Pompeii is an “ancient”
city, in order to prove the answer, we must introduce a rule which allows a simplification such
as: “What is the name of the volcano that destroyed the ancient city of Pompeii?” Æ “What is
the name of the volcano that destroyed the city of Pompeii?”, i.e. a rule which allows us to
simplify the question by removing the constraint given by the adjective. Examples of
questions to which this process may be applied are:
•
“What is Australia's state flower?”, which becomes “What is Australia's flower?”
202
Chapter 8: Logical Relevance
•
“What was the last year that the Chicago Cubs won the World Series?” which becomes
“What was the last year that the Cubs won the World Series”
•
“The sun is mostly made up of what two gasses?”, which becomes “The sun is mostly
made up of what gasses?”
Relaxation rule:
Relax constraints in the question by removing adverbs
As in the case of adjectives above, answer sentences often do not contain adverbs which
constrain the question. We therefore need a rule to simplify questions by removing
constraining adverbs. Examples of questions to which this process may be applied are:
•
“What name is horror actor William Henry Pratt better known by?”, which becomes
“What name is horror actor William Henry Pratt known by?”
•
“What card game uses only 48 cards?”, which becomes “What card game uses 48 cards?”
Relaxation rule:
Relax constraints in the question by removing verbs in
combinations such as “used in”, “employed in”, “helps”,
“aids”
For example, given the question:
What is the currency used in China?
And the following answer sentence
The currency for China is called Renminbi or RMB and is issued by the People's
Bank of China
(from www.1uptravel.com/international/asia/china/essentials.html), in order to prove the
answer we must remove the verb “use”, i.e. we must have a rule which allows the
transformation “What is the currency used in China” Æ “what is the currency in China”. In
other cases we must ensure that the main verb is put into the correct form, as in the following
example, where “prevent” must become “prevents”:
203
Chapter 8: Logical Relevance
•
“What mineral helps prevent osteoporosis?” which becomes “What mineral prevents
osteoporosis?”
Relaxation rule:
Relax constraints in the question by removing time and
place constraints, i.e. expressions such as “in Time”,
“in Place”, “at Time”, “at Place”, “on Time”, “on Place”
For example, given question 1594:
Which long Lewis Carroll poem was turned into a musical on the London stage?
We would only be able to prove an answer in the form “Lewis Carroll’s poem… was made
into a musical” by removing the constraint “on the London stage”, i.e. having a rule which
allowed the transformation: “Which long Lewis Carroll poem was turned into a musical on
the London stage”“ Æ “Which long Lewis Carroll poem was turned into a musical?” and then
applied the rule above on adjectives to also remove “long”, leaving us with “Which Lewis
Carroll poem was turned into a musical?”. Another example is:
•
In the late 1700s British convicts were used to populate which colony?”, which becomes:
“British convicts were used to populate which colony?”.
Relaxation rule:
Relax
constraints
in
the
question
by
removing
nouns
indicating works of art such as “film”, “movie”, “story”,
“poem”,
followed
“novel”,
by
the
“book”,
title
“bestseller”
of
the
work
of
when
art,
they
i.e.
are
in
expressions such as “the book Title”
Often questions specify that titles are titles of books, films, TV series etc., while answer
sentences usually omit such information. Take for example the question:
What was the name of the dog in the Thin Man movies?
and the answer sentence (found in NYT-1998-08-12):
204
Chapter 8: Logical Relevance
Asta in the “Thin Man” (1934): Asta, the wire haired terrier ... was ... fashionably art
deco...
In order to prove the answer we need a rule which allows the tranformation “What was the
name of the dog in the Thin Man movie” Æ “What was the name of the dog in the Thin
Man?”. Other examples are:
•
“Who directed the film “Fail Safe”?”, which becomes “Who directed “Fail Safe”?”
•
“When did the story of Romeo and Juliet take place?”, which becomes “When did
“Romeo and Juliet” take place?”
Relaxation rule:
Relax
constraints
in
the
question
by
removing
specifications of place names in combinations such as
“District, Region”
Often questions specify the exact location of towns, cities, regions etc. by adding the country,
region, state, etc. in which they are found, e.g. by stating “Rome, Italy”, or “Stanford,
California”. Most documents, however, do not use such specifications (containing, in the
examples above, simply a reference to “Rome” or “Stanford”) and in order for the answer to
be proven rules must allow the removal of these specifications. For example,
What is the current population of Bombay, India?
There are numerous documents which answer the question by giving details of Bombay
without explicitly mentioning that it is in India: in order to be answered, the question becomes
“What is the current population of Bombay?”.
Relaxation rule:
Relax
constraints
in
the
question
by
removing
first
names, middle names, surnames or titles (both preceding
and following names) where these are in conjunction with
other proper names, as in the combination “Title Name
Surname”
205
Chapter 8: Logical Relevance
At times questions contain both surname and name of a given person, but documents contain
only a surname (and vice versa); the same problem often occurs with titles (President Bush
being referred to as Bush, for example). Examples of questions where such a rule would be
useful are:
•
“What was Frank Sinatra's nickname?”, which becomes “What was Sinatra's nickname?”
•
“What year was president Kennedy killed?”, which becomes “What year was Kennedy
killed?”
A similar rule to the above is:
Relaxation rule:
Relax constraints in the question by removing appositive
phrases,
i.e.
sub-sentences
of
the
form
“noun,
noun
phrase” which describe in more detail a noun, for example
providing information such as job titles
Documents often take for granted that the reader knows enough about the subject not to need
“obvious” information to be spelled out: “everyone” knows that Frank Sinatra was a singer
and therefore documents often omit to mention this fact. Examples of questions to which this
rule applies are:
•
“Where did Roger Williams, pianist, grow up?” which becomes “Where did Roger
Williams grow up?”
•
“What year was Ebbets Field, home of Brooklyn Dodgers, built?”, which becomes
“What year was Ebbets Field built?”
Another similar rule is:
Relaxation rule:
Relax constraints in the question by removing modifier
nouns, i.e. nouns which clarify other nouns
Often noun phrases in questions contain clarifying information which is assumed as
“obvious” in answer documents. As an example, we saw above the question
206
Chapter 8: Logical Relevance
What company makes Bentley cars?
with the answer sentence:
Volkswagen has made no secret of its plan to develop a more modest Bentley.
where the author of the answer assumed that the reader was aware that Bentley was a car,
without having to spell this information out explicitly. We therefore need a rule which allows
the transformation “What company makes Bentley cars?” Æ “What company makes
Bentleys?”.
Relaxation rule:
Relax
constraints
in
the
question
by
simplifying
possessive expressions such as “X’s Y” to “Y”
As an example, take the question:
What part of the eye continues to grow throughout a person's life?
with the answer sentence (found in NYT-2000-08-07):
the lens …, a protein-filled disk... continues to grow throughout life
in order for this answer sentence to be proven to be an answer, we must apply a
transformation such as the following: : “What part of the eye continues to grow throughout a
person's life?” Æ “What part of the eye continues to grow throughout life?”.
A related rule is:
Relaxation rule:
Relax
constraints
in
the
question
by
simplifying
specifications such as “Y of X” to “X” where Y is a more
generic term for X, i.e. a hypernym of X
207
Chapter 8: Logical Relevance
Often noun phrases containing specifications (e.g. town of X, science of Y) can be
summarised without loss of information by the noun which is being specified (X and Y in the
examples above). As an example, take question:
What name is given to the science of map making?
There are a number of documents about map making which do not mention explicitly that
map making is a science; by applying the rule above, the question can be related to the answer
sentences by being simplified to “What name is given to map making?”.
Relaxation rule:
Relax
constraints
in
the
question
by
removing
parenthetical remarks
Often, constraints which are either obvious or unimportant are found within brackets in a
question. For example, given the question:
How fast does an Iguana travel (mph)?
there are a number of documents which contain an answer in Kilometres per hour. While the
questioner would obviously prefer an answer given in miles, an answer in kilometres would
essentially contain the same information: it could not be considered as relevant as an answer
in miles, but it should nevertheless be considered relevant. A rule which allowed a
transformation such as the following would provide for this: “How fast does an Iguana travel
(mph)?” Æ “How fast does an Iguana travel?”.
Relaxation rule:
Relax constraints in the question by removing relative
clauses
Relative clauses in questions often specify additional information which is either taken for
granted or ignored in answer sentences; a rule therefore needs to be applied to remove these
clauses. For example:
Which disciple received 30 pieces of silver for betraying Jesus?
208
Chapter 8: Logical Relevance
There are a number of documents which refer to the disciple being given 30 pieces of silver,
without specifying the reason; in order to infer the answer the question must therefore be
rephrased as “Which disciple received 30 pieces of silver?”.
Relaxation rule:
Relax constraints in the question by removing platitudes
Questions are often formulated as commands/requests, of the form “Tell me…” or “Please
could you let me know...”. It is necessary to transform such sentences into regular questions
in order for the theorem prover to be used successfully. An example is:
•
“Tell me where the DuPont company is located”, which should become “Where is the
DuPont company located?”
Relaxation rule:
Relax constraints in the question by transforming complex
idioms into simple sentences, i.e. transforming sentences
such “X is home to Y” into “Y is in X”; “the name of X is
Y” into “X is Y”; “X is the person who did Y” into “X did
Y”
Often questions are formulated in a manner which is more complex than necessary, with
answer sentences being expressed with much simpler language. It is therefore necessary to
rephrase these questions to successfully prove the answer sentences. Examples are:
•
“Which italian city is home to the cathedral of Santa Maria del Fiore”, which should
become “In which Italian city is the cathedral of Santa Maria del Fiore?”
•
“What was the name of the first Russian astronaut to do a spacewalk”, which should be
rephrased as “Who was the first Russian astronaut to do a spacewalk”
•
“What was the man's name who was killed in a duel with Aaron Burr?”, which becomes
“Who was killed in a duel with Aaron Burr?”
Given this set of rules, we would first try to infer (by negation) the question from the answer
without applying the rules. The answers that allow an inference without the application of any
relaxation rule are most relevant from a logical point of view. Answers which can't infer the
209
Chapter 8: Logical Relevance
question in this way are then simplified, attempting to apply the given rules in succession: the
more rules need to be applied, the less relevant the answer is. Implicitly therefore we would
have a confidence score for each answer sentence: the more rules have been applied to prove
the answer, the less confident we are that this is actually an answer.
Here we purposefully ignore the question of giving a weighting to the rules, but this would be
an important consideration in a full implementation, as would be a consideration of the order
in which the rules are applied: a combination of the weighting, the order and the number of
rules applied would determine the ranking of the answer sentences.
5.4
Evaluation on the TREC collection
A number of relaxation rules were derived from the above analysis and then implemented; a
subset of 400 of the TREC-11, TREC-10 and TREC-9 questions (equally distributed),
different from the questions used for the analysis, was then employed to verify the usefulness
of the relaxation rules. Usefulness was taken to be the ability to use the rules to discard
constraints given by the question and use the Otter theorem prover to prove answers which
could not be proved by enforcing all the constraints specified in the question.
In order to test this ability, answer sentences which provided an answer to the question were
sought within the TREC document collection and on documents retrieved through the Google
search engine. Answer sentences were then manually examined, and those which could be
proven without the use of the rules were discarded: these discarded answers would be
considered the “most relevant” answers from the point of view of logical relevance; we were
interested in the rules’ ability to recognise other, less logically relevant, but nevertheless
logically relevant, answer sentences, i.e. the use of these rules to determine logical relevance,
not simply logical inference, as currently done by the best performing TREC-style QA
systems. The relaxation rules were then applied to the questions in order to determine whether
the application of these rules would enable the theorem prover to prove the answerhood of the
remaining answer sentences. Coreference resolution and background knowledge was provided
by hand in order to standardise the evaluation. While, as shown above, the use of background
knowledge should contribute to the evaluation of the relevance of an answer, for simplicity
we ignored this problem, focusing our attention on the constraint relaxation rules.
The results show that in the TREC collection most questions are already minimal, and usually
cannot be reduced further without loss of crucial information, as for example definition
questions such as
210
Chapter 8: Logical Relevance
Who was Galilei?
to which none of the relaxation rules above could be applied. The overall results of the
evaluation follow.
27.6% of TREC-11 questions could be reduced (and then used to prove answer sentences)
using the relaxation rules. The rest were already in minimal form and could not be reduced
further without serious loss of information. This compares with 11.2% of TREC-10 questions.
This was due to the grammatical simplicity of TREC-10 question, which were very short,
often asking for simple definitions (TREC-11 did not envisage definition questions). On the
other hand, 19.2% of TREC-9 questions could be reduced through the use of these rules.
Again the higher number of questions to which the rules were applicable was due to the fact
that they did not contain the large number of definition questions found in TREC-10. The
examined questions did not contain any “surprises” in the form of constraints which required
additional rules in order for the answer sentences to be proven.
6
Limitations
As noted above, in order to be applicable to a “live” question answering system, the given
implementation must resolve a number of issues which have been purposefully ignored
(where these issues caused difficulty to the evaluation, human intervention on the part of the
operator provided a suitable solution), in particular the fact that:
•
“real” sentences such as the ones contained in the TREC document collection are not
as simple as the ones usually used in logic examples, containing difficult matters such
as the use of complex grammatical constructions, colloquialisms, grammatical and
typographical errors
•
while there is a considerable body of work on discourse analysis, systems which
implement some form of logical inference for question answering currently only work
for single sentences, not for multiple sentences. A full system should be able to gather
knowledge from multiple sentences, including the ability to correctly resolve coreferences
•
work on logical inference for question answering has only considered closed concept
questions (questions typically starting with words such as “who”, “what X”, or
“where”), and has ignored the complexity of answering open-ended questions such as
“why” questions
211
Chapter 8: Logical Relevance
•
a full implementation would require considerable background knowledge, including
an understanding of temporal and spatial relationships
•
the initial work we have carried out in identifying a number of constraints which may
be relaxed to calculate logical relevance needs to be expanded employing some form
of machine learning to automatically create relaxation rules. This would require
significant resources as detailed above in paragraph 5.3.1
A fully satisfactory implementation would necessarily have to resolve these (rather complex)
issues, but while we do not claim that the implementation presented here is the most
satisfactory implementation possible, it nevertheless gives an idea of how QA systems can
move from the current limited approach to finding a logical connection between questions and
answers to a more satisfactory method which employs the concept of relevance, moving away
from the idea of logically “correct” and “incorrect” answers to the idea of logically relevant
answers.
Having clarified what form the relaxation rules will take it now becomes possible to consider
a method for inducing relaxation rules through some form of machine learning. A number of
issues need to be solved before working in this direction, however; in particular:
•
In order to successfully employ machine learning we would need to have a robust
method of transforming both question and candidate answer sentences into logical
form, as errors in this area would propagate into the proving mechanism, hence
invalidating the conclusions. The TREC data is often “dirty”, with long sentences
containing complex phrasing, grammar and spelling mistakes.
•
The same considerations apply to coreference resolution, as, for example, incorrectly
resolved pronouns would again negatively impact the proving process
•
The results would be highly dependant on the background knowledge used, as
previously known inference rules could obviate the need for eliminating
constraints
7
Conclusion
Logical relevance improves the performance of TREC-style systems such as YorkQA by
ensuring a strong logical connection between question and answer. More importantly, though,
by introducing the notion of logical relevance, we overcome the limitations of a “either right
or wrong” answer approach usually given by TREC-style QA systems. At one extreme,
conventional TREC-style systems implicitly relax all logical constraints in the question, apart
212
Chapter 8: Logical Relevance
from the constraint given by the wh-word, by seeking a Named Entity corresponding to the
answer type; at the other extreme they attempt to “prove” the answerhood of an answer
sentence by transforming the whole question into logical form and using some form of
theorem prover to demonstrate that the answer is an answer to that question. However,
conventional TREC-style systems fail to establish the connection between these two
extremes, opting for one or the other to provide a single answer sentence. Our notion of
logical relevance moves beyond this dualism, seeing the property of answerhood as a
continuum between the extremes of a full application of the constraints set out in the question
and a complete indifference to these same constraints. A method was presented for the
calculation of logical relevance for question answering through the use of relaxation rules that
gradually drop the logical constraints on the answer given in the question. A number of
relaxation rules were then formed and we then showed how these rules could be applied to the
TREC questions, analysing the limitations of this approach.
213
Chapter 9
Morphic Relevance
Executive Summary
The idea of morphic relevance, concerned with the form an answer takes when it is given in response to
a question, is investigated. We show how morphic relevance has been implemented in the YorkQA
system and clarify the difference between morphic relevance and goal-directed relevance.
1
Introduction
Morphic relevance is concerned with the external form an answer takes when it is presented
to a questioner. An answer, which satisfies a questioner’s goals, is semantically related to a
question and that can be logically proven to be an answer, may still be expressed in a number
of different ways, for example as plain text or in some XML format. The questioner (and the
answerer) may have a preference as to how the answer should be presented, a preference
which may be associated with an informational goal (and hence closely related to goaldirected relevance, as when a questioner who is not an expert in a particular field to which a
question relates would like a definition to be given in simple terms as opposed to technical
jargon), but does not necessarily have to be associated with such a goal: a visually impaired
questioner may prefer tables in an answer to be presented in a different form, such as a list; if
the answer is given though a speech synthesiser, this may speak in a variety of tones and
accents. Morphic relevance must therefore be considered separately from the other relevance
categories. Another way of expressing the meaning of morphic relevance could be to say that
it is related to the aesthetic sensibilities of the questioner and the answerer, rather than their
informational needs and related goals
2
Implementation
The YorkQA system, in common with other TREC-style QA systems, seeks a potential
answer to a given question and then constructs an answer sentence which conforms to the
NIST formatting standards based on this. In other words, after identifying an answer it
generates an answer sentence to be given to the questioner.
Chapter 9: Morphic Relevance
The system built for TREC-10 generated answers either by taking the named entity
recognised as an answer or by taking a 50 byte portion of text surrounding what was
identified as the answer, and then formatting the answers according to the TREC guidelines
by outputting question number, a code specified by NIST, the docment from which the
answer was retrieved, the rank given to the answer, a system-determined score for the answer,
the name of the system and the answer, as in the following examples, showing the output for a
question with a named entity and with a text snippet as an answer:
894 Q0 SJMN91-06154228 1 1.33333 yorkqa02 10 miles
894 Q0 SJMN91-06154228 2 1.33333 yorkqa02 41 miles
894 Q0 LA021289-0057 3 1.33333 yorkqa02 two links
894 Q0 AP880613-0058 4 1.33333 yorkqa02 Six miles
894 Q0 AP880613-0056 5 1.33333 yorkqa02 Six miles
901 Q0 AP890802-0229 1 4.33678 yorkqa02 is also the land of beautiful flowers , and those
901 Q0 FT921-15572 2 3.83506 yorkqa02 a country and no tradition of flower buying
901 Q0 FT921-15572 3 3.67011 yorkqa02 is sending Italian and Japanese immigrants back to
901 Q0 FT921-6170 4 3.16839 yorkqa02 , who remains Australia 's head of state
901 Q0 AP890227-0279 5 3.16839 yorkqa02 Australia , fruits , vegetables and flowers that
The TREC-11 system worked in a similar manner, outputting the answer in the new format
specified for the evaluation:
1394 yorkqa01 XIE19960405.0213 Greece
1394 yorkqa01
1395 yorkqa01 APW19990423.0019 Tom Cruise and Nicole Kidman
1395 yorkqa01
1396 yorkqa01 XIE19961004.0048 However , both sides
1396 yorkqa01
1397 yorkqa01 NYT19990113.0529 I
1397 yorkqa01
1398 yorkqa01 APW20000202.0021 1933.
1398 yorkqa01
1399 yorkqa01 XIE20000921.0050 Lockerbie
1399 yorkqa01
The idea behind the concept of morphic relevance is that the above answers could be
presented in a number of different ways, for example:
215
Chapter 9: Morphic Relevance
Question ID: 1395
Who is Tom Cruise married to ?
APW-1999-04-23-0019, 64.1178/0/0/0/0, LOS ANGELES ( AP ) -- Tom Cruise and Nicole
Kidman have filed a libel suit against the Star for a story claiming they needed a sex therapist
to perform love scenes in `` Eyes Wide Shut . ''
The implementation of morphic relevance in the YorkQA system is therefore trivial, given the
simplicity of the questioner preferences expressed by the NIST evaluation guidelines. Other
implementations, however,
may require a higher degree of complexity, for example
summarising the document from which an answer was taken, or providing an answer “in
tone” with the question, for example by giving prolix, polite answers to prolix, polite
questions and giving straightforward, unadorned answers to straightforward, unadorned
questions.
3
From morphically correct answers to morphic relevance
While the YorkQA system effectively divides answer sentences into morphically correct and
incorrect, only ever outputting morphically correct, i.e. correctly formatted, answers, the idea
of relevance would indicate that it is possible to have answers which are to a greater or lesser
degree morphically relevant. Relevance therefore requires systems to move beyond the
concepts of “correct” and “incorrect” in order to implement the idea of relevance. To see how
this could be the case, consider the question
Q: What is the “devil in music”?
and the possible answer sentences:
A1: The so-called “devil in music” is the interval of augmented fourth.
A2: interval of augmented fourth
If the constraint from the point of view of morphic relevance is that answers must be less than
30 characters long, A2 is more relevant from a morphic point of view than A1. The notion of
morphic relevance allows a system to give the questioner both answers, but in a ranked order
of relevance, with A2 being the most relevant answer, thus preserving potentially useful
information (e.g. a learner of English might find the structure of A1 useful to learn grammar
rules) while still giving an answer which best meets the questioner’s preferences. Similar
examples could easily be found for “famous person” questions such as “Who was Galilei?”,
which could be answered through a terse comment such as “an astronomer”, or more
216
Chapter 9: Morphic Relevance
comprehensively, elaborating on Galilei’s life and times: which answer was to be considered
more relevant would not only depend on the questioner’s informational goal (to write an essay
on Galilei? To understand a reference in another text?), but also on the questioner’s
preference for minimalist answers or more elaborate cogitations.
Applied to the YorkQA system, an implementation of the above generates both TRECcompliant answers, formatted according to the TREC guidelines, but also provides noncompliant, and hence less relevant, answers, which give the full sentence from which the
answer is derived. While these other answers will be ignored by the NIST evaluators, they
could nevertheless provide useful information about the performance of the system to the
questioner, who, in the case of YorkQA, is the scientist carrying out the TREC experiments.
The output above therefore becomes:
1394 yorkqa01 XIE19960405.0213 Greece
1394 yorkqa01
1394 yorkqa01 XIE19960405.0213 Greece 's bid to host the 1996 Olympics , to which it
believed it had a special right because it was the country in which the games originated and
were revived , particularly in view of the centenary anniversary , was rejected by the IOC .
1394 yorkqa01
1395 yorkqa01 APW19990423.0019 Tom Cruise and Nicole Kidman
1395 yorkqa01
1395 yorkqa01 LOS ANGELES ( AP ) -- Tom Cruise and Nicole Kidman have filed a libel
suit against the Star for a story claiming they needed a sex therapist to perform love scenes in
“Eyes Wide Shut.”
1395 yorkqa01
providing both a fully relevant answer, which could be submitted to the NIST assessors, and a
partially relevant answer, which could be useful for other purposes, for example providing the
questioner with an idea of the reliability of the system.
4
Goals and morphic relevance
It could be argued that morphic relevance, dependent on the preferences of the questioner,
refers, in some sense, to a goal of the questioner: the preference in TREC-style systems for a
50 byte answer, for example, is determined by the goal of submitting the answers to NIST for
evaluation. It may therefore be argued that this kind of preference should really be dealt with
by goal-directed relevance. But such a characterisation of goal-directed relevance would be so
generic as to be, ultimately, useless: what we have attempted to do is to identify specific
217
Chapter 9: Morphic Relevance
aspects of relevance: as we saw above, goal-directed relevance deals with the immediate
informational goals related to the specific question at hand, not the overall goals of the
questioner: questions are asked in order to fulfil some need for information, a need for
information which arises for a specific purpose (the information will be used for something);
this specific purpose is the informational goal which goal-directed relevance is concerned
with. In the example above, the TREC questions are asked in the context of goals such as
“find an answer appropriate for a graduate student to understand”. If a question answering
system was to take goal-directed relevance as being concerned with all the goals of the
questioner, it could arguably be considered to be more a lifestyle consultant than a question
answering system. The overall goals of a questioner will find expression in the questioner’s
relevance judgements not only from the point of view of goal-directed relevance, but also
from the point of view of semantic, logical and morphic relevance; moreover, there will be
some questioner goals (e.g. eating, drinking, sleeping) which will be irrelevant from the point
of view of forming a relevance judgement for question answering.
5
Conclusion
Morphic relevance, being concerned with the form that an answer takes when it is given in
response to a question, may be implemented through an answer generation module in question
answering systems. It was shown how this could be achieved trivially through an answer
generation module in the YorkQA system for the TREC evaluation. We then introduced the
distinction between morphic correctness and morphic relevance and showed how the YorkQA
system could implement morphic relevance. Finally we spelled out the difference between
morphic and goal-directed relevance, underlining the fact that goal-directed relevance is
concerned with informational goals, not generic questioner goals.
218
Chapter 10
Background knowledge for relevance judgements
Executive Summary
We examine how background knowledge could be expanded to improve relevance judgements, by
devising a method for expanding on the knowledge base used in our QA experiments, WordNet, by
automatically learning “telic” semantic relations.
1
Introduction
We saw above that background knowledge is a fundamental constituent of answerer
prejudices, and has a strong influence on the relevance judgements made by a Question
Answering system. Background knowledge, in the form of dictionary and encyclopaedic
knowledge is used to determine semantic relevance, and, in our implementation of goaldirected relevance, to decide if the current question is part of a wider dialogue; moreover, in
the form of inference rules, background knowledge is used to determine logical relevance.
Any shortcomings in this background knowledge will therefore be reflected in the reliability
of the relevance judgements which ultimately depend on this knowledge: a satisfactory
implementation of relevance will consequently require a satisfactory knowledge base, or a
knowledge base which can be easily improved.
In our experiments the primary source for background knowledge was WordNet. The
knowledge given in WordNet is however limited to a small number of fundamental semantic
relations, and while these are satisfactory in most cases, as shown above there is room for
improvement. It would therefore appear that in order to improve the performance of our
implementation of relevance we would need to be able to improve on WordNet. We verified
this hypothesis by seeking to establish in what ways a knowledge base such as WordNet
could be expanded and what effect of such an expansion would have on the relevance
judgements made by a Question Answering system. The following paragraphs will describe
our work in this area, which sought to improve the WordNet knowledge base through the
addition of telic relationships between words.
Chapter 10: Background knowledge for relevance judgements
2
Automated discovery of telic relations for WordNet
As noted above, WordNet (Miller 1995; Fellbaum 1998) is a lexical database which organizes
words into synsets, sets of synonymous words, and specifies a number of relationships such as
hypernym, synonym, meronym which can exist between the synsets in the lexicon. The
explicit relationships in WordNet do not exhaust (nor claim to exhaust) the set of possible
relationships between words and there is scope for expansion and improvement. Pustejovsky
(1995), for example, presents a model in which each lexical item in a dictionary would be
characterized by an argument structure (specifying the number and type of arguments that a
lexical item carries), an event structure (characterizing the event type of a lexical item and its
internal structure), qualia structure (representing the different modes of predication possible
with a lexical item) and a lexical inheritance structure (identifying how a lexical structure is
related to other structures in the dictionary). Each of these structures could then be used to
infer a very complex (but also, hopefully, comprehensive) net of relationships between lexical
items. Qualia structures, for example, would provide information on “constitutive”
relationships (the relation between an object and its constitutive parts, e.g. material, weight,
parts and component elements), “formal” relationships (that which distinguishes an object
within a larger domain, e.g. orientation, magnitude, shape, dimensionality, colour, position),
“telic” relationships (the purpose or function of an object, e.g. the purpose an agent has in
performing an act or the built-in function or aim which specifies certain activities) and
“agentive” relationships (factors involved in the origin or “bringing about” of an object, e.g.
creator, artefact, natural kind, causal chain). Of the relationships identified by Pustejovsky,
WordNet partially considers argument structure (only in the case of verbs, as verb groups),
inheritance structure (hyponym relationships, but not the “complex type” relationships
identified by Pustejovsky) and qualia structure. In the case of qualia structure it only
considers “constutive” relationships, in the form of meronym relationships (member,
substance and part) and, in part, “agentive” relationships (in the form of entailment and causal
relationships).
There is scope therefore for enhancing WordNet by adding relationships such as argument
structure for words that are not verbs, event structure, complex inheritance, and qualia
structures such as formal and telic relationships.
We focused on telic relationships and in particular examined a method by which telic
relationships could be automatically discovered from the glosses contained in WordNet itself
and used to augment WordNet itself.
220
Chapter 10: Background knowledge for relevance judgements
2.1
Telic Relationships
The qualia structures identified by Pustejovsky are derived in part from the Aristotelian view
of word meaning which identified a set of “modes of explanation” (aitiai) that could be
applied to words. These modes of explanation identify particular aspects of meaning (qualia)
which can be used to connect words in a lexicon. One particular aspect of meaning is the
“telic” of an object, indicating the purpose or function of an object, for example the purpose
an agent has in performing an act or the built-in function or aim which specifies certain
activities. Thus the telic of milk would be drink, as the purpose of milk is to be drunk.
Although Pustejovsky never mentions multiple relationships, it is conceivable that a word
may have more than one telic relation, as, for example, wood, used both for burning and for
making furniture.
The objective was to therefore to extend WordNet by creating a new set of relations
telic( A, B )
linking two synsets and indicating that there exists a telic relationship between A and B, such
that if A is a synset representing a word, B is the telic of A, or, in other words, that A is used
to achieve B.
2.2
Related Work
Machine readable dictionaries and encyclopaedias have been shown to be useful tools in the
creation of knowledge-bases. Different approaches have been applied, including patternmatching (e.g. Chodrow et al., 1985), and specially constructed or broad coverage parsers (see
for example Wilks et al. 1996; Richardson et al. 1998; Kang and Lee 2001; Katz et al. 2001).
WordNet glosses, brief explanations describing the particular meaning of individual synsets
within WordNet, have been successfully used to semi-automatically enhance and create
knowledge bases. Moldovan and Rus (2001), and Harabagiu et al. (1999), for example,
parsed the text of the glosses in order to transform them into a logical form to be used
respectively as axioms in reasoning about world-knowledge and to enhance WordNet with
new derivational morphology relations. Attempts have also been made to automatically build
qualia relations from corpora (Pustejovsky et al. (1993); Bouillon et al. (2001)).
2.3
Method
In order to use the WordNet glosses to add telic relations to WordNet itself, it was necessary
to:
221
Chapter 10: Background knowledge for relevance judgements
•
Extract the telic relation (or, possibly, relations) from the gloss using some parsing
method. The result of this process was expected be a word or a group of words
representing the telic relation(s) for the synset which provided the gloss.
•
Transform the telic word into a synset by disambiguating its meaning, thus avoiding
the creation of misleading relationships
In the present experiments, pattern-matching, enhanced by a very simple part-of-speech
parser was used to find the relevant information, i.e. a word representing the telic of the given
synset. The extracted information was then passed to a word disambiguation module which
returned a synset for the given telic word.
2.4
Identification and extraction of telic words
Initially the WordNet glosses were cleaned, removing example sentences. The glosses were
then analyzed for patterns indicating that the gloss contained information about the telic
relations for a particular synset. It was noted that within the given glosses telic relations could
be found in the presence of the following patterns:
•
“... to TELIC_VERB by the use of...”, as in mammography ( synset id 100649306), which has
as gloss “a diagnostic procedure to detect breast tumors by the use of X rays”, indicating that
the telic relation for mammography is to detect breast tumors (i.e. it is used to detect breast
tumors).
•
“... used for TELIC” , as in tracing_paper (synset id 110816432), with gloss “a
semitransparent paper that is used for tracing drawings”, indicating that the telic relation for
tracing paper is to trace drawings.
•
“... used to TELIC”, as in cardiac_glycoside (synset id 110805579), defined as “obtained from
a number of plants and used to stimulate the heart in cases of heart failure”, indicating that the
telic relation for cardiac glycoside is the stimulation of the heart.
•
“... use of ... to TELIC” as in trickery (synset id 100485559), with gloss “the use of tricks to
deceive someone (usually to extract money from them)”, indicating that the telic of trickery is
to deceive someone.
•
“... used as ... in TELIC_ING-VERB” as in Plasticine (synset id 110453999), defined as “a
synthetic material resembling clay but remaining soft; used as a substitute for clay or wax in
modeling (especially in schools)”, indicating that the telic relation for Plastcine is modeling.
222
Chapter 10: Background knowledge for relevance judgements
•
“... used in TELIC_ING-VERB” as in seal_oil (synset id 110781016), with gloss “a pale
yellow to red-brown fatty oil obtained from seal blubber; used in making soap and dressing
leather and as a lubricant”, indicating that the telic relations for seal_oil should be making
soap, dressing leather and lubrification.
•
“... used in ... as a TELIC” as in giant_taro (synset id 108093257 ) with gloss “large evergreen
with extremely large erect or spreading leaves; cultivated widely in tropics for its edible
rhizome and shoots; used in wet warm regions as a stately ornamental”, indicating that the
telic relation of a giant taro is its use as a stately ornamental.
•
“... for use as TELIC” as in houseboat (synset id 102838388) defined its gloss as “a barge that
is designed and equipped for use as a dwelling” indicating that the telic relation for a
houseboat is a dwelling.
•
“... for use in ... TELIC_ING-VERB” as in wherry (synset id 103611080), with gloss “light
rowboat for use in racing or for transporting goods and passengers in inland waters and
harbors”, indicating that the telic relation for a wherry is racing and the transportation of
goods and passengers.
A subset of the WordNet glosses possibly containing information regarding telic relations was
therefore taken and each gloss was split where more than one telic relation was indicated, as
in the presence of conjunctions or disjunctions (as in synset 103357011, “a sailing ship […]
used in fishing and sailing along the coast”, where two telic relations, a) fishing, and b)
sailing, are present ) and semicolons (as in synset 110842812, “any of a group of synthetic
steroid hormones used to stimulate muscle and bone growth; sometimes used illicitly by
athletes to increase their strength” where two telic relations are present, a) the stimulation of
muscle and bone growth, and b) to increase strength).
It was then necessary to identify one word (or compound word) that could summarize the telic
relationship found in the gloss. In particular, it was necessary to identify one verb or noun that
would represent the telic for a chosen synset. In order to avoid over-generalization, words
such as “be”, “do”, “make” and “thing” were avoided and where these were found, a more
specific word was sought. So, for example, in seeking the telic relationship for conditioner
(synset id 102485262), defined as “a substance used in washing (clothing or hair) to make
things softer”, the relationship that was sought was to “make soft”, i.e. to soften, not simply to
“make”, which is far too general to be of any use. In these cases a more specific noun or verb
was sought and, in the absence of a more specific noun or verb the adjective attached to the
223
Chapter 10: Background knowledge for relevance judgements
noun was modified to find a more specific verb (in the case of conditioner, the adjective
“soft” was used to derive the verb “soften”).
2.5
Telic Word Sense Disambiguation
Having identified one word that summarized the telic relationship, it was necessary to identify
the specific sense of the telic word, i.e. to find the synset which represented the telic
relationship. It was therefore necessary to consider the set of possible synsets to which the
telic word could belong and choose the synset that best represented the meaning of the telic
word. One approach to disambiguation of word sense is to use some measure of relatedness
between the word and its context, or, in other words, to calculate the semantic similarity or
conceptual distance between the word and its context (Miller and Teibel 1991, Rada et al.
1989). As already noted, WordNet has been shown to be fruitful in the calculation of semantic
similarity, determining similarity by calculating the length of the path or relations connecting
two concepts; different approaches have been applied, using all WordNet relations (Hirst-StOnge (1998) or only a subset of the available relations (specifically, is-a relations) (Resnik
(1995); Jiang-Conrath (1997); Lin (1998); Leacock-Chodorow (1998); for an overall
evaluation see Budanitsky and Hirst (2001)). In determining conceptual distance, Mihalcea
and Moldovan (1999) and Harabagiu et al. (1999) usefully employed WordNet glosses,
considered as micro-contexts. A similarly inspired approach was taken in this study, with the
meaning of a telic word being constrained by a) the gloss from which it was extracted and b)
the set of glosses of the synsets to which it could belong.
Given a word w, whose meaning was represented by the gloss (definition) GWw, and its telic
B
B
word t, it was necessary to find the synset ts representing the correct meaning of t from the set
of all synsets T to which ts could belong. The correct synset ts (in other words, the correct
sense) for the telic word t was taken to be the synset that maximized the semantic similarity
between GWw and ts, as follows:
B
B
ts = argmaxts∈T sd( GWw, ts )
B
B
B
B
where sd( a, s ) is a function calculating the semantic similarity sd between a sentence a and a
particular meaning of word w, represented by its synset s, which returns a number between 0
and 1 indicating the relatedness between the sentences, where 1 indicates they have the same
meaning and 0 indicates they have no meaning in common. The function sd was a simpler
version of the function used to calculate semantic relevance, and was modified slightly to
tailor it to the specific task at hand, which was word sense disambiguation.
224
Chapter 10: Background knowledge for relevance judgements
In order to calculate sd the set TSts was constructed by taking all the words in the gloss GT of
B
B
ts and all the words in the glosses of all the hyponyms and hypernyms of ts (to a depth of 3
hypernyms and 3 hyponyms) as follows:
TSts = {w : w ∈ GT ∨
B
B
w ∈ hyperg( ts, 3 ) ∨ w ∈
hypog( ts, 3 ) }
Where w represents a word; hyperg( s, d) is a function which returns a set made up of all the
words in the glosses of the hypernyms of a synset s, to a depth d; and hypog( s, d) is a
function which returns a set made up of all the words in the glosses of the hyponyms of a
synset s, to a depth d.
TS was then compared with GWw (which was considered the set of words making up a gloss)
B
B
by using a form of term overlap measure to measure their semantic relatedness. Initially, a set
of stop-words SW was used to ignore words that were too common to be of any use (e.g.
“the”, “do”) thus producing the two reduced sets RGWw and RTS:
B
B
RGWw = GWw – SW
B
B
B
B
RTS = TS – SW
The remaining words were then analyzed to find their stems thus producing the two sets
SGWw and STS made of the stems of the words belonging to RGWw and RTS. Each word in
B
B
B
B
RGWw was then compared to all the words in RTS, using all the available WordNet
B
B
relationships (is_a, satellite, similar, pertains, meronym, entails, etc.), with the additional
relationship, “same_as”, which indicated that two words were identical. Each relationship was
given a weighting indicating how related two words were, with a “same as” relationship
indicating the closest relationship, followed by synonym relationships, hypernym, hyponym,
then satellite, meronym, pertains, entails. Each word wi in RGWw was therefore assigned a
B
B
B
B
weighting ri indicating its relatedness to RTS, and the total semantic similarity tsd between
B
B
RGWw and RTS was calculated as the normalized sum of all the weightings r of RGWw. The
B
B
B
B
normalization was carried out by dividing by the number of words in the gloss by sum of the
result + 1, in order for short glosses not to be disadvantaged.
tsd =
RTS
⎞
⎛
⎜
r ( w, RTS ⎟⎟ + 1
⎜
⎠
⎝ w∈RGW
∑
225
Chapter 10: Background knowledge for relevance judgements
A number of experiments were conducted to see to what depth the hyper- and hyponyms of a
candidate synset should be considered, i.e. to decide, given a synset S, and its hyponyms HS,
if the hypernyms HS' of the set HS also be considered, if the hypernyms HS'' of HS' should
be considered, and so forth to an arbitrary depth n. It was found that a depth of 3 hypernyms
and 3 hyponyms gave satisfactory results in an acceptable time.
2.6
Results
2449 telic relationships were derived, relating to 1841 different synsets (i.e. a synset could
have more than one telic relationship). A sample (about 10% of the total) of the derived
relationships was examined manually and it was estimated that 78% of the relationships were
actually telic relationships, while the rest either denoted other types of relationships (e.g. the
context in which something is used), or denoted telic relationships that could not be
summarized in one word (as in synset 110556533, “activating agent”, whose telic should be
“increase the attraction to a specific mineral”). 9% of the relationships were in effect telic
relationships, but were counterintuitive, in part because of the limitations posed by the
adopted method, which considered telics in isolation from their possible objects (e.g. a cancer
drug having as telic “kill”, because its function is to kill cancer cells). Of the correct
relationships, the correct synset (i.e. the correct meaning) was chosen 77% of the time. The
disambiguation algorithm failed mainly where there were very subtle differences in meaning
in WordNet, as in the difference in meaning for the word “represent” between synset
201841374 (take the place of), 200566766 (express indirectly; be a symbol of) and
200265192 (to establish a mapping (of mathematical elements or sets)), or the difference in
meaning for the word “stain”, between synset 200196870 (produce or leave stains; “Red wine
stains table cloths”) and 201053918 (make a spot or mark onto; “The wine spotted the
tablecloth”). The following table summarises the results.
Total relationships found
2449
Number of different synsets
1841
Actual telic relationships
78% (1910)
Of which with correct synset
77% (1470)
The following are examples of some of the telic relationships derived:
Example 1: from synset 102853717, indicating “incubator, brooder”, defined as “a
box designed to maintain a constant temperature by the use of a thermostat; used for
chicks or premature infants” it was derived that incubators are used for (or: the telic
of an incubator is) maintaining a constant temperature: the telic word was therefore
226
Chapter 10: Background knowledge for relevance judgements
correctly identified as “maintain”, with the particular meaning given by synset
201829600, i.e. “keep in a certain state, position, or activity”, which was correctly
chosen by the algorithm in preference to, for example, synset 200723279 (maintain
by writing regular records), synset 200607420 (support against an opponent) and
200496801 (observe correctly).
Example 2: from synset 100450328, indicating “desensitization technique,
desensitization procedure, systematic desensitization”, defined as “a technique used
in behavior therapy to treat phobias and other behavior problems involving anxiety;
client is exposed to the threatening situation under relaxed conditions until the
anxiety reaction is extinguished”, the algorithm correctly inferred that the telic of
desensitization technique was “treat” in the particular meaning given by synset
200054862, i.e. “provide treatment for” as in “The doctor treated my broken leg”;
this meaning was chosen if favour of incorrect alternatives such as 201547305
(provide with a treat) and 200699711 (deal with verbally or in some form of artistic
expression).
Example 3: from synset 102399372, indicating “cash_register, register”, “a cashbox
with an adding machine to register transactions; used in shops to add up the bill” it
was correctly inferred that the telic was 201805970, “add up”, with the meaning “add
up in number or quantity”, and not “add up” as in synset 201786912, “be reasonable
or logical or comprehensible” or in synset 201792159, “develop into”.
3
Effect of augmented knowledge
We analysed the TREC-11 question collection to evaluate the usefulness of the new
relationship. Telic relationships were shown to be useful in 0.8% of the total number of
questions (4/500). Examples of the telic relationships used to find semantically relevant
answer sentences are (for ease of reading we have given the actual words in the relationship
as opposed to the synsets):
telic( uranium, nuclear fuels)
telic( uranium, nuclear weapons)
indicating that uranium is used for nuclear fuels and to make nuclear weapons, needed for
Question 1547, “What is the atomic number of uranium?”. Another example is
telic( telegraph, communicate)
227
Chapter 10: Background knowledge for relevance judgements
Indicating that the telegraph is used to communicate, found useful for finding semantically
relevant answer sentences to Question 1400, “When was the telegraph invented?”.
Telic relations between words are very weak compared to relations such as synonymy and
consequently can only be found in rare occasions when comparing questions and answers.
The given results are therefore expected, showing that such a relationship is only applicable to
a small number of questions. Nevertheless, the extension of the knowledge base used to
define relevance relationships in this manner is important if correct relevance judgements are
to be made: correct relevance judgements rely on a solid and comprehensive knowledge base;
but comprehensiveness can only come about by a gradual addition of what at first sight may
appear only marginally important knowledge relationships, but which nevertheless provide a
valid contribution.
4
Conclusion
We showed how background knowledge could be increased to improve relevance judgements
by expanding WordNet, the knowledge base used in YorkQA. While WordNet contains a
limited number of explicitly defined relationships between words, it also contains a
significant amount of implicit information in the form of synset glosses. We have shown how
this implicit information can be made explicit by automatically extracting new relationships.
In particular, the algorithm presented usefully extended WordNet by automatically inducing a
number of telic relationships from the glosses. The fact that we constructed the new
relationships automatically rises the possibility of errors (as we verified through an analysis of
the discovered relationships), and a manual review of the relationships found is necessary to
ensure that the derived relationships are of sufficient high-quality to be used in practice.
228
Chapter 11
Integrating and Evaluating the Components of Relevance
Executive Summary
We show how the components of relevance, semantic, goal-directed, logical and morphic relevance can
be integrated into a full system, showing the complexity of deciding how overall relevance is to be
calculated, and how this depends on the design requirements of the question answering system.
1
How the purpose of a QA system determines overall relevance
We saw above that relevance is determined from a number of different perspectives and we
have shown how each of these particular perspectives could be implemented to give a
relevance judgement from that particular point of view. We will now show how these may be
integrated in a full system to provide an overall relevance judgement.
Each question answering system has (using the terminology of chapters 2, 4 and 5) a set of
prejudices that constrain what is considered a relevant answer; but these prejudices can be
seen to correspond to the overall purpose for which the system was built; in other words, there
is no such thing as a generic question answering system:all question answering systems are,
implicitly or explicitly, built for a particular purpose. Only once we have determined the
purpose of the system will we be able to determine the rules which will be applied to form
relevance judgements and resolve any conflicts between the judgements of questioner and
answerer. In other words, there must be a policy decision on the way a question answering
system judges overall relevance, a policy which will be determined by the scope of the
system. Consequently, we propose that:
the way in which the components of relevance combine to provide an overall
relevance judgement in a question answering system will be determined by
the purpose for which the system is built.
The purpose of the system will be given by capturing and analysing the requirements of the
sponsor(s) of the system, who will
Chapter 11: Integrating and Evaluating the Components of Relevance
a) determine how the relevance judgements of the answerer should be made
b) decide the way in which the relevance judgements of the questioner will be made
(identifying, for example, who the users of the system will be)
c) provide a mechanism for resolving conflicts between the judgements of the
questioner and the answerer
Using the terminology introduced in chapter 5, integrating the components of relevance
means constructing the function relevant-answer given
•
the sets S-ANSWER, G-ANSWER, L-ANSWER, M-ANSWER of answers from
the points of view of each type of relevance
•
relevance logics RL, RL’ and a conflict resolution process CR, which indicate that,
for example, some systems may want to emphasise semantic relevance, others may
wish to emphasise logical relevance and yet others may wish to emphasise the
relevance judgements given by the questioner as opposed to the judgements given by
the answerer
The purpose or requirements of the system would provide the relevance logics RL, RL’ and
the conflict resolution process CR as follows:
a) RL will specify how the system (the answerer) is required to rank answers
b) RL’ will specify how it is assumed that users of the system (the questioners) will rank
answers (we may even decide to ignore users who do not conform to the given model
as “uninteresting” customers)
c) CR will be a policy which specifies how the system is required to act if there is a
conflict between the ranking that would given by the system and the ranking that it is
assumed would be given by the user
We shall now show how an overall relevance judgement would be made in two example
systems.
230
Chapter 11: Integrating and Evaluating the Components of Relevance
2
2.1
Examples of relevance integration
A teaching system
As an example of the problems faced in implementing a QA system, we shall consider a
system built to be used by learners of English as a foreign language. A teaching system would
have to resolve possible conflicts between questioner and answerer goals: depending on the
requirements overall relevance will be calculated in different ways. In the case of a teaching
aid the goals of the answerer (to provide hints to answers as opposed to complete answers)
would probably take precedence over the goals of the questioner (who would probably prefer
a system that gave complete answers). Morphic relevance could also cause a conflict if the
answerer prefers long, challenging answers to short, easy ones while the questioner would
prefer whatever requires less effort. If on the other hand the answerer was designed as a
substitute teacher, the answerer’s and questioner’s goals could coincide, being for example to
provide an answer which is as complete and easily understood as possible.
The sponsor of the system, an English teacher might have the following requirements:
a) Answerer requirements: Improve English language learners’ command of
English by engaging the questioner, providing friendly, short and easily understood
answers which give the questioner an idea of the context in which words are found
and the manner in which they are employed.
Which translate into the following relevance logic:
•
From the point of view of semantic relevance answers must be engaging, providing
interesting information about the question, not simply an answer
•
From the point of view of goal-directed relevance answers must be friendly and easily
understood, must show context and the manner in which words are employed
•
From the point of view of logical relevance answers should try and satisfy the
informational need of the questioner but showing context is more important
•
From the point of view of morphic relevance answers must be short
•
The most relevant answers will be the shortest answers that provide interesting
information, then the shortest answers which show the context in which words are used,
then the shortest answers which satisfy the informational need
The goal of the questioner (a student of English) could be simply (and naively from a
pedagogical point of view):
231
Chapter 11: Integrating and Evaluating the Components of Relevance
b) Questioner requirements: Learn the meaning of individual words as quickly as
possible
Which translate into the following relevance logic:
•
From the point of view of goal-directed relevance answers must provide meaning
of words in an easily understood, immediate manner
•
From the point of view of morphic relevance answers must be short and to the
point
•
Providing answers which explain the meaning of words is more important than
providing short answers
Notice that there are different requirements between questioner and answerer, especially
regarding goal-directed relevance: the questioner wants immediately understandable answers,
while the answerer wants to show context and the manner in which words are employed. To
solve this potential conflict we then have a
c) Conflict resolution policy: the teacher’s requirements are more important
than the students’ and precedence should be given to the goals of the
answerer as opposed to the questioner goals.
Given the question:
What is a redwood?
The answerer may find the following semantically (as they are all “about” redwoods) and
logically (because they all explain the meaning of “redwood”) relevant sentences in the TREC
data collection:
NYT19990607.0395: “A strapping prince of a redwood growing in splendid isolation
12 miles west of Ukiah now holds the title of being the world's tallest tree.”
APW19990605.0149: “A towering redwood surrounded by other giant trees in
Northern California has been identified as the world's tallest living thing.”
232
Chapter 11: Integrating and Evaluating the Components of Relevance
NYT19981123.0272: “Always an awe-inspiring sight, the giant redwoods that tower
along the California coast are perhaps at their majestic best on foggy days, when
these ancients, among the botanical wonders of the world, can be glimpsed through
wisps of swirling mist.”
NYT19981123.0272: “Coastal redwoods, or Sequoia sempervirens, are found
patchily mostly along the California coast and into southern Oregon.”
From these sentences we can construct the following (morphically relevant) brief answers:
A redwood is the world's tallest tree
A redwood is the world's tallest living thing
A redwood is among the botanical wonders of the world
A redwood is Sequoia sempervirens
If the questioner was a native speaker of a Romance language, the most relevant answer
would be “sequoia sempervirens” giving a Latin definition which is immediately
understandable to a speaker of these Latin-derived languages. The next most relevant answer
would be “the world’s tallest tree”, requiring an understanding of the words “tree” and “tall”
and perhaps “world”, but giving a short and precise answer; following this we would have
“the world’s tallest living thing”, slightly more generic and more ambiguous than the
previous, and therefore potentially misleading, followed by “botanical wonder”, which does
not provide a direct answer to the question. Summarising, from the point of view of the
questioner the relevance ordering should be:
1.
A redwood is Sequoia sempervirens
2.
A redwood is the world's tallest tree
3.
A redwood is the world's tallest living thing
4. A redwood is among the botanical wonders of the world
From the point of view of the answerer, however, “sequoia” does not provide any clues as to
how the word “redwood” is used in English (few native speakers of English would refer to a
redwood as a Sequoia, or employ the word Sequoia in ordinary conversation) and does not
engage the questioner with interesting information; it also does not give the questioner an idea
of the context in which the word “redwood” would be found, i.e. the words which usually can
be found in a document or conversation containing this word. On the other hand saying that a
redwood is a “botanical wonder” would teach the questioner more interesting words and more
233
Chapter 11: Integrating and Evaluating the Components of Relevance
interesting contextual worlds than an answer containing the very generic “living thing”. The
answerer’s relevance ranking would probably be:
1.
A redwood is the world's tallest tree
2.
A redwood is among the botanical wonders of the world
3.
A redwood is the world's tallest living thing
4. A redwood is Sequoia sempervirens
Given that we considered the teacher’s requirements more important than the students’, the
answerer would therefore provide a different relevance ranking from what the questioner
would prefer, giving precedence to the goals of the answerer as opposed to the questioner
goals.
2.2
An advertising system
An extreme case where there may be conflicts between questioners’ and answerers’ relevance
judgements is the case of a QA system used for marketing. While the questioner’s overall aim
would be to get information, the answerer’s overall aim would be to make a sale. In the limit,
the answerer could even seek to mislead the questioner by providing information which is
irrelevant to the questioner’s goals but relevant to the answerer’s goals.
As an example, consider the question:
What is the cheapest way to get from London to Paris?
Where the answerer, sponsored by an air company, has the following requirement:
a) Answerer requirement: provide the truest answer which puts the sponsor in
a favourable light.
Which translate into the following relevance logic:
•
From the point of view of goal-directed relevance answers must put the sponsor
in a favourable light
•
From the point of view of logical relevance answers should be as truthful as
possible
234
Chapter 11: Integrating and Evaluating the Components of Relevance
•
It is more important to show the sponsor in a favourable light than to provide
truthful answers and nswers may not tell the entire truth if this is favourable to the
sponsor
While the questioner has the following requirement:
b) Questioner requirement: a short and true answer to the question.
Which translate into the following relevance logic:
•
From the point of view of logical relevance answers must satisfy the
informational need of the questioner
•
From the point of view of morphic relevance answers must be short
•
It is more important for answers to satisfy the informational need than to be short
We then have
c) Conflict resolution policy: the goal of the answerer to put the sponsor in a
favourable light takes precedence over all other requirements even if the
questioner does not want this.
If the answerer retrieved a set of relevant answer sentences made of the following:
(www.travelselect.co.uk) “(LHR) - London Heathrow Intl (CDG) - Charles Degaulle
Air £56.95 Lufthansa”
(www.travelselect.co.uk) “(LHR) - London Heathrow Intl (CDG) - Charles Degaulle
Air £61.65 Bmi British Midland”
(www.travelselect.co.uk) “(LHR) - London Heathrow Intl (CDG) - Charles Degaulle
Air £65.53 Air France”
(www.travelselect.co.uk) “(LHR) - London Heathrow Intl (CDG) - Charles Degaulle
Air £70.47 British Airways”
(www.gobycoach.co.uk) “Eurolines advance return - must be booked 30 days prior to
departure date and are valid for one month from the departure date - £32.00”
235
Chapter 11: Integrating and Evaluating the Components of Relevance
(www.gobycoach.co.uk) “Eurolines economy return - available up to two days prior
to departure and valid for up to six months - £50.00”
From the questioner’s point of view the top four answers should be ranked as follows:
1.
“Eurolines advance return - must be booked 30 days prior to departure date and
are valid for one month from the departure date - £32.00”
2.
“Eurolines economy return - available up to two days prior to departure and
valid for up to six months - £50.00”
3.
“(LHR) - London Heathrow Intl (CDG) - Charles Degaulle Air £56.95
Lufthansa”
4.
“(LHR) - London Heathrow Intl (CDG) - Charles Degaulle Air £61.65 Bmi
British Midland”
The answerer however may wish to overlook the fact that it is cheaper to travel by coach,
providing the questioner with a set of answers which solely provide information about air
journeys. From the answer’s point of view, therefore, the top four answers should be ranked
as follows:
1.
“(LHR) - London Heathrow Intl (CDG) - Charles Degaulle Air £56.95
Lufthansa”
2.
“(LHR) - London Heathrow Intl (CDG) - Charles Degaulle Air £61.65 Bmi
British Midland”
3.
“(LHR) - London Heathrow Intl (CDG) - Charles Degaulle Air £65.53 Air
France”
4. “(LHR) - London Heathrow Intl (CDG) - Charles Degaulle Air £70.47 British
Airways”
Notice that it is not a strictly true answer that the cheapest way to get from London to Paris is
by using a plane costing £56.95; this is however the most relevant answer. If the overall aim
of the system is to satisfy the answerer’s goals even when these conflict with the questioner’s
goals or when this implies ignoring valid (logically relevant) answers which are not relevant
from a goal directed point of view, this will also be the overall ranking given to the
questioner.
A side issue in this case would be the ethical and legal implications of such a system: it may
be considered morally reprehensible to construct such a “deceptive” system and it may also be
236
Chapter 11: Integrating and Evaluating the Components of Relevance
necessary to have some sort of disclaimer to avoid accusations of misrepresentation if the
system is to be used to help sell goods.
3
A framework for evaluating relevance
Having shown how the components of relevance would be brought together in a working
system to provide an overall relevance judgement in respect to an answer, it is now necessary
to provide a method for evaluating such a system.
•
We propose an evaluation framework, alternative to the current TREC framework
(which we have shown in Chapters 2 and 3 to be unsatisfactory), which could be used
to evaluate QA systems which implement our relevance-based theoretical framework.
As set out in Chapters 4 and 5, our framework:
•
Takes the notion of relevance as the basis for judging the appropriateness of an
answer, as opposed to the notion of correctness
•
Explicitly defines the notions of questioner and answerer
•
Explicitly defines the constraints (prejudices) which contribute to the relevance of an
answer in relation to a questioner and an answerer, in particular: background
knowledge, preferences, goals and the context in which a question is asked (i.e.
previously asked questions)
An evaluation framework based on this notion will hence:
•
Judge the relevance of a number of answers, as opposed to the correctness of a single
answer.
•
Allow systems to provide answer in different formats which could be judged more or
less relevant (e.g. whole sentences, single concepts, snippets of text)
•
Explicitly define the notion of questioner model, including background knowledge,
preferences and goals
•
Explicitly define the notion of answerer model, including background knowledge,
preferences and goals
•
Explicitly recognise the context in which questions are asked
Although TREC-10 and previous evaluations allowed systems to return more than one
answer, this is not equivalent to a notion of a system returning more than one relevant answer.
237
Chapter 11: Integrating and Evaluating the Components of Relevance
The TREC evaluation was effectively giving systems more than one chance to return the
correct answer. What we propose is to require systems to attempt to return more than one
relevant answer. Where we propose systems to return more than one answer we are therefore
not simply returning to a situation similar to the TREC-10 evaluation, which required systems
to provide five snippets of text as answers: the TREC judges examined these snippets to find a
unique answer and ignored whether systems were able to return more than one correct (but
possibly different) answer. Again, TREC-10 systems did not provide five answers to the
judges, but five attempts at a single answer.
Systems would be judged by their ability to:
•
Provide more than one distinct answer
•
Correctly rank the sentences in order of relevance
The ability to rank sentences in order of relevance would be judged based on:
•
Questioner requirements
•
Answerer requirements
•
Explicit rules for resolving conflicts between questioner and answerer
One of the difficulties with the TREC evaluation has been the problem of judging the
correctness of an answer without reference to a user model; this in turn has lead to a difficulty
in comparing systems which make use of different components such as knowledge bases, and
the difficulty in understanding how much this influenced results. Our evaluation would
resolve this problem by explicitly defining:
•
Answerer background knowledge
•
Answerer inference logic
•
Answerer goals
•
Answerer preferences
•
Questioner background knowledge
•
Questioner inference logic
•
Questioner preferences
•
Questioner goals
•
Question context (e.g. whether the question is part of a wider dialogue)
238
Chapter 11: Integrating and Evaluating the Components of Relevance
The following table summarises this discussion by comparing the TREC evaluation with the
proposed evaluation:
Evaluation criterion
TREC
Proposed
Satisfactory answer
Correctness
Relevance
Number of answers
One (e.g. TREC-11);
Many
Answer assessment
one out of five attempts
(e.g. TREC-10)
Questioner requirements
n/a
Defined
in
terms
of
in
terms
of
relevance
Answerer requirements
Conflict resolution mechanism
Equivalent to the TREC
Defined
evaluation criteria
relevance;
n/a
Policy
for
resolving
conflicts
between
answerer and questioner
Ensuring consistency
requirements
Questioner knowledge
n/a
Explicitly defined
Questioner inference logic
n/a
Explicitly defined
Questioner goals
n/a
Explicitly defined
Questioner preferences
n/a
Explicitly defined
Answerer knowledge
Various (e.g. WordNet,
Explicitly defined
Cyc, custom-built), not
explicitly defined
Answerer inference logic
Various
(e.g.
custom-built),
Otter,
Explicitly defined
not
explicitly defined
Answerer goals
n/a
Explicitly defined
Answerer preferences
Formatting rules: Single
Explicitly defined
concept
(TREC-11);
Text window (TREC10)
Question context
No context give
Explicitly defined
This framework overcomes the limitations of the TREC evaluation framework identified in
chapter 3 as follows:
239
Chapter 11: Integrating and Evaluating the Components of Relevance
•
Lack of clarity about the meaning of “answer”: the proposed framework explicitly
refers to the notion of relevance, understood in terms of questioner and answerer
requirements
•
Multiple answers or one answer: by employing the notion of relevance the framework
explicitly acknowledges that there may be more than one satisfactory answer to any
question
•
Lack of clarity about what was being evaluated: the framework ensures consistency
of evaluation by ensuring that all the components that help provide a relevance
judgement are explicitly set out
•
User modelling: both user and system models are set out explicitly as questioner and
answerer models
•
User goals: the questioner model includes the notion of questioner goals
•
System goals: the answerer model includes the notion of answerer goals
•
Question context: the framework ensures consistency of evaluation by explicitly
setting out the context in which questions should be considered
4
Using the relevance framework to improve and evaluate a
TREC-style system
While in chapters 7-10 we showed how a TREC-style QA system could be improved from the
point of view of the individual relevance categories we shall now show how such a system
may be improved and evaluated by making reference to the overall relevance framework we
have developed:
•
The theoretical framework developed in chapters 4 and 5 will enable us to
understand, from a theoretical point of view, the limitations of the YorkQA system
•
We shall then show how we can construct a more satisfactory system making use of
the implementation ideas set out in chapters 7-10 and the framework for integration
set out above in paragraphs 1-2
•
We will then make use of the evaluation framework set out in paragraph 3 to evaluate
theoretically the performance of the new system, showing that it does in fact present
an improvement on the previous system
4.1
Making reference to the relevance framework developed
The theoretical framework for question answering that we provided is based on
240
Chapter 11: Integrating and Evaluating the Components of Relevance
1) the notion of relevance
which in turn has been shown to be made up of:
2) semantic relevance
3) goal-directed relevance
4) logical relevance
5) morphic relevance
Moreover, an answering system provides an answer that depends on the answerer’s
prejudices, which we have seen consist of:
6) an answerer model
7) a questioner model
8) previous question/answer pairs
We shall now examine the YorkQA TREC-style QA system in the light of these points,
showing how the relevance framework highlights the shortcomings of the system.
1) The notion of relevance. The YorkQA system, in common with all TREC-style QA
systems, does not attempt to provide relevant answers, but a unique and correct answer to a
question: this may be considered a limit case of relevance, where answers are deemed either
wholly relevant or wholly irrelevant, with no intermediate distinction. Although TREC-10
systems provided five answer sentences for each question, they still cannot be said to have
provided five differently relevant answers: the objective of TREC-10 was to arrive at a single
answer; systems were given five chances of getting the correct answer, a rather different task
from asking a system to provide five answers of varying degrees of relevance.
2) Semantic relevance. TREC-style QA systems narrow down the number of
documents to be examined to find an answer by employing an information retrieval engine to
retrieve a small subset of the document collection based on a query constructed from the
question. From this point of view TREC-style QA systems can be said to employ a limited
notion of semantic relevance: answer sentences are considered potentially relevant to a
question if they belong to a document which has been judged to be relevant to the question;
documents can therefore be said to be semantically relevant to the question as they contain
terms in common with the question or closely related to words in the question. The
241
Chapter 11: Integrating and Evaluating the Components of Relevance
unsatisfactory nature of this approach is illustrated by referring to a document retrieved
through the Google search engine by asking the question “What is creative accounting?”:
“Enron has made everyone from politicians to comedians aware of the potentially
disastrous results of creative accounting practices. And though the jury is still out on
what exactly caused the sudden collapse of the seventh largest company in the
country, investors are now scrutinizing financial statements, and companies are
scrambling to squash investor doubts by improving disclosure.
As well, professionals in the field are expressing strong opinions on the debacle. A
recent poll conducted by BusinessWeek and Financial Executives International -- a
professional association whose members include CFOs, controllers, and treasurers -revealed that nearly half of those surveyed believe Enron is not an isolated situation.
It is merely the most extreme example of problematic financial reporting, suggest
some (1).” (www.smartpros.com)
The first paragraph of the document is certainly more relevant to the question than the second;
but even if we retrieved (as some systems do) relevant paragraphs as opposed to relevant
document, we would still have the problem of distinguishing between more or less relevant
sentences within the paragraph.
Another problem with the approach given in current TREC-style systems, is that the very
limited (and implicit) notion of semantic relevance which is employed is used negatively as a
filtering mechanism to aid processing (processing the whole document collection is seen as
the ideal approach, but is considered impractical) as opposed to being used positively to help
provide better (i.e. more relevant) answers. YorkQA attempts in part to move beyond the idea
of a negative filter by using an information retrieval engine as an initial filter, but then using a
semantic similarity measure to rank sentences in order of relevance. The top ranking sentence
which has a named entity corresponding to the question type is then considered to be the
answer sentence. Nevertheless, even this approach does not fully make use of the notion of
semantic relevance as ultimately the system does not return relevant answers but a single
correct answer.
3) Goal-directed relevance. TREC-style systems, including YorkQA, have so far ignored the
issues surrounding questioner and answerer goals. The overall relevance judgement of an
answer therefore ignores the contribution of goal-directed relevance.
242
Chapter 11: Integrating and Evaluating the Components of Relevance
4) Logical relevance. As already noted in the chapter on logical relevance, TREC-style
systems have a very limited concept of logical relevance which ignores the concept of degrees
of relevance and instead seeks a logical connection between question and answer. The
YorkQA system seeks this connection through the notion of answer type and seeks to provide
a corresponding named entity in the question.
5) Morphic relevance. TREC-style systems implement the notion of morphically correct
answers, formatting their output according to the NIST standards. As noted above when
discussing relevance, TREC-style systems implement a dualistic concept of relevance, which
classifies answers as fully relevant or irrelevant without any intermediate grades.
6) Answerer model. TREC-style systems, and hence YorkQA, do not have any explicit
concept of answerer model. Implicitly, however, the answerer model for YorkQA consists of
prior knowledge provided by WordNet, an inference logic which connects questions and
answers through the notion of question type, and answer form preferences which correspond
to the formatting rules specified by NIST for answers.
7) Questioner model. TREC-style systems, and hence YorkQA, do not have any explicit
concept of questioner model.
8) Previous question/answer pairs. TREC-style systems, and hence YorkQA, ignore any
previous questions and answers.
4.2
Improving a TREC-style QA system
We saw above in paragraph 1 that a necessary step in designing a QA system is to determine
the overall purpose of the system, i.e. capture the system requirements: Before starting work
on improving YorkQA, we must therefore decide what exactly we want the system to do. As a
simple example, we have the following
a) Answerer requirements: we want a system which provides answers which (in order
of importance):
•
Meet the questioner’s informational needs and goals
•
Can provide some additional information
•
Are as short as possible
243
Chapter 11: Integrating and Evaluating the Components of Relevance
b) Questioner requirements: these are taken to be the same as the answerer
requirements
c) Conflict resolution process: not needed as questioner and answerer requirements
are identical
By making reference to the evaluation criteria set out above we can set out these points more
rigorously in order to be able determine the performance of an implemented system:
244
Chapter 11: Integrating and Evaluating the Components of Relevance
Evaluation criterion
Explanation
Satisfactory answer
The answer given as the most relevant meets the questioner’s
informational needs by providing the information which was
sought in the question, but the system also provides additional
information, again as short as possible, in the form of answer
Answer assessment
sentences which do not provide the information sought directly in
the question
Number of answers
A number of different answer sentences given
Questioner
Answers must meet the questioner’s informational needs and goals;
requirements
must be able to provide some additional information; must be as
short as possible
Answerer
Answers must meet the questioner’s informational needs and goals;
requirements
must be able to provide some additional information; must be as
short as possible
Conflict
resolution Not needed as questioner and answerer requirements are taken to
mechanism
be equal
Questioner
Use all WordNet relations
knowledge
Questioner inference Use the Otter theorem prover with the relaxation rules defined for
logic
logical relevance
Questioner goals
Defined dynamically by referring to previous question/answer pairs
Ensuring consistency
through clarification dialogue
Questioner
Shortest
answer
sentence
preferences
requirements
Answerer knowledge
Use all WordNet relations
which
could
meet
information
Answerer inference Use the Otter theorem prover with the relaxation rules defined for
logic
logical relevance
Answerer goals
Defined dynamically by referring to previous question/answer pairs
through clarification dialogue
Answerer
Shortest
answer
sentence
which
could
meet
information
preferences
requirements
Question context
Questions to be considered as part of a wider dialogue with the
user
What follows is a more detailed explanation of how we will meet the requirements by
reference to the framework developed.
245
Chapter 11: Integrating and Evaluating the Components of Relevance
1) The notion of relevance. One of our requirements was to provide additional information:
using the notion of answers which can be relevant to different degrees, as opposed to simply
using the idea of correct or incorrect answer, we can provide additional answers which may
give the questioner some interesting information.
2) Semantic relevance. One way of providing additional information is by giving answers
which are about a similar topic to the question, but do not directly answer the question. One
approach to this could be to first provide an answer that is logically relevant to the question
(an answer which gives the questioner the “unknown” that the question was seeking) and then
to provide additional answer sentences which are less logically relevant but are interesting
from the point of view of semantic relevance.
3) Goal-directed relevance. The system should provide answers which meet the informational
needs of the questioner. The questioner’s goals must therefore be taken into account when
formulating an answer, and answers should strive to meet these goals. The implementation of
goal-directed relevance which we presented earlier would ensure that questioner goals are met
through the use of clarification dialogue, choosing answer sentences from the sentences which
are considered most relevant to the questioner, given the questioner’s previous questions and
the answerer’s previous answers.
4) Logical relevance. One way of meeting the information needs of the questioner will be to
provide answers which fill the information gap which the questioner expresses through the
use of a question. We will need to ensure that the most relevant answer given is an answer
which is maximally relevant from a logical point of view. But we also want additional
(different) information: one approach could be given by finding answer sentences which are
less logically relevant than the sentence considered to give the most relevant (the “best”)
answer but which have a strong semantic relevance to the question.
5) Morphic relevance. Answers need to be as short as possible. From the point of view of
morphic relevance, therefore, answers which are equally relevant from the point of view of
semantic, goal-directed and logical relevance should be ranked in order of length, the shortest
answers being the most relevant from the point of view of morphic relevance.
6) Answerer model. The prior knowledge will be the same as in YorkQA. Given that the
requirements specify that the questioner’s informational needs take priority, and do not
mention the answerer’s goals, the questioner goals, form preferences and inference logic are
taken to be equivalent to those of the questioner. The inference logic will need to be able to
246
Chapter 11: Integrating and Evaluating the Components of Relevance
distinguish between grades of relevance and therefore will be as described in the chapter on
logical relevance. The form preferences will be a series of rules specifying that shorter
answers are preferable to longer answers.
7) Questioner model. Questioner goals will be modelled implicitly though the use of the
implementation of clarification dialogue described in the chapter on goal-directed relevance
and will change as the question answering session with the questioner progresses.
8) Previous question/answer pairs. Previously asked questions and previously given answers
will be taken into account in order to ensure goal-directed relevance through the idea of
clarification dialogue, as specified above.
We can now specify how an overall relevance judgement will be made by the system, i.e. we
can elucidate how the function relevant-answer (see chapter 5) proceeds to return an ordered
set of relevant answers to the question.
We shall first apply a filter to retain the top n answer sentences which are most relevant from
the point of view of goal-directed relevance. We shall then judge these sentences from the
point of view of logical relevance, retaining the highest ranked sentence, AL; if there is more
than one sentence ranked at the highest level, the shortest (as was specified from the point of
view of morphic relevance) will be taken to be the more relevant answer. We shall then take
the remaining m answer sentences (taking from n) which are less relevant from the point of
view of logical relevance than AL and rank them according to semantic relevance; again, if
there is more than one sentence ranked at the same level, the shortest (as was specified from
the point of view of morphic relevance) will be taken to be the more relevant answer. The
result will be an answer AL which is maximally relevant from the point of view of goaldirected and logical relevance, and therefore meets the requirement for answers which meet
the informational need of the questioner: this will be considered the most relevant answer
from the point of view of overall relevance. But we shall also have m answer sentences,
which, not being fully relevant from the point of view of logical relevance, but being relevant
from the point of view of semantic relevance, say something about the question matter
without fully answering the question: in other words, as required, we will have m sentences
which provide additional information to the “main” answer.
Using a more formal notation, we start with the function
247
Chapter 11: Integrating and Evaluating the Components of Relevance
relevant-answer( q, S-ANSWER, G-ANSWER, L-ANSWER, M-ANSWER) = A
In order to build A, we will take the n answer sentences which are maximally relevant from
the point of view of goal-directed relevance, i.e. we shall filter the answer sentences and take
the top n elements of the ordered set G-ANSWER to have a subset G-A ⊂ G-ANSWER.
We then take the subset L-A ⊂ L-ANSWER of logically relevant answers such that L-A ⊂
G-A. The highest ranking element of L-A will be the highest ranking element of A. If there is
more than one top ranking answer sentence, i.e. there happens to be more than one element a
of L-A such that ¬∃x: x ≻ a, we take the sentence which has the highest ranking in MANSWER to be the most relevant and hence be the highest ranking element of A; we shall
call this answer a1.
B
B
We now take those elements b of L-A such that a1 ≻ b, and construct a subset L2-A ⊂ L-A
B
B
i.e. we take those answer sentences which are less logically relevant than a1 from the point of
B
B
view of logical relevance. We now construct a subset S-A ⊂ S-ANSWER of semantically
relevant answers such that S-A ⊂ L2-A. From S-A we shall take the top m ranked answer
sentences to form the elements a2 … am of A, where ap ≻ aq if and only if ap ≻ aq in S-A, i.e.
B
B
B
B
B
B
B
B
B
B
B
B
the ranking of the answers from the point of view of semantic relevance determines the
ranking of the sentence from the point of view of overall relevance. As above, in the case of
equally relevant answers, we take the sentence which has the highest ranking in M-ANS to be
the most relevant. The following diagram illustrates the process by which relevant-answer
constructs the set A of relevant answers:
248
Chapter 11: Integrating and Evaluating the Components of Relevance
Document collection
Dialogue Manager
Logical
Semantic
(Goal-directed
Relevance
Relevance
relevance)
G-A
L-A
Question
Morphic
relevance
Previous
Questions
and Answers
A
Answer sentences
249
S-A
Chapter 11: Integrating and Evaluating the Components of Relevance
4.3
Evaluation
Evaluation will be carried out by showing how the system would answer an example
question. In the example below, the Google search engine was used to retrieve appropriate
documents; the individual relevance modules were then used to carry out appropriate
processing; to simplify what would have been a complex task of integrating the modules in a
satisfactory manner, data was passed between the modules through the intervention of an
operator.
Take, for example, the question
When was it founded?
in the context of a dialogue about the bank of England. The dialogue manager would identify
documents relating to the Bank of England as being the most relevant from the point of view
of goal-directed relevance (the informational goal of the questioner is therefore taken to be
“find information about the bank of England”). In the documents, the following sentences
would then be identified as the most relevant from the point of view of logical relevance (the
numbers preceding the sentences indicate their rank; only the top 6 answer sentences are
given):
1. The Bank of England, founded in 1694, has been a focal point of precious metal
trading in London for three centuries
1. Bank of England founded 1694
2. In 1694, William Paterson (a Scotsman) founded the Bank of England to assist the
crown by managing the public debt
3. In the seventeenth century many banks were founded by imitating the Bank of
Amsterdam, such as the Bank of Hamburg in1619, the Bank of Sweden in 1656, and
the Bank of England in 1694 (Clapham, 1966).
4. At last, the king adopted the plan to establish the Bank of England in 1694
5. Many economists have explored how the Bank of England has evolved since 1694
from a private business to the central bank (see e.g.Bowen (1995)).
Note that the first two sentences are considered equally relevant from the point of view of
logical relevance as they have similar structure. We now use morphic relevance to decide
between the two answer sentences, giving:
1. Bank of England founded 1694
250
Chapter 11: Integrating and Evaluating the Components of Relevance
as the most relevant, due to its brevity. The other sentences that contain a direct answer to the
question will now be ignored and semantic relevance will rank the remaining sentences in the
documents, giving (again we show the top 6 ranked sentences, with the ranking given in each
case at the side of the sentence):
1. Andreades (1966, 43-59) describes concisely the founding of the Bank of England
2. In 1697, its position of prominence was secured when parliament forbade the
formation of any further joint-stock banks in England (a writ that did not run in the
legislatively independent Scotland - where the Bank of Scotland was established, by
an Englishman, in 1695).
3. If not for the Bank, England might have been defeated by France in the economic
and power competition in the eighteenth century
3. The Bank of England became the national reserve for the British Isles
3. The Bank engineered enormous financial resources to the army and the navy,
helping England dominate in the world
4. The Bank of Amsterdam, established in 1609, was often regarded as the antecedent
of many European public or semi-public banks
Morphic relevance will then be used to decide on the final ranking of the three sentences
above which have been judged to be equally relevant by taking the shortest answers to be the
most relevant, giving:
3.1. The Bank of England became the national reserve for the British Isles
3.2. The Bank engineered enormous financial resources to the army and the navy,
helping England dominate in the world
3.3. If not for the Bank, England might have been defeated by France in the
economic and power competition in the eighteenth century
We can now give the top six answers to the question in order of overall relevance:
1. Bank of England founded 1694
2. Andreades (1966, 43-59) describes concisely the founding of the Bank of England
3. In 1697, its position of prominence was secured when parliament forbade the
formation of any further joint-stock banks in England (a writ that did not run in the
legislatively independent Scotland - where the Bank of Scotland was established, by
an Englishman, in 1695).
4. The Bank of England became the national reserve for the British Isles
251
Chapter 11: Integrating and Evaluating the Components of Relevance
5. The Bank engineered enormous financial resources to the army and the navy,
helping England dominate in the world
6. If not for the Bank, England might have been defeated by France in the economic
and power competition in the eighteenth century
We can verify informally that the system meets the answerer requirements set out above:
•
the answer given as the most relevant meets the questioner’s informational needs
by providing the information which was sought in the question;
•
at the same time the most relevant answer is the shortest answer sentence which
could meet these requirements;
•
the system also provided additional information, again as short as possible, in the
form of answer sentences which do not provide the information sought directly in
the question (this information has already been given by the highest ranked
answer sentence), but give information (such as the fact that an author called
Andreades describes the founding of the Bank in a book published in 1966)
which the questioner may find of interest.
By making reference instead to the evaluation framework set out above we can provide a
more rigorous evaluation which would enable us to compare different implementations of the
required system:
252
Chapter 11: Integrating and Evaluating the Components of Relevance
Evaluation criterion
Is the system adequate?
Satisfactory answer
Yes. The answer given as the most relevant meets the questioner’s
informational needs, but the system also provides additional
information, in the form of answer sentences which do not provide
Answer assessment
the information sought directly in the question
Number of answers
Yes, six different answer sentences are given
Questioner
Yes. Answers meet the questioner’s informational needs and goals;
requirements
additional information is given and is as short as possible
Answerer
Yes. Answers meet the questioner’s informational needs and goals;
requirements
additional information is given and is as short as possible
Conflict
resolution Yes: not needed as questioner and answerer requirements are taken
mechanism
to be equal
Questioner
Yes. All relations in WordNet used to calculate semantic relevance
knowledge
Questioner inference Yes. Otter and the relaxation rules used for logical relevance
Ensuring consistency
logic
Questioner goals
Yes. They are defined dynamically through clarification dialogue
Questioner
Yes. Shortest answer sentence which could meet information
preferences
requirements is given
Answerer knowledge
Yes. All relations in WordNet used to calculate semantic relevance
Answerer inference Yes. Otter and the relaxation rules used for logical relevance
logic
Answerer goals
Yes. They are defined dynamically through clarification dialogue
Answerer
Yes. Shortest answer sentence which could meet information
preferences
requirements is given
Question context
Yes, through clarification dialogue
By using this framework we are in a position to
•
immediately note any theoretical shortcomings of the system (e.g. the absence of an
answerer model, or the lack of explicitly defined goals)
•
immediately note how the implementation compares with the desired system
requirements
•
rigorously evaluate different systems which attempt to satisfy the given requirements
by comparing, for example, their different use of background knowledge and
inference mechanisms
253
Chapter 11: Integrating and Evaluating the Components of Relevance
5
Conclusion
We have shown how the individual components of relevance (semantic, goal-directed, logical,
morphic) may be integrated to provide an overall relevance judgement for an answer in
relation to a question. We underlined the fact that the way in which the components will come
together to provide an overall judgement will depend on the purpose, i.e. design requirements,
of the QA system: different systems built for different objectives will make use of the
components of relevance in different ways, emphasising different aspects of relevance and
having different approaches to resolving conflicts. Finally, an important consideration has
been shown to be the relationship between the requirements of the questioner and the
answerer, which may in some cases be incompatible, as in the case of a system used for
teaching or a question answering system which is attempting to advertise a particular product
as opposed to simply providing an answer to a question.
We then provided an evaluation framework for question answering systems, based on the
relevance theory for question answering that we developed in chapters 4 and 5, showing how
it could be used to judge an improved version of the YorkQA system.
254
Chapter 12
Conclusion and Open Issues
Executive Summary
Issues beyond the scope of the current investigation are identified for further investigation, in particular
problems in the areas of systems integration, human-computer interaction, cultural bias and legal and
ethical concerns. We then give an overview of the thesis, showing how it met the aims set out initially.
1
Overview
In the previous chapters we have set out a framework, based on the notion of relevance, which
allows us to have a clear understanding of the question answering task and hence puts us in a
strong position to implement a satisfactory question answering system. We have outlined out
the individual components of relevance, semantic, goal-directed, logical and morphic
relevance may be implemented. We then examined how the implementation of these
individual relevance components could be improved by augmenting the knowledge base we
used. Finally we showed how the individual relevance components could be integrated to give
an overall relevance judgement and how a complete question answering system could be
evaluated.
2
Conclusion
The aims of the thesis were to:
•
Examine from a philosophical point of view the concept of answerhood as applicable
to open domain QA systems, and, in particular, the conditions which determine the
answerhood of an answer in relation to a question.
•
Provide a theoretical framework for open domain QA systems research, based on the
concept of answerhood examined above
•
Show that this new framework moves beyond the limitations of the TREC-style
evaluation by clarifying exactly what is sought in QA, what the constraints are, what
should be evaluated and what is needed in terms of research directions
Chapter 12: Conclusion
•
Illustrate how the framework can be used to improve current TREC-style QA systems
by demonstrating how it can be implemented and evaluated in a working system
We met the aims which we set out initially by:
•
Examining from a philosophical point of view the concept of answerhood as
applicable open domain QA systems. We argued, following Eco’s critique of Derrida,
that there are limits as to what can be considered an answer to a question. In order to
understand the nature of these limits we then examined the concept of relevance,
showing that to talk about an answer is really to speak about the relevance of that
answer in relation to a question: we maintained that it was misleading to talk about
absolutely correct or incorrect answers and that instead we should be referring to
answers which are more or less relevant to a question. We then examined the concept
of relevance in detail, illustrating how it could be seen to be composed of semantic
relevance, dealing with the relationship in meaning between question and answer;
goal-directed relevance, dealing with questioner and answerer goals; logical
relevance, dealing with the more formal relationship which considers whether an
answer provides the information which the question sought; and morphic relevance,
dealing with the form an answer takes in relation to a question.
•
Providing a theoretical framework for open domain QA systems research. From the
notion of relevance we built a model of QA systems which illustrated the constraints
under which they operate, i.e. what have been called questioner prejudices in
philosophical discussions of question answering: we showed how an answer is
constrained by the questioner and the answerer’s prior knowledge, goals, rules of
inference, answer form preferences as well as the questioner and the answerer’s
approach to giving relevance judgements from the point of view of semantic, goaldirected, logical, morphic and overall relevance.
•
Showing that this new framework moves beyond the limitations of TREC-style
evaluation. By clarifying the concept of answerhood through the idea of relevance
and spelling out the constraints under which QA systems operate we overcame much
of the confusion found in TREC-inspired research and we set out clearly a number of
research directions corresponding to the components of relevance. Finally, we
showed how the notion of design requirement must be taken into consideration in
order to decide how to integrate the various components of relevance into a coherent
system.
256
Chapter 12: Conclusion
•
Illustrating how the framework could be used to improve current TREC-style QA
systems. This was done by implementing each component of relevance individually
starting from a “standard” TREC-style QA system and showing how these
components could be brought together in a complete system. In order to do this we:
-
Designed and implemented YorkQA, a system built for a standard TREC QA
evaluation; the system contained a number of novel ideas, but was not built
on the theoretical framework which we developed and instead followed on
from “standard” research which had been previously carried out for the
TREC evaluation
-
Implemented semantic relevance in YorkQA and showed how it could be
used both to improve results in the TREC evaluation but also to move beyond
such an evaluation to provide relevant answers, not simply “correct” answers
-
Implemented goal-directed relevance in YorkQA through the use of
clarification dialogue, developing a new algorithm for the recognition of
clarification dialogue in open-domain question answering
-
Demonstrated how logical relevance could be implemented in YorkQA to
provide not simply an indication of a unique “correct” answer which meets
the information need set out by the question, but a ranking of relevant
answers from a logical point of view through the use of a number of rules to
gradually relax the constraints on the proof of an answer
-
Showed how the idea of morphic relevance was implemented trivially in
YorkQA and how we could move beyond this implementation to find
morphically relevant answer sentences
-
Investigated how the performance of the individual components could be
improved by augmenting the background knowledge, showing how the
knowledge base used (WordNet) could be expanded automatically to provide
new, useful relationships
-
Established how the individual implementations of the components could be
brought together to meet the requirements of a system which went beyond
TREC-style question answering
-
Established how an evaluation of the integration of the components would
have to take place
We have shown question answering to be a complex matter both from a theoretical
perspective and from the point of view of practical implementation and we have endeavoured
257
Chapter 12: Conclusion
to shed some light on this complexity by clarifying the theoretical status of question
answering and providing a practical implementation of the theory developed.
Having provided a response to some of the issues which have arisen in relation to research in
question answering, a new world of unresolved possibilities has opened up: new answers,
solving old problems, give birth to new questions, which seek new solutions: in phoenix-like
fashion, the end is a fresh beginning.
3
Open Issues
It is now necessary to examine what issues remain unresolved and require further
investigation.
3.1
Question answering as an engineering problem
Constructing a question answering system is a non-trivial engineering task. As seen, an
overall judgement of relevance depends on a judgement made from the point of view of
semantic, goal-directed, logical and morphic relevance. But in turn, to provide a correct
judgement from these different points of view in an actual system, we will need to carry out a
number of complex tasks such as (and not limited to) part-of-speech tagging, sentence
splitting, named entity recognition, sentence parsing, co-reference resolution and theorem
proving, for which there is as yet no perfect implementation. Each of these subtasks raises a
number of complex and unresolved issues which warrant an in-depth examination in their
own right. Moreover, constructing a question answering system becomes a complex systems
integration project, which requires the coordination of a variety of research efforts. But while
integrating different modules, written in a variety of programming languages, with differing
input and output formatting standards, is a challenging task, it does not provide any
interesting insight into the problem of question answering. Consequently we have deliberately
ignored important issues such as how to combine the different modules which constitute a QA
system for optimal performance: issues such as speed were considered beyond the scope of
the current investigation, which set out to provide a theoretical framework for QA, showing
how this framework could be implemented, but stopping short of claiming that the given
implementation was the best possible in terms of, for example, speed (possibly a key
requirement for users), memory usage (a necessary consideration for implementing QA
systems on small devices such as mobile phones) and portability.
3.2
Machine Learning
Our analysis has set out the initial foundations necessary to be able to make use of machine
learning techniques to implement the theoretical framework:
258
Chapter 12: Conclusion
•
We have clarified the task we are seeking to carry out, by specifying what question
answering systems aim to do
•
We have provided a performance metric, by elucidating what is meant by saying that
an answer is a “good” answer to a question
•
We have narrowed down the hypothesis space by providing an initial implementation
of the notion of relevance functions and their integration
The main hindrance to the use of machine learning is now the lack of available data giving
relevant (not simply correct) answers. Future work will have to address this issue, providing
the necessary data for the specific task of analysing relevance and making use of machine
learning to automatically or semiautomatically construct algorithms to calculate relevance.
A particularly interesting problem is whether we could determine the function relevantanswer (Chapter 5 and 11), which integrates the components of relevance into a complete
system, through some automated method. Such an approach would be a highly complex
matter, which would have to overcome the following difficulties:
•
A unique function for each system. A different version of the function relevantanswer would have to be constructed for each possible questioner and answerer
model and conflict resolution procedure: answers are not relevant in isolation, but in
relation to a questioner and an answerer model, i.e. in relation to the purpose, or
requirements, for which the system was built.
•
Not a single answer. The output of the function is not a single answer, but is an
ordered set of relevant answers: while there is data containing correct answers to
questions (for example the TREC data) there is no data which provides relevant
answers
•
Lack of answer sets. There is no training data and the set of relevant answers would
have to be constructed from scratch, a task which would be extremely time
consuming (ranking one hundred answers for just one hundred questions would
require a relevance judgement to be made on 10,000 answer sentences)
•
Agreement between annotators. A number of different annotators would have to be
used to provide the relevance judgements in order to avoid errors or bias in
understanding the requirements of the system. On the other hand, even if the
requirements were clear, different annotators would probably provide slightly
259
Chapter 12: Conclusion
different relevance judgements, meaning some method would have to be found to
reconcile any differences
•
A new annotation for each new system. Each annotator would have to understand the
overall requirements of the QA system, i.e. the answerer’s purpose, in order to
provide an appropriate ranking of answer sentences, providing a different ranking for
each different QA system. This in turn means that if any requirement changed, a new
set of annotated data would have to be provided and used to derive the function
relevant-answer.
It remains to be investigated to what extent these issues can be resolved.
3.3
User interface issues in question answering system design
Designing a complete question answering system will require an understanding of the ways in
which users will interact with the system, i.e. an understanding of the principles of human
computer interaction. Issues to be investigated would include the best method to allow a user
to input a question, answering questions such as: will the size of the input textbox influence
the average length of questions? Should grammatical errors in the question be highlighted as
the user types? What should be displayed, if anything, while the system searches for an
answer? Another issue is the best method of displaying an answer; open questions are, for
example: should previous question answer pairs be displayed? If so, how many should be
displayed? What font should be used as default? Should the system provide the user with a
single answer initially, waiting for some signal before displaying other relevant answers, or
should the system display a certain number of relevant answers immediately? If the latter is
the best approach, how many relevant answers should be displayed at any one time? Again,
these issues were deliberately ignored as they were considered beyond the scope of the
current investigation, being issues which will arise separately from the design and
implementation of a system based on our relevance framework.
3.4
Other implementation issues
As highlighted above, the implementation of question answering is dependent on the results
of a large number of research fields, such as co-reference resolution, theorem proving, partof-speech tagging, etc. The complexity of the system means that even a partial failure of any
of these contributing areas will have negative repercussions on the entire system. One matter
that needs to be addressed is therefore how to implement a fault-tolerant system which can
cope with uncertain output from its component parts. Amongst the implementation issues
which remain to be solved is also the problem of how to design components such as parsers
and part-of-speech taggers specifically for question answering: the components we used,
260
Chapter 12: Conclusion
trained for the most part on corpora such as newspapers, were unable to cope with question
sentences, few of which had been encountered during the training stage. Other problems
which will have to be dealt with are improving background knowledge and inference rules.
Perhaps the most interesting subject which needs to be tackled is implementing the
framework for questions which are more complex than the (mostly) short “concept-filling”
questions which are found in the TREC data collection, for example carrying out empirical
research to see effectively what sort of “open ended” questions users ask.
3.5
Cultural bias in question answering theory
The theory we developed builds on the work of what some would define as “upper-middle
class European Caucasian male” philosophers such as Gadamer, Eco and Grice. Although
philosophers strive to formulate a pure theory divested of any bias, it could be argued that that
the cultural background of these thinkers nevertheless will have, to a greater or lesser extent,
influenced their thinking. A theory developed based on their philosophy therefore may or may
not be applicable to other social groups. An interesting line of investigation from a theoretical
point of view would be whether such a theory has universal applicability or whether on the
contrary we need to develop a separate theory depending on the gender, ethnicity or economic
background of the questioner and answerer.
3.6
Legal and Ethical Issues
Question answering systems are not meant to be deployed in a vacuum. Their design will
necessarily have to take into account the constraints set out by the legal and cultural
environment in which they are meant to be operated. An interesting avenue of research will be
to investigate how legal and ethical considerations will influence the application of the theory
developed. An example of an interesting legal problem arises when considering the concept of
a user model: should the user model be disclosed to the user if the user requests so? What if
the user model uses the stereotype approach, classifying users in unflattering groups such as
“gullible consumer” or “arrogant yuppie”? How does this fit in with legislation such as the
various forms of Data Protection available in a number of countries, which require disclosure
of any personal information that is used for automatic processing? Should such personal
information be kept hidden from the developers and the operators of the system?
Another issue arises when we consider the possibility that the answerer’s goals may take
precedence over the questioner’s goals, for example in an educational setting, but also in
commercial settings with more or less covert advertising systems. But while it is widely
accepted that in an educational setting a teacher probably knows what is in the best interest of
the pupil and acts accordingly, it is doubtful that an advertising system would work in the best
261
Chapter 12: Conclusion
interest of the consumer. While from a legal point of view this may not be a problem (except
in the case of malicious deception or when providing grossly misleading information), from
an ethical point of view it is cause for concern.
262
Chapter 13Bibliography
Ackermann, W., “Begruendung einer strengen Implikation”, The Journal of Symbolic Logic,
vol.21, pp. 113-128, 1956.
Alfonseca, E., De Boni, M., Jara, J.L., Manandhar, S., “A prototype Question Answering
system using syntactic and semantic information for answer retrieval”, in Proceedings of the
10th Text Retrieval Conference (TREC-10), Gaithersburg, US, 2002.
Ali, S. S., and Shapiro, S. C., “Natural Language Processing using a prepositional semantic
network with structured variables”, Mind and Machines, 3:421-451, 1993.
Allen, J. and Perrault, C. R., “Analysing intention in utterances”, in Artificial Intelligence, 15,
143-178, 1980.
Allen, J., et al., “The TRAINS project: a case study in building a conversational planning
agent”, Journal of Experimental and Theoretical AI, 7:7-48, 1995.
Anderson, A. R., and Belnap, N. D. Jr., Entailment - The logic of relevance and necessity,
Princeton University Press, 1975.
Anderson, T. D., “Situating relevance: exploring individual relevance assessments in
context”, Information Research, Vol. 6 No. 2, January 2001.
Appelt, D. E., and Pollack, M. E., "Weighted Abduction for Plan Ascription", User Modeling
and User-Adapted Interaction, 2(1-2):1-25, 1992.
AQUAINT, Proposer Information Pamphlet (Pip) For The AQUAINT (Advanced Question
Answering for Intelligence) Program Phase 2, Advanced Research and
Development
Activity, Fort George G. Meade, MD, available from http://www.nbc.gov/aquaint.cfm, (last
accessed 29/8/2003), 2003.
263
Bibliography
Ardissono, L, Lesmo, L., Lombardo, A., Sestero, D., “Production of cooperative answers on
the basis of partial knowledge in information-seeking dialogues”, in Lecture Notes in
Artificial Intelligence n. 728, Springer, Berlin, 1993.
Ardissono, L. and Sestero, D., "Using dynamic user models in the recognition of the plans of
the user". User Modeling and User-Adapted Interaction, 5(2):157-190, 1996.
Ayer, A. J., Language,Truth and Logic, Ben Rogers (ed.), Penguin Modern Classics, 2001
Baeza-Yates, R. and Ribeiro-Neto, B., Modern Information Retrieval, Addison-Wesley, 1999.
Bagga, A., et al., “The Role Of WordNet in The Creation of a Trainable Message
Understanding System” Proceedings of the Thirteenth National Conference on Artificial
Intelligence and the Eighth Innovative Applications of Artificial Intelligence Conference,
1997.
Bar-Hillel, Y., “Summary of Area 6 Discussion”, Proceedings of the International
Conference on Scientific Information, Washington D.C., 1958.
Bates, M. J., “Document familiarity, relevance and Bradford’s Law: The Getty online
searching project report n. 5”, Information Processing and Management, 32(6), 1996.
Bean, T. W., “Classroom questioning strategies: directions for applied research”, in Graesser
and Black (eds.), The Psychology of Questions, 1985.
Belkin, N. J., “The cognitive viewpoint in information science”, Journal of Information
Science: Principles and Practice, 16(1), 1990.
Belnap, N. D., and Steel, T. B., The Logic of Questions and Answers, Yale University Press,
1976.
Belnap, N. D., “How a computer should think”, in Contemporary aspects of philosophy:
Proceedings of the Oxford international symposium, pp. 30-56, Oxford, 1975.
Berg, J., “The Relevant Relevance”, Journal of Pragmatics, 16, pp. 411-425, 1991.
264
Bibliography
Berger, A. et al., “Bridging the lexical chasm: Statistical approaches to answer-finding”, in
Proceedings of SIGIR, 2000.
Berger, A., and Mittal, V. O., “Query-relevant summarization using FAQs”, Proceedings of
ACL-2000, 2000.
BNCFreq,
English
Word
Frequency
List.
Available
from
http://www.eecs.umich.edu/~qstout/586/bncfreq.html, 2003 (last accessed March 2003).
Bobrow, et al., “GUS, a frame driven dialog system”, Artificial Intelligence, 8:155-173, 1977.
Borlund, P., “The concept of relevance in IR”, In Journal of the American Society for
Information Science and Technology, vol. 54, no. 10, 913-925, 2003.
Bikel, D., Miller, S, Schwartz, R, and Weischedel, R. “Nymble: a high-performance learning
name-finder”. In Proceedings of the Fifth Conference on Applied Natural
Language Processing, 194-201, 1997.
Borgida, A., and McGuinness, D. L., “Asking queries about frames”, Proceedings fo KR-96,
Cambridge, MA., 1996.
Bouillon, P., Claveau, V., Fabre, C., Sebillot, P., “Using part-of-speech and semantic tagging
for the corpus-based learning of qualia”, in Bouillon and Kanzaki (eds.), Proceedings of the
First International Workshop on Generative Approaches to the Lexicon, Geneva, 2001.
Boyle, C., and Encarnacion, A. O., “MetaDoc: and adaptive hypertext reading system”, in
User Modeling and User-Adapted Interaction, 4(1): 21-45, 1994.
Brants, T., TnT - A Statistical Part-of-Speech Tagger, User manual, 2000.
Bratman, M. E., Intentions, Plans, and Practical Reason. Harvard University Press:
Cambridge, MA, 1987.
Breck, E., et al., “A Sys Called Qanda”, Proceedings of TREC-8, NIST, 2000.
Breck, E. et al., “Another Sys called Quanda”, Proceedings of TREC-9, NIST, 2001.
265
Bibliography
Brill, E., et al., “Data Intensive Question Answering”, Proceedings of TREC-10, NIST, 2002.
Brown, G. and Yule, G., Discourse Analysis, Cambridge, 1983.
Budanitsky, A., and Hirst, G., “Semantic distance in WordNet: and experimental, applicationoriented evaluation of five measures”, in Proceedings of the NAACL 2001 Workshop on
WordNet and other lexical resources, Pittsburgh, 2001.
Burger, J. et al., “Issues, Tasks and Program Structures to Roadmap Research in Question &
Answering, NIST, 2001.
Burhans, D. T., A Question Answering Interpretation of Resolution Refutation, PhD
Dissertation, State University of New York at Buffalo, Department of Computer Science and
Engineering, 2002.
Burke, R., et al., “Question Answering from Frequently-Asked Question files: experiences
with the FAQ Finder system”, Technical Report, University of Chicago Computer Science
Department, June 1997.
Cardie, C., et al., “Combining Low-Level and Summary Representations of Opinions for
Multi-Perspective Question Answering”, Proceedings on the AAAI Symposium on New
Directions in Question Answering, Stanford, 2003.
Cassirer, E., Kant's life and thought, tr. by James Haden, New Haven, Conn., Yale U.P.,
1981.
Catala', N., Castell, N., Martin, M., “ESSENCE: a portable methodology for acquiring
information extraction patterns”, Proceedings of the 14th European Conference on Artificial
Intelligence, Berlin, 2000.
Caudra, C. A. & Katter, R. V., “Opening the black box of relevance.” Journal of
Documentation, 23(4), 291-303, 1967.
Chakrabarti, D. et al., “Experiences in Building the Indo WordNet - A WordNet for Hindi”,
Proceedings of the 1st International WordNet Conference, India, 2002.
266
Bibliography
Chen, H., and Dhar, V., “Cognitive process as a basis for intelligent retrieval system design”,
Information Processing and Management, 27(5):405-432, 1991.
Cheng, J., Logical Tool of Knowledge Engineering: using entailment logic rather than
mathematical logic, Proceedings of the 19th annual conference on computer science, San
Antonio, TX, USA, 1991.
Chin, D. N., “KNOME: modelling what the user knows in UC”, in Kobsa and Wahlster
(eds.), User models in dialog systems, Springer, 1989.
Chodrow, M., Byrd, R., Heidorn, G, "Extracting semantic hierarchies from a large on-line
dictionary", In Proceedings of the 23rd Annual Meeting of ACL, 1985.
Cholvy, L, “Answering queries addressed to a rule base”, Revue d’Intelligence Artificielle,
4(1), 1990.
Cholvy, L., and Demolombe, R., “Querying a rule base”, Proceedings of the First
International Workshop on Expert Database Systems, 1986.
Chu-Carroll, J., et al., “In Question Answering, Two Heads Are Better Than One”,
Proceedings of HLT-NAACL 2003, Edmonton, 2003.
Church, A., “The weak theory of implication”, in Menne-Wilhelmy-Angsil (ed.),
Kontrolliertes
Denken,
Untersuchungen
zum
Logikkalkuel
und
der
Logik
der
Einzelwissenschaften, pp. 22-37, Munich, 1951.
Cohen, P. R. and Perrault, C. R., “Elements of a plan-based theory of speech acts”, in
Cognitive Science, 3(3), 177-212, 1979.
Collier, R., Automatic template creation for information extraction, PhD thesis, Department
of Computer Science, University of Sheffield, 1998.
Cooper, W.S., “A definition of relevance for information retrieval”, Information Storage and
Retrieval, 7, 19-37 (1971).
Cooper, R. J. and Rueger, S., “A simple Question Answering system”, Proceedings of TREC9, NIST, 2001.
267
Bibliography
De Boni, M. and Manandhar, S., “Automated Discovery of Telic Relations for WordNet”,
Proceedings of the 1st International WordNet Conference, India, 2002.
De Boni, M. and Manandhar, S. “The Use of Sentence Similarity as a Semantic Relevance
Metric for Question Answering”. Proceedings of the AAAI Symposium on New Directions in
Question Answering, AAAI Press, 2003a.
De Boni, M. and Manandhar, S., “An Analysis of Clarification Dialogue for Question
Answering”, Proceedings of the HLT-NAACL Conference, Edmonton, 2003b.
De Boni, M., and Prigmore, M., “Information Privacy as an Economic Right”, accepted for
publication in Contemporary Political Theory, 2003.
De Boni, M., Jara, J.L., Manandhar, S., “The YorkQA prototype question answering system”,
Proceedings of the 11th Text Retrieval Conference (TREC-11), Gaithersburg, US, 2003.
Derrida, J., De la grammatologie, Paris, 1967
Dunn, J. M., “Intuitive semantics for first-degree entailments and coupled trees”,
Philosophical studies, 29:149-168, 1976.
Dyer, M. G., In-depth Understanding, MIT Press, 1983.
Eco, U., I limiti dell’interpretazione, Milano, 1990
Eco, U., La struttura assente, Milano, 1962
ECRAN, Extraction of Content: Research at Near Market. Deliverable of Task 3.1.1. Domain
Modeling and Templates Customization, The ECRAN consortium, September 1998.
Ellis, D., “A behavioural approach to information retrieval system design”, Journal of
Documentation, 45(3):171-212, 1989.
Elworthy, D., “Question answering using a large NLP system”, Proceedings of TREC-9,
NIST, 2001.
268
Bibliography
Fellbaum, C., Wordnet, An electronic lexical database, MIT Press, 1998.
Gadamer, H. G., Wahrheit und Methode, Tuebingen, 1960.
Gaizauskas, R., Humpreys, K., “Quantitative Evaluation of Coreference Algorithms in an
Information Extraction System”, in Botley, S., and McEnery, T. (eds.), Corpus-Based and
Computational approaches to discourse anaphora, John Benjamins, Amsterdam, 2000.
Georgeff, M., et al., The Belief-Desire-Intention Model of Agency, Springer Publishers, 1998.
Ginzburg, J., “Resolving questions I”, in Linguistics and Philosophy, Vol. 18(5), 459-527,
1995a.
Ginzburg, J., “Resolving questions II”, in Linguistics and Philosophy, Vol. 18(6), 567-609,
1995a.
Ginzburg, J., “Semantically-based ellipsis resolution with syntactic presuppositions”, in Bunt
and Muskens (eds.), Current Issues in Computational Semantics, Kluwer, 1998.
Ginzburg , J. "Clarifying Utterances" In: J. Hulstijn and A. Nijholt (eds.) Proceedings of the
2nd Workshop on the Formal Semantics and Pragmatics of Dialogue, Twente, 1998b.
Ginzburg, J., and Sag, I. A., Interrogative Investigations: the Form, Meaning and Use of
English Interrogatives, CSLI Publications, Stanford, 2000.
Gluck, M., “Exploring the relationship between user satisfaction and relevance in information
systems”, Information Processing and Management, 32(1), 1996.
Goffman, W., “On relevance as a measure”, Information Storage and Retrieval, 2, 1964.
Graesser, A.C, Franklin, S P, “QUEST: a cognitive model of question answering”, Discourse
Processes, 13, 279-303, 1990
Graesser, A C, Black J B (eds), The Psychology of Questions, Hillsdale, NJ, Lawrence
Erlbaum Associates, 1985.
269
Bibliography
Green, C., “Theorem proving by resolution as a basis for question answering systems”,
Michie and Melzer (eds.), Machine Intelligence 4, Edinburgh University Press, 1969.
Green, C., and Raphael, B., “The use of theorem proving techniques in question answering
systems”, in Blue (ed.), Proceedings of the 23rd National Conference fo the Association for
Computing MachineryI, Princeton, N. J., 1968.
Green, S. J., Automatically generating hypertext by computing semantic similarity, Technical
Report n. 366, University of Toronto, 1997.
Green, et al., “BASEBALL: an automatic question answerer”, Proceedings of the Western
Joint Computer Conference, 1961.
Grice, H. P., “Meaning”, Philosophical Review 66: 377–388, 1957 (Reprinted in Grice 1989).
Grice, H. P., “The causal theory of perception”, Aristotelian Society Proceedings,
Supplementary Volume 35: 121–152, 1961 (Reprinted in Grice 1989).
Grice, H. P., Logic and Conversation. William James Lectures, Harvard University, 1967
(Reprinted in Grice 1989).
Grice, H. P., Studies in the Way of Words. Harvard University Press, Cambridge MA, 1989.
Griesdorf, H., “Relevance: an Interdisciplinary and Information Science Perspective”,
Informing Science, Vol. 3, n. 2, 2000.
Hagstrom, P., Decomposing Questions, MIT Working Papers in Linguistics, 1998.
Hamblin, C., “Questions in Montague English”, Foundations of Language, 10, 1973.
Harabagiu, S. and Moldovan, D., “Testing Gricean Constraints on a WordNet-based
Coherence Evaluation System” in the Working Notes of the AAAI Spring Symposium on
Computational Implicature, AAAISS-96, Stanford CA., 1996.
Harabagiu, S. A., Miller, A. G., Moldovan, D. I., “WordNet2 - a morphologically and
semantically enhanced resource”, In Proceedings of SIGLEX-99, University of Maryland,
1999.
270
Bibliography
Harabagiu, S.,et al., “FALCON - Boosting Knowledge for Answer Engines”. In Proceedings
of TREC-9, NIST, 2001.
Harabagiu, S., et al., “Answering Complex, List and Context Questions with LCC's QuestionAnswering Server”, Proceedings of TREC-10, NIST, 2001.
Harrah, D., “The logic of questions”, in Gabbay and Guenthner (eds.), Handbook of
Philosophical Logic, vol. II, Reidel, 1984.
Hegel, G. W. F., Enzyklopaedie der philosophischen Wissenschaften im Grundrisse, Berlin,
1812.
Hirschman and Gaizauskas, “Natural Language QA, the view from here”. Natural Language
Engineering 7(4), 2001.
Hirst, G., and St-Onge, D., “Lexical chains as representations of context for the detection and
correction of malapropisms”, in Fellbaum (ed.), WordNet: and electronic lexical database,
MIT Press, 1998.
Hiz, H. (ed.), Questions, Reidel, Holland, 1978.
Hobbs, J. R., “The generic Information Extraction system”. In Proceedings of the fifth
Message Understanding Conference, (MUC5), Morgan Kaufman, 1993.
Horvitz, E., Breese, J., Heckerman, D., Hovel, D., Rommelse, K., “The lumiere project:
Bayesian user modelling for inferring the goals and needs of software users”, in Proceedings
of the 14th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufman, 1998.
Horvitz, E., and Paek, T., “DeepListener: Harnessing Expected Utility to Guide Clarification
Dialog in Spoken Language Systems”, 6th International Conference on Spoken Language
Processing (ICSLP 2000), Beijing, November 2000.
Horvitz, E. and Paek, T., “Harnessing models of users’ goals to mediate clarification dialog in
spoken language systems”, Proceedings of User Modeling 2001: 8th International
Conference, Springer, 2001.
271
Bibliography
Howard, D. L., “Pertinence as reflected in personal constructs”, Journal of the American
Society for Information Science, 45(3), 1994.
Huffman, S. B., “Learning Information Extraction patterns from examples”, IJCAI-95 Joint
Workshop on New Approaches to Learning for NLP, 1995.
Hull, D., “Xerox TREC-8 Question Answering Track Report”, Proceedings of TREC-8,
NIST, 2000.
Humphreys, K., et al., “University of Sheffield TREC-8 Q & A System”, Proceedings of
TREC-8, NIST, 2000.
Ingwersen, P., “Cognitive perspectives of information retrieval interaction: elements of a
cognitive IR theory”, Journal of Documentation, 52(1), 1996.
Jiang, J. J., and Conrath, D. W., “Semantic similarity based on corpus statistics and lexical
taxonomy”, in Proceedings of ICRCL, Taiwan, 1997.
Kahusk, N., Vider, K., “Estonian WordNet Benefits from Word Sense Disambiguation”,
Proceedings of the 1st International WordNet Conference, India, 2002.
Kando, N. “Relevance Re-Examined: In the Context of Information Retrieval System
Testing”, International Symposium on the Logic of Real-World Interaction (LoRWI 2002),
Tokyo, Japan, Jan. 30-31, 2002.
Kang, S-J, Lee, J-H, "Semi-Automatic Practical Ontology Construction by Using a
Thesaurus, Computational Dictionaries and Large Corpora", Proceedings of the Workshop on
Human Language Technology, ACL-2001, Toulouse, 2001.
Kant, I., Kritik der reinen Vernunft, Riga, Hartknoch, 1781.
Katz, B., Lin, J., Felshin, S., "Gathering Knowledge for a Question Answering System from
Heterogeneous Information Sources", Proceedings of the Workshop on Human Language
Technology, ACL-2001, Toulouse, 2001.
272
Bibliography
Katz, B., et al., “Answering Questions about Moving Objects in Surveillance Videos”,
Proceedings on the AAAI Symposium on New Directions in Question Answering, Stanford,
2003.
Kautz, H. A., “A circumscriptive theory of plan recognition”, in Cohen, Morgan and Pollack
(eds.), Intentions in Communication, 1990.
Kim, J. and Moldovan, D. “Acquisition of semantic patterns for information extraction from
corpora”. Proceedings of the Ninth IEEE Conference on Artificial Intelligence for
Applications, 1993.
Kobsa, A., “VIE-DPM: a user model in a natural language dialogue system”, in Proceedings
of the 8th German Workshop on Artificial Intelligence, Springer, 1985
Kobsa, A., “Modelling the user’s conceptual knowledge in BGP-MS, a user modelling shell
system”, Computational Intelligence, 6:193-208, 1990.
Kobsa and Wahlster (eds.), User models in dialog systems, Springer, 1989.
Kolodner, J. R., “Organizing Memory and Keeping it Organized”, Proceedings of AAAI,
1980.
Lafferty, J., The Link Parser API, http://www.link.cs.cmu.edu/link/api/index.html, last
modified Aug25 2000, (last accessed 17 Oct 2003), 2000.
Lafferty, J., Sleator, D., and Temperley, D., “Grammatical Trigrams: A Probabilistic Model
of Link Grammar”. Proceedings of the AAAI Conference on Probabilistic Approaches to
Natural Language, 1992.
Lambrecht, K., and Michaelis, L., “Sentence accent in information questions: default and
projection”, Linguistics and Philosophy, 21, 1998.
Lee, G. G., et al., “SiteQ: Engineering High Performance QA System Using Lexico-Semantic
Pattern Matching and Shallow NLP”, Proceedings of TREC-10, NIST, 2002.
Lehnert, W. G., The Process of Question Answering, New Jersey, 1978.
273
Bibliography
Lenat, D. B. "Cyc: A Large-Scale Investment in Knowledge Infrastructure." Communications
of the ACM, 38, no. 11, 1995.
Lewis, D., “Languages and Language”, in Lewis, D., Philosophical Papers, volume I.
Oxford, Oxford University Press, 1983 (originally published 1975).
Lin, B., “The motivation of constructing entailment logic”, Proceedings of the 8th
International Conference of Logic, Methodology and Philosophy of science, vol. 5, Moscow,
1987.
Lin, D., “An information-theoretic definition of similarity”, in Proceedings of the 15th
International Conference on Machine Learning, Madison, 1998.
Lin, J., et al., “Extracting Answers from the Web Using Data Annotation and Knowledge
Mining Techniques”, Proceedings of TREC-11, NIST, 2003.
Litkowski, K. D. Question-Answering Using Semantic Relation Triples, Proceedings of
TREC-8, NIST, 2000.
Litkowski, K., “Syntactic Clues and lexical resources in Question-Answering”, Proceedings
of TREC-9, NIST, 2001.
Luckham, D., and Nilsson, N., “Extracting information from resolution proof trees”, Artificial
Intelligence, 2, pp. 27-54, 1971.
Mani, I., and Bloedorn, E., “Machine learning of generic and user-focused summarization”,
Proceedings of AAAI-98, 1998.
Marcuse, H., Reason and Revolution: Hegel and the Rise of Social Theory, Routledge, 1985
Martini, M. L., Verita' e metodo di Gadamer ed il dibattito ermeneutico contemporaneo,
Torino, 1991
McCune, W. W., Otter 3.0 Reference Manual and Guide, Argonne National Laboratories,
1984.
274
Bibliography
Meghini, C., et al., “A model of information retrieval based on a terminological logic”, in
Korfhage, R. (ed.), Proceedings of SIGIR-93, pp. 298-307, ACM Press, Baltimore, 1993.
Meghini, C., Straccia, U., “A relevance terminological logic for information retrieval”,
Proceedings of the 15th Annual ACM SIGIR Conference on Research and Development in
Information Retrieval, 1996.
Mihalcea, R., Moldovan, D. “A Method for Word Sense Disambiguation of Unrestricted
Text”. Proceedings of ACL, 1999.
Mikheev, A. “Periods, Capitalized Words, etc.”. Computational Linguistics 28(3): 289-318,
2002.
Miller, G. A., “WordNet: A Lexical Database”, Communications of the ACM, 38 (11), 1995.
Miller, G, and Teibel, D., “A proposal for lexical disambiguation”, in Proceedings of DARPA
Speech and natural Language Workshop, California, 1991.
Moh, S-K, “The deduction theorems and two new logical systems”, in Methods, vol. 2, pp.
56-75, 1950.
Moldovan, D, and Rus, V., "Logic Form Transformation of WordNet and its Applicability to
Question Answering", in Proceedings of ACL-2001, Toulouse, 2001.
Moldovan, D., et al., “LASSO: A Tool for Surfing the Answer Net”, Proceedings of TREC-8,
NIST, 2000.
Moldovan, D., et al., “LCC Tools for Question Answering”, Proceedings of TREC-11, NIST,
2003.
Moldovan, D., et al., “COGEX: A Logic Prover for Question Answering”, Proceedings of
HLT-NAACL, Edmonton, 2003a.
Morris Engel, S., Analyzing Informal Fallacies, pp. 95-99, Prentice-Hall, 1980.
Nelson, E. J., On three logical principles in intension, The Monist, 43, 1933.
275
Bibliography
Nwana, H. S., “User modelling and user adapted interaction in an intelligent tutoring system”,
in User Modeling and User-Adapted Interaction, 1(1):1-32, 1991.
Oard, D. W., Wang, J., “TREC-8 Experiments at Maryland: CLIR, QA and Routing”,
Proceedings of TREC-8, NIST, 2000.
Olson, G. M., Duffy, S. A., Mack, R. L., “Question asking as a component of text
comprehension”, in Graesser and Black (eds.), The Psychology of Questions, 1985.
Paijmans,
Hans,
“SMART
Tutorial
for
beginners”,
available
from
http://pi0959.kub.nl:2080/Paai/Onderw/Smart/hands.html, 1999 (last accessed July 2003).
Perrault, C. R., and Allen, J., “A plan-based analysis of indirect speech acts”, in American
Journal of Computational Linguistics, 6(3-4), 167-182, 1980.
Petrova, K. and Nikolov, T., “Bulgarian WordNet as a Source for (Psycho) Linguistic
Studies”, Proceedings of the 1st International WordNet Conference, India, 2002.
Piaget, J., “The constructivist approach: Recent studies in genetic epistemology”, Cahiers de
la Fondation Archives Jean Piaget. No.1., , 1-7, 1980.
Piaget, J. and Garcia, R., Toward a logic of meaning, Hillsdale, NJ, Erlbaum, 1991.
Pohl et al., “User model acquisition heuristics based on dialogue acts”, International
Workshop on the Design of Cooperative Systems, pages 471--486, Antibes-Juan-les-Pins,
France, 1995.
Pollack, M., Inferring domain plans in question-answering, PhD dissertation, University of
Pennsilvania, 1986.
Pollack, M., “Plans as complex mental attitudes”, in Cohen et al. (eds.), Intentions in
Communication, MIT Press, Cambridge, MA, 1990.
Pollitt, A., and Ahmed, A., “A new model of the question answering process”, Proceedings of
the International Association for Educational Assessment, Slovenia, 1999.
276
Bibliography
Prager, J., Radev, D., Brown, E., Coden, A., “The use of predictive annotation for question
answering in TREC-8”, in Proceedings of TREC-8, NIST, 2000
Prager, J., Chu-Carroll, J., and Czuba, K., “Use of WordNet Hypernyms for Answering WhatIs Questions”, Proceedings of TREC-10, NIST, 2002.
Preece, J., et al. “Human Computer Interaction”, Addison-Wesley, 1994.
Prior, M. L., and Prior, A. N., “Erotetic Logic”, The Philosophical Review, 64(1), 1955.
Purver, M., Ginzburg, J., and Healey, P., “On the Means for Clarification in Dialogue.” In R.
Smith and J. van Kuppevelt, editors, Current and New Directions in Discourse and Dialogue,
Kluwer Academic Publishers, 2002.
Pustejovsky, J., Bergler, S., Anick, P., “Lexical Semantic techniques for corpus analysis”,
Computational Linguistics, 19 (2), 1993.
Pustejovsky, J., The Generative Lexicon, MIT Press, 1995.
Pustejovsky, J., et al., “TimeML: Robust Specification of Event and Temporal Expressions in
Text”, Proceedings on the AAAI Symposium on New Directions in Question Answering,
Stanford, 2003.
Quilici, A., “AQUA: a system that detects and responds to user misconceptions”, in Kobsa
and Wahlster (eds.), User models in dialog systems, Springer, 1989.
Rada, R., Mili, H., Bicknell, E. and Blettner, M., "Development and application of a metric on
semantic nets", in IEEE Transactions on Systems, Man and Cybernetics, vol.19, n.1, 1989.
Ramshaw, Lance A. and Marcus, Mitchell P. , “Text Chunking Using Transformation-Based
Learning”, Proceedings of the Third ACL Workshop on Very Large Corpora, 82-94, Kluwer,
1995.
Rees, A. M., “The relevance of relevance to the testing and evaluation of document retrieval
systems”, Proceedings of Aslib, 18(11), 1966.
277
Bibliography
Reiter, R., “Deductive question answering in relational databases”, in Gallaire and Minker
(eds.), Logic and Databases, pp. 149-177. New York, 1978.
Resnik, P., “Using information content to evaluate semantic similarity”, in Proceedings of the
14th IJCAI, Montreal, 1995.
Reynolds, R., and Anderson, R. C., “Influence of questions on the allocation of attention
during reading”, Journal of Educational Psychology, n. 74, 1982.
Rich, E., “User modelling via stereotypes”, in Cognitive Science, 3:329-354, 1979.
Richardson, S. D., Dolan, W. D., Vanderwende, L., "Mindnet: acquiring and structuring
semantic information from text", in Proceedings of COLING-98, 1998.
Rieger, C. J. III, “Conceptual memory and inference”, in Schank, R. C., Conceptual
Information Processing, The Netherlands, 1975.
Riesbeck, C. K., and Schank, R. C., Inside Case-Based Reasoning, New Jersey, 1989 .
Roget, P. M., Roget's Thesaurus, Project Gutenberg. Etext #22. - ID:23, Urbana, Illinois
(USA), 1991.
Romero, M., and Han, C., “Yes/no questions and epistemic implicatures”, Sinn und
Bedeutung VI, October 2001.
Rus, V. and Moldovan, D., “High Precision Logic Form Transformation”, International
Journal on Artificial Intelligence Tools, vol. 11 no. 3, September 2002
Saracevic, T., “Relevance reconsidered. Information science: Integration in perspectives”.
Proceedings of the Second Conference on Conceptions of Library and Information Science.
Copenhagen (Denmark), 201-218, 1996.
Saracevic, T., Search strategy & tactics Governed by effectiveness & feedback Searching,
available
from
http://www.scils.rutgers.edu/~tefko/Courses/530/Lectures-
current/Search%20tactics.ppt, 2003 (. Last accessed 10/6/2003).
Schank, R. C., Conceptual Information Processing, The Netherlands, 1975.
278
Bibliography
Schank, R. C., and Abelson, R. P., Scripts, Plans, Goals and Understanding, New Jersey,
1977
Schilder, F. and Habel, C., “Temporal Information Extraction for Temporal Question
Answering”, in Proceedings on the AAAI Symposium on New Directions in Question
Answering, Stanford, 2003.
Schneiderman, Designing the User Interface, Addison-Wesley, 1998.
Scott, S., and Gaizauskas, R. University of Sheffield TREC-9 Q&A system, Proceedings of
TREC-9, NIST, 2001.
Shapiro, S. C., and Rapaport, W. J., “SNePS considered as a fully intensional prepositional
semantic network”. In Cercon and McCalla (eds.), The Knowledge Frontier, Essays in the
Representation of Knowledge, Springer, New York, 1987.
Shapiro, S. C., and Rapaport, W. J., “The SNePS family”, in Lehman (ed.), Semantic
Networks in Artificial Intelligence, Pergamon Press, 1992.
Simmons, R. F., “Answering English questions by computer: a survey”, in Communications
of the ACM, 8(1):53-70, 1965.
Simmons, R. F., “Semantic Networks: computation and use for understanding English
sentences”, in Schank, R. C. and Colby, K. M., Computer Models of Thought and Language,
San Francisco, 1973.
Singhal, A. et al., “AT&T at TREC-8”, in Proceedings of TREC-8, NIST, 2000
Singer, H., and Donlan, D., “Active comprehension: Problem solving schema with question
generation for comprehension of complex short stories”, Reading Quarterly, n. 17, 1982.
Sleator, D. and Temperley, D., “Parsing English with a Link Grammar”, Third International
Workshop on Parsing Technologies, 1993.
Soubbotin, M. M., and Soubbotin, S. M., “Patterns of potential answer expressions as clues to
the right answers”, Proceedings of TREC-10, NIST, 2002.
279
Bibliography
Soubbotin, M. M. and Soubbotin, S. M., “Use of Patterns for Detection of Likely Answer
Strings: A Systematic Approach”, in Proceedings of TREC-11, NIST, 2003.
Spark-Jones, K., et al. (eds.), Readings in Information Retrieval, Morgan Kaufmann, 1997.
Sperber, D. & Wilson, D., Relevance: Communication and Cognition. Blackwell, Oxford and
Harvard University Press, Cambridge MA, 1986.
Srihari, R., and Li, W., “Information Extraction Supported Question Answering”,
Proceedings of TREC-8, NIST, 2000.
Strachan, L., et al., “Pragmatic user modelling in a commercial software system”, in
Proceedings of the Sixth International Conference on User Modeling (UM97), Springer,
1997.
Su, L. T., “Value of search results as a whole as the best single measure of information
retrieval performance”, Information Processing and Management, 34(5), 1998.
Sukaviriya and Foley, “A built-in provision for collecting individual task usage information in
UIDE: the User Interface Design Environment), in Schneider-Hufscmidt et al. (eds.),
Adaptive User Interfaces: Principles and Practice, Amsterdam, 1993.
Takaki, T., “NTT-DATA: Overview of system approach at TREC-8 ad-hoc and questionanswering”, in Proceedings of TREC-8, NIST, 2000
Taylor, R., “Information use environments”, in Progress in Communication Science, pp. 217255. Norwood, NJ: Ablex, 1991.
.
Temperley, D., An Introduction to the Link Grammar Parser, available from
http://www.link.cs.cmu.edu/link/dict/introduction.html, Last modified: Mon Mar 22 09:25:46
EST 1999, (last accessed 17 Oct 2003), 1999.
van Beek, P., Cohen, R. and Schmidt, K., “From plan critiquing to clarification dialogue for
cooperative response generation”, Computational Intelligence 9:132-154, 1993.
280
Bibliography
Voorhees, E., “The TREC-8 Question Ansering Track Report”, in Proceedings of the 8th Text
Retrieval Conference, NIST, 2000
Voorhees, E., “Natural Language Processing and Information Retrieval”, in M.T. Pazienza
(ed.), Information Extraction, Springer, 2000b.
Voorhees, E., “Overview of the TREC-9 Question Answering Track”, Proceedings of TREC9, NIST, 2001.
Voorhees, E., “Overview of the TREC 2001 Question Answering Track”, Proceedings of
TREC-10, NIST, 2002.
Voorhees, E., “Overview of the TREC 2002 Question Answering Track”, Proceedings of
TREC-11, NIST, 2003.
Voorhees, E., “Evaluating the Evaluation: A Case Study Using TREC 2002 QA”,
Proceedings of HLT-NAACL 2003, Edmonton, Canada, 2003b.
Voorhees, E., Tice, D. M., “The TREC-8 Question Answering Track evaluation”, in
Proceedings of the 8th Text Retrieval Conference, NIST, 2000 .
Vossen, P., “EuroWordNet: a multilingual database for information retrieval,” In Proceedings
of the DELOS workshop on Cross-language Information Retrieval, Zurich, 1997.
Wahlster, W. and Kobsa, A., “User models in Dialog Systems”, in Kobsa and Wahlster (eds.),
User models in dialog systems, Springer, 1989.
Wiebe, J. et al., “Recognizing and Organizing Opinions Expressed in the World Press”,
Proceedings on the AAAI Symposium on New Directions in Question Answering, Stanford,
2003.
Wilks, Y., and Catizone, R., “Can we make information extraction more adaptive?”, in
Proceedings of the SCIE99 Workshop, Springer-Verlag, Berlin. Rome,1999.
Wilks, Y. A., Slator, B. M., Guthrie, L. M., Electric Words: dictionaries, computers and
meaning, MIT Press, 1996.
281
Bibliography
Wilson, D., and Sperber, D., “On defining 'relevance'“, in Grandy and Warner (eds.),
Philosophical grounds of rationality, Oxford, 1986.
Wilson, D., and Sperber, D., “Relevance Theory”, in G. Ward and L. Horn (eds) Handbook of
Pragmatics. Oxford, 2003.
Winograd, T., Understanding Natural Language, NY Academic Press, 1972.
Wittgenstein, L., Tractatus Logico-philosophicus, Routledge Classics, 2001
Woods, W. A., et al., “Halfway to Question Answering”, Proceedings of TREC-9, NIST,
2001.
Yang, H., and Chua, T.-S., “The integration of lexical knowledge and external resources for
Question Answering”, in Proceedings of TREC-11, NIST, 2003.
282
© Copyright 2026 Paperzz