Computing Pronoun Antecedents in an English Query System

C o m p u t i n g Pronoun Antecedents in an English Query System
K u r t Godden
Computer Science Department
General Motors Research Laboratories
Warren, Michigan
48090
Abstract
In this paper I discuss h o w t h e D a t a l o g E n glish query system resolves p r o n o m i n a l references to e x t r a - s e n t e n t i a l antecedents t h a t represent database records.
W h e n t h e system
encounters a p r o n o u n in a query, it searches
t h r o u g h saved representations of earlier queries
for an antecedent. A n u m b e r of c r i t e r i a m u s t
be satisfied before a proposed antecedent w i l l
be accepted. A m o n g these are satisfaction of
the p r o n o u n ' s g r a m m a t i c a l features a n d tests
for c o n t r a d i c t i o n s a n d tautologies. A d d i t i o n a l
d i s c r i m i n a t o r s are a p p l i e d i n the event t h a t
there are t w o c o m p e t i n g antecedents b e i n g considered.
Of special interest is use of a h o l d
queue m e c h a n i s m w h i c h allows r e l a x a t i o n o f
the g r a m m a t i c a l features o f n u m b e r a n d gender expressed by a personal p r o n o u n . A l l of
the strategies are i n d e p e n d e n t of any a p p l i c a t i o n d o m a i n a n d d o n o t f a l l i n t o those parts o f
the system t h a t need to be replaced or m o d i f i e d
to interface D a t a l o g to a new database.
1
Introduction
In t h i s paper I w i l l describe in d e t a i l the specific algor i t h m s a n d heuristics o f how p r o n o m i n a l antecedents are
f o u n d i n a n E n g l i s h database f r o n t end, a s c u r r e n t l y i m plemented i n d e p e n d e n t l y o f the d o m a i n i n the D a t a l o g
system [ G o d d e n , 1985, Hafner, 1984, Hafner a n d G o d den, 1985]. S ome of the system's a b i l i t i e s in r e s p o n d i n g
t o queries w i t h p r o n o u n s are shown i n ( 1 ) . T h e examples
o f D a t a l o g exchanges are t a k e n f r o m a n i m p l e m e n t a t i o n
t h a t interfaces to a simple f i c t i o n a l personnel database.
T h e D a t a l o g system i n c l u d i n g the p r a g m a t i c s c o m p o nent is p o r t a b l e in the sense t h a t it m a y easily be i n t e r faced to other r e l a t i o n a l databases as w e l l [Hafner a n d
G o d d e n , 1985]. In t h e examples, user i n p u t is in sans
serif a n d D a t a l o g ' s answers are in slanted type.
1 a. Does Smith work for the same dept as Stevens?
No, it's not true for the employee called Smith
that department
= electronics.
What dept does he work for?
Note: 'he' refers to a female.
Name
Department
Smith
Physics
1498
Speech and Natural Language
b. Is the avg age of the men in electronics greater
than 35?
Yes, for the 7 employees where sex is male and
dept is electronics, the average age is 45.86
Is it more than the avg age of the women in electronics?
Yes, it is true for those employees that averageage > 42
c. How many men work in math?
There are 2 employees where sex is male and
where
department is mathematics.
Does Jones work for them?
No, it is not true for the employee called Jones
that
department
is
mathematics.
A t present, D a t a l o g s u p p o r t s three t y p e s o f a n tecedents:
entities, such as employees represented as
database records ( l a ) ; data objects, w h i c h are values t h a t
are c o m p u t e d such as the average age of t h e m e n ( l b ) ;
a n d attribute values, such as the name of an employee's
d e p a r t m e n t ( l c ) . T h i s paper w i l l discuss o n l y the t y p e
of p r o n o u n s t h a t refer to entities. For a discussion of the
other t w o types see [ G o d d e n , 1988].
E n g l i s h has m a n y different categories of p r o n o u n s as
well as several l i n g u i s t i c devices for a n a p h o r i c reference.
D a t a l o g c u r r e n t l y recognizes a l i m i t e d subset of these,
specifically the definite personal p r o n o u n s in t h e i r various inflected f o r m s - he, him, his, she, her, hers, it,
its, they, them, their, a n d theirs.
The genitive forms
(e.g. his) m a y be used as a d j e c t i v a l m o d i f i e r s (his age)
or m a y be n o m i n a l i z e d (greater than his). As c u r r e n t l y
i m p l e m e n t e d , D a t a l o g i n t e r p r e t s these p r o n o u n s as h a v i n g antecedents t h a t appear i n previous queries. T h e
remainder of this paper discusses the m e t h o d s used in
this interpretation.
2
Knowledge Structures Used
D a t a l o g uses f r a m e - s t r u c t u r e d semantic representations
w h i c h are constructed by the semantics c o m p o n e n t
[Hafner, 1984]. T h e semantic f r a m e s t r u c t u r e t h a t represents the p r o n o u n he is s h o w n in F i g u r e 1.
(anaphor (var g l 2 3 4 ) ( q n i l ) ( p r o he)
(restrictions
( p r e f ( a t t sex)(relop i s ) ( v a l u e m ) )
(pref(feat n u m b e r ) ( r e l o p i s ) ( v a l u e s i n g ) ) ) )
Figure 1. Semantic Frame for ' H E '
Of interest are t h e slots for var a n d restrictions. T h e
purpose o f the p r a g m a t i c s c o m p o n e n t i s t o b i n d the v a r i able in the var slot of an a n a p h o r f r a m e to i t s antecedent.
T h e restrictions l i s t c o n t a i n s the m o r p h o l o g i c a l features
of the p r o n o u n for n u m b e r a n d gender w h i c h are used
b y p r a g m a t i c s i n t e s t i n g p o t e n t i a l antecedents. A s exp l a i n e d b e l o w , these features are o n l y preferences and
n o t s t r i c t c o n s t r a i n t s o n antecedents. T h e preference
feature of n u m b e r is a l w a y s b u i l t for a p r o n o u n f r a m e .
In c o n t r a s t , the gender preference s t r u c t u r e is only b u i l t
w h e n the u n d e r l y i n g d a t a can be d i s t i n g u i s h e d by gender t h r o u g h an associated a t t r i b u t e such as sex. F u r t h e r ,
the gender s t r u c t u r e is o n l y added to anaphor frames for
singular p r o n o u n s , since p l u r a l p r o n o u n s are not so distinguished in English.
P o t e n t i a l antecedents are f o u n d in the discourse hist o r y t h a t is m a i n t a i n e d t h r o u g h o u t a user session. Each
parsed query is saved on t h i s l i s t a n d antecedent searchi n g takes place in reverse order, i n s p e c t i n g the most recent query f i r s t . Each query is represented in the discourse h i s t o r y as a p a i r consisting of the semantic structure b u i l t for the query a n d a l i s t of records retrieved in
processing t h a t query. I f the list o f records i s f o u n d t o
represent the entities referred to by a p r o n o u n , t h e n the
variable of t h a t p r o n o u n is b o u n d to the l i s t . A global
parameter defines a h o r i z o n ' b e y o n d w h i c h the h i s t o r y
list is n o t searched. T h i s p a r a m e t e r has been a r b i t r a r i l y
set to a l l o w consideration of t h e five preceding queries
a n d has seemed adequate in use. If an antecedent is not
f o u n d w i t h i n t h e h o r i z o n , t h e n the parse fails.
I t was j u s t m e n t i o n e d t h a t the s t r u c t u r e p o i n t e d a t
by an a n a p h o r ' s v a r i a b l e is a l i s t of database records
i n the case o f e n t i t y t y p e antecedents. T h i s list m a y
come f r o m a discourse h i s t o r y p a i r as noted above, or
it m a y be a l i s t t h a t is the referent of a n o u n phrase
s u b s t r u c t u r e of some earlier query. A l l referring expressions, n o t j u s t anaphors, have variables i n their semantic
structures t h a t are b o u n d t o t h e i r referents. W h e n i t i s
d e t e r m i n e d t h a t an a n a p h o r refers to one of these referents, the a n a p h o r ' s v a r i a b l e is set to the same list as
t h a t o f the v a r i a b l e i n the antecedent expression. T h u s ,
e n t i t y - r e f e r r i n g p r o n o u n s are i n t e r p r e t e d as extensional
expressions.
3
C o n t r o l a n d t h e H o l d Queue
T h i s section presents the c o n t r o l a l g o r i t h m used to
search the discourse h i s t o r y l i s t for an antecedent of a
p r a g m a t i c p r o n o u n . A t the m o s t general level, the algor i t h m first determines w h i c h o f the three i m p l e m e n t e d
types of antecedents to seek for a given p r o n o u n a n d
t h e n i t tries t o f i n d a n antecedent o f t h a t t y p e w i t h i n the
search h o r i z o n . In general, the desired antecedent t y p e is
easily d e t e r m i n e d f r o m t h e o v e r a l l semantic s t r u c t u r e o f
the query c o n t a i n i n g the p r o n o u n . For example as in F i g ure 2, if the subject of a query, called a t o p i c in D a t a l o g ,
is an a n a p h o r a n d the predicate specifies some a t t r i b u t e value r e s t r i c t i o n p a i r , t h e n t h a t a n a p h o r i c t o p i c clearly
refers to e n t i t i e s , or database records. T h i s is because
it is meaningless to predicate an a t t r i b u t e value of some
c o m p u t e d d a t a o b j e c t , w h i c h is i t s e l f a value.
D o they w o r k i n m a t h ?
(query ( p e r f o r m (test t r u e ? ) )
(topic(anaphor(var g l 5 7 ) ( q nil)(pro they)
(restrictions
(pref
(feat n u m b e r ) ( r e l o p is)(value p l u r ) ) ) ) )
(predicate
( p r o p ( a t t d e p t ) ( r e l o p is)(value m a t h ) ) ) )
Figure 2. Anaphor Referring to an Entity
O n the other h a n d , the t y p e o f the i n t e n d e d antecedent
cannot always be so easily d e t e r m i n e d . T h e query in
Figure 3 could either be requesting a l i s t of p r e v i o u s l y
m e n t i o n e d employees, or a list of ages or some other
values j u s t referred to by another query.
List t h e m .
(query ( p e r f o r m ( d i s p l a y ) )
(topic(anaphor(var g l 5 8 ) ( q nil)(pro them)
(restrictions
(pref (feat n u m b e r ) ( r e l o p is)(value p l u r ) ) ) ) )
(predicate t ) )
Figure 3. Anaphor of Unknown Antecedent Type
In cases like t h i s , the preceding query is inspected to
determine w h a t the referent is. It is assumed t h a t if
the i m m e d i a t e l y preceding query is a request for a d a t a
object, t h e n the current anaphor is i n t e n d e d to refer to
t h a t value. Otherwise, the a n a p h o r is t a k e n to refer to
the entities picked o u t by the preceding query. If the
preceding query h a d no entities associated w i t h i t , as
w o u l d be the case given a negative response to a y e s / n o
question, then the entities chosen as antecedent are those
referred to by the preceding query's t o p i c .
Once the t y p e of antecedent has been d e t e r m i n e d , cont r o l i s transferred t o the search a l g o r i t h m , s h o w n i n s i m plified f o r m i n Figure 4 . T h i s top-level c o n t r o l a l g o r i t h m
searches back t h r o u g h the discourse h i s t o r y l i s t checki n g each previous query's semantic s t r u c t u r e for an a n tecedent o f the a p p r o p r i a t e t y p e . T h e first a p p r o p r i a t e
antecedent f o u n d is r e t u r n e d .
1 . Check n e x t p a i r o n h i s t o r y l i s t f o r antec
2 . I f n o antecedent found w i t h i n h o r i z o n o r
i f end o f h i s t o r y l i s t i s r e a c h e d ,
t h e n i f h o l d queue i s n o t empty
t h e n p r i n t d i a g n o s t i c and
r e t u r n f r o n t o f h o l d queue
e l s e r e t u r n n i l (parse f a i l s )
e l s e i f acceptable antecedent i s f o u n d
t h e n r e t u r n t h a t antecedent
e l s e l o o p back t o step 1 .
Figure 4. Search Control Algorithm
T h e h o l d queue is used to store a l t e r n a t i v e a n tecedents. A l t e r n a t i v e antecedents are those m a t c h i n g
the desired t y p e b u t d i f f e r i n g f r o m the p r o n o u n i n n u m ber or gender. H o w a l t e r n a t i v e antecedents are placed
on the h o l d queue is discussed in section 4.1 b e l o w . T h e
need for this h o l d queue became evident w h e n t a r g e t e d
end users ( a n d v i s i t i n g researchers) made incorrect ass u m p t i o n s concerning the d a t a w h i c h confused t h e m d u r i n g a session w i t h D a t a l o g . In ( 2 ) , a t y p i c a l e x a m p l e , t h e
user assumes t h a t S m i t h is a m a l e by using he.
Godden
1499
2 Tell me whether Smith works for the same dept as the
tallest woman.
No, it is not true for the employee called Smith that
department
=
computer-sci
What dept does he work for?
However, i t t u r n s o u t for the d a t a i n question t h a t
this assumption is incorrect because S m i t h is a w o m a n .
W i t h o u t the h o l d queue, D a t a l o g w o u l d either f i n d no
antecedent at a l l , or else a male antecedent f r o m a different sentence leading to confusion since t h a t was not the
antecedent intended by the user. W i t h the h o l d queue,
however, the record for S m i t h is placed on the h o l d queue
and the search for an antecedent continues. If no antecedent is f o u n d w i t h i n the horizon t h a t does satisfy
the appropriate g r a m m a t i c a l features as well as other
tests to be described later, S m i t h ' s record is retrieved
f r o m the h o l d queue, a n d the system responds as shown
in (3). A similar diagnostic is p r i n t e d when the cardin a l i t y of the antecedent conflicts w i t h the g r a m m a t i c a l
number of the p r o n o u n .
3 Note: 'he9 refers to a female.
Name
Departmen t
Smith
Physics
As this example illustrates, the h o l d queue provides a
fair degree of f l e x i b i l i t y since the system prefers to satisfy
the g r a m m a t i c a l features of a p r o n o u n when l i n k i n g it
to an antecedent, b u t these features m a y be relaxed.
4
Search S t r a t e g i e s f o r E n t i t i e s
Let us now consider the procedure t h a t searches a semantic query frame for an antecedent of t y p e e n t i t y .
In database queries, pronouns most l i k e l y refer to antecedents t h a t correspond either to an entire query (4a)
or to a semantic subject (4b).
4 Which men work in math?
a. Which of them (men in math) are over 40?
b. Which of them (men) work in physics?
D a t a l o g first considers these p o t e n t i a l antecedents before others referenced in predicates, embedded clauses,
and other constituents. For convenience, let us call the
entities referenced by an entire query as the final set and
those referenced by a query's topic slot as the topic set.
Recall t h a t the final set of a query is saved on the history
list paired w i t h the semantic structure of the query t h a t
selected i t .
4.1
C h o o s i n g b e t w e e n T o p i c Set a n d F i n a l Set
There are four general strategies plus one default to determine which of the topic set or final set is the more
appropriate antecedent. These strategies are invoked sequentially, the next being called only when the current
strategy cannot make a choice. T h e final default strategy is used as a last recourse after a t t e m p t i n g the other
four. Even when a choice between topic set and final set
is made, this choice is subject to placement on the hold
queue.
T h e first strategy is for queries t h a t result in an e m p t y
final set, as for a negative response to a yes/no question.
For these cases, only the n o n - n u l l topic set needs to be
1500
Speech and Natural Language
confirmed for reasonableness (cf. discussion of contradictions a n d tautologies below) to accept it as the a n tecedent. If b o t h sets are n u l l , they are n o t even considered since pronouns are assumed to refer successfully.
W h e n b o t h the topic set a n d final set are n o n - n u l l , the
system cannot yet choose a n d moves to the next strategy. T h e other logical possibility is t h a t the topic set is
e m p t y while the final set is n o t . T h i s is an impossible
case, however, since the final set is derived f r o m the topic
set and is, therefore, a subset of i t .
A pair of specialized reasoning procedures are called
to look for contradictions a n d tautologies as the second
strategy. For example, if the assignment of one proposed
antecedent w o u l d result in a c o n t r a d i c t o r y or t a u t o l o g ical reading i n v o l v i n g the anaphor, then t h a t proposed
antecedent is rejected. T h i s is how D a t a l o g chooses the
topic set as the antecedent in (5).
5 Are any of the men older than 30?
Are any of them younger than 30?
6 Is Bell in math?
What dept is Jones in?
Is he the same age as Jones?
Detection of contradictions a n d tautologies involves
more t h a n numeric comparisons, however. T h i s is true
in part because a t t r i b u t e s m a y range over symbolic as
well as numeric values. B u t even w i t h numeric a t t r i b u t e s
comparisons are not always needed, as (6) illustrates. In
(6), Jones is rejected as a possible antecedent w i t h o u t
p e r f o r m i n g a numeric comparison on age a n d the system
continues searching t h r o u g h earlier queries on the history
l i s t , where it finds a n d accepts B e l l as the antecedent. It
should be emphasized t h a t the system computes these
conditions symbolically and d o m a i n independently. I f
b o t h the topic set and final set are ruled out due to resulting contradictions or tautologies, then the current
query f r o m the history list is examined for possible antecedents embedded elsewhere. If a p o t e n t i a l antecedent
is found in an embedded constituent, then it too is tested
for contradiction or tautology a n d is subjected to a test
for placement on the hold queue.
It is often the case t h a t the topic set or f i n a l set cannot
be chosen on the basis of contradictions or tautologies.
T h e next strategy is a heuristic based u p o n the r e p e t i t i o n
of an a t t r i b u t e in the query w i t h the p r o n o u n . T h i s
heuristic is stated in (7).
7 T h e topic set is chosen as the antecedent if any
a t t r i b u t e f r o m the predicate of the proposed antecedent's query is repeated in the predicate of the
query containing the anaphor.
As an example, consider the queries in (8). In (8a)
the antecedent is taken to be the topic set: a l l men over
30. T h i s contrasts w i t h (8b) whose predicate does not
contain a reference to the a t t r i b u t e weight, a n d here the
system assigns the final set as the antecedent (see below).
8
a. How many men older than 30 weigh more than
200 pounds?
Which of them weigh less than 210 pounds?
b. How many men older than 30 weigh more than
200 pounds?
Which of them are taller than 70 inches?
T h e heuristic enforces the principle t h a t a user's cont i n u e d e x p l o r a t i o n of various ranges or values of an att r i b u t e expressed in the predicate is probably meant to
discover different p a r t i t i o n s based on t h a t a t t r i b u t e of
some fixed set of entities referred to by the anaphor.
If the repeated a t t r i b u t e were instead intended to be a
more specific n a r r o w i n g of the previous set, then the first
query itself w o u l d p r o b a b l y have contained a more specific restriction o n the a t t r i b u t e i n question. I n (8a), i f
the user were really interested in seeing those men who
are older t h a n 30 and whose weight falls between 200
and 210 pounds, then he w o u l d have asked for t h a t in
his first query rather t h a n using t w o queries. (However,
see below.) T h i s j u s t i f i c a t i o n for the heuristic is wellm o t i v a t e d on general linguistic principles. T h e Least
Effort Hypothesis [Zipf, 1949] of c o m m u n i c a t i o n states
t h a t given alternative means to express some concept,
language users w i l l tend to choose t h a t alternative t h a t
requires the least effort. W h i l e the Repeated A t t r i b u t e
Heuristic works in m a n y instances, it is only an approxi m a t i o n a n d needs m u c h i m p r o v e m e n t as (9) shows.
9
a. Are any employees heavy?
b. Are any of them tall?
Datalog assigns the final set of (9a) to the pronoun of
(9b) as its antecedent since there is no repeated a t t r i b u t e
and because the final set is the default assignment (see
below). B u t the more n a t u r a l reading of (9b) suggests
t h a t the topic set is the appropriate antecedent.
Notice also t h a t a l t h o u g h (9a) refers (via the adjective) to the a t t r i b u t e weight and (9b) to height, there
is an i m p o r t a n t difference f r o m the sequence in (8b).
In (8b) there is a similar sequence of reference to the
weight and height a t t r i b u t e s , yet the assignment of the
antecedent for (8b) to the final set seems correct in contrast to t h a t same assignment in (9b). T h e difference
seems to be t h a t in the examples in (9) where fuzzy predicates (the adjectives heavy and tall) are used, there is
the i n t u i t i v e feeling t h a t the a t t r i b u t e s weight and height
are somehow related in a way not exhibited by the use
in (8) of specific reference to points in the domains of
those a t t r i b u t e s .
B u t there is another c o m p l i c a t i n g factor involved here,
namely the influence of syntactic and semantic parallelism. Sidner [1979] notes this same phenomenon and
the difficulties it presents for her a l g o r i t h m of anaphora
resolution using focus. W h e n we look at the sequence
of (8a) a certain parallelism between the predicates is
evident (where such parallelism is not so pronounced in
(8b)). It is this same parallelism t h a t may help explain
w h y the favored i n t e r p r e t a t i o n of the pronoun in (9b) is
to assign it to the topic set. If D a t a l o g were to detect
such parallelism ( w h i c h it currently does not do), then
this i n f o r m a t i o n could be used to a v o i d the default assignment of final set in (9b) noted above, and in similar
examples where a p p r o p r i a t e .
The f o u r t h strategy compares the g r a m m a t i c a l gender and number expressed by the p r o n o u n against the
same features f o u n d in the entities referred to by b o t h
the topic set and the final set. So for example, if the
pronoun used is she the candidate antecedent sets are
tested for the existence of a single female e n t i t y . O n l y
the marked forms of male and female gender are tested.
If the pronoun has neuter gender, e.g. it, no test is performed for gender. T h e queries in (10) show how D a t a l o g
responds to choices made by this strategy.
10
a. How many metallurgy employees art women?
There is 1 employee where department is metallurgy and where sex is female.
Is she older than 20?
Yes, it is true for that employee that age > 20.
b. How many metallurgy employees are women?
There is 1 employee where department is metallurgy and where sex is female.
Are they older than 20?
Yes, it is true for those employees that age>20
T h e resulting outcomes of comparisons using number
and gender include the one where b o t h p o t e n t i a l a n tecedents m a t c h b o t h features of the p r o n o u n , whereupon no decision is made to choose one over the other.
Another possibility is t h a t neither of the p o t e n t i a l a n tecedents matches either of the pronoun features. If this
is the case, no decision is made and the p o t e n t i a l antecedents are placed on the hold queue. A choice is
made if only one feature matches only one of the proposed antecedents, or else if b o t h features m a t c h the
same proposed antecedent. In b o t h situations, the decision of course favors the antecedent t h a t is matched by
the features.
If none of our preceding strategies was able to suggest
a choice for antecedent, then the decision is a r b i t r a r i l y
made to choose the final set as antecedent. T h i s subsumes the a d d i t i o n a l special case where the topic set is
the same as the final set. Because this final decision is
arbitrary, a message is printed to the user t h a t indicates
which antecedent is assumed by the system. An example of this is shown in (11), which precedes the answer
to the second query of (8b).
11 By 'them'1 assume you mean the employees where
sex is male and age > 30 and where weight > 200.
4.2
P r o n o u n Sequences as I m p l i c i t Focus
There is also a case where the antecedent is a u t o m a t i cally chosen. This is to deal w i t h a sequence of queries
where the first specifies some referent t h a t is referred to
in succeeding queries by using pronouns. An example
would be a sequence such as (12).
12 How many women are in math?
Do they have PhD's?
Which of them earn less than $40,000?
Once the first such pronoun in the sequence is b o u n d
to an antecedent (the women in m a t h in this example), subsequent compatible pronouns are a u t o m a t i c a l l y
bound to the same antecedent. For this purpose, compatible pronouns are defined to be those w i t h the same
number and gender ( i f applicable), b u t m a y have different case. T h u s , he is compatible w i t h him and his, b u t
not w i t h it, she or they. It should be p o i n t e d out t h a t
the sequence of queries w i t h compatible pronouns need
not be an unbroken sequence. There m a y be intervening
queries w i t h no pronouns at a l l , or some w i t h pronouns
t h a t are not compatible w i t h those of the sequence.
Godden
1501
T r e a t i n g sequences of compatible pronouns as coreferential can be viewed as a non-explicit use of the n o t i o n of
focus. T h e first referential phrase selects w h a t amounts
to a focus—an object or set of objects w h i c h remains
the antecedent of a subsequent series of pronouns. T h a t
original 'focus* w i l l remain the antecedent of subsequent
pronouns as long as those pronouns continue to be compatible w i t h their predecessors and as long as the resulting bindings do not result in c o n t r a d i c t o r y or t a u t o l o g i cal readings.
If an acceptable antecedent is not f o u n d using the
methods described up to this p o i n t , then other structures are searched in the current query frame f r o m the
history list for entity-referring expressions. If any such
expressions are f o u n d , then the proposed antecedent is
tested for contradictions and tautologies, and is also subject to placement on the hold queue depending on how
it matches the g r a m m a t i c a l features of the anaphor. If
no entity-referring expressions are f o u n d , then the search
continues w i t h the next query on the history list.
5
Problems
The first problem is t h a t the current i m p l e m e n t a t i o n requires t h a t an antecedent occur as a single constituent.
T h i s means t h a t in (13a) they is prevented f r o m referring
to b o t h Jones and S m i t h .
13
a. Is Jones older than Smith?
Are they in the math dept?
b. Are Jones and Smith older than 35?
Are they in the math dept?
Therefore D a t a l o g w i l l skip over the first query in
(13a) to inspect earlier queries in the discourse history
list. T h i s contrasts w i t h (13b) where they does get properly b o u n d to the c o m b i n a t i o n of Jones and S m i t h , b u t
the difference is t h a t in (13b) Jones and S m i t h f o r m a
single constituent at b o t h the syntactic and semantic
levels. There is no s t r a i g h t f o r w a r d way in w h i c h the
d i s c o n t i n u o u s antecedents' of (13a) could be considered
by the system. One approach t h a t could be taken w o u l d
be to b u i l d machinery to posit combinations of i n d i v i d ually considered p o t e n t i a l antecedents. T h i s is the basic
f u n c t i o n of the compose operation of the N L C system
[ B a l l a r d , 1982]. It remains to be seen in practice whether
or not there is a great need for such operations in the
environment of database queries.
A more general p r o b l e m occurs when the antecedent
the user had in m i n d is passed over by the system in favor
of some other referent, or when the correct antecedent is
not even considered because the system accepts another
antecedent before the intended one is encountered.
T h i s occurs most often as a result of a user's misconception concerning the number or gender of the intended
antecedent. W h i l e the h o l d queue was designed for this
eventuality, i t s use is sometimes circumvented due to
the content of earlier queries w i t h i n the search horizon.
Consider again the example in (2), repeated here as (14),
where he is intended to refer to S m i t h , who is a female.
14 Tell me whether Smith works in the same dept as the
tallest woman?
What dept does he work for?
1502
Speech and Natural Language
A n o t h e r male referent m a y be chosen f r o m an earlier query, before S m i t h can be retrieved f r o m the h o l d
queue. T h i s could occur, for example, if Jones (a male)
were referred to w i t h i n the horizon of the search space
in a context where a l l of the tests for inclusion or exclusion of an antecedent indicate Jones as an acceptable
antecedent for the p r o n o u n he in (14).
T h e sequence in (15) shows another s i t u a t i o n where
the wrong antecedent is chosen in the second query.
15 Which of the men is the shortest?
3 employees are in this group.
Name D e p a r t m e n t
Height
Wilson Polymers
64
Collier Polymers
64
Bell
Electronics
64
Are they taller than the woman named Smith?
Yes, it is true for 23 out of 26 of those employees
that height > 66.
Here, the user probably intends to refer to the three
shortest men, b u t because of the repeated a t t r i b u t e
heuristic the system binds the p r o n o u n to the topic set:
all 26 men in the database. One possible avenue o u t
of this d i f f i c u l t y w o u l d be to use the n o t i o n of perspective as o u t l i n e d by M c C o y [1986]. In (15), the f i n a l set
of the first query w o u l d be more salient t h a n the topic
set w i t h regard to the a t t r i b u t e height. Since the second
query asks about some group (they) f r o m the perspective
of height, the final set is therefore chosen. T h i s w o u l d
seem to be in direct conflict w i t h the repeated a t t r i b u t e
heuristic. Just how this conflict w o u l d be resolved remains for further study.
6
Relations to Other W o r k
Webber [1978] lists eight categories of objects t h a t m a y
serve as antecedents for anaphoric expressions. W h a t I
have referred to as entities coincides largely w i t h her t w o
categories of i n d i v i d u a l s and sets. T h e other t w o categories I mentioned, computed d a t a objects and a t t r i b u t e
values, are also f a i r l y n a t u r a l in the database d o m a i n b u t
do not have neat counterparts in Webber's l i s t . Webber
takes the antecedent of an anaphor to be " t h e unique description of [a discourse e n t i t y ] conveyed to the listener
by the i m m e d i a t e l y preceding t e x t " (p.28). In contrast,
I have already pointed out how in D a t a l o g pronouns refer directly to their antecedent objects in the w o r l d , i.e.
the database. In [Godden, 1988] I discuss how Datalog's
responses to queries w i t h pronouns could be made more
i n f o r m a t i v e if the antecedent were an intensional structure instead. Such a change w o u l d p u t D a t a l o g ' s treatment of the semantics of anaphors more closely aligned
w i t h Webber's view.
Some of the best-known w o r k in discourse anaphora
involves the n o t i o n of focus [Grosz, 1977, Sidner, 1979].
However, the concept of focus, w h i l e appearing to be especially well-suited to task-oriented dialogues, seems less
well-suited to free-wheeling database dialogues where
there is no specific task or goal to n a t u r a l l y constrain the
flow of discourse. Therefore, the i n t e n t of the current i n vestigation has been to explore a l t e r n a t i v e strategies for
discourse anaphora while r e t a i n i n g the a t t e m p t to base
these strategies on sound linguistic principles.
Brennan, F r i e d m a n , a n d P o l l a r d [1987] discuss an alg o r i t h m for p r o n o u n resolution based on the notion of
centering [Grosz et al., 1983]. W h i l e they state t h a t their
a l g o r i t h m has been i m p l e m e n t e d in a n a t u r a l language
interface to a database query system, a l l their examples
are taken f r o m short story-like narratives. T h e y mention
t h a t their system shares some similarities w i t h Sidner's
notions of focusing, b u t w i t h o u t database examples it
is difficult to evaluate the appropriateness of their approach vis-a-vis my c r i t i c i s m of focusing for this a p p l i cation area. T h e i r system does account for agreement
features between a p r o n o u n and its antecedent and also
deals w i t h coreference constraints, w h i c h Datalog does
not. On the other h a n d , they reserve for f u t u r e research
use of i n f o r m a t i o n present such as was discussed previously under contradictions, tautologies, and the heuristics used by the D a t a l o g system. In their system, pot e n t i a l antecedents (the " f o r w a r d l o o k i n g centers") are
rank ordered by their p a r t i c i p a t i o n in g r a m m a t i c a l relations. As a result, the system favors as an antecedent
the "subject, object, and object2, followed by other subcategorized functions, and f i n a l l y adjuncts." (p. 156)
Datalog favors such p o t e n t i a l antecedent structures in a
similar order, a l t h o u g h it does so i n d i r e c t l y due to the
search order for possible antecedents in the history list.
Rich and L u p e r F o y [1988] discuss a blackboard architecture used in the Lucy system for anaphora resolution
in a f r o n t end for knowledge-based systems. Lucy handles b o t h b o u n d and p r a g m a t i c pronouns. There is a separate module for each strategy used. Essentially, the different modules specify p o t e n t i a l antecedents along w i t h a
score and a confidence factor for t h a t score. Another procedure selects the "best" of the possible antecedents after
this scoring is done by each strategy module. In contrast
w i t h D a t a l o g , if the system cannot make a choice, Lucy
asks the end-user to choose a m o n g the alternative antecedents. Lucy employs some strategies t h a t have no
counterpart in D a t a l o g , a n d vice versa. Some strategies in Lucy deal w i t h b o u n d pronouns (e.g. the familiar
s t r u c t u r a l constraints on coreference) and others are applicable to pronouns in general, such as "semantic consistency" (selectional restrictions). Where the strategies
used by D a t a l o g and Lucy overlap, there are significant
differences in their use, e.g. L u c y s t r i c t l y enforces the
constraints of n u m b e r and gender, and recency is found
as an overt strategy m o d u l e in Lucy, as is focusing.
7
Conclusions
In this paper I have discussed how the Datalog system resolves p r o n o m i n a l references to extra-sentential
antecedents. Of special interest are a) the five strategies
used to decide between c o m p e t i n g p o t e n t i a l antecedents
of t y p e e n t i t y , a n d b) use of the h o l d queue mechanism
to relax the g r a m m a t i c a l features of number and gender
expressed by a personal p r o n o u n . A l l of the strategies
used by the system are independent of any application
d o m a i n a n d do n o t f a l l i n t o those parts of the system
t h a t need to be replaced or modified to interface D a t a log to a new database.
References
[Ballard, 1982] B ruce W. B a l l a r d . P r o n o m i n a l processi n g in a c o m p u t a t i o n a l d o m a i n . Research Report CS1982-1, D u r h a m , N C : Duke University, 1982.
[Brennan et al., 1987] Susan E. Brennan, M a r i l y n W.
Friedman, and C a r l J. P o l l a r d . A centering approach
to pronouns. In Proceedings of the 25th Annual Meeting of the Association for Computational Linguistics,
pp. 155-162, Stanford, C A : A C L , June 1987.
[Godden, 1985] K u r t Godden. Categorizing n a t u r a l l a n guage queries for intelligent responses. In Proceedings
of the 1985 National Computer Conference, pp. 67-73,
Reston, V A : A F I P S Press, J u l y 1985.
[Godden, 1988] K u r t Godden. Strategies for interpreting pragmatic pronouns. Research P u b l i c a t i o n G M R 6208, W a r r e n , M I : General M o t o r s Research Laboratories, 1988.
[Grosz, 1977] Barbara J. Grosz. The representation and
use of focus in dialogue understanding. Technical Note
151, Menlo Park, C A : S R I I n t e r n a t i o n a l , 1977.
[Grosz et al., 1983] Barbara J. Grosz, A r a v i n d K. Joshi,
and Scott Weinstein. P r o v i d i n g a unified account of
definite noun phrases in discourse. In Proceedings of
the 21st Annual Meeting of the Association for Computational Linguistics, p p . 4 4 - 5 0 , Cambridge, M A :
A C L , June 1983.
[Hafner, 1984] Carole D. Hafner. I n t e r a c t i o n of k n o w l edge sources in a portable n a t u r a l language interface.
In Proceedings of the 10th International Conference on
Computational Linguistics, p p . 57-60, Stanford, C A :
A C L , J u l y 1984.
[Hafner and Godden, 1985] Carole D. Hafner and K u r t
Godden. P o r t a b i l i t y of Syntax and Semantics in D a t a log. In ACM Transactions on Office Information Systems, 3(2):141-164, A p r i l 1985.
[McCoy, 1986] Kathleen F. McCoy. T h e R O M P E R Syst e m : Responding to Object-Related Misconceptions
using Perspective. In Proceedings of the 24th Annual
Meeting of the Association for Computational Linguistics, pp. 97-105, New Y o r k , N Y : A C L , June 1986.
[Rich and LuperFoy, 1988] El aine R i c h and Susann L u perFoy. A n Architecture for A n a p h o r a Resolution. I n
Proceedings of the Second Conference on Applied Natural Language Understanding, p p . 18-24, A u s t i n , T X :
A C L , 1988.
[Sidner, 1979] Candace L. Sidner. Towards a Computational Theory of Definite Anaphora Comprehension in
English Discourse. Technical Report 537, Cambridge,
M A : M I T A I L a b , 1979.
[Webber, 1978] Bonnie L y n n Webber. A Formal Approach to Discourse Anaphora. Report No. 3761. C a m bridge, M A : B o l t , Beranek and N e w m a n , Inc., 1978.
[Zipf, 1949] George K. Zipf. Human Behavior and the
Principle of Least Effort. Cambridge, M A : A d d i s o n Wesley Press, Inc., 1949.
Godden
1503