Generating and Interpreting Referring Expressions

Generating and Interpreting Referring Expressions in
Context
by
Dustin Arthur Smith
B.S., Wake Forest University (2005)
S.M., Massachusetts Institute of Technology (2007)
Submitted to the Program of Media Arts and Sciences, School of Architecture and
Planning
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2013
© MIT, 2013.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Program of Media Arts and Sciences, School of Architecture and Planning
September 1, 2013
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Henry A. Lieberman
Principal Research Scientist, MIT Media Lab
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pattie Maes
Associate Academic Head, Program of Media Arts and Sciences
Generating and Interpreting Referring Expressions in Context
by
Dustin Arthur Smith
Submitted to the Program of Media Arts and Sciences, School of Architecture and Planning
on September 1, 2013, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
Abstract
Referring expressions with vague and ambiguous modifiers, such as “a quick visit” and “the
big meeting,” are difficult for computers to interpret because their meanings are defined in part
by context. For the hearer to arrive at the speaker’s intended meaning, he must consider the
alternative decisions that the speaker was faced with in context. To address these challenges, I
propose a new approach to both generating and interpreting referring expressions based on beliefstate planning and plan recognition. Planning in belief space offers a way to capture referential
uncertainty and the incremental nature of generating and interpretation, because each belief state
represents a complete interpretation. The contributions of my thesis are as follows:
(1) A computational model of reference generation and interpretation that is fast, incremental,
and non-deterministic. This model includes a lexical semantics for a fragment of English noun
phrases, which specifies the encoded meanings of determiners (quantifiers and articles), gradable
and ambiguous modifiers. It performs in real time, even when the hypothesis space grows very
large. Because it’s incremental, it avoids considering possibilities that will later turn out to be
irrelevant.
(2) The integration of generation and interpretation into a single process. Interpretation is
guided by comparison to alternatives produced by the generation module. When faced with an
underspecified description, the system uses what it could have said and compares that to what the
speaker did say. Reasoning about alternative decisions facilitates inferences of this sort: “She ate
some of the tuna" means not all of it, otherwise you would have said, “She ate the tuna."
This approach has been implemented and evaluated using a computational model, AIGRE. I also
created a testbed for comparing human judgments of referring expressions to those produced
by our algorithm (or others). In an online experiment with Mechanical Turk, we attained 94%
coverage of human responses in a simple geometrical domain, as well as lower, but still encouraging,
coverage in a more complex, real-world domain.
The model, AIGRE, demonstrates that managing the vagueness and ambiguity in natural language,
while still not easy, is nevertheless possible. The day where we will routinely talk to our computers
in unconstrained natural language is not far off.
Thesis Supervisor: Henry A. Lieberman
Title: Principal Research Scientist, MIT Media Lab
Generating and Interpreting Referring Expressions in
Context
by
Dustin Arthur Smith
Submitted to the Program of Media Arts and Sciences, School of Architecture and
Planning
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2013
© MIT, 2013.
Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Marvin Minsky
Professor of Media Arts & Sciences
MIT Media Lab
Generating and Interpreting Referring Expressions in
Context
by
Dustin Arthur Smith
Submitted to the Program of Media Arts and Sciences, School of Architecture and
Planning
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2013
© MIT, 2013.
Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Agustín Rayo
Associate Professor
MIT Department of Linguistics & Philosophy
This thesis is dedicated to Push Singh
1972-2006
Acknowledgments
I owe my advisor, Henry Lieberman, enormous gratitude for sharing his invaluable advice toward research and all aspects of my intellectual life. I thank my readers: Marvin Minsky for a
decade of support, inspiring ideas, and courageously big thinking. Agustín Rayo, for his energy,
encouragement and precise ways for thinking about vagueness.
None of my pursuits would have been possible without my parents unwavering support and
love—Art, Eric & Sharon. I’m thankful for a large and loving family, especially my siblings:
Alexis, Allison, Erin & Matt, and Roy; and my grandparents: Joe & Shelia, and Lily & Arthur.
Unfortunately my grandfather, Arthur Norbert Smith, is no longer around; he was an engineer
with a great sense of humor and ability to get to the core of matters quickly. I am grateful for the
support from the Ahren, Brown, Coleman, Cabral and Cuneo families. Marcio, for his love, humor
and companionship. Marcio, Erin, Art and Sharon deserve special recognition for copyediting
under short deadlines.
I am grateful for all my friends; they’ve kept me insane throughout the years, especially: Simon,
Bo, Scotty, the Nicos, Nick, Spencer, Ernie, Kristin, Kyle, Karthik, Dan, Ryan and Chip.
I would like to thank Nicholas Negroponte for creating the Media Lab, and all of the sponsors who
support it. The lab has given me tremendous freedom and support over the years and exposed me
to people with amazing passion and enthusiasm.
My research has benefited from the support and discussion of numerous friends, colleagues and
administrators at MIT and elsewhere, including: Clio Andris, Ken Arnold, Nathan Artz, Barbara
Barry, Walter Bender, Kristina Bonikowski, Amna Carreiro, Brendan Casey, Yin Fu Chen, Jaewoo
Chung, David Dalrymple, Ben Deen, Karthik Dinakar, Nick Dufour, Ian Eslick, Charles Fracchia,
Christopher Fry, Errin Fulp, Felice Gardner, Catherine Havasi, Kasia Hayden, Mako Hill, Roarke
Horstmeyer, Gleb Kuznetsov, Simon Laflamme, Cameron Levy, Hugo Liu, Marcus Lowe, Anomal
Madan, Pattie Maes, Sean Markan, Bill Martin, electronic Max, Pranav Mistry, Manas Mittal, Bo
Morgan, Elan Pavlov, Nicolas Pinto, Tom Roberts, Luke Schiefelbein, Amelia Servi, Push Singh,
Rob Speer, Gerry Sussman, Alea Teeters, Scotty Vercoe, Ben & Rebecca Waber and Aaron Zinman.
Thank you!
7
8
Contents
Title Page
1
Abstract
2
Acknowledgments
7
Contents
9
1 Linguistic Reference in Context
13
1.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.2
What are referring expressions? . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
1.3
What do referring expressions mean? . . . . . . . . . . . . . . . . . . . . . . . . .
15
1.3.1
Referring expressions have intensional and extensional meanings . . . . .
16
Architectural constraints on the reference processes . . . . . . . . . . . . . . . . .
18
1.4.1
Fast and Incremental . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
1.4.2
Non-deterministic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
Constraining context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
1.5.1
Identifying relevant contextual factors . . . . . . . . . . . . . . . . . . . .
22
1.5.2
Collecting data in a constrained communication setting . . . . . . . . . .
23
1.5.3
Context Set = Referential Domain + combinatoric possibilities . . . . . . .
25
Characterizing the two reference tasks . . . . . . . . . . . . . . . . . . . . . . . .
26
1.6.1
Formalizing the referring expression generation (REG) task . . . . . . . .
26
1.6.2
Formalizing the referring expression interpretation (REI) task . . . . . . .
27
1.6.3
Defining communication success and failure . . . . . . . . . . . . . . . . .
28
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
1.4
1.5
1.6
1.7
9
2 AIGRE, a belief-state planning approach to reference
29
2.1
Generating Referring Expressions as Planning . . . . . . . . . . . . . . . . . . . .
29
2.2
Interpretation as Plan Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.3
Representing Lexical Items as Actions . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.4
Representing Intensions as Belief States . . . . . . . . . . . . . . . . . . . . . . . .
33
2.4.1
Belief states, a representation for uncertainty . . . . . . . . . . . . . . . .
34
2.4.2
Quantifying the extensional complexity of meaning . . . . . . . . . . . . .
35
2.4.3
Belief state implementation details . . . . . . . . . . . . . . . . . . . . . .
36
2.4.4
Action implementation details . . . . . . . . . . . . . . . . . . . . . . . . .
36
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
2.5
3 Building the Lexicon
3.1
39
Referring to sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.1.1
Plurals and Cardinals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
Referring to sets of sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.2.1
Representing unspecific meanings . . . . . . . . . . . . . . . . . . . . . .
41
3.2.2
An unconventional treatment of English quantifiers . . . . . . . . . . . .
41
3.2.3
Representing free choice in interpretations . . . . . . . . . . . . . . . . . .
42
3.2.4
A simple meaning for the definite article . . . . . . . . . . . . . . . . . . .
43
3.2.5
Representing Negation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.3
A function for deriving the extension from the intension . . . . . . . . . . . . . .
45
3.4
Referring to mutually incompatible sets . . . . . . . . . . . . . . . . . . . . . . . .
47
3.4.1
Lexical Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
3.4.2
Vagueness and Gradability . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
Ellipsis and the problem of missing words . . . . . . . . . . . . . . . . . . . . . .
56
3.5.1
Representing syntactic state in the intension . . . . . . . . . . . . . . . . .
57
3.5.2
Assuming default actions when needed . . . . . . . . . . . . . . . . . . . .
57
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
3.2
3.5
3.6
4 Controlling Search
59
4.1
AIGRE’s search framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
4.2
Defining the search components for both reference tasks . . . . . . . . . . . . . .
62
4.2.1
64
Goal-test functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
4.3
4.4
4.2.2
Action proposal functions . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
4.2.3
Effect sorting functions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
4.2.4
Heuristic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
4.2.5
Commit-test function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
4.2.6
Get-next-node function . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
Weighing decision factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
4.3.1
Using costs for guiding search . . . . . . . . . . . . . . . . . . . . . . . . .
66
4.3.2
Using benefits for guiding search . . . . . . . . . . . . . . . . . . . . . . .
67
4.3.3
Using costs for comparing plans . . . . . . . . . . . . . . . . . . . . . . . .
67
Search Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
5 Evaluation
69
5.1
Collecting the Turk Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
5.2
Coverage evaluation for generation (REG) . . . . . . . . . . . . . . . . . . . . . .
71
5.2.1
Analysis of generation errors in coverage . . . . . . . . . . . . . . . . . .
74
Coverage evaluation for interpretation (REI) . . . . . . . . . . . . . . . . . . . . .
74
5.3.1
Ablation analysis for interpretation . . . . . . . . . . . . . . . . . . . . . .
75
5.3.2
Analysis of interpretation errors in coverage . . . . . . . . . . . . . . . . .
76
Computational evaluations of REG performance . . . . . . . . . . . . . . . . . . .
80
5.4.1
Evaluating task complexity . . . . . . . . . . . . . . . . . . . . . . . . . .
80
5.4.2
Evaluating lexicon size . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
Qualitative evaluations of REG output . . . . . . . . . . . . . . . . . . . . . . . .
82
5.3
5.4
5.5
6 Related Work
6.1
89
Abandoning serial pipeline architectures . . . . . . . . . . . . . . . . . . . . . . .
89
6.1.1
Processing architectures for interpretation . . . . . . . . . . . . . . . . . .
89
6.1.2
Processing architectures for generation . . . . . . . . . . . . . . . . . . . .
92
6.1.3
An anti-modular, inferential architecture for both processes . . . . . . . .
94
6.2
Planning-based approaches to generation . . . . . . . . . . . . . . . . . . . . . . .
95
6.3
Planning-based approaches to interpretation . . . . . . . . . . . . . . . . . . . . .
96
11
12
7 Conclusion
7.1
Computational models of reference production . . . . . . . . . . . . . . . . . . . .
7.1.1
7.2
99
99
Comments about modeling syntax . . . . . . . . . . . . . . . . . . . . . . 100
Integrating interpretation and generation . . . . . . . . . . . . . . . . . . . . . . . 100
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Bibliography
105
Chapter 1
Linguistic Reference in Context
1.1
Introduction
This dialogue contains examples of reference in context, the subject of this thesis:
• Sally: Hans, what’s a good way to get to the new bank?
• Hans Frei (Phone): Take your second left to get to the credit union.
In this brief exchange, Sally generated a referring expression, the new bank, and her phone
interpreted her expression to mean the location she intended. Although natural language interfaces
are the future of human-computer interaction, their progress has been stifled by pragmatic issues
such as ambiguity (what type of “bank” did Sally mean?) and vagueness (how old must the bank
be to be considered “new”?). To interpret this kind of referring expression, the hearer must make
decisions that draw from “contextual” information because the linguistically-encoded meaning
falls short of the full meaning the speaker intended.
Such cases where meaning depends on context create major problems for shallow-semantic
approaches to interpretation, because they ignore frequently occurring words like “the” and “new.”
These modifiers are what give the hearer an opportunity to recover the assumptions underlying
Sally’s dialogue. By using the word “the,” Sally indicated that there was a single bank that was
mutually known to her and her phone, and in describing the bank as “new,” she conveyed that if
she had simply said “the bank,” the description may include some older bank that she wished to
exclude. Furthermore, sometimes deriving the full meaning that the speaker intended requires the
hearer to reason about the alternative decisions that the human speaker was faced with in that
context. To facilitate both the recovery of implicit assumptions and anticipated inferences, the
interpretation process benefits from being able to accurately emulate the decisions at each step of
the generation process.
14
CHAPTER 1. LINGUISTIC REFERENCE IN CONTEXT
This thesis claims that (1) both of these problems can be formulated using techniques from
automated planning in belief space, and (2) to improve communication efficacy, generation and
interpretation processes should be modeled in tandem. Beginning with these definitions:
• Generation: Given an intended meaning in context, produce a referring expression that
will allow a human hearer to derive its intended meaning.
• Interpretation: Given a human speaker’s referring expression in context, infer the
speaker’s intended meaning.
The remainder of this chapter is devoted to more clearly defining these processes in terms of
their processing architecture and its inputs and outputs. First, I discuss the referring expressions
themselves and why they are a compelling microcosm of linguistic communication. By focusing
on reference, it is possible to investigate linguistic meaning. Second, I develop a theory of their
meaning: what information is conveyed when a speaker uses a referring expression. Third, I
explain additional constraints on reference processing architectures by summarizing theoretical,
computational and experimental evidence. Fourth, I discuss the miscellaneous “contextual” inputs
and how they can be minimized in a restricted communication setting. Finally, I formalize both
tasks in terms of inputs, outputs and architectural constraints.
1.2
What are referring expressions?
Reference is the act of communicating the identity of an entity (or entities) to an audience. As an
act of communication, reference minimally involves two participants: a speaker, who produces a
referring expression, and a hearer, who interprets it. For instance in the earlier story the speaker,
Sally, produced a referring expression “the new bank’’ to communicate to her phone the identity
of the location she desired directions to. Referring expressions can be single words or complete
phrases, and come in the following forms (to use the categorization of Cruse [2011]):
• proper names (e.g., Joefish, Miami)
• pronouns (e.g., it, her, his)
• deixis1 (e.g., tomorrow, there, I, this, now).
• descriptive referring expressions
– definite noun phrases (e.g. "the new bank")
– indefinite noun phrases (e.g. "a new bank")
– generic noun phrases (e.g. "banks are corrupt" [meaning the entire class of banks]).
1
Instead of deixis, philosophers use the term indexicals to describe words that change reference depending on
who says them, when and where.
1.3. WHAT DO REFERRING EXPRESSIONS MEAN?
15
Throughout this thesis, I will use feminine pronouns (e.g. “she”) when talking about a speaker
(including writers); masculine pronouns when describing the hearer (or reader); and this style to
denote referring expressions.
Written and spoken dialogue is rife with acts of reference. Analyses of linguistic corpora have estimated that the prevalence of referring expressions could be as high as one out of every three words
(Kibrik [2013]). But estimates based on automatic counting are hard to trust because computers are
currently incapable of distinguishing whether a given linguistic form, such as it, is truly a referring
expression. The defining characteristic of a referring expression is that the speaker intended it to
refer. In any case, referring expressions should be abundant because before information can be
exchanged about something, the speaker and audience must agree on its referents (van Deemter
et al. [2011b]). Referring expressions are usually embedded within larger linguistic constructions;
if a speaker wishes to convey “who did what to whom,” referring expressions are used to direct
the hearer’s attention to the relevant who, what and whom components.
1.3
What do referring expressions mean?
The generation process converts meaning into a referring expression, and the interpretation
process converts a referring expression into its meaning. Meaning is the input to the generation
process and the output of interpretation process; so what do referring expressions mean?
Most philosophers and linguists who have attempted to answer this have done so from the
perspective of the interpretation procedure: what information does a hearer derive from a speaker’s
use of a referring expression? In this section, I will give a short summary of the history of answers
to this question. My explanation, along with many examples in this thesis, will draw upon the
following two referential domains, Circles and Kindles, which are expressed as visual scenes:
c1
c2
c3
Figure 1-1: The Circles referential domain containing referents: c1 , c2 and c3 .
16
CHAPTER 1. LINGUISTIC REFERENCE IN CONTEXT
Figure 1-2: The Amazon Kindles referential domain with 5 referents: k1 , k2 , k3 , k4 and k5 .
Given one of these two domains, we can start by defining the meaning of a referring expression
as the set(s) of elements in the referential domain it refers to, or its extension. Visual referential
domains give us an avenue for bringing the private meaning of a referring expression into
public view: We can simply ask human subjects (or ourselves) to “point out” its extension—for
example, by handing a person a picture of the Circles domain and asking the person to “Select
the second biggest green one.” Referring expressions can also refer to non-physical entities, like
events and concepts. However, the use of visual scenes such as Kindles and Circles in controlled
task settings helps ensure the speaker and hearer share referential domains (section 1.5.1).
1.3.1
Referring expressions have intensional and extensional meanings
The relation between a referring expression and its extension can be viewed as a mapping, provided
by the extension(·) function:
extension(the biggest one) = {c3 }
But there is more to an expression’s meaning than its extension alone. The referring expressions
the biggest one and the blue circle have the same extension, but we wouldn’t go as far as to say
they have exactly the same meanings. And what can be said about the meaning of empty expressions
such as the red square? Because its extension is empty, does it not have a meaning? To answer
these questions, Gottlob Frege introduced the concept of the sense of an expression to distinguish
it from its extension, which he called its reference (Frege [1892]).2 Throughout this thesis, I will
use “extension” instead of “reference” or “denotation” to describe the entities it refers to, and
“intension” instead of “sense” to describe the rest of its hidden information structure.
Separating a referring expression’s intension from its extension helps to explain why its extension
differs even when the linguistically-encoded meaning seems to remain the same. Examples of this
can most easily be seen when the referring expression takes the form of a pronoun or proper
name, because its extension changes when the referential domain changes. For example, consider
2
J.S. Mill came up with a similar distinction earlier with different terminology (Mill [1851]).
1.3. WHAT DO REFERRING EXPRESSIONS MEAN?
how the extension of the expression, “Meet John,” would change from one party to the next. This
also happens with descriptive referring expressions. For example, compare when the leftmost one
is evaluated with respect to the Circles:
extension(the leftmost one) = {c1 }
And when its meaning is computed with respect to Kindles:
extension(the leftmost one) = {k1 }
Here the referring expression’s intension contains a description of its extension and can take on
different extensions depending on various contextual factors. Referring expressions’ descriptions
are used by the speaker to help the hearer derive her intended extension, but in many cases they
can be used for the additional purposes of conveying the speaker’s attitude (e.g. the ugly one) or to
inform the hearer (e.g. the tiffany blue circle, when the speaker believes the hearer doesn’t know
the color tiffany blue.). The descriptive component of a referring expression is always partially
referential, but it may also be used for non-referential communication goals.
The intension and extension distinction turns out to be very important for solving a great number
of philosophical conundrums. Here are two illuminating problems that arise when a referring
expression is embedded in a modal or epistemic context.
First, when a referring expression is embedded, it can be important to know who came up with
the intensional, descriptive portion of the referring expression. For example, “Oedipus wanted to
marry his mother" is unclear whether Oedipus’ intention was to marry the woman (who happened
to be) the extension of the description his mother, or the more disreputable intention to marry
his mother—i.e., his intent included the referring expression’s intension and extension (Abbott
and Abbott [2010]). In the first meaning, the intensional description came from only the speaker,
whereas in the second, the description was also attributed to Oedipus.
Second, in these embedded contexts, indefinite referring expressions can result in an ambiguity
about why an unspecific description was used instead of a specific one. For example, consider
the referring expression in the assertion “Mary wants to marry a Norwegian banker” [Cruse, 2011,
pp. 384]. Mary’s intention could have included a specific extension (she knows the banker already)
or it could be a description that matches a range of alternative extensions (she is still looking for
such a banker). The decision to make the description unspecific by using “a” may have been the
speaker’s (i.e., Mary may have a specific banker in mind, but for it would have been irrelevant
for the speaker to identify him) or as a result of Mary’s situation (i.e., she has yet not found the
banker of her dreams).
Confusion can be avoided by understanding that a speaker uses the intensional component to
serve both referential and non-referential communication goals.
Finally, there is an important terminological difference in the research on generating referring
expressions from the natural language generation (NLG) community. Taking the speaker’s
perspective, the NLG community calls the intended extension the target set. When the speaker’s
intention is purely referential, namely, to get the hearer to resolve the same extension that she
17
18
CHAPTER 1. LINGUISTIC REFERENCE IN CONTEXT
had in mind, the task is to find a description that contains enough information to distinguish the
target set, the intended (group of) entities in the referential domain, from the rest of its members,
called distractors. I use both terms “extension” and “target set” throughout this thesis; however
“target set” means more specifically the intended (set of) entities the speaker wants to convey.
1.4
Architectural constraints on the reference processes
In this chapter, I summarize arguments about the architectural constraints operating on human
reference generation. The conclusion is that a model of generating and interpreting referring
expressions needs to be fast, incremental and non-deterministic.
1.4.1
Fast and Incremental
One source of constraints on the architectures for generation and interpretation processes comes
from the field of psycholinguistics, which seeks to understanding the process by which humans
“produce” (generate) and “comprehend” (interpret) natural language. Psycholinguistics has accumulated a large body of evidence that both processes are highly incremental and draw upon
disambiguating information from any sources as soon as it is available (Altmann and Steedman
[1988]; Gibson [1991]; Kempen and Hoenkamp [1982]). For interpretation, readers’ on-line disambiguation choices are influenced by the extension of their current dominant interpretation, which
is updated incrementally and frequently while reading.
In psycholinguistic studies of reference, the common experimental methodology is the visual
world paradigm (see Tanenhaus [2007] for an overview), in which a subject’s eyes are monitored
over time as a referring expression is spoken to them. Hearers tend to focus on their current most
likely interpretation, thereby allowing experimenters to investigate how humans process definite
descriptions in visual referential scenes (Cooper [1974]).
Using the visual world paradigm, Tanenhaus et al. [1995] gave verbal instructions to subjects that
contained structurally ambiguous referring expressions, including “Put the apple on the towel in the
box”3 while varying the referential domains (the number and positions of apples, towels and boxes
in the environment). The results showed that people are able to integrate all available information
to predict the outcome of an ambiguous decision within a fraction of a second after the onset
of the choice point (e.g., the prepositional phrase). If there are two applies, then the subjects
are primed at the onset of the prepositional phrase to treat it as if it modifies the noun. Sedivy
et al. [1999] showed that by simply using a gradable adjective, such as “big”, hearers anticipate a
contrasting element in the distractor set. For example, when given a referential domain containing
three objects: two glasses: one short and one tall, and one pitcher taller than both glasses, subjects
looked at the taller glass after hearing the big–.
These results give evidence that human hearers incrementally compute the extension of the
current best hypothesis about the speaker’s intended meaning. For computational models, this
3
This could be parsed as “Put the apple on the towel in the box" or “Put the apple on the towel in the box".
1.4. ARCHITECTURAL CONSTRAINTS ON THE REFERENCE PROCESSES
means that either syntactic and semantic processing are part of the same module, or the outputs
of each module are frequently fed back into one another.
Another line of arguments for incremental interpretation comes from computational models.
In terms of computational resources, the information from one module can often constrain the
possible decisions in the next module. For example, the decision about which noun a given
prepositional phrase modifies is under-constrained by syntactic information alone: as Mellish
[1981] observed, only a small portion of the syntactic possibilities are semantically plausible.
Consider these two example sentences from [Jackson and Moulinier, 2007, pp. 4]:
(1.1) The woman boarded the airplane with two bags.
(1.2) The woman boarded the airplane with two engines.
Using only syntactic information, both of these sentences appear the same:
(1.3) DT NN VBD DT NN IN CD NNS
And the decision about which NN to attach IN CD NNS to is left to chance, despite a human
reader’s semantic bias that “with two bags” modifies “woman” and “with two engines” modifies
“plane”. One could imagine an exotic context in which the opposite attachments are favored,
e.g., the woman is moving engines via a forklift, or the plane only has two bags boarded, etc.
In any case, failing to consider semantic information at the same time as syntactic information
will require the interpretation process to spend time generating and testing wrong hypotheses.
Furthermore, putting these stages into different modules typically involves a wasteful recreation
of different data structures for each module (Mellish [1981]).
1.4.2
Non-deterministic
Over repeated trials, the same human speaker will tend to produce different referring expressions
to the same target set (Viethen and Dale [2009]). In other words, the human generation process is
non-deterministic; it produces multiple outcomes.
Many models of reference generation have been developed to produce referring expressions that
over-specify and include more information than is necessary to distinguish the target from the
distractors (Dale and Reiter [1995]; van Deemter et al. [2011a], summarized in Viethen et al. [2012]).
These models are motivated by the observation from psycholinguistics that humans sometimes
include redundant modifiers in 21% (PECHMANN [1989]) to 30% (Engelhardt et al. [2006]) of
descriptive referring expressions. The mainstream approach to modeling over-specification is to
use a greedy search that commits to local improvements based on a fixed preference ordering
of attributes. Still, as van Deemter et al. [2011b] observed, very few computational models
of generation from the NLG research community produce non-deterministic output. From a
computational perspective, this is surprising because a straightforward way to sometimes overspecify would be to introduce a stochastic element to the decision-making process.
19
20
CHAPTER 1. LINGUISTIC REFERENCE IN CONTEXT
Are decisions made serially or in parallel?
Another type of architectural question concerns how many hypotheses are generated and when
in the process their generation occurs. This is often cast as the following question: Are all
hypotheses considered in parallel or is a single best hypothesis maintained and revised throughout
interpretation?
There are a few reasonable arguments against the notion that hearers considering all hypotheses
at once. In an otherwise unconstrained reference resolution step, the pronoun “he” could resolve
to several billion possibilities (Norvig [1988]). Likewise, for compound nominals, when two
or more nouns are put together, e.g. “dog door”, the number of interpretations for their hidden
relationship is seemingly infinite: the door the dog uses, the door near the dog, the door shaped like
a dog, the door that smells like dogs. . . Clearly not all of the hypotheses should be exhausted when
inferring the speaker’s intended meaning (Hobbs et al. [1993]).
Psycholinguists who investigate this kind of question often use garden path sentences (Gibson
[1991]): sentences with at least one ambiguity where the hearer has a preference for a particular
resolution that turns out to be incompatible with the rest of the sentence. Garden-path sentences also
serve as compelling examples of the incremental nature of interpretation. Consider the issue of
lexical ambiguity: when a lexical item maps to multiple meanings, as in the lexically ambiguous
noun form of the word “bank”:
(1.4) Let’s go stop by the bank
(1.5) Let’s go fish by the bank
Both referring expressions contain the canonical lexical ambiguous word bank, whose primary
senses include bank1 , a financial institution, and bank2 , the land border along of a river. The
linguistic context of (1.4) biases the reader to resolve “bank” as bank1 (finance) while (1.5) biases
the reader toward bank2 (river).
The plausible readings are also constrained by what options (i.e., distractors) are available: if
(1.4) were uttered in a rural community that did not have any financial institutions but did have
accessible rivers, we would expect the meaning of bank to only describe bank2 (river); and, because
it is singular and definite (i.e., begins with ‘the’), it presupposes that the referring expression in the
current context is enough for the hearer to arrive at a meaning that denotes a single river bank.
The garden-path effect occurs when initial disambiguation choices are reversed by adding linguistic
context:
(1.4b) Let’s go stop by the bank of the Charles River → bank2 (river)
(1.5b) Let’s go fish by the Bank of Commerce → bank1 (finance)
Garden-path sentences can also be constructed for vague, gradable adjectives such as “tall”, “hot”
or “long.”
1.5. CONSTRAINING CONTEXT
(1.6) Let’s watch a short movie → short for movies: less than 2 hours? 1.5 hours?
(1.6b) Let’s watch a short movie by Ken Burns → less than 12 hours?
These garden-path examples are interesting, because they show how the intension of an interpretation is non-monotonic: the combined meaning up to word wi+1 may not have been included in
(or entailed from) the meaning up to word wi . However, there is no reason to think that the same
representation is augmented throughout the process: a backtracking mechanism can be used to
replace the current over-constrained hypothesis with a different hypothesis with less information.
What, if anything, does this tell us about the human interpretation architecture? From my
knowledge, no arguments for or against parallel interpretations have considered the full range
of computational possibilities, and thereby the answers are inconclusive. It is possible for a
backtracking search to maintain only the previous decision points in memory, not their outcomes,
and return to them when the previous decision resulted in a failure (e.g., by passing a continuation
function). And there are a large variety of ways to search the hypothesis space: a local search
algorithm will only consider the first improved successor, and a beam search considers only the
k−best hypotheses. And the search behavior varies depending on how the programmer decides
to rank the hypotheses and whether their scores are recomputed if one of their dependencies
changes. And in general, to what extent can a partial solution for a given hypothesis be used by
its successor—does the new solution have to be recomputed from scratch, or can it use all of the
still-valid partial decisions used by its predecessor?
In short, the range of computational solutions is vast and hasn’t been systematically explored.
To avoid making a premature commitment to an arbitrary architecture, I will remain agnostic to
these lines of processing constraints. Of course, any particular solution requires a commitment to
an approach; and the architecture I define in Chapter 4 enables several heuristic search strategies
in one framework.
1.5
Constraining context
“The less personal knowledge the dialogue participants have about one another, the
greater the reliance that must be placed on the conventions shared by a larger society
to which they all belong” [Hobbs, 1988, pp. 83].
The aim of Chapter 1 is to characterize the reference generation and interpretation process in terms
of their inputs, outputs and processing architectures. While many architectures posit that there is
a “context” parameter, I see appeals to a “context” input as a modelling deficiency. An adequate
model should isolate these miscellaneous factors that influence a referring expression’s meaning
and either include them in the model, or create a controlled setting in which their influence is
minimized. In this section, I review several contextual factors and describe how their influence
can be mitigated in a constrained communication task.
21
22
CHAPTER 1. LINGUISTIC REFERENCE IN CONTEXT
1.5.1
Identifying relevant contextual factors
Michael Tomasello described discourse context as "information that is available to [both speaker
and hearer] in the environment, along with what is ‘relevant’ to the social interaction, that is,
what each participant sees as relevant and knows that the other sees as relevant as well—and
knows that the other knows this as well, and so on, potentially ad infinitum. This kind of shared
intersubjective context is what we may call following Clark [1996] common ground. . . it takes
[hearer and speaker] beyond their own egocentric perspective on things” [Tomasello, 2008, pp. 76].
Common ground is taken to be the information mutually believed by both speaker and hearer,
which includes knowledge about what knowledge is mutually believed. Of course, true common
ground is a fiction: in addition to being paradoxically recursive, neither speaker nor hearer are
omniscient so neither could ever know precisely what beliefs they share. However, it is nonetheless
important for speaker and hearer to be able to reason about each other’s beliefs, and it may be
useful to envisage and speak of the “common ground” as an idealized state both the speaker and
hearer’s inferential processes strive toward in order to make the reference task succeed. A less
stringent notion of “itersubjective context”, views this roughly as the relevant information each
dialogue participant can expect the other to possess or infer within reason.4
Using Roberts [2004]’s pragmatic theory, the relevant information needs are derived from the tasks
shared between speaker and hearer. These information needs constitute the dialogue’s questions
under discussion, and provide the impetus for communication. Questions under discussion give
rise to communication goals, which are fulfilled by communication acts toward these goals (e.g.,
speaking, gesturing). For reference, the communication goal is at least in part referential: to make
the intended referent(s) mutually known to hearer and speaker (i.e., in the proverbial common
ground). Consequently, the critical way to constrain context is to embed it into a shared task in
which the operating constraints on the possible referents are well understood.
In the following table, I review known contextual factors that can affect generation and interpretation of referring expressions, and describe how they are accounted for in the constrained
reference task in 1.5.2:
4
I added the “within reason" hedge to exclude information that is cognitively available but impractical or inconvenient to derive, such as the book 27cm from the wall or the 9th tallest student.
1.5. CONSTRAINING CONTEXT
Factor
23
Evidence
How it can be constrained
The shared task setting motivates
the need for reference (questions un• Explaining the shared task to
der discussion) and the constraints
the speaker or hearer
on hypothetical intended meanings
(answer for these questions) [Tanen• Using visual referential dohaus, 2007, pp. 313]
mains that are co-present
(e.g., Circles and Kindles)
helps to align the speaker
and hearer’s hypothesis space
Clark and Marshall [2002]
Shared Task
• Using very neutral instructions with verbs that do not
further constrain the arguments, Select the items that
correspond to [NP]
Communication
Goals
Dialogue
tory
His-
Referring expressions can also For interpretation tasks, attempt
serve non-referential communica- words that are purely descriptive,
tion goals including: (1) communi- without attitudinal information.
cating additional descriptive information or (2) the speaker’s attitude
toward about its extension Appelt
[1981]; Jordan [2005]
Speaker and hearers "align" their
linguistic behavior to reflect the
other’s choice of content and words,
syntactic frames, and prosody. After successful reference, participants use the same referring expression and extension pairings (Branigan et al. [2010]; Goudbeek and
Krahmer [2012]) even if it describes
the referent inaccurately (Brennan
[1996])
Use “one-shot" referring expressions when possible. When collecting data, use time stamps to record
all of subjects’ history.
Table 1.1: Contextual factors and how they are constrained.
To summarize, we are after purely referential, one-shot referring expressions that are acquired
from a fixed task in a visual domain.
24
CHAPTER 1. LINGUISTIC REFERENCE IN CONTEXT
1.5.2
Collecting data in a constrained communication setting
I have developed a web-based experimental platform to collect data from humans generating and
interpreting referring expressions in this constrained communication setting. For an interpretation task, participants are presented with a set of entities, and asked to select the ones that were
intended by a particular referring expression (see Figure 1-3). For generation tasks, participants
are asked to enter an English description for the indicated entities (see Figure 1-4).
This experimental platform allows us to conduct large human experiments using Amazon’s
Mechanical Turk and rapidly acquire linguistic data from a population of self-identified native
English speakers from the United States. In the evaluation section 5.1, I discuss experiments
conducted using this platform were used to collected the data for evaluating the model of generation
and interpretation.
Figure 1-3: An example interpretation task, where subjects are asked to check the targets that correspond
to the referring expression.
1.5. CONSTRAINING CONTEXT
Figure 1-4: An example generation task, where subjects are asked to generate a referring expression for
the target Kindle.
1.5.3
Context Set = Referential Domain + combinatoric possibilities
The NLG community always formulates the reference task with respect to formal description
of all possible target sets, called the context set. The term “context set” came from theoretical
linguistics and describes the viable candidates meanings for an interpretation, which evolve over
the course of dialogue (Rayo [2010]; Stalnaker [2004]). In the NLG community, each model of
referring expression generation states the contents of the context set: can target sets contain more
than one member? can the member be repeated more than once? In this section, I will develop
this notion of context set that will be used to formalize both generation and interpretation tasks.
If the context set represents the hypothesis space of interpretations, and humans are able to
assimilate evidence from the contextual factors to constrain the interpretations, then the context
set can be described as an abstraction that summarizes all contextual constraints on the possible
meanings.
What does a context set contain? Given that a referring expression’s extension is our one way into
investigating its meaning, the members of the context set can be seen as the referential domain’s
possible extensions. Because each extension can be described by a number of different intensions,
25
26
CHAPTER 1. LINGUISTIC REFERENCE IN CONTEXT
this is only a coarse view of the possible meanings.
Given a referential domain, what extensions are possible? Answers to this question vary depending on the kinds of linguistic phenomena one wants to cover. Earlier, the idea of an indefinite
description was introduced as an example of communicating an unspecific extension. For example,
with the Circles domain the extension of a green circle or any green circle, is a single referent
whose exact identity is one of multiple possibilities. Essentially the meaning of green circle
has restricted the referential domain to {c1 , c2 } while the meaning of the determiners, a or any,
restricts the possible targets that can be formed out of that domain. Whereas the meaning of
the two green circles has an extension containing the set {c1 , c2 }, the meaning of a green circle
has an extension containing one of either: {c1 } or {c2 }. To express the former plural description, the context set must be at least the Plural extensional complexity, while expressing the
latter, unspecific description requires that it is able to express the larger, Unspecific extensional
complexity class.
The topic of the extensional complexity of a context set is returned to in 2.4.2, wherein the
representational expressiveness required to handle ambiguity and vagueness is doubly-exponential
in the size of the referential domain—a problem the remainder of this thesis is devoted to solving.
For the purpose of what follows in the remainder of this chapter, the context set can be assumed
to belong to the smaller, Singleton complexity class, in which only single, definite descriptions
are valid hypotheses:
contextset(Circles) = {{c1 } , {c2 } , {c3 }}
1.6
Characterizing the two reference tasks
Section 1.5.1 replaced the illusive concept of context with a single abstraction, the context set,
which expresses the sum of all contextual constraints on the speaker’s possible target sets and
the hearer’s possible extensions. Now, we are in a position to characterize the broader referential
tasks of the dialogue participants in terms of their inputs and outputs.
1.6.1
Formalizing the referring expression generation (REG) task
The speaker completes a referring expression generation (REG) task: given an initial context
set and a designated subset of it—the target set, she produces a referring expression which she
expects will enable the hearer to infer her intended target set from the rest of the elements in the
context set, called distractors (Dale and Reiter [1995]):
REG(context set, target set) → referring expression
(1.7)
The speaker’s referring expression is a function of the target set, the distractors and the lexical
items (e.g. morphemes, words, phrases or idioms) available. As the distractors get more similar to
the targets, there are fewer words that could be used to differentiate the targets.
1.6. CHARACTERIZING THE TWO REFERENCE TASKS
27
This formalism of the REG task can be extended to support non-determinism (section 1.4.2). If
we run our REG procedure n times, we should be able to summarize its output compactly as a
probability distribution, denoted Pr (·), over the referring expressions it produced:
REG-Aggregate(context set, target set, n) → Pr (referring expression)
1.6.2
(1.8)
Formalizing the referring expression interpretation (REI) task
A hearer completes a referring expression interpretation (REI) task: given a referring expression, his goal is to jointly infer the context set and the meanings that the speaker intended:
\set
REI-Full(referring expression) → Pr context
(1.9)
The outcome of a REI-Full task is a probability distribution over the all elements in the hypothetical
\set, inferred by the hearer.
context set, context
The constrained interpretation tasks in this thesis assume that the context set is the same between
speaker and the hearer, so the context set no longer has to be inferred and the problem is simpler
than REI-Full:
REI(context set, referring expression) → Pr (context set)
(1.10)
Here is an example of a REI task. The referring expression a green one, should result in a probability
distribution that assigns p({c1 }) = 0.5, p({c2 }) = 0.5 and p({c3 }) = 0.
How does incremental processing affect formulations of the REG and REI tasks?
A characteristic of an incremental architecture is that it is able to generate partial results at any
stage of the output. Therefore it gives us more chances to probe the extensional meaning of a
referring expression.
For example, consider the REI task of interpreting a green circle in the Circles domain (again,
I assume the Singleton context set containing 3 members). Delineating lexical items requires
a theory of lexical semantics; for now, if we assume that all lexical items are word tokens, in
this example we have three opportunities to investigate the referring expression’s extensional
meaning:
1. REI(contextset(Circles), a) → p({c1 }) = p({c2 }) = p({c3 }) = 31
2. REI(contextset(Circles), a green) → p({c1 }) = p({c2 }) = 0.5
3. REI(contextset(Circles), a green circle) → p({c1 }) = p({c2 }) = 0.5
However, if you ask a number of human subjects to “Select the elements that correspond to ‘a’”?,
the answers you collect are not illuminating. The meaning of articles comes largely from how it
combines with the other lexical items in its expression. When considered alone, it is semantically
unacceptable.
28
CHAPTER 1. LINGUISTIC REFERENCE IN CONTEXT
1.6.3
Defining communication success and failure
A given act of linguistic reference contains both REG and REI tasks. A reference success is when
the target set input to the REG task is the sole extension of the REI task’s output. A reference
failure is defined as any mismatch between the speaker’s intended extension and the one (or
ones) yielded by the hearer’s interpretation.
What causes a referring expression to fail?
First, the speaker and hearer can have different context sets, because they have a different set of
assumptions in operation. This is the nature of the REI-Full task, and in everyday interpersonal
communication, hearers are expected to revise their implicit assumptions while interpreting a
referring expression. However, in this thesis, I assume that the measures taken in section 1.5.2 to
constrain the task setting justify us focusing on the simpler REG task.
Second, decisions made in REG while decoding the meanings of the referring expression may have
been incorrect, for example due to ambiguity, ellipsis and vagueness. In Chapter 3, I present a
model that handles these three problematic cases.
Third, even when the hearer has successfully decoded the referring expression’s intensional
description during REG, it may be insufficiently descriptive to yield the extension that the speaker
intended. One reason may be that the referring expression is actually underspecified and that a
human hearer would not be able to successfully interpret it. The other reason is that the speaker
may have intended the hearer to deploy implicit expected inferences (implicatures) to recover
her full intended meaning.
1.7
Conclusion
I began this chapter by explaining how referring expressions’ extensions provide an avenue for
studying linguistic meaning. Then, I accumulated processing constraints from related theoretical,
computational, and psycholinguistic work to make a case that models of interpretation and
generation should be fast, incremental and non-deterministic. Finally, I formalized the two
REG and REI tasks.
What computational model allows us to perform incrementally syntactic and semantic analysis,
and maintain the current best intension and its extension at each choice point? In the next chapter,
a case is made for using techniques from automated planning over belief states, where each state
represents a complete interpretation.
Chapter 2
AIGRE, a belief-state planning approach
to reference
This chapter presents AIGRE,1 a computational model of both generation (REG) and interpretation
(REI) tasks. By framing the REG problem within automated planning, the related REI task
becomes a problem of plan recognition.
For REG, automated planning provides a unified treatment of linguistic decisions, and a set of
techniques for making decisions (belief-changing actions) toward communication goals while
balancing them against their costs. For REI tasks, plan recognition offers an abductive framework
for balancing the costs of potentially ad-hoc interpretation decisions against the benefits. Addressing both tasks together allows researchers to transfer ideas between NLG and natural language
processing (NLP) research communities, and engineers to reuse components and mechanisms.2
A video demonstration of AIGRE’s behavior can be found at: http://web.media.mit.
edu/~dustin/aigre/
2.1
Generating Referring Expressions as Planning
The origins of the anti-modular approaches (see Chapter 6) to word-level decisions for REG began
with the SPUD system (Stone et al. [2003]). SPUD pioneered the lexical approach in which each
lexical entry encoded a surface lexical item along with its syntactic, semantic, and conventional
pragmatic meanings. Lexical approaches presume that lexical entries can be designed to contain
all of the ingredients required to synthesize a phrase or sentence. As such, the REG task is reduced
to choosing (i.e., content selection and lexical choice) and serializing lexical units (putting them
into a flat sequence), which, as Koller and Stone [2007] observed, bears strong similarities to
automated planning (Ghallab et al. [2004]).
1
Automatic interpretation and generation of referring expressions. In French, it means “sour”.
As described in Chapter 5, by using the same representations between both tasks, once a linguistic meaning can
be expressed by the interpretation process, being able to generate it is simply a matter of directing the search process
by learning the decision weights.
2
30
CHAPTER 2. AIGRE, A BELIEF-STATE PLANNING APPROACH TO REFERENCE
Automated planners try to find plans (sequences of actions), given (1) a fixed planning domain
that describes how the relevant aspects of the world are changed by actions, and (2) a problem
instance: a description of the initial state and the desired goal states. The goal states are implicitly
defined by a goal-test function:
Problem instances can be solved using heuristic search (Bonet and Geffner [2001]). Starting at
the initial state, the search process applies actions to generate hypothetical successor states until
it has found a goal state. The actual search process does not operate on states, but on search nodes
that contain states plus additional meta-data. Minimally, the search node’s meta-data contains a
backpointer expressing the haction, previous statei pair that created it. Once a goal is found,
the backpointers are used to recover the plan.
Observe the analogy between planning and the REG task (table 2.1). The actions defined by the
planning domain play an analogous role to the lexicon: each action corresponds to a lexical entry
and is responsible for defining its semantic effects, along with the local syntactic and compositional
constraints that come with its lexical unit. These actions change states, which represent complete
interpretations, until the intension meets the communication goal. The syntactic constraints are
expressed in the action operators that transition between state, and each action has the possibility
of producing a lexical item.
Automated Planning
actions changes the state
Generating Reference (REG)
lexical entries change the intension
while adhering to semantic constraints
and morphosyntactic constraints
among lexical items
states represent the parts of the en- states represent the meaning of the
vironment relevant to the problem referring expression, which can be
viewed either as (1) an intensional description and (2) as its extension
goal desired state(s)
communication goal a state containing a meaning that identifies targets
and none of the distractors
planning domain a collection of lexicon a collection of lexical entries
action descriptions
plan a sequence of actions
referring expression, a sequence of lexical items (surface forms of the lexical
entries)
Table 2.1: The analogy between automated planning and generating reference.
In separating the problem solver from the instances of problems, these frameworks allow the same
planning domain (i.e., lexical semantics) to be reused across different problem instances—different
REG tasks. Furthermore, the next section shows that by using plan and goal recognition to infer
2.2. INTERPRETATION AS PLAN RECOGNITION
plans and goals from partially observed action sequences (e.g. lexical items), the same planning
domain can be used for REI tasks as well.
2.2
Interpretation as Plan Recognition
If generating a sentence can be modeled as a planning problem, then it follows that interpretation
can be modeled as plan recognition (Geib and Steedman [2006]; Heeman and Hirst [1995];
Schlangen et al. [2009]), or its subproblem of goal recognition. By treating plan recognition as
an “inversion" the planning problem, it can be solved using the same techniques used in classical
planners (Ramírez and Geffner [2010]) or optimal decision-theoretic planners (Baker et al. [2007]).
Viewed as plan recognition, the REI task is: Given an initial state (context set), a sequence of
partially observed actions (lexical items), what are the true actions’ effects (lexical entries), their
intermediate states (incremental meanings) and, most importantly, intended goal state (intended
meaning)? Many linguistic issues can be rephrased in plan recognition terminology:
Linguistic Concept
inferring the speaker’s
intended meaning
inferring the speaker’s
intended meaning and
her linguistic decisions
lexical ambiguity
vagueness
ellipsis
implicit inferences
misspellings
Plan Recognition Terminology
goal recognition: given evidence, infer
agent’s intended goal
plan recognition: given evidence, infer
the goal state and intermediate actions
and states
partial observability of an action
partial observability of an action
omitted actions
omitted actions
unintended actions
Table 2.2: Linguistic processes and concepts put into plan recognition terminology.
2.3
Representing Lexical Items as Actions
Action operators define transitions between search nodes—small data structures that contain belief
states. In a planning-based approach to REG, actions can correspond to any range of decisions
within the NLG pipeline (figure 6-3). As our goal is to model the meaning of referring expressions,
the actions we describe here correspond to lexical items. Here are some examples of actions, or
lexical entries that can be stored in AIGRE’s heterogenous lexicon (containing units of different
sizes):
Morphemes: –’s
31
32
CHAPTER 2. AIGRE, A BELIEF-STATE PLANNING APPROACH TO REFERENCE
Words: the, blue, biggest
Compounds: most expensive
Idioms: bigger than a breadbox
AIGRE’s linguistic actions operate entirely in belief-space, so they do not affect the state of the
world. This means that each action can be evaluated hypothetically without incurring costs outside
of computation or producing irrecoverable side-effects. This is an advantage of working in the
belief space.
When solving an instance of a planning problem using search, planners internally generate a
directed graph called a planning graph, where the nodes represent hypothetical states and
the labeled edges correspond to actions that represent valid transitions between the states. A
planning domain and an initial state thus characterize an implicit graph of all the possible states
and transitions between them, which is usually infeasible to fully enumerate.
The result of both REG and REI tasks can be depicted as a planning graph that shows the paths to
possible interpretations. In the case of Figure 2-1, only one distinct path and interpretation were
found. Each edge represents an action, and the actions typically bring a surface lexical item with
them.
1
size:7
the
cost: 0.01
2
size:7
blue1
cost: 1.31
3
size:1
one2
cost: 2.37
4
size:1
Figure 2-1: The interpretation planning graph for the blue one in the Circles domain, containing three actions: a1 : the, a2 : blue and a3 : one.
There are a few pieces of information about the planning graph of figure 2-1:
• The blue circles and red diamonds represent search nodes, containing belief states.
• The red diamonds represent goal states.
• In states, the first number represents the order it was expanded in.
• In states, the second number represents its size: the number of targets in its extension. The
fact that the goal state is of size 1 means that it has one element in its extension.
2.4. REPRESENTING INTENSIONS AS BELIEF STATES
33
• The beige arrows represent actions, which correspond to lexical items.
• In actions, the first text represents the action’s name—generally the same as its lexical item
with some distinguishing information if it is lexically ambiguous. If it is vague, its standard
of comparison will appear below its name.
• In actions, the last property, cost represents the cumulative path cost.
2.4
Representing Intensions as Belief States
Automated planning is the task of finding a sequence of actions that are expected to reach a goal
state, given a starting state, a set of goal states, and a planning domain that specifies the relevant
aspects of the world and how it changes as a function of the agent’s actions. Unlike the classical
state-based planners, where actions describe how the physical world changes, for reference task,
the actions describe the dynamics of belief states. The previous chapter argued that architectures
to reference generation and interpretation should be incremental, meaning at each decision point
the entire extension is available. One straightforward way to do this is to have each state represent
an intension (from which its extension can be derived), thereby every intermediate state is a full
interpretation.
To illustrate how intensional meanings are represented, I will use the Circles domain again; and
c1 c2 c3
in case you forgot it, here is a small reminder:
In section 1.5.3, we used context sets of Singular complexity, represented as all members of the
referential domain. Here, we will use the Plural extensional complexity class (2.4.2), which is the
power set, P(·), of the Circles domain, minus the empty set, ∅, contains 7 members:
P(Circles) \ ∅ = {c1 } , {c2 } , {c3 } , {c1 , c3 } , {c2 , c3 } , {c2 , c1 } , {c1 , c2 , c3 }
The power set under the subset operator forms a lattice, which is what we will use to visualize the
possible target sets of a belief state. Given an empty belief state, b = [], about the Circles domain:
c1 , c2 , c3
generate-targets(b, Circles) →
c1 , c2
c1 , c3
c2 , c3
c1
c2
c3
34
CHAPTER 2. AIGRE, A BELIEF-STATE PLANNING APPROACH TO REFERENCE
In this example, the intension of b is empty. Thus it has no constraints, and so the generatetargets function returns all 7 candidate target sets. Through the rest of this thesis, a belief state’s
intensional meaning will be represented by an attribute-value matrix and the extension will be
described by a lattice. These will usually appear side-by-side, as in figure 2-2.
c1 , c2 , c3
→
b = []
c1 , c2
c1 , c3
c2 , c3
c1
c2
c3
Figure 2-2: Recall from section 1.3.1 the distinction between the referring expression’s two
meanings: the intension (left) and the possible targets, the extension (right). Here is how these
two views are represented. The extension contains 7 sets and is visualized as a lattice for reasons
that will become clear soon.
New information can be added to the belief state, via actions, to change the intensional meaning
of the hypothesis under consideration. This, in turn, places constraints on the state’s extension.
c1 , c2
h
b = color
i
green
→
c1
c2
Figure 2-3: An updated intensional meaning that expresses the constraint green(x), i.e., the
meaning of the word “green", and the corresponding extension.
Because states are complete interpretations, we have achieved the desired incremental property
discussed in section 1.4.1.
2.4.1
Belief states, a representation for uncertainty
The representation being used for states of Plural complexity: namely, the power set of the
referential domain, is similar to AI concept of a belief state about the referential domain. A
belief state characterizes a state of uncertainty about some lower layer, such as the world, W,
or another belief state. For example, if there are only three relevant distinctions in the world,
W = p1 , p2 , p3 , and you know that both p1 and p2 are true but do not know about p3 , your belief
state is: b = {{p1 , p2 , p3 } , {p1 , p2 }}. This means that either you are in the state {p1 , p2 , p3 } where
all three propositions are true, or you are in the state, {p1 , p2 }, where only p3 is false. (Following
2.4. REPRESENTING INTENSIONS AS BELIEF STATES
the convention in the literature, I represent a state by a set containing only its true propositions,
and assume all of the remaining propositions that are not in the set are false.)
For reference tasks, the underlying state is not the state of the world, it is the extensional meaning
of the referring expression being conveyed. Consequently, the propositions, i.e. p1 , correspond to
the whether or not the speaker intended a particular referent, e.g. c1 . For example, in the Circles
domain, the referring expression “any circle" correspond to the belief state b = {{c1 } , {c2 } , {c3 }}.
The standard representation of a belief state is the power set of the states in the lower layer,
b = P(W), containing 2|W| members, or more generally as a probability distribution, b = Pr (W),
representing degree of belief.
Beliefs are an abstraction of any lower layer, and beliefs can be about beliefs about beliefs.
There is an important difference between information-changing actions (e.g. decisions) such as
accepting the semantic contributions of a word, and the world state-changing decisions, such as
moving your head. Confusing the two can have drastic consequences, because world-changing
actions often cannot be reversed. For example, a valid strategy for answering the question of “How
many files are in this directory?" is to delete it and respond “Zero" (Golden and Weld [1996]).
2.4.2
Quantifying the extensional complexity of meaning
If we focus on referring expressions where the speaker’s communicational goal is solely to refer, we can get a handle on the size of the hypothesis space—at least for extensional meanings.
Purely referential uses of referring expressions can be quantified in terms of their extensional
complexity; however, this is only a grossly underestimated lower-bound on the referring expression’s actual meaning. In general, referring expressions are for more than communicating
the identity of targets; in many cases, their meaning includes descriptions and attitudes about
the entities they denote. And keep in mind that there may be a large number of varied intensional meanings for each extension: recall, from the Circle domain, extension(the blue one) =
extension(not the green ones) = {c3 }, however their intensions are presumably very different.
Given a referential domain of size, R, REG systems that can refer to sets—i.e. can express
plural meanings (Horacek [2004a]; Stone [2000]; van Deemter [2000]) explore a hypothesis space
containing 2R − 1 extensions, which is representationally equivalent to a belief state about the
hypothesis space of only singleton referents. If one wants to represent all multiple interpretations
R
about non-empty sets, the hypothesis space contains 22 −1 − 1 extensions: which is large enough
to express any Boolean combinations of the sets.
As we see in the next chapter, to represent the disjoint interpretations caused by vagueness and
ambiguity, the state representation needs to extend into the Boolean complexity class. A Boolean
state-space grows large quickly: for the Circle domain, where R = 3, there are 127 extensions;
while for Kindle, where R = 5, there are over two billion.
Fortunately, there are ways to avoid ever having to enumerate this doubly-exponential hypothesis
space. First, a belief state uses lazy evaluation to generate its targets. Second, the base exponent is
avoided altogether, as we derive it by aggregating states from the planning graph. This will be
described in extensive detail in the section on ambiguity 3.4.1 and vagueness 3.4.2.
35
36
CHAPTER 2. AIGRE, A BELIEF-STATE PLANNING APPROACH TO REFERENCE
Complexity Class
Singleton
Plural
Unspecific (quantifiers and indefinites)
Boolean
Size
R
R
2 −1
R2R − 1
R
22 −1 − 1
Example
the blue circle
the two circles
all circles
green or not blue
extension(Example)
{c3 }
{c1 , c2 }
{c1 , c2 , c3 }
{{c1 } , {c2 } , {c1 , c2 }}
Table 2.3: Denotational complexity classes for referring expressions in referential domains of size
R. The farther down the list, the more expressive the referring expressions.
Nonetheless, it is the complexity of the extension, which is lower bound on the complexity of the
distinct belief states. Any one of them could be the intended meaning for a referring expression.
Although the hypothesis space is not being explicitly enumerated, it is still doubly-exponential
and thus efficient ways are needed to search it. The entirety of chapter 4 is devoted to narrowing
in on the intended meaning given a referring expression (for REI) or finding a path to the intended
extensional meaning (for REG).
2.4.3
Belief state implementation details
The key responsibility of a belief state is to represent and detect equivalent or inconsistent
information at the intensional level. Its function is to aggregate all actions’ intensions and detect
whether a partial information update is inconsistent or would cause the interpretation to be invalid
(i.e., have no members). In AIGRE, belief states are represented as a collection of objects, called
cells,3 which hold partial information and manage the consistency of information updates.
A belief state does need to explicitly enumerate all 2R − 1 possible members of its extension; it
can lazily generate the extension only when needed—see 3.3.
Actions operate on AIGRE’s belief states, yet the belief state influences much of the behavior of
the action’s effects. As we will see in the next section, the contents of a belief state determine
the number of effects an action will yield, the specific values within the effect’s belief (using late
binding), and whether or not the update is valid.
2.4.4
Action implementation details
AIGRE’s lexicon is comprised of lexical entries—actions that can change belief states. Each
action/word is an instantiation of an action class and has (1) a syntactic category (part of speech
tag), (2) a lexical item, (3) an effect generator that yields effect functions that produce specific
semantic contributions to belief states—determined in part by its syntactic category, (4) a fixed
lexical cost, (5) an estimated effect cost, and (6) an computed, real effect cost.
The idea behind cells comes from the propagator framework of Radul and Sussman [2009] and our Python
library is available from http://eventteam.github.io/beliefs/
3
2.5. CONCLUSION
Actions are defined by instantiating class instances, for example:
GradableAdjective(’big’, attribute=’size’, sort_from=’max’)
CrispAdjective(’big’, attribute=’size’, value=IntervalCell(5, np.inf))
When instantiating an action, the first argument is its lexical item in its root form; the class’
initialization method uses the root lexical item to also instantiate variant actions for each derivative
lexical item when applicable (e.g. the plural, singular, comparative, and superlative forms). The
rest of the arguments are specific meaning components of the syntactic category’s effect generator
function.
2.5
Conclusion
This chapter motivated the connection between planning and REG, and plan recognition and REI.
There are still a multitude of design decisions for what goes into the planning domain (analogous
to a theory of lexical semantics) and what kind of search control procedure is used to navigate
the space of planning decisions. The next chapter develops a theory of lexical semantics for a
fragment of English noun phrases, which takes advantage of the belief-state representation. The
next chapters of this thesis are organized as follows:
1. Analyzing linguistic phenomena and making sure the planning domain is capable of representing all relevant choice points that lead to desired output (Chapter 3), which is evaluated
in terms of coverage (5.2 and 5.3).
2. Devising strategies for minimizing the number of choice points the system considers in
producing the desired output (Chapter 4) and evaluating the strategies (5.4).
37
38
CHAPTER 2. AIGRE, A BELIEF-STATE PLANNING APPROACH TO REFERENCE
Chapter 3
Building the Lexicon
This chapter describes a planning-based lexical semantics for English noun modifiers that expresses
the intensional component of a referring expression’s meaning, along with a function for deriving
the extension. The first section describes a compact representation for English determiners, articles
and quantifiers. The second section describes an approach to representing indeterminate meanings
and the context-dependent meanings of lexically ambiguous, vague, and elided words.
3.1
3.1.1
Referring to sets
Plurals and Cardinals
Consider these two referring expressions in the Circles domain:
(3.1) the blue circle
(3.2) the blue circles
These both have the identical meanings except (3.1) encodes a singular number and (3.2) encodes
plural number. With a few exceptions (Horacek [2004a]; Stone [2000]), the majority of research
on REG has focused on singular, definite descriptions to the neglect of referring expressions like
(3.2). How do we represent Number using belief states?
Representing number in belief states
To encode the cardinality of a belief state, a piece of meta-data called targetset_arity is used.
It enforces a constraint on the cardinality of all of the possible members of the belief state’s
extension. This property is implemented as an interval cell, which supports the operations of
interval algebra. To keep the belief state’s descriptions separate from its meta-data, the description
is stored in the target property of the belief state: Only referents in the referential domain that
entail description in the belief state target property can be used in members of its extension. (It is
40
CHAPTER 3. BUILDING THE LEXICON
called target, because from the speaker’s perspective, the members of the extension are the target
set.)
The meaning of a singular property looks like this:
c1 , c2 , c3
"
b0 = targetset_arity
target
#
1
[]
→
c1 , c2
c1 , c3
c2 , c3
c1
c2
c3
Figure 3-1: The intension of singularity in the Circles domain. The belief state, b0 ’s target
property contains the (empty) descriptive constraints on the 3 elements in the referential domain,
and the targetset_arity property describes constraints on the size of members of the powers set
that is generated from those 3 elements.
The cross-hatched areas represent arity constraints. (Note that with intervals, [1, 1] = 1) These
are in operation during the generate-targets stage, when the extension is being computed, and
serve to control the size of members of the power set of the consistent elements of the referential
domain. The meaning of plurality looks like this:
c1 , c2 , c3
"
b0 = targetset_arity
target
#
[2, ∞)
[]
→
c1 , c2
c1 , c3
c2 , c3
c1
c2
c3
Figure 3-2: The intension of plurality in the Circles domain.
In a similar way the meanings of cardinal numbers, such as “three” and “3”, are defined by setting
the value of targetset_arity equal to their value.
3.2. REFERRING TO SETS OF SETS
3.2
3.2.1
Referring to sets of sets
Representing unspecific meanings
What happens when the speaker wishes to communicate the identify of a set of sets? First, when
would this happen? Recall from section 2.4 that “a set of set” is a common representation for a
belief state, and belief states represents uncertainty. Consequently, one reason for a speaker to
communicate a belief state is because she lacks knowledge about the target set or wants to avoid
revealing the identity of the target set for any reason (perhaps the target’s identity is irrelevant
for the conversation).
Using the Circles domain, suppose someone asks you to hand him:
(3.3) a circle
The referring expression (3.3) has a extension with three members: extension(a circle) = {c1 } ∨
{c2 } ∨ {c3 }. Any of the three circles is acceptable; they all meet the description. The speaker did
not intend to communicate all three circles: part of the meaning of the indefinite article is that the
referent is singular in number. In effect, the speaker has communicated a description and a choice
between multiple alternative targets.
How do we express the meaning of the indefinite article, a, (and its plural form: some)? To
understand AIGRE’s lexical semantics for indefinite articles, I will need to first describe the lexical
semantics of quantifiers such as all and most.
3.2.2
An unconventional treatment of English quantifiers
English, like many other languages, has several quantifiers. Only a few are expressed in the syntax
of FOPC, including some (denoted as ∃) and all (∀). English quantifiers have different interactions
of scope and entailment. Following [Cruse, 2011, pp. 36], the meanings of quantifiers are best
modeled as constraints on the cardinality properties of the (interacting) sets, which are within the
quantifier’s (indeterminate) scope. For the referring expressions we are interested in, there is only
one set under the quantifier’s scope, so we do not need to model scope attachment. However, it is
interesting to notice that quantifiers attachments, like prepositions, depend on context:
1. Many Americans are computer programmers.
2. Many MIT students are computer programmers.
For these assertions to be informative or relevant, they depend on different contextually-supplied
expectations of the extent to which the sets overlap. For example, a hearer who knows MIT is an
engineering school may interpret these with a preconception that the sets of (2) overlap more
than (1).
41
42
CHAPTER 3. BUILDING THE LEXICON
How does AIGRE represent the meaning of quantifiers like “any”? To do this, I introduce another
meta-data property to the belief state called contrast_arity. It, like targetset_arity, is an interval
that restricts the size of the targets; however, it is a relational property that determines the
difference between the size of the candidate target and the size of the largest target set.
The largest target set is always the total number of entities in the referential domain that are
consistent with the description of the belief state.
The meaning of the action any is simply to set the contrast_arity equal to the interval [1, ∞).
Here is an illustration of the meaning of the action any derived from the contrast_arity property:
c1 , c2 , c3

b0 = targetset_arity

contrast_arity
target


[1, ∞)
[]
→
c1 , c2
c1 , c3
c2 , c3
c1
c2
c3
Figure 3-3: The intension of any in the Circles domain.
By setting contrast_arity to [1, ∞), the only target ruled out was {c1 , c2 , c3 } because its size
difference with the largest target (itself) was 0, a number outside the interval [1, ∞).
This concept of “contrast arity" simplifies the intensional descriptions of the other determiners
(and is a novel formulation, to my knowledge) including much of the meaning of the definite and
indefinite articles listed in Table 3.1:
3.2.3
Representing free choice in interpretations
Returning to the discussion of indefinites, the contrast_arity and the interval representation were
the missing pieces needed to represent the acton for the indefinite article, a. In the treatment
from Table 3.1, the meaning of a is the combined meaning of the quantifier any and the cardinal
number one.
The consequence of these design decisions is that we have nailed down a semantics for when
our belief states represents unspecific extensions: it represents free choice among the available
targets. Any time the belief state outputs more than one member, it means that the hearer is
justified in picking among one of them at will.
This brings the states into a larger extensional complexity class, Unspecific, which has R2R − 1
possible extensions. This class extends the expressiveness of the Plural class by adding the arity
constraints. Because the arity constraints only restrict levels of the lattice, of depth R, this only
leads to an additional multiplication factor of R.
3.2. REFERRING TO SETS OF SETS
Lexical Item
a
all
any
both
a couple
a few
most
several
some
the
43
targetset_arity Meaning
1
[2, ∞)
2
2†
5†
[d|b|/2e, ∞)
7†
[2, ∞)
contrast_arity Meaning
[1, ∞)
0
[1, ∞)
0
[1, ∞)
[1, ∞)
[1, ∞)
[1, ∞)
[1, ∞)
0
Table 3.1: Treatment of determiners’ intensional meanings, without scope attachment. †—while
the meanings of unspecific determiners could alternatively be modeled with an interval for their
targetset_arity, their different senses differ in their degree of fit—i.e. a set of size 7 is better for
“several” than 8, and 8 is better than 9. These separate senses are better suited to be modeled as
different senses, each with a different weight (e.g. cost, probability), as we do later in 3.4.2.
As we will see later in the section 3.4.1 on ambiguity, unspecific meanings are different from having
mutually exclusive targets. Distinguishing “a set of sets" from mutually exclusive interpretations
will require additional representation.
A prediction of this representation for the indefinite article is that: if there were only one referent
provided by the description, using the indefinite article would be infelicitous. So, if Susan asked Hans
for “directions to a bank" this would only be appropriate if there were more than one bank in her
referential domain. If Hans had only one bank in his, it is a clue that he is out of sync with Susan,
and he should attempt to revise his assumptions.
3.2.4
A simple meaning for the definite article
Table 3.1 specified the meaning of the definite article as setting the context_arity value to 0. This
outrageously simple solution is worth more discussion because the is so ubiquitous, we definitely
want to get its meaning right.
To explain the meaning of the, I will describe two different cases in turn: when the meaning of
the is followed by a singular NP,1 and when it is followed by a plural NP. I use examples from the
Circles domain to illustrate both cases. First, the singular case:
(3.4) the blue circle
1
Noun Phrase. Consult section 1.2.
44
CHAPTER 3. BUILDING THE LEXICON
(3.5) the green circle(*)
Given the above singular descriptions, only (3.4) is a valid. This is because using a definite article
presupposes that there is a single member denoted by the description, and in the case of (3.5), there
are two members denoted by the description. In both cases, the singular constraint comes not from
the definite article, but from the noun “circle". This singular constraint sets the targetset_arity to
greater than 1, as was explained in 3.1.1.

b0 = targetset_arity
contrast_arity



target
1
0
"
type
color


#

circle 

blue
c3
→
Figure 3-4: The intension and extension of the blue circle in the Circles domain.

b0 = targetset_arity
contrast_arity



target
1
0
"
type
color
c1 , c2


#

circle 

green
→
c1
c2
Figure 3-5: The intension and extension of the green circle in the Circles domain.
For the unacceptable meaning of (3.5) the green circle, there are two separate size constraints that
together make the extension empty. The first is the singular constraint of the targetset_arity,
which rules out all target sets except those with two or more members and is visually represented
by the upper cross-marked region. The second is the contrast_arity created by the definite article,
which rules out all but the top row of the lattice, and this thereby visually represented by the
lower cross-marked region. This yields the extensions we intended: a single member for (3.4) and
nothing for (3.5).
Now consider the other case: when the definite article precedes a plural noun phrase:
(3.6) the blue circles (*)
(3.7) the green circles
Guided by our intuition again, the meaning of (3.6) should be unacceptable and the meaning of
(3.7) should be {c1 , c2 }. The only difference is that the singular constraint of targetset_arity is
now a plural constraint.
3.3. A FUNCTION FOR DERIVING THE EXTENSION FROM THE INTENSION

b0 = targetset_arity
contrast_arity



target
[2, ∞)
0
"
type
color
45


#

circle 

blue
c3
→
Figure 3-6: The intension and extension of the blue circles in the Circles domain.

b0 = targetset_arity
contrast_arity



target
[2, ∞)
0
"
type
color
c1 , c2


#

circle 

green
→
c1
c2
Figure 3-7: The intension and extension of the green circles in the Circles domain.
And the resulting output stands up to intuition.
3.2.5
Representing Negation
Simple cases of negation are handled by adding a negative descriptive counterpart to the target
descriptive property that is called the distractor. Whatever elements in the referential domain
entail the description of the distractor will be ruled out. This representation can express the
semantics of referring expressions such as:
(3.8) no blue circle
(3.9) all except the Kindle Fire
(3.10) not the big ones
3.3
A function for deriving the extension from the intension
Now that all of the four main properties of a belief state have been described, all the pieces are in
place to explain the process that generates the extension (possible target sets) given an intension
(belief state) and a referential domain. Here is the process implemented in Python 2.7:
46
CHAPTER 3. BUILDING THE LEXICON
def generate_targetset(self, referential_domain):
"""Generates members of the target set that are
compatible with all available constraints."""
tlow, thigh = self[’targetset_arity’].get_tuple()
clow, chigh = self[’contrast_arity’].get_tuple()
# build list of all entities consistent with
# target and inconsistent with the distractor
entities = []
for entity in referential_domain.iter_entities():
if self[’target’].is_entailed_by(entity) and \
(self[’distractor’].empty() or not \
self[’distractor’].is_entailed_by(entity)):
entities.append(entity)
biggest_set_size = len(entities)
low = max(1, tlow)
high = min([biggest_set, thigh])
# iterate through all combinations of entities of size ’size’
for target in chain.from_iter(combinations(entities, size) \
for size in reversed(range(low, high+1)):
# if within contrast boundaries, yield it
if clow <= biggest_set_size-len(target) <= chigh:
yield target
This function is given a referential_domain as input. It then computes all of the entities (R) that are consistent with its target description and not consistent with its distractor
description (except when distractor is empty, otherwise, everything trivially entails the distractor).
Then, it proceeds by generating the power set, starting from the top level of the partial ordering
and working its way down.
Observe that the size of a belief state’s extension can be computed without explicitly enumerating
and counting its members, using the equation below. Unfortunately, a summation over a range of
binomial coefficients is required and there is no simpler known closed-form solution.
clow = min(b.contrast_arity)
(3.11)
thigh = max(b.targetset_arity)
(3.12)
R = | {r ∈ referential domain : r |= b.target ∧ r 6|= b.distractor} |
(3.13)
3.4. REFERRING TO MUTUALLY INCOMPATIBLE SETS
min(R−clow,thigh,R) |b| =
X
k=tlow
47
R
k
(3.14)
Figure 3-8: The equation for computing the size of a belief state’s extension without enumerating
all target sets.
3.4
Referring to mutually incompatible sets
Chapter 1 pointed out that across uses, a referring expression’s extension will always fluctuate as
a function of the referential domain. This section focuses on cases where a referring expression’s
intension fluctuates as a function of the referential domain. Such referring expressions are called
indeterminate:
A referring expression is indeterminate in a given context if it gives rise to multiple
intensions.
Because it is impossible to observe the intension directly, we recognize indeterminacy via its
extension: if there is some referential domain such that human subjects interpret a referring
expression in it to denote different extensions, then it is indeterminate.
Indeterminate referring expressions are especially interesting because they give an insight into
human linguistic processing. If the very same surface expression gives rise to different meanings
across uses, it indicates non-linguistically-encoded information is creeping into the linguistic
decoding process. In this chapter, several causes of referential indeterminacy are analyzed, and I
describe how they are represented by AIGRE.
3.4.1
Lexical Ambiguity
Suppose you are a clerk showcasing the Kindles from Fig 1-2 and a customer asks you for:
(3.15) the big one
The problem with the referring expression (3.15) is that it contains lexical ambiguity: in using
the word “big,” did the customer intend the sense big 1 , which modifies the size attribute, or big 2 ,
which modifies the hard_drive.size attribute? Although one is much more likely, they are both
mutually exclusive possibilities: extension(the big one)= {k4 } ⊕ {k5 }.
If this simple example does not work for you, consider some others: If the two kindles were
replaced with a US nickel and a dime and a 10 year old asked you to pass her the big one: does
“big" refer to value or size? Or consider the story from page 13, what sense of “bank" did Sally intend
with “the new bank”: a river, a financial institution, or a financial institution broadly construed to
include ATMs?
48
CHAPTER 3. BUILDING THE LEXICON
Ambiguity can be found at all levels at linguistic analysis. For automatic speech recognition,
acoustic ambiguities from continuous speech make it difficult to identify word boundaries and
resolve homophones—words that sound the same but have different meanings and (sometimes)
spellings (e.g. “recognize speech” vs. “wreck a nice beach”). In text processing applications like
ours, word boundaries and spellings are usually accurate; however, lexical ambiguities still arise
often due to homonymy and polysemy. In multi-word utterances, structural ambiguities can
arise due to attachments decisions, such as prepositional phrases (e.g. recall the “...boarded the
plane with two bags" example) or through issues of scope (e.g., does “the green and blue circles"
describe circles that are both green and blue, or two different types of circles?). Most relevant to
our investigation of referring expressions are the two forms of lexical ambiguity:
homonymy when identical lexical items carry two or more distinct meanings.
polysemy when identical lexical items have several different but related meanings. This can arise
for many reasons (Klepousniotou [2002]), including: metaphorical uses (head to describe
the top of the human body, or the top of beer); the mass/count distinction (turkey meaning
a slice of meat, or the animal it was part of); and the producer/product distinction (Dali to
describe the artist, or a piece of his artwork).
This distinction comes down to whether the same lexical item can take on multiple meanings, or
whether there are two different pairs of lexical items and meanings with lexical item that happen
to be the same. In practice it is difficult to determine whether a lexical item’s varying meanings
(senses) are truly the same, i.e., many of the cases of homonymy were once cases of polysemy, but
eventually the lexical item’s meaning grew apart. Furthermore, it is difficult to distinguish the
stable contributions from the context-enriched contributions to the intensional meaning, because
often lexical item’s meanings are adjusted to meet the constraints of the context (Allen [2011];
Palmer [2000]). Next, I present a computational model that does just that.
Ambiguities are a function of the context set
Observe that the problem of (3.15) disappears if we apply it instead to the Circles domain. This is
because there is only one meaningful way “big" could apply to circles. the big one clearly denotes
the circle that is the biggest in size, {c3 }. Compare a visualization of the interpretation planning
graph for the big ones in Circles and Kindles side-by-side:
3.4. REFERRING TO MUTUALLY INCOMPATIBLE SETS
1
size:7
the
cost: 0.01
2
size:7
49
big1 (size)
standard: [70, ∞)
cost: 1.40
3
size:1
one2
cost: 2.55
6
size:1
Figure 3-9: The interpretation plan for the big one in the Circles domain.
e)
,∞
(siz .70
2 d: [ 9
g
i
b dar 0
4
n
sta st: 1.
co
1
size:31
the
cost: 0.01
2
size:31
3
size:1
one2
cost: 2.46
9
size:1
6
size:1
one2
cost: 3.46
14
size:1
)
big2 (harddrive.size)
standard: [8, ∞)
cost: 2.40
Figure 3-10: The interpretation plan for the big one in the Kindles domain.
Notice that in the Kindles domain, the top branch corresponds to big1 (size) and the bottom
branch corresponds to big2 (harddrive.size).
Representing lexical items’ alternative meanings with non-deterministic actions
Non-deterministic actions provide a convenient way to represent the multiple meanings of lexical
ambiguous words. A non-deterministic action is one that is capable of generating more than
one successor state.
In traditional planning, non-deterministic actions are used to express actions that have uncertain
effects. For example, the result of a pick-up-block action may be one of three effects: (1) the
desired outcome of the block being in your hand, (2) the block falling on the floor, or (3) the block
50
CHAPTER 3. BUILDING THE LEXICON
not being on your hand for some other reason. Non-deterministic actions allow the planner to
come up with hypothetical plans.
Because the linguistic actions discussed operate entirely in belief-space, they do not affect the
world state, and all information needed is available to hypothetically evaluate its effect without
incurring cost outside computation.
Distinguishing polysemy and homonymy
In the interpretation depicted in Figure 3-10, the Circles domain produced one meaning of “big"
and the Kindles domain produced two different meanings. If we were to apply (3.15) to a more
information-rich referential domain (e.g. books in a library) there would be many more possible
interpretations.
How does this work? First, let us consider a new hypothetical referential domain, Cars, which
contains several automobiles, a1 , a2 . . . a8 . Entities in the referential domain are represented the
same way intensional meanings (belief states) are represented: as an attribute-value matrix whose
values are instances of cell data structures (section 2.4.3). Suppose for example a1 contains this
data:

a1 = make

model

color

weight


interior

Kia

Rio



gray


1093kg
h
i

color beige
Now suppose there are two different actions in our planning domain: lightweight and lightcolor .
Both of these actions produce the same lexical item, “light.” But these have an important semantic
different: lightcolor operates on color properties and lightweight operates on weight properties.
Because a1 has two different color properties, it has two different meanings for “light" as in color:
(1) light exterior color, and (2) light interior color. As for weight properties, it has only one, and
so lightweight only produces one meaning on the account of a1 : (1) having a light weight. The
fact that lightcolor had multiple meanings I will call polysemy, and the fact that there were
two distinct actions with the same surface lexical item, lightcolor and lightweight , I will call
homonymy.
To produce these alternative senses of an ambiguous lexical item, AIGRE uses the structure of
the entities in the referential domain. Starting at a given position (the top level of the entities’
structure) it navigates the entities’ properties in a breadth-first order—for example, to find all
properties whose attribute label is light. It then generates an effect function for each of those
meanings.
3.4. REFERRING TO MUTUALLY INCOMPATIBLE SETS
A note about ambiguity and vagueness
Section 3.2.1 gave a semantics for the case when a extension is a set of sets. This was used to
characterize the outcome of communicating uncertain descriptions. Uncertain descriptions are
represented using arity constraints that affect the generation of the extension. The arity constraints
do not rule out any members in the referential domain, just their combinatoric properties.
Ambiguity and vagueness (discussed next) are different from unspecific meanings. If a speaker
uses an ambiguous word, presumably there was one meaning that she intended although it
may have produced different interpretations. Ambiguity and unspecific descriptions can even
co-occur. Consider any big one in the Kindles domain: either way the hearer decides to resolve
the ambiguity will lead to an unspecific state.
When an ambiguous utterance produces multiple interpretations, the hearer must pick one
among the mutual exclusive interpretations. AIGRE keeps ambiguous interpretations separate, by
representing them as separate states with different intensions (possibly, but not necessarily, with
different extensions). Representing ambiguities with different belief states gives a clear way to
distinguish unspecific interpretations (when the hearer has a choice over multiple targets) from
the other mutually exclusive targets (choices that were artifacts of the interpretation process): If
two candidate target sets belong to the same belief state, then they are the result of an unspecific
extension; whereas, if they are in different belief states, then they are mutually exclusive.
Using different belief states to represent disjoint alternative interpretations makes our representation more expressive, but pushes it into the doubly-exponential Boolean extensional complexity
class (2.4.2). One representational way to mitigate this complexity is to not represent these alternative interpretations until the search algorithm generates them. A secondary process can reason
about the ambiguous interpretations by reading them off the planning graph: more than one node
in the same graph level (column) means there are alternative disjoint intensions.
Observe AIGRE’s two visualizations of an interpretation: a planning graph and the extensional
view:
51
52
CHAPTER 3. BUILDING THE LEXICON
Figure 3-11: Two views on the interpretation of the big ones in the Circles domain.
The extensional view combines the extension’s of each column of the state graph (excluding the
initial state) and computes their relative likelihoods from the plan’s cumulative cost.
3.4.2
Vagueness and Gradability
Another threat to recovering the speaker’s intended meaning from her utterance is vagueness.
The term “vagueness” itself is lexically ambiguous. Linguists and laypeople typically use it to
mean (autologically): insufficiently informative for the current purposes (Elbourne [2011]). An
example of vagueness1 (insufficient information) is:
(3.16) Let’s meet for dinner at a restaurant
because, in order for people to meet, a definite place needs to be established. This meaning is
unspecific, and thus handled by the “set of sets" representation implicitly built into to AIGRE’s
belief states, described in 3.2.1.
The second sense, vagueness2 (borderline cases), is the one I am interested in discussing here.
This sense of vagueness is best known to philosophers of language, and connotes something more
specific: predicates with unclear extensions—i.e. extensions containing borderline cases (Graff [2002]).
In the rest of this thesis, whenever the term “vagueness" is used, this second sense is intended.
This kind of vagueness commonly arises from lexical items whose meanings can be placed on a
scale; and for referring expressions, a common culprit is gradable adjectives (or scalar adjectives)
such as ‘small,’ ‘heavy,’ and ‘expensive.’
(3.17) the big kindles
3.4. REFERRING TO MUTUALLY INCOMPATIBLE SETS
(3.18) the light kindles
(3.19) the cheap kindles
Figure 3-12: Another example of vagueness2 (borderline cases). Which circles are big?
How are the meanings of gradable adjectives represented? This question has a deep philosophical
history,2 and is based in the problem of determining a cut-off point—a standard of comparison
for the applicability of a given gradable term. Values that fall in the middle of the graded domain
can lead to uncertainty. For example expensive restaurants may definitely exclude restaurants
whose average meal costs $10 or less, definitely include those whose average meal price is more
than $40, but lead to uncertainty for restaurants whose average price is in between.
Referents that are “in between” are called borderline cases, and in referring expressions these
create vagueness2 that can cause reference failures van Deemter [2010]. For you to succeed at
interpreting a vague gradable adjective, you must pick (a) a comparison class that defines the
set of restaurants relevant to your comparison, (b) pick a clustering coefficient for grouping the
values—allowing you to be “tolerant" to minor differences, and (c) a standard of comparison
that delineates expensive(xi ) from ¬expensive(xi−1 ) for the ordered restaurants’ prices x in the
comparison class. In this thesis, I focus on modeling (c), picking the standard of comparison. Due
to the restricted reference task we described in 1.5.2, the comparison class (a) can be assumed to
be the referents that are consistent with the current interpretation; and, for the simple referential
domains in this thesis, the issue of tolerance (b) will not play a significant role. Depending on
how you set this standard, the hearer may arrive at different interpretations.
How is the standard of comparison for gradable meanings chosen? The following observations
lead to a solution:
• Observation 1. If all borderline cases are potentially valid cut-off points, then the hypothesis
space must represent them all.
• Observation 2. Gradable adjectives meanings are ordered. $100 will always be a better fit
for “expensive” than $99.
• Observation 3. The incremental property limits the number of cases.
2
The foundation of this philosophical question is the Sorites Paradox, which is the contradiction that arises if
one believes all these three things: (1) One brick is not a heap: ¬heap(1); (2) A hundred bricks is a heap: heap(100).
(3) If n bricks aren’t enough to form a “heap", then n + 1 bricks aren’t either: ∀ n ¬heap(n) → ¬heap(n + 1).
53
54
CHAPTER 3. BUILDING THE LEXICON
• Solution. During REG, a search for a valid hypothesis needs to potentially consider them
all (Observation 1) but should start at the best (Observation 2) option using only values
relevant to the elements in the extension of the current interpretation (Observation 3). Then,
given AIGRE’s search strategy (discussed next in Chapter 4) it should avoid even generating
the alternative hypotheses until there are other reasons to do so. In short, structure the
search space to make the most common meanings possible.
The gradable adjective looks at the relevant values of the attribute for the entities in the current
extension. Values for the attribute’s are grouped and sorted. Then effects are generated in order,
starting from the strongest meaning and progressing to the weakest meaning. For example, using
the 8 circles in the figure 3-12 to interpret the big ones, AIGRE generates the planning graph of
figure 3-13. Its antonym, “small” sorts the values and effects in the other direction, and some
adjectives, like “middle” and “center,” start from the middle.
bi
st big
b
co and 1 (s stan g1 (s sta ig1 (s s big1 (s
st: ar izceos da izec nda ize tand
i
d
r
)
ard ze)
o
)
rd c)o
d
t: 8 :
16 : [
s
.40 [135 t: 4.4 : [157st: 2.4 : [178
.40 11
.57,
4.2
.14 0
.71 0
∞)
,∞
9,
,∞
)
∞
)
)
3.4. REFERRING TO MUTUALLY INCOMPATIBLE SETS
1
size:255
the
cost: 0.01
2
size:255
)
,∞
) .86
e
iz 2
(s : [ 9
g 1 ard .40
i
b nd 32
a
st ost:
)
c
,∞
e)
(siz [71.43
1
g
bi ard: 0
nd .4
sta st: 64
o
c
big1 (size)
standard: [50, ∞)
cost: 128.40
55
4
size:3
ones2
cost: 3.51
13
size:1
5
size:7
ones2
cost: 5.51
14
size:4
6
size:15
ones2
cost: 9.51
15
size:11
7
size:31
ones2
cost: 17.51
16
size:26
8
size:63
ones2
cost: 33.51
17
size:57
9
size:127
ones2
cost: 65.51
18
size:120
10
size:255
ones2
cost: 129.51
19
size:247
Figure 3-13: A demonstration of the problem of gradable adjectives during the interpretation of
the big ones with respect to the 8 circles of figure 3-12.
Notice that the 8 circles referential domain does not give rise to any ambiguity in the meaning of
the word “big". All of the divergent interpretations are due to the non-deterministic treatment of
vagueness.
56
CHAPTER 3. BUILDING THE LEXICON
AIGRE represents vague modifiers in a similar way as it does for ambiguity. Each vague lexical
unit creates multiple distinct effects, and possibilities are generated as a function of the elements
in the context set at the time the lexical unit is used in the interpretation.
Gradable adjectives are relational
Gradable adjectives like “big" impose a relational constraint between ordered values of an attribute,
and can only apply when the attribute’s values vary between entities in the target set and its
distractors (Sedivy et al. [1999]). AIGRE models this as a precondition for actions that implement
gradable adjectives. This is especially relevant to their brother form, the comparative, in which
the values are grouped into two classes.
Gradable adjectives can be made more precise using superlative
A gradable adjective can be made more precise by using its superlative form. Morphologically,
this typically involves adding “-est" to the adjective’s base form; or, if it is longer than two syllables,
by adding the word “most" before it.
Superlatives can be modified by ordinals such as “second" and “fifth". Just as adjectives modify
nouns, ordinals modify superlatives. In AIGRE, superlatives are modeled using a “skip” variable
that is stored as a temporary meta-data attribute of the belief state, and removed after it is used.
For example, the second would set the skip variable to 1 and then the gradable adjective would
skip over 1 effect, while the action is generating the effect functions in order from most to least
likely.
Gradable adjectives lead to borderline cases only when the noun they modify is plural
It is worth noting that if there is (a) only one gradable modifier in its positive/base form and (b)
the noun it modifies is singular, then its meaning is equivalent with the superlative form.
3.5
Ellipsis and the problem of missing words
Consider the following examples:
(3.20) the biggest
(3.21) the biggest green shape
(3.22) the second
(3.23) the second circle
3.5. ELLIPSIS AND THE PROBLEM OF MISSING WORDS
57
(3.20) is missing a noun, and in (3.22) and (3.23), the second one, the ordinal “second” appears
without a gradable adjective. I take these to be instances of ellipsis: when a lexical entry’s
meaning is present but its surface lexical item is omitted (and the sentence would still be valid if
the lexical item was filled in). These expressions should be interpreted as:
(3.24) the biggest [oneN N ]
(3.25) the second [leftmostJJS ] [oneN N ]
(3.26) the second [leftmostJJS ] circle
During REI, AIGRE treats ellipsis as the problem of inferring missing actions—interleaving the
partially3 observed actions of the speaker with inferred actions of the hearer (Benotti [2010];
Hobbs et al. [1993]). For a REG, ellipsis means that the speaker can decide to elide some surface
forms under certain conditions—such as if the listener is expected to infer it.
3.5.1
Representing syntactic state in the intension
How can we tell when a word has been omitted? This is a function of the syntactic state, which
AIGRE encodes within the intensional description, as a property part_of_speech of the belief state.

b = target_arity

contrast_arity

target

distractor

part_of_speech

[0, ∞)

[0, ∞)


[]


[]

NP
Figure 3-14: A complete description of AIGRE’s belief states.
The initial, empty belief state has a dummy starting part of speech, NP, and all goal states must be
a noun state, NN. The part_of_speech property is used by the action proposal function (section
4.2.2) to restrict the relevant actions to those whose part of speech is a valid next transition. For
example, NP allows any type of action, while the ordinal, ORD, must be followed by an adjective
JJ.
3.5.2
Assuming default actions when needed
Syntax constraints are handled entirely by the search procedure, specifically the action proposal
function responsible for filtering the actions depending on the current state. During REI, after
3
The actions are not fully observed because of ellipsis and, as we have seen with vagueness and ambiguity, different
senses of a word can produce the same surface form of the lexical unit.
58
CHAPTER 3. BUILDING THE LEXICON
considering all actions, if there is no valid transition to explain the rest of the observed text (e.g.
what remains of the referring expression), then certain default actions can be assumed at a cost.
For example, the language model forbids the ORD→NN transition and the goal test function
requires that all noun phrases terminate with a noun. Consequently, the second is interpreted as
theDT secondORD [leftmostJJS ] [oneNN ], assuming the default actions leftmostJJS and oneN N .
3.6
Conclusion
This chapter introduced representational solutions to several linguistic phenomena. Lexical
ambiguity and gradable meanings are modeled using non-deterministic actions, whose meaning
comes from properties of elements in the belief state’s current extension. Vagueness and ambiguity
were expressed using non-deterministic actions, and the approach was to express all of their
possible meanings and only their possible meanings, given the words that came before them. Still,
the lexical entry had to be capable to representing all of possible meanings of an indeterminate
referring expression, however likely. And as a result, the effects for vague and ambiguous actions
proliferate: if the adjective big has s senses, and there are r referents consistent with the belief
state, then it can yield as many as s × r effect functions. The next chapter will describe ways to
navigate a search space that contains such a high branching factor.
To describe the chapters’ transitions in a more pithy way: Chapter 3’s goal was to design an
intensional description, using a planning-based lexicon, which was capable of covering the entire
space of possible extensional meanings. The next chapter discusses ways to find solutions without
considering all possibilities.
Chapter 4
Controlling Search
The previous chapter was about representing all of the relevant decision points. This chapter is
concerned with making the right decisions as quickly as possible.
First, section 4.1 gives a sketch of the general search framework underlying both REG and REI
tasks. This framework is generally enough to support a large variety of heuristic search methods,
simply by changing its components. Second, in section 4.2, the task-specific constraints are
described that influence the search components.
4.1 AIGRE’s search framework
As is common for planning problems, AIGRE is based on heuristic search. In heuristic search
planning (Bonet and Geffner [2001]), states correspond to (beliefs about) world states, and actions
represent hypothetical changes to the world states. In our formulation (table 2.1), states correspond
to referring expressions’ intensional meanings, and the actions correspond to linguistic decisions
like word choice. The heuristic function estimates a state’s distance to a goal state, h(s) → [0, 1],
that guides the search algorithm toward states that have a lower estimated distance to a goal.
Search processes are recursive, and in the basic search model, the same procedure is applied to
each node:
Basic search framework: First, the base-case of the recursive process is checked:
does the goal-test function consider the current node a goal node? If so, it is returned.
If not, the node is added to a hash table (to recognize that it’s been visited), and its
successors states are created—by going through a list of the actions and applying each
action to the current node. After creating a successor, a labeled directed edge is added
that links the current node to its successor. The successor node score is computed
by the heuristic function and it is added to a global ranking called the fringe. Then,
the search algorithm uses the global ranking to decide which to pick next. And the
process continues...
60
CHAPTER 4. CONTROLLING SEARCH
AIGRE introduces a few changes to this basic search model.
First, I made significant modifications to the traditional planning-as-heuristic search framework by
changing the role of actions. Traditionally, actions generate successor states; however, in AIGRE’s
model, actions generate effect functions that modify states. It is up to the controller to decide if it
wants to apply a given effect function to a state and when it does it. Separating effect generation
from action generation is beneficial for a number of reasons:
• Copying states is computationally costly, while generating effect functions is cheap.
• It creates a new layer of decision making, and a new opportunity to optimize the search.
• Effects can be deferred—applied to its farther descendants rather than only its successor.
• Effects can estimate the cost of their computational side-effects, and communicate this cost
to the intermediate decision maker before the effect is actually executed.
• It creates two different times and scopes for variable binding: the effect function can contain
variables that are bound to properties of the parent node when it is defined, or the child
state while it is being called.
Example of actions generating effect functions
Actions in AIGRE receive a belief state as input and lazily generate 0 or more effect functions as
output, depending on the contents of the belief state. An action that does not yield any effects to be
analogous to a traditional action that does not have its preconditions satisfied. Unlike traditional
domains, an action’s behavior is opaque until it is explicitly applied to a belief state.
Given a parent state, an action in AIGRE will yield heffect-function, estimated-costi tuples. This
deferred binding approach allows the search process to consider both action costs and effect costs
before it actually generates a successor state.
For example, let’s say we have the noun bachelorN N , and a belief state, b, that is initially empty.
Given the current belief state, the action, bachelorN N , will return 0 or more effect functions and
their estimated costs:
bachelorN N (b) → he1, 0.021i, he2, 0.138i, . . .
Then effect e1 is applied to the belief state b:
e1(b) = 0.22
And the belief state now has been updated to contain the intensional contribution from the first
effect (i.e. sense) of the action bachelor:
4.1. AIGRE’S SEARCH FRAMEWORK

b = target_arity
contrast_arity




target





distractor
part_of_speech
61
[0, ∞)
[0, ∞)

type

gender
married
[]




human 

male 


False 



NN
Applying the effect to the belief state, e1(b), did multiple things: (1) it updated the state to include
a definition of bachelor in its intensional description (stored in the target property), (2) it returned
the effect’s actual cost of 0.22.
The second way in which AIGRE differs from the Basic search framework is by using local
search strategies that allow early-commitment to a promising successor before all of the nodes
have been expanded. This by consequence loses all assurances of completeness, and therefore
optimality. However, it gains in that the search is fast, non-deterministic and occasionally overgenerates—a desired property in REG approaches that seek psychological plausibility (van Deemter
et al. [2011b]).
Third, action sequences are constrained by a language model because the order of the plan’s
surface text corresponds 1:1 with the action operators.
I have made extensions to the basic search framework so that these additional three properties
can be added without compromising the framework’s ability to perform basic search procedures
like best-first and A*. The search framework looks like this (bold represents differences):
AIGRE search framework: First, the base-case of the recursive process is checked:
does the goal-test function consider the current node a goal node? If so, it is yielded.
If not, the node is added to a hash table (to recognize that it’s been visited), and its
successors states are created—by going through the actions proposed by action
proposal function and applying each action to the current node which yields effect
functions (and their estimated costs). Effect functions are sorted by the effect
sorting function. Then, each is applied to a copy of the current node to yield
a successor. After creating a successor, a labeled directed edge is added that links the
current node to its successor. The successor node score is computed by the heuristic
function and it is added to a global ranking. If the commit-test function returns
true, the current node and its successor are added to the fringe and the search
algorithm calls the get-next-node function. And the process continues...
The AIGRE framework is general enough that it subsumes the basic search framework, and can still
implement a variety of standard heuristic search techniques (e.g. best-first and A*) by changing its
various component functions. These component functions are:
goal-test function(state)→ T ∨ F determines whether a goal has been found. This is used for
returning a found plan and possibly terminating search.
62
CHAPTER 4. CONTROLLING SEARCH
heuristic function(state)→ [0, 1] estimates a state’s distance to the goal. This function is used
to rank search nodes and determine the best one to next apply the search process to.
action proposal function(state)→ action1 , action2 . . . actioni . returns a filtered, ordered list
of actions that are relevant to the current state. They are then applied to the current state to
yield effects.
effect sorting function(list of effects)→ effect1 , effect2 . . . effectk . Reranks the effects according to some criteria.
commit-test function(state, successor)→ T ∨ F Determines whether to commit-early to the
current successor and stop expanding the rest of the node’s actions and effects.
get-next-node function()→ node Decides which next node goes to the search process next.
4.2
Defining the search components for both reference tasks
From a high level, both REG and REI can be seen as problems of choosing the best sequence of
actions that map the initial state onto a goal state. And although both tasks use the same action
libraries and belief states, their search processes are subject to very different constraints. For the
generation task, the desired semantic content is fixed and the linguistic choices are open; while
for interpretation, the linguistic contents are relatively fixed and the semantic possibilities are
open. I use these differences to create task-specific heuristic, goal-test, action proposal, effect and
commit-test functions. Here is an overview:
4.2. DEFINING THE SEARCH COMPONENTS FOR BOTH REFERENCE TASKS
Component
goal-test function 4.2.1
REI Functions
REG Functions
• State is NN or NNS
• State is NN or NNS
• All of the input referring expression is accounted for
• All targets are described; no
distractors are
• The ratio of input text accounted for
• The combined targets identified and distractors ruled out
(F-measure)
• Filters actions proposed according to syntactic state
• Filters actions proposed according to syntactic state
heuristic function 4.2.4
action proposal
function 4.2.2
• Ranks actions by similarity
to remaining observation sequence
• Includes default actions (assumptions) that can be assumed at cost.
effect sorting
function 4.2.3
• minimize expected cost
• minimize expected cost
commit-test
function 4.2.5
N/A
• when successor is a sufficient
improvement
get-next-node
function 4.2.5
• takes node from the global
ranking with the lowest
heuristic score
• takes node from the global
ranking with the lowest
heuristic score
Table 4.1: A summary of the search components.
The constraints of the REI’ action proposal function, which consist of the observed referring
63
64
CHAPTER 4. CONTROLLING SEARCH
expression and the language model, jointly constrain the search space enough that the whole of it
can be quickly traversed. For REG tasks on the other-hand, the search space is much larger so
local search (e.g., hillclimbing) search techniques are important for expediting the search process.
For these, the bias is toward finding a goal as quickly as possible without regard for completeness
or optimality, and generally ignore the costs.
4.2.1
Goal-test functions
To guide a search process we minimally need a termination condition called a goal-test function.
This function returns True when we have encountered a goal state and False otherwise.
For REI, a goal state is one in which all observations have been accounted for, and the belief state’s
part of speech is a noun. For REG, a goal state is one in which only the targets are described
(i.e. its heuristic, Equation 1, returns 0), and the belief state’s part of speech is a noun.
Both tasks’ goal-test functions impose a syntactic constraint: the requirement that plans terminate
in a noun state. This all-or-nothing constraint, along with the language model in the action
proposal function, forces the resulting referring expression to be grammatical.
4.2.2
Action proposal functions
The action proposal function does not return every action in the lexicon; it is passed the current
state and is constrained in several ways. Constraints fall into one of two categories: hard constraints
are used to rule out actions from consideration altogether; and soft constraints are used to rank
the actions and give a priority to some over others.
REG and REI both share a hard constraint from syntax; and REI has one additional hard constraint
in operation. It does this by being passed the current node, whose state’s part_of_speech property
tells the syntactic category of the last action that changed it. Actions are proposed only if they are
consistent with a language model that describes valid transitions between syntactic categories.
AIGRE implements a simple language model is expressed in a regular language: NP DT? CD?
(ORD? JJS)* JJ* (NN|NNS)+.
For example, if the current node’s belief state is has JJ for its part_of_speech property, then the
language model within the action proposal function will restrict the relevant actions to lexical
items whose part of speech belongs to: JJ, NN or NNS.
For REI tasks, in addition to the constraint of this language model, an additional strong constraint
is imposed by its action proposal function. AIGRE only proposes those whose lexical units can
produce the text that appears in the remaining observation sequence. Because lexical items can be
prefixes of each other, this can result in an ambiguity—where two actions of different lengths are
proposed.
4.2. DEFINING THE SEARCH COMPONENTS FOR BOTH REFERENCE TASKS
4.2.3
65
Effect sorting functions
When an action generates an effect, it also produces an estimate of the effect’s computational cost
(recall the "bachelor" example from 4.1).
A previous implementation of AIGRE relied on the action functions to produce effect functions in
the order of the commonality of their meanings. By using an expected cost variable instead, the
action no longer has to be responsible for requiring its effects to come in a particular order. It
also allows effects between actions to be compared with one either, provided they use a fungible
numeric ranking.
4.2.4
Heuristic functions
For REG, the heuristic function characterizes its communicational objective: to describe the
target(s) and none of the distractors. For this we use the F1 score (F-measure) from information
retrieval, because it rewards inclusion of targets (recall) and penalizes inclusion of distractors
(precision). Given a belief state, s, and the intended target set, ṫ:
(
HREG =
= max F1(s, t) ∀s ∈ b
1
t∈b
t∈
/b
(4.1)
This heuristic iterates over each target set, t, in a belief state to find the biggest set difference
according to the F1 score. By taking the worst possible score of any target, it always is greater
than or equal to the true distance, meaning the heuristic is admissible. However, the search space
of a REG is typically too large for an optimal search algorithm (like A*) to find solutions.
For REI, the action proposal function is so restrictive that we can generate and test the entire search
space; therefore, no heuristic is necessary. However, because of the heterogenous lexicon, there
may ties to break between two lexical items, and we will want to go with the longest. Therefore,
we want a function that prefers states based on the amount of the input observation sequence
they have accounted for. The heuristic function for interpretation is simply:
HREI (s) = 1 −
4.2.5
length of plan’s surface text
length of observation sequence
Commit-test function
In standard search algorithms for the Basic search framework, all actions are considered
whenever a node is being expanded. In a local search algorithms (e.g. stochastic hillclimbing),
the first promising successor state is immediately visited without its parent being fully expanded.
The commit-test function is only used by local search algorithms:
HREG (parent) − HREG (child) ≥ θ
(4.2)
66
CHAPTER 4. CONTROLLING SEARCH
In the experiments described in this thesis, θ has been set to 0.2. In future work, optimal values
for parameter can be learned from data.
4.2.6
Get-next-node function
For all search algorithms other than stochastic search with backtracking, this simply picks
the node from the fringe that has the lowest estimated distance to the goal (heuristic score).
4.3
Weighing decision factors
This section reviews the factors that are relevant to decisions throughout the search process, and
how they are quantified and used to rank alternative decisions. Some represent costs, others
represent benefits; which can be standardized via inversion.
I group them into those that are used to guide the search process (e.g. in the heuristic function,
action proposal function, or effect sorting function) 4.3.1 and those that are used to compare
complete plans 4.3.3.
4.3.1
Using costs for guiding search
In a critique of Hobbs et al. [1993]’s interpretation by abduction framework, Norvig and
Wilensky [1990] complained that there was no principled way to separate or aggregate the costs of
system. The framework overloaded several disparate factors, including: inferential cost, resolution
appropriateness costs, word sense costs, etc, into a single cost. In AIGRE’s framework, costs and
benefits are assigned to several factors within the search process:
1. lexical cost: Actions’ lexical items have a value that gives preference to more frequently
used forms. This is estimated by using its inverse lexical frequency from the Open American
National Corpus (Ide and Macleod [2001]).
2. pre-effect cost: Effects’ yielded by actions come with an estimated cost.
3. post-effect cost: Effect functions, when executed, contain their actual effect cost.
These can be set in various ways and used to represent different meanings. In AIGRE, (1) is used
to give a bias toward using more common lexical items, (2) is used to sort the effects by their
more likely meanings and avoid computationally expensive effects and (3) conveys the true cost
of using a particular meaning. Only costs (1) and (2) go into the action’s overall cost.
4.4. SEARCH STRATEGIES
4.3.2
67
Using benefits for guiding search
The most commonly used metric to guide search in the REG literature is that of discriminatory
power of an action (or, an action’s effect in AIGRE’s case). This measures the number of distractors
ruled out by a particular target. One way to compute this is by applying the effect and computing
the size difference between the parent and target states:
discriminatory-power(Parent, Child) = |Parent| − |Child|
(4.3)
Discriminatory power is analogous to precision in information retrieval: it maximizes true
negatives (thereby minimizing false positives) at the cost of true positives (thereby risking false
negatives). For example, a word with great discriminatory power may rule out all of the distractors
and the target set too. A better way to balance this trade off is to use the F-measure from
information retrieval:
f-measure(Node, Targetset) = 2 ×
4.3.3
precision × recall
precision + recall
(4.4)
Using costs for comparing plans
Given a plan, Plan, its cost can be defined as the sum of its costs:
plan-cost(Plan) =
X
cost(Action)
(4.5)
Action∈Plan
This means that the plan’s plan-cost is the sum of each of its actions’ costs.
4.4
Search Strategies
In the next section, three search approaches are used: A* search, Best-First, Stochastic Hillclimbing and Stochastic Hillclimbing with Backtracking. Each inherits the basic properties
from the its respective task in the AIGRE search framework but has the following differences:
Component
heuristic function
commit-test function
A*
Adds the plan-cost to the heuristic,
e.g. HREG or HREI
N/A, because all actions are always
considered.
Table 4.2: A* search’s difference with the AIGRE search framework.
68
CHAPTER 4. CONTROLLING SEARCH
Component
commit-test function
Best-first
N/A, because all actions are always
considered.
Table 4.3: Best-first search’s difference with the AIGRE search framework.
The stochastic hillclimbing approach was an attempt for a search strategy that halfway between
standard stochastic search and best-first search. It achieves this by adding a second "fringe" called
the partially-open fringe that contains nodes whose expansion was terminated early due to the
early-commit function.
Component
Stochastic
Hillclimbing
with Backtracking
get-next-node function With probability p (a search parameter), pick a node from the open
fringe. With 1 − p, pick a node from
the partially-open fringe that contains nodes that have not been fully
expanded.
Table 4.4: Stochastic Hillclimbing with backtracking’s difference with the AIGRE search
framework.
Chapter 5
Evaluation
This chapter describes several evaluations of AIGRE’s performance, both in computational and
output quality, for both generation (REG) and interpretation (REI) tasks. Section 5.1 shows how
the data was collected. Then, both tasks are evaluated in terms of their coverage. Section 5.2 looks
at how similar the output was between the human data set and the computational model: the
analysis shows that AIGRE produces useful referring expressions, but will require adjusting its
costs in order to better reflect human decisions.
The chapter then moves on to interpretation. Section 5.3 investigates REI’s ability to interpret the
referring expressions that were in its vocabulary, and the results were very good. One finding is
that AIGRE’s characterization of gradable adjectives reflected the behavior of the majority of our
subjects in its assumption that the most conservative value for the standard was intended 3.4.2.
The results are followed up with a detailed analysis of each error, and suggested next steps are
proposed. The majority of errors came from pragmatic issue, where an underspecified description
receives a more specific meaning.
Algorithmic evaluations are performed in 5.4. The speed at which AIGRE generates referring
expressions is compared as the complexity of the task and the size of the lexicon grows. The
results clearly show the superiority of stochastic search techniques, and suggest that optimality
may not be achievable.
AIGRE ends the chapter with some concluding remarks: Section 5.5 shows the referring expressions
it generated for all extensions of Plural extensional complexity in both the Circles and Kindles
referential domains.
5.1
Collecting the Turk Dataset
In order to collect data about how people interpret and produce referring expressions, I developed
an experimental platform. The interface allowed me to rapidly create and deploy experiments
that involved human subjects solving tasks of both type: interpretation and generation. For REI,
participants are presented with a set of items, and asked to select the ones that were intended by
70
CHAPTER 5. EVALUATION
a particular referring expression. For REG, participants are asked to enter an English description
of a set of indicated objects.
Figure 5-1: An example of what the platform looks like to a human subject solving a generation
(REG) task.
Using Mechanical Turk, I ran a series of experiments to collect data about how humans use
referring expressions. Only data collected from REG experiments was used to evaluate AIGRE in
this chapter. The REG experiments exhausted the entire 2R − 1 Plural extensional complexity
class for both domains. The 38 total groupings for plural extensions for Kindles + Circles
(25 − 1 + 23 − 1).
Why the plural complexity class? For classes of higher complexity, the methodology for collecting
data is not straightforward for generation tasks. It is hard to conceive of a way, other than natural
language, the set of sets the subject should describe.
Each experiment consisted of 12 REG tasks, which were randomly chosen from the 38 plural
extensions from either domain. Subjects were allowed to participate in any or all of the four
experiments, and they were compensated 0.25USD. The HITs (Mechanical Turk vernacular for
"Human Intelligence Task") were restricted to US citizens with greater than 75% approval history.
The description of the task stated, in English, that it was only aimed for native English speakers.
A total of 604 subjects participated, with most (388) completing at least one REG task in both
domains. On average, subjects completed 7.68 ± 5.2 tasks and spent 3.7 minutes ± 3.6 completing
5.2. COVERAGE EVALUATION FOR GENERATION (REG)
Uniqe human subjects
Mean tasks per subject
Total Referring Expressions
Uniqe Referring Expressions
Uniqe RE from >1 subject
Total RE from >1 subject
Accurate and Covered by AIGRE
Uniqe Words
Most Common Word
Highest Entropy Denotation
Lowest Entropy Denotation
Circles
Kindles
488
504
7.6
2.19
1070
3971
517
3108
106
215
631
762
631
560
196
891
"circle" (16.57%) "the" (11.07%)
{c1 , c2 , c3 }
{k1 , k3 , k4 }
{c1 }
{k4 }
Table 5.1: General statistics about the Turk Dataset.
an experiment. The data is publicly available.1 Some general aspects of the data are listed in table
5.1.
5.2
Coverage evaluation for generation (REG)
Evaluating generation is more challenging than evaluating interpretation because there are several
conflicting notions of what would constitute a good description. In the REG community, there has
been a recent push for empirically evaluating against a corpus of human data Gatt [2007].
I have decided to use a simple, but very conservative evaluation question: how much of AIGRE’s
output is the same as the data collected from the human subjects? To answer this, I combined the
output of AIGRE for a REG-Aggregate task with n = 10, and looked at how much this overlapped
with the 1070 total referring expressions collected from people for each plural extension of the
Circles domain.
For example, the results of AIGRE versus the human data on one task (repeated 10 times for
AIGRE), REG-Aggregate(Circles, {c1 , c2 , c3 }, 10), are in Table 5.2:
http://web.media.mit.edu/~dustin/2013-circles_raw.yaml.gz and http://web.
media.mit.edu/~dustin/2013-kindles_raw.yaml.gz.
1
71
72
CHAPTER 5. EVALUATION
Count
9
2
2
2
2
2
2
2
2
2
AIGRE’s RE
plan-cost
the right ones
3.492
the big ones
3.509
the large circles
3.512
the large ones
3.509
the big circles
3.512
the big balls
3.512
the big dots
3.512
the big spheres
3.512
the large spheres 3.512
the large balls
3.512
Count
9
5
5
4
3
3
2
2
2
2
Human RE
the two largest circles
the two larger circles
two largest circles
the two biggest circles
medium green circle, large blue circle
bigger circles
the larger circles
medium green circle and big blue circle
the large green circle and the blue circle
larger circles
Table 5.2: The referring expressions produced by AIGRE (left) and those collected from human
subjects (right) for describing {c1 , c2 , c3 } in Circles. The left table is the first 10 of AIGRE’s output
across the 10 trials, and the right table is the 10 most frequent human referring expressions.
Exact string matches did not occur very frequently, in most cases only between 1-5 per trial.
Another, less strict measurement of coverage is the quality of the referring expressions. This
computes looks at how far each referring expression AIGRE produced is from the most similar
member of the human corpus, and averages them per task:
5.2. COVERAGE EVALUATION FOR GENERATION (REG)
1.0
Mean Sorensen-Dice coefficient
0.9
73
REG-Aggregate(C IRCLES, X, 10) Quality
Best-first
Stochastic Hillclimbing
Stochastic Hillclimbing with backtracking
0.8
0.7
0.6
0.5
0.4
{ c1 , c2 , c3 }
{ c2 , c3 }
{ c1 , c3 }
{ c1 , c2 }
{ c3 }
{ c1 }
0.2
{ c2 }
0.3
Target Sets
Figure 5-2: A measure of the average quality of the produced descriptions. This was derived
by treating each referring expression as a bag of words and scoring each by its closet neighbor
in the human corpus. Perfect matches are 1.0. Closeness is defined by the inverse set distance
Sørensen–Dice coefficient: 2 × |A ∩ B|/|A| ∪ |B|.
74
CHAPTER 5. EVALUATION
Count
39
13
13
4
3
3
3
3
3
3
5.2.1
AIGRE’s RE
the ones
the 3 ones
the three ones
the circles
the three circles
the 3 circles
the three balls
the balls
the spheres
the dots
plan-cost
1.117
2.191
2.191
1.120
2.207
2.207
2.207
1.120
1.120
1.120
Count
34
9
7
7
7
5
3
3
2
2
Human RE
circles
the circles
all circles
all the circles
all of the circles
three circles
the three circles
all three circles
a circle
the small, medium, and large circles
Analysis of generation errors in coverage
A few observations after comparing AIGRE’s output with humans’:
• The human subjects selected a more discriminating noun than was necessary. AIGRE
frequently ended the referring expression with the low-lexical cost item “ones.” It behaves
this way because it goes by f-measure, of which both are equally undiscriminating, and
action-cost alone, which prefers words that are more frequent. This can be accounted for
by adding a built-in preference for specificity (following Dale and Reiter [1995]) or content
types (Reiter and Dale [1992]).
• AIGRE had no way to conjoin more than one description other than by using the distractor
property, which yielded some awkward expressions like “the small not $99 one” in order to
refer to a set of items in terms of its difference. In cases where AIGRE had to resort to such
utterance, most humans used the “and” to join two separate noun phrases.
Many of the problems that AIGRE faced in generation can be solved by adjusting weights of the
costs for actions. My concern, first and foremost, is with AIGRE’s ability to represent all possible
meanings that can be conveyed by the lexical items used in the referring expressions. Because
AIGRE treats both REG and REI tasks in the same manner: anything AIGRE can interpret
it can also generate. Although it didn’t generate much of the desired output in the previous
section, I consider this deficiency to be a matter of setting weights—a much next step for future
work.
5.3
Coverage evaluation for interpretation (REI)
Using the collected data, I first looked at AIGRE’s ability to accurately interpret the referring
expressions collected from the human subjects.
5.3. COVERAGE EVALUATION FOR INTERPRETATION (REI)
The data was preprocessed as follows: To reduce the nose, I removed all but the referring expressions that were only contributed by one person. This brought the Circles domain to 20% of its
original size, to 106 unique expressions; and the Kindles domain to 6% of its size, to 215 unique
referring expressions.
Then, after examining the referring expressions, I added several lexical items to AIGRE’s lexicon,
because words that are outside of its lexicon lead to immediate interpretation failures. Importantly,
I did not create any new word classes, add misspelled words, or modify the lexical semantics
approach outlined in Chapter 3.
Because the objects in the Kindle domain have so few differentiating features, many human subjects
found it difficult to refer to them with succinct referring expressions and had to use complex noun
phrases (e.g. “the kindle touch, kindle touch 3g, and kindle dx”). Most of the referring expressions
in the Kindles domain required complex grammatical structures and AIGRE’s model only deals
with individual noun phrases. For a select subset of these that took forms like X, Y, and Z and
W, X, Y and Z—what I call “list form", I created a simple preprocessor to treat these as their
independent (2 or more) component referring expressions (e.g., X, Y and Z) and afterward combine
their individual interpretations’ extensions via the union operation. In this process, I did not
change any of the text in the surface forms, leaving the scope ambiguities in place and so many of
these were not counted among the covered. 18 referring expressions in the Circles were of the
"list form" and also benefited from this preprocessing step.
I restricted the study to only those referring expressions that had words covered by AIGRE, and
removed referring expressions that were underspecified. For example, several of the descriptions
were simply "kindles" to describe any subset of the kindles, not just the case where all Kindles
were selected. The result was a smaller dataset:
• Kindles data contained 128 unique referring expressions constituting 560 total.
• Circles data contained 94 unique referring expressions constituting 599 total.
Table 5.3: Here is the filtered Turk Dataset.
Circles
Kindles
Total Covered
567/599 (94.66%) 443/560 (80.89%)
Uniqe REs Covered 86/94 (91.49%)
93/128 (72.66%)
In the simple geometric Circles domain, AIGRE interpreted the human referring expressions
very well. In the more complex and realistic Kindles domain, it didn’t perform as well. Analyzing
the errors gives us a good insight into how to improve the model.
5.3.1
Ablation analysis for interpretation
After running AIGRE on all referring expressions in the Turk Dataset with a frequency of 2
or greater, I removed some of AIGRE’s syntactic classes of words to see which were the most
75
76
CHAPTER 5. EVALUATION
important in the derivation of meaning. The fractions below are a percentage out of all of the
referring expressions that were successfully interpreted, and is meant to illustrate the relative
importance of each word class (which varies across domains).
Lexical Class
Gradable Adj. (Base)
Gradable Adj.(Superlative)
Gradable Adj. (Comparative)
Crisp Adjectives
Quantifiers
Cardinals
Negation
Subsective Adj. (Gradable or Crisp)
circles
71.88%
29.17%
7.29%
48.96%
5.21%
11.46%
-
Kindle
60.87%
54.78%
15.65%
57.39%
20%
28.7%
13.91%
-
Table 5.4: Incidence of various lexical classes in successful interpretations.
Table 5.4 shows the importance of the gradable adjective in all of its forms. The most interesting
thing was that the model of gradable adjectives appears to fairly accurately reflect the human data.
The dominant standard of comparison used by the human subjects was the first, most conservative
sense—the first effect yielded by the gradable action!
I was surprised to discover that the representational apparatus I created for subsective adjectives
did not have any positive contribution in the interpretation in the REI task, and only slowed
down the REG tasks. I have removed it from AIGRE and moved my discussion of the approach
from Chapter 3 to the conclusion chapter. As a negative finding, it gives insight into the kinds of
syntactic representational apparatus that AIGRE needs to be able to extend beyond pre-nominally
modified noun phrases.
5.3.2
Analysis of interpretation errors in coverage
There were 99 referring expressions that AIGRE wasn’t able to return the correct responses, which
led to 268 errors. I have combined errors for the two domains and grouped them into semantic
types.
5.3. COVERAGE EVALUATION FOR INTERPRETATION (REI)
Type
Bad Grammar
Indefinite Errors
Scope Ambiguity
Post Nominal Modifier
Lexical/Semantic Omission
Underspecifation
77
Occurrences
2
9
12
55
56
145
Table 5.5: These numbers show the total counts of the types of error taken from AIGREattempting
to interpret all referring expressions with a count greater than 1. It sums to 279, not 268, because
some belong to multiple error types.
Bad Grammar
This malformed string occurred twice in the dataset:
(5.1) kindle, kindle touch (2)
It was not recognized by the preprocessor as being in “list form" because it was missing it used a
comma instead of an “and".
Indefinite errors
The theory of indefinite articles presented in section 3.2.1 came with a strong prediction: that it
would be inappropriate to use the indefinite article “a" when there is only one element that meets
the description. Consequently, AIGRE failed to interpret the following two referring expressions:
(5.2) a large blue circle (4)
(5.3) a small green circle and a large blue circle (5)
These 9 instances were the only cases in the data I analyzed that used an indefinite article. None
of its uses occurred in the Kindles domain, and my conjecture is that the geometric objects in the
Circles were so ubiquitous that the speakers in these cases did not conceive of them as distinct
entities, but as instances derived from a broader context set.
Scope Ambiguity
The following referring expressions were recognized to be in “list form" but failed to identify the
intended referents because of distributed/collective ambiguity of scope created by “and":
(5.4) the first and third kindles (2)
78
CHAPTER 5. EVALUATION
(5.5) the least and most expensive kindles (2)
(5.6) the least and two most expensive kindles (2)
Here’s an example of why this failed in spite of the preprocessing. Given the referring expression
“small and large circles" with the intended target set being {c1 , c3 }, AIGRE would break this into
two interpretations: small and large circles. Because of the plural on the latter interpretation, it
would end up with the combined interpretation containing all three circles.
A proper treatment of this would require understanding that the plurality of “Kindles" applies to
the combined meanings of the two components on either side of the “and".
Post Nominal Modifiers
There were 55 instances of modifiers coming after the noun. Because AIGRE only handles prenominal modifiers, none of these expressions with prepositional phrases were understood.
(5.7) the circle in the middle (4)
(5.8) the two circles on the right (2)
(5.9) the kindle with the largest screen (3)
(5.10) the kindle with color (2)
(5.11) the kindle with a color screen (2)
(5.12) the kindle that costs $99 (2)
(5.13) the kindles without 3g (2)
Most of them have pre-nominal paraphrases.
Lexical-Semantic Omissions
These referring expressions had words whose meanings would have required extension to the
lexical semantic theory:
(5.14) basic kindle (3)
(5.15) opposites (2)
(5.16) the two outside circles (2)
(5.17) circles of increasing size (2)
(5.18) every other kindle (2)
5.3. COVERAGE EVALUATION FOR INTERPRETATION (REI)
(5.19) the last kindle (2)
(5.20) the middle priced kindle (3)
(5.21) the three highest priced kindles (2)
(5.22) the black and white kindles (2)
(5.23) largest screen (2)
The first six referring expressions involve group properties of the set of kindles. The gradable
adjective, “basic", is a combination of several different relational properties (e.g., lacking in technological features). The other examples, such as every other kindle and the two outside circles,
draw from the speaker’s own visual representations of the scene.
For the word “priced," I found it interesting that it was preceded by height modifiers “low,"
“middle" and “high." The word “middle" can have both meanings in operation, either modifying
properties construed vertically or horizontally, as in another common expression (5.24) which can
be interpreted as an elided form of (5.25):
(5.24) the middle kindle
(5.25) the middle [positioned] circle
A simple solution would have been to treat “black and whiteJJ ” as a single lexical unit whose
meaning is has_color=False; however, this would have avoided its distributed reading.
Underspecification
The leading cause of errors were a result of underspecification (or “vagueness1 " (insufficient
information)): when the speaker’s description did not uniquely describe the target. There were
17 miscellaneous cases where the information in the descriptions was not sufficient to identify a
single set of referents:
(5.26) a variety of kindle models (2)
(5.27) three different kindles (2)
(5.28) the kindle 3g (2)
(5.29) green circle (3)
(5.30) the 3g kindle (3)
(5.31) circle (5)
The remaining 128 cases belonged to the Kindles domain and were a result of these two problems:
79
80
CHAPTER 5. EVALUATION
1. One of the kindles, k1 , is named “Kindle", which is a sub-string of all the other kindles names
and is the type of all 5 referents.
2. One of the kindles, k2 , is named “Kindle Touch", which is a sub-string of the “Kindle Touch
3g", k3 , and “touch" is also an adjective that can be used to describe whether or not a Kindle
has a touch screen (k1 , k2 , k3 and k5 do).
One hypothesis is that these can be resolved through pragmatic inferences of the kind discussed in
Chapter 1. The referring expression “the kindle touch" (9) can have an unspecific reading, however
the definite article makes the speaker’s intention to communicate a specific target overt, thereby
justifying a pragmatic inference.
What motivates the inference process? One possibility is an argument from the Brevity sub-maxim
of Grice’s framework. If the speaker intended k2 , “The Kindle Touch 3g", she would have had to
say more, therefore she probably intended k1 .
If we can resolve the general problem underlying these two issues, the results for AIGRE’s accuracy
on the Kindle domain will change to:
Total Covered
Uniqe REs Covered
5.4
Kindles
540/560 (97.84%)
112/128 (91.18%)
Computational evaluations of REG performance
In this section, we evaluate AIGRE’s computational performance on the REG task by measuring
the change in computational time for both task-complexity and lexicon size.
• Task Complexity: A measure of performance as the number of distractors and their
similarity to the target set increases.
• Lexicon Size: A measure of performance as the number of lexical entries increases.
5.4.1
Evaluating task complexity
Following the (p,d) task evaluation metric discussed by Koller and Petrick [2011], I evaluated
the performance of AIGRE as the REG task became increasingly difficult. The difficulty for the
generation task is a function of the semantic distinctions that can be used to rule out the distractors,
and the expressive resources available to describe them.
The (p,d) task systematically explores the task complexity by varying the number of distractors (d) and the number of properties (p) that are required to distinguish the target from the
distractors.
5.4. COMPUTATIONAL EVALUATIONS OF REG PERFORMANCE
81
For example, (3, 2) means there are 3 distractors and 2 properties are needed to distinguish the
target from the distractors. Each task also has a single target, so with 3 distractors, there are
really 4 elements in the referential domain. The p=2 properties means that the target requires two
properties to be encoded for it to be ruled out from the distractors. These properties are available
as adjectives, so the resulting referring expressions take the form: DT JJ1 JJ2 . . . JJp NN.
Comparison of search methods for generating reference
16
Avgerage search time (seconds) for 50 trials
14
12
10
Stochastic Hillclimbing
A* Search
Best-first search
8
6
4
(10,1)
(9,1)
(8,1)
(7,1)
(6,1)
(5,1)
(4,1)
(3,1)
0
(1,1)
(2,1)
2
(# distractors, #of discriminating properties required)
I repeated each task 50 times, but stopped A* search after a certain point when it clearly was
floundering. A* has guarantee of optimality, and because finding the minimal description is
essentially the set covering problem, which is known to be NP-complete. The greedy search
approaches were the clear winners, and were able to solve the task at (10,10) under 2 seconds.
These results do not rule out A* as a viable candidate. The possibility of generating a referring
expression with 4 or more adjectives would be highly unnatural, and so A* can still apply to the
majority of AIGRE’s reference tasks.
5.4.2
Evaluating lexicon size
If models of REG and REI stand any chance of communicating naturally with people, they will
need to use a larger lexicon. The lexical resources, FrameNet and WordNet, currently contain
82
CHAPTER 5. EVALUATION
Lexicon Size versus Search Time
A* (Generation)
Best-first (Generation)
Stochastic Hillclimbing (Generation)
A* (Interpretation)
350
300
Search time (sec)
250
200
150
100
50
0
−50
0
5
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Number of Actions
20,000+ adjectives. Consequently, a natural measure for computational model is to see how it
scales as a function of the size of the lexicon.
The results show that stochastic hillclimbing is the clear leader with respect to lexicon size, and
that lexicon size does not influence REI, because its action proposal function is doing all the work.
Lastly, the hidden costs of search approaches that expand all successors, such as best-first, comes
into view using this evaluation metric.
5.5
Qualitative evaluations of REG output
This section shows generative output of AIGRE for all REG tasks in the Circles and Kindles
domains. They were all performed using the default search approach, Stochastic Hillclimbing,
on the REG-Aggregate task run n=10 iterations. First, here is every extension in the Circles
domain:
5.5. QUALITATIVE EVALUATIONS OF REG OUTPUT
Target Set
Mean Time (sec)
{c1 }
2.01 ± 5.3
{c2 }
0.35 ± 0.1
{c3 }
1.86 ± 4.3
{c1 , c2 }
0.63 ± 0.2
{c1 , c3 }
0.54 ± 0.1
{c2 , c3 }
0.32 ± 0.2
{c1 , c2 , c3 }
0.69 ± 0.2
Referring Expressions (and their plancost)
the small one (2.5), the left one (2.5), the smaller one
(2.5), the small ball (2.5), the small circle (2.5), the
small dot (2.5), the small sphere (2.5)
the green right one (3.5), the green big one (3.6), the
right small one (4.4), the small right one (4.4), the right
tiny one (4.4), the left big one (4.5), the right green
one (4.5), the big green one (4.6)
the right one (2.2), the big one (2.3), the large one (2.3),
the larger one (2.3), the bigger one (2.3)
the 2 green ones (3.5), the two green ones (3.5), the
2 small ones (4.4), the 2 small balls (4.4), the 2 small
circles (4.4), the 2 small dots (4.4), the 2 small spheres
(4.4), the 2 left ones (4.4), the 2 tiny ones (4.4), the two
small balls (4.4), the two small circles (4.4), the two
small dots (4.4), the two small spheres (4.4)
the 2 except center one (4.5), the 2 not center one
(4.5), the 2 but center one (4.5), the 2 except middle
one (4.5), the 2 but middle one (4.5), the 2 not middle
one (4.5), the two but center one (4.5), the two not
center one (4.5), the two except center one (4.5), the
two except medium one (4.5), the two not medium
one (4.5), the two but medium one (4.5)
the right ones (3.3), the right circles (3.3), the big ones
(3.3), the 2 right ones (4.4), the 2 right balls (4.4), the
2 right circles (4.4), the 2 right dots (4.4), the 2 right
spheres (4.4), the two right ones (4.4)
the ones (1.1), the 3 ones (2.2), the three ones (2.2), the
three balls (2.2), the three circles (2.2), the three dots
(2.2), the three spheres (2.2)
And these are the results of using Stochastic Hillclimbing for the REG-Aggregatetask run n=10
iterations on the Kindles domain:
83
84
CHAPTER 5. EVALUATION
Target Set
Seconds
{k1 }
0.52 ± 0.2
{k2 }
0.73 ± 0.3
{k3 }
0.58 ± 0.3
{k4 }
0.54 ± 0.3
{k5 }
0.66 ± 0.7
{k1 , k2 }
3.28 ± 6.2
{k1 , k3 }
2.68 ± 4.1
Referring Expressions (and their plan-cost)
the left one (2.5), the light one (2.5), the cheaper one (2.5),
the small left one (3.9), the small light one (3.9), the tiny left
one (3.9)
the big left one (5.9), the left large one (5.9), the small right
light one (6.2), the biggest small left one (7.3), the light
right one (10.8), the right cheap one (10.8), the right cheaper
one (10.8), the expensive left one (10.9), the right small
inexpensive one (12.2), the right medium light one (13.2),
the right medium cheap one (13.2)
the center one (2.5), the small right one (5.8), the right small
one (6.8), the right smaller one (6.8), the cheaper right one
(6.8), the medium inexpensive right one (9.2), the large right
small one (10.2)
the big one (2.5), the larger one (2.5), the largest one (2.5),
the bigger one (2.5), the black large one (3.8), the dark big
one (3.9), the right big one (4.8), the center biggest one (4.9),
the small right one (5.8), the smaller right one (5.8)
the right one (2.4), the big fire (3.5), the big right one (4.8),
the large right one (4.8), the light right one (10.8)
the light ones (3.5), the small light ones (4.9), the small cheap
ones (4.9), the small light readers (4.9), the small inexpensive
ones (4.9), the smaller left readers (4.9), all light readers (5.5),
the left ones (9.5), the medium center light readers (10.3)
the smaller but middle one (4.9), the smaller except middle
one (4.9), the smaller not middle one (4.9), the small but
center one (5.9), the small not center one (5.9), the small
except center one (5.9), the small but middle one (6.9), the
small not middle one (6.9), the small except middle one (6.9),
the left except $99 one (7.8), the left not $99 one (7.8), the
left but $99 one (7.8), the cheap except center one (7.9), the
cheap but center one (7.9), the cheap not center one (7.9),
the light but middle one (7.9), the light except middle one
(7.9), the light not middle one (7.9), medium center not $99
one (10.2), medium center except $99 one (10.2), medium
center but $99 one (10.2), the light not center one (11.9), the
light but center one (11.9), the light except center one (11.9)
5.5. QUALITATIVE EVALUATIONS OF REG OUTPUT
Targets
Seconds
{k1 , k4 }
0.70 ± 0.2
{k1 , k5 }
10.61 ± 5.0
{k2 , k3 }
4.41 ± 6.5
{k2 , k4 }
0.74 ± 0.1
{k2 , k5 }
0.54 ± 0.1
Referring Expressions (and their plan-cost)
the small but center one (5.9), the small not center one (5.9), the small except
center one (5.9), the small but middle one (6.9), the small not middle one
(6.9), the small except middle one (6.9), the medium but center one (6.9),
the medium except center one (6.9), the medium not center one (6.9), the
smaller not center one (6.9), the smaller except center one (6.9), the smaller
but center one (6.9), the tiny but middle one (6.9), the tiny except middle one
(6.9), the tiny not middle one (6.9), the small except middle ones (6.9), the
small but middle ones (6.9), the small not middle ones (6.9), the left except
touch (10.5), the left not touch (10.5), the left but touch (10.5), the left not
touches (10.5), the left but touches (10.5), the left except touches (10.5)
the light not touches (10.5), the light except touches (10.5), the light but
touches (10.5), the light not medium one (12.9), the light but medium one
(12.9), the light except medium one (12.9)
the middle small ones (4.9), the center small tablets (4.9), the small right
ones (7.9), the expensive smaller ones (10.9), the right cheaper ones (11.9),
all right small ones (12.8)
the big left not kindle touch 3g (9.9), the big left except kindle touch 3g
(9.9), the big left but kindle touch 3g (9.9), the bigger left but kindle touch 3g
(9.9), the bigger left except kindle touch 3g (9.9), the bigger left not kindle
touch 3g (9.9), the right small not kindle touch 3g (12.9), the right small
but kindle touch 3g (12.9), the right small except kindle touch 3g (12.9), the
right medium except kindle touch 3g (12.9), the right medium not kindle
touch 3g (12.9), the right medium but kindle touch 3g (12.9), the right smaller
not kindle touch 3g (12.9), the right smaller except kindle touch 3g (12.9),
the right smaller but kindle touch 3g (12.9), the right tiny not kindle touch
3g (12.9), the right tiny except kindle touch 3g (12.9), the right tiny but
kindle touch 3g (12.9), the right smallest not kindle touch 3g (12.9), the right
smallest but kindle touch 3g (12.9), the right smallest except kindle touch 3g
(12.9)
the small right not kindle touch 3g (8.9), the small right but kindle touch
3g (8.9), the small right except kindle touch 3g (8.9), the smaller right but
kindle touch 3g (8.9), the smaller right not kindle touch 3g (8.9), the smaller
right except kindle touch 3g (8.9), large small but center one (9.3), large
small except center one (9.3), large small not center one (9.3), large small but
middle one (9.3), large small not middle one (9.3), large small except middle
one (9.3), big smaller but center one (9.3), big smaller not center one (9.3), big
smaller except center one (9.3), bigger small except center one (9.3), bigger
small not center one (9.3), bigger small but center one (9.3), large smaller
but middle one (9.3), large smaller except middle one (9.3), large smaller not
middle one (9.3), larger medium but center ones (9.3), larger medium except
center ones (9.3), larger medium not center ones (9.3), the right small not
kindle touch 3g (12.9), the right small except kindle touch 3g (12.9), the right
small but kindle touch 3g (12.9), the right lighter not kindle touch 3g (14.9),
the right lighter except kindle touch 3g (14.9), the right lighter but kindle
touch 3g (14.9)
85
86
CHAPTER 5. EVALUATION
Target Set
Seconds
{k3 , k4 }
0.29 ± 0.1
{k3 , k5 }
1.65 ± 4.4
{k4 , k5 }
2.28 ± 4.3
{k1 , k2 , k3 }
2.64 ± 3.7
{k1 , k2 , k4 }
0.73 ± 0.1
{k1 , k2 , k5 }
4.75 ± 5.6
Referring Expressions (and their plan-cost)
the center heavier ones (5.9), the smaller right readers (6.9),
the small right ones (7.9), the right tiny ones (7.9), the expensive small ones (7.9), the big right 3g ones (10.2), the
large right 3g ones (10.2), the right smaller ones (11.9), the
left big expensive ones (14.3)
the medium right ones (5.9), the tiny heavy readers (5.9), the
right light ones (7.9), the expensive medium readers (7.9),
the big small right ones (9.3), the largest right small ones
(11.3), the right small ones (11.9), the right small readers
(11.9), the right medium ones (11.9)
the big ones (3.5), the large readers (3.5), the larger ones
(3.5), the largest ones (3.5), the bigger ones (3.5), the bigger
readers (3.5), the right ones (9.5)
the small ones (3.5), the 3 small ones (3.6), the 3 smaller ones
(3.6), the three small ones (3.6), the cheap readers (5.5), the 3
left ones (6.6), the 3 light readers (6.6), the 3 light tablets (6.6),
the 3 light kindles (6.6), the 3 medium cheaper readers (9.0),
the 3 medium cheaper tablets (9.0), the 3 medium cheaper
kindles (9.0)
the small except kindle touch 3g (5.5), the small but kindle
touch 3g (5.5), the small not kindle touch 3g (5.5), the smaller
not kindle touch 3g (5.5), the smaller but kindle touch 3g
(5.5), the smaller except kindle touch 3g (5.5), the 3 small
not kindle touch 3g (6.6), the 3 small except kindle touch 3g
(6.6), the 3 small but kindle touch 3g (6.6), the 3 tiny except
kindle touch 3g (6.6), the 3 tiny not kindle touch 3g (6.6),
the 3 tiny but kindle touch 3g (6.6), the three small except
kindle touch 3g (6.6), the three small but kindle touch 3g
(6.6), the three small not kindle touch 3g (6.6), the 3 left not
kindle touch 3g (11.6), the 3 left but kindle touch 3g (11.6),
the 3 left except kindle touch 3g (11.6)
the small but center one (5.9), the small except center one
(5.9), the small not center one (5.9), 3 small but middle one
(6.9), 3 small except middle one (6.9), 3 small not middle
one (6.9), 3 medium not center one (6.9), 3 medium except
center one (6.9), 3 medium but center one (6.9), 3 smaller not
middle one (6.9), 3 smaller but middle one (6.9), 3 smaller
except middle one (6.9), three small not center one (6.9),
three small but center one (6.9), three small except center
one (6.9), the light but center one (11.9), the light not center
one (11.9), the light except center one (11.9), three light not
center one (13.0), three light but center one (13.0), three
light except center one (13.0)
5.5. QUALITATIVE EVALUATIONS OF REG OUTPUT
Targets
Seconds
{k1 , k3 , k4 }
3.53 ± 3.7
{k1 , k3 , k5 }
5.14 ± 9.0
{k1 , k4 , k5 }
{k2 , k3 , k4 }
14.37 ± 0.6
0.52 ± 0.5
{k2 , k3 , k5 }
2.34 ± 4.6
{k2 , k4 , k5 }
1.08 ± 0.5
Referring Expressions (and their plan-cost)
the small not $99 touch (6.8), the small except $99 touch (6.8), the
small but $99 touch (6.8), 3 small but $99 one (6.8), 3 small except
$99 one (6.8), 3 small not $99 one (6.8), three small but $99 one
(6.9), three small except $99 one (6.9), three small not $99 one
(6.9), 3 medium not $99 one (7.8), 3 medium but $99 one (7.8), 3
medium except $99 one (7.8), 3 tiny not $99 one (7.8), 3 tiny but
$99 one (7.8), 3 tiny except $99 one (7.8), three small except $99
ones (7.9), three small not $99 ones (7.9), three small but $99 ones
(7.9), 3 left not $99 one (12.8), 3 left but $99 one (12.8), 3 left except
$99 one (12.8)
the small not $99 one (5.8), the small except $99 one (5.8), the
small but $99 one (5.8), the smaller but $99 one (5.8), the smaller
not $99 one (5.8), the smaller except $99 one (5.8), 3 small not $99
one (6.8), 3 small but $99 one (6.8), 3 small except $99 one (6.8), 3
smaller not $99 one (6.8), 3 smaller except $99 one (6.8), 3 smaller
but $99 one (6.8), three small except $99 one (6.9), three small but
$99 one (6.9), three small not $99 one (6.9), 3 small not $99 ones
(6.9), 3 small but $99 ones (6.9), 3 small except $99 ones (6.9), 3
tiny except $99 ones (6.9), 3 tiny but $99 ones (6.9), 3 tiny not $99
ones (6.9), the cheaper except $99 one (11.8), the cheaper not $99
one (11.8), the cheaper but $99 one (11.8)
the middle ones (3.5), the big left ones (8.9), the larger left readers
(8.9), the right small ones (11.9), the right small readers (11.9), the
right medium ones (11.9), the right tiny readers (11.9), the 3 right
small ones (13.0), the three left largest ones (13.0)
the large small readers (6.9), the small right ones (7.9), the right
small ones (11.9), the right smaller ones (11.9), the three right
medium readers (13.0), the three right medium tablets (13.0), the
three right medium kindles (13.0), the three heavy small ones
(13.0), the right inexpensive ones (13.9), the 3 cheap right ones
(15.0), the three light right ones (15.0)
the big not kindle touch 3g (5.5), the big except kindle touch 3g
(5.5), the big but kindle touch 3g (5.5), the 3 big not kindle touch
3g (6.6), the 3 big except kindle touch 3g (6.6), the 3 big but kindle
touch 3g (6.6), the 3 large not kindle touch 3g (6.6), the 3 large
except kindle touch 3g (6.6), the 3 large but kindle touch 3g (6.6),
the right except kindle touch 3g (10.5), the right not kindle touch
3g (10.5), the right but kindle touch 3g (10.5), the 3 right but
kindle touch 3g (11.6), the 3 right not kindle touch 3g (11.6), the 3
right except kindle touch 3g (11.6), the 3 expensive except kindle
touch 3g (11.6), the 3 expensive not kindle touch 3g (11.6), the 3
expensive but kindle touch 3g (11.6)
87
88
CHAPTER 5. EVALUATION
Target Set
Seconds
{k3 , k4 , k5 }
3.65 ± 4.7
{k1 , k2 , k3 , k4 }
3.82 ± 3.8
{k1 , k2 , k3 , k5 }
2.82 ± 2.6
{k1 , k2 , k4 , k5 }
8.47 ± 5.3
{k1 , k3 , k4 , k5 }
3.60 ± 3.4
{k2 , k3 , k4 , k5 }
1.70 ± 2.2
{k1 , k2 , k3 , k4 , k5 }
5.44 ± 4.3
Referring Expressions (and their plan-cost)
the three right readers (6.6), the three right tablets (6.6), the
three right kindles (6.6), the big right readers (8.9), the large
right ones (8.9), the large right readers (8.9), the right ones
(9.5), the right readers (9.5), the 3 big right ones (10.0), the 3
big right readers (10.0), the 3 big right tablets (10.0), the 3
big right kindles (10.0)
the small ones (3.5), the small readers (4.5), the 4 small ones
(4.6), the 4 small readers (4.6), the 4 small tablets (4.6), the 4
small kindles (4.6), the 4 tiny readers (5.6), the 4 tiny tablets
(5.6), the 4 tiny kindles (5.6), the four smaller ones (5.6), the
left ones (9.5), the 4 left ones (10.6)
the small ones (3.5), the smaller ones (3.5), the 4 small ones
(4.6), the 4 small readers (4.6), the 4 small tablets (4.6), the 4
small kindles (4.6), the 4 medium ones (4.6), the 4 medium
readers (4.6), the 4 medium tablets (4.6), the 4 medium kindles (4.6), the 4 light ones (10.6), the 4 cheap ones (10.6), the
4 light readers (10.6), the 4 inexpensive ones (10.6), the 4
light tablets (10.6), the 4 light kindles (10.6)
the 4 but center one (4.5), the 4 not center one (4.5), the 4
except center one (4.5), the four not center one (4.6), the
four but center one (4.6), the four except center one (4.6),
the 4 not middle ones (4.6), the 4 but middle ones (4.6), the 4
except middle ones (4.6), the four not center reader (4.6), the
four except center reader (4.6), the four but center reader
(4.6), the four but center tablet (4.6), the four not center
tablet (4.6), the four except center tablet (4.6), the four not
center kindle (4.6), the four but center kindle (4.6), the four
except center kindle (4.6)
the 4 except $99 one (4.5), the 4 but $99 one (4.5), the 4 not
$99 one (4.5), the four except $99 one (4.5), the four not $99
one (4.5), the four but $99 one (4.5), the 4 not $99 readers
(4.5), the 4 but $99 readers (4.5), the 4 except $99 readers
(4.5), the 4 but $99 tablets (4.5), the 4 except $99 tablets (4.5),
the 4 not $99 tablets (4.5), the 4 not $99 kindles (4.5), the 4
but $99 kindles (4.5), the 4 except $99 kindles (4.5)
the large ones (4.5), the right ones (9.5), the right readers
(9.5), the expensive ones (9.5), the heavy ones (9.5), the
expensive readers (9.5), the 4 right readers (10.6), the 4 right
tablets (10.6), the 4 right kindles (10.6), the four right ones
(10.6)
the ones (1.1), the 5 ones (2.2), the 5 readers (2.2), the 5
tablets (2.2), the 5 kindles (2.2), the five ones (2.2)
Chapter 6
Related Work
This chapter describes related work in computational models of REG and REI. First, in 6.1, I
justify AIGRE’s lexical, planning-based approach, which sets it apart from most other work
in NLG and NLU. AIGRE is the first lexical, planning-based approach to both REG and REI.
As a planning problem, AIGRE is novel for framing it within belief-state and for focusing on
the linguistic phenomena of vagueness and ambiguity. Compared to alternative approaches to
generating referring expressions, it unique in generating multiple referring expressions (opposed
to deterministic output), producing linguistic output (instead of merely selecting content), referring
to sets (instead of just individual referents). Furthermore, it offers a novel approach to representing
vague and ambiguous meanings.
6.1
Abandoning serial pipeline architectures
How does a speaker take a context set and a target set and encode a referring expression? And
how does a hearer use the context set, along with a referring expression to decode the intensional
description?
In the process of researching previous definitions of generation and interpretation processes, it
became evident that the primary points of contention arose due to issues of scope. On one hand,
there is pressure to define each process’s scope narrowly enough to facilitate scientific inquiry.
On the other hand, each process must be broadly enough to account for all aspects of linguistic
meaning. The earliest computational models (Appelt [1985]; Winograd [1972]) were among the
most broadly scoped and addressed both NLG and NLU together. I suspect these system’s various
theoretical contributions were too difficult for the research community to identify and their overall
engineering contributions—like most large software systems—were too difficult to build upon.
6.1.1
Processing architectures for interpretation
A theory about how meaning is decoded from a referring expression requires: (1) a theory of
what constitutes the referring expression’s parts, (2) what the parts mean and (3) how the parts’
90
CHAPTER 6. RELATED WORK
meanings are combined with each other (including context) to produce the whole expression’s
meaning.
An extreme view, which has strongly influenced modern linguistics, is that all relevant meaning
is contained in the referring expression itself. As a consequence of this view, natural language
is seen as a self-contained formal language. This I call the coding-theory view, and it has the
following processing architecture:
Generation
Interpretation
Meaning Representation
Meaning Representation
Linguistic
Encoding
Linguistic
Decoding
Referring Expression
Referring Expression
Figure 6-1: The coding-theory view of linguistic communication.
The architecture for reference procedures, according the coding-theory view, involves the speaker
feeding the meaning she intends to communicate into a language encoding module. This module
produces a referring expression as output. This output is then decoded by the hearer’s language
decoding module. Despite its lasting influence, the coding-theory view fails to account for
several linguistic phenomena, including: ambiguity, compound nominals, ellipsis, figurative
language (hyperbole, metaphor, metonymy), implicit assumptions (presuppositions), implicit
expected inferences (implicatures), mistakes in spelling or grammar, unspecific descriptions
and vagueness. In testament to the influence of the coding-theory view, the reason many of
these issues are considered “phenomena" is because their meanings fall outside the scope of its
treatment.
Various proposals to extend the coding-theory view account for some of these phenomena. Generally, they do so by receiving an additional input, context. Context is a “suitcase word"1 which is
used to describe “any information that is available to the speaker or hearer that may influence
meaning" or, more cynically, “any information not accounted for by the theory at hand."
Referring expressions were given special attention in developing theories about how natural
language is interpreted, because it was clear that at least part of their meaning depends on context.
The coding-theory view (figure 6.1.1), which treats natural language as a self-contained formal
system, needed to be augmented with a mechanism to change the denotations of pronouns, proper
names and deictic expressions across different uses. What kind of representation would allow
the intension of the leftmost one to stay unchanged across uses while its extension changes?
1
To describe an expression that has been packed with too many imprecise meanings, Minsky [2007] uses the term
“suitcase word."
6.1. ABANDONING SERIAL PIPELINE ARCHITECTURES
To address this, Frege developed a formal language, an important predecessor to first-order
predicate calculus (FOPC), which allows descriptions to hold over variables that can be bound
to different objects (a set of entities of any domain of interest) to form True or False statements.
Consequently, a referring expression’s intension could contain free variables which are bound
subsequently during the reference resolution step, extension(·), which interacts with context. The
result is a picture of the interpretation process that looks like the following:
Interpretation
Denotation
Context
Reference
Resolution
Encoded Meaning
Linguistic
Decoding
Referring Expression
Figure 6-2: A model-coding-theory view of the interpretation process.
Given a referring expression, such as a red square, a modern model-coding-theory architecture
would describe its intension using an expressive logic (e.g. FOPC, Intensional Logic). During
the linguistic decoding, the intension is constructed by assembling the individual lexical entry’s
atomic meanings according to the syntactic structure and compositional rules. For example, the
resulting intension in this case may be: ∃x red(x) ∧ sqare(x). In the model-theoretic approach
to reference resolution, a hearer must find the possible world(s) that make the intension True,
namely, a model that solves the intension (in this case, by assigning x to an entity in the world
that is both red and square).
In many cases the linguistically-encoded component of the intension is not all that the speaker
intended to communicate. So how does the hearer derive the intended meaning from the intension?
Pragmatic theories have proposed that the intended meaning is derived from inferences based on
the intension and context, but there are different ideas about (1) what factors drive the inference
process2 (2) what criteria is used to terminate an inference process, and (3) at which stages context
is available.
To develop an intuition for how to answer these questions, consider this story demonstrating
non-linguistic referential communication:
2
The criteria for termination is critical, because any intension can have multiple speaker meanings. Consider, for
example, irony: given any utterance, a speaker can intend the literal intension or its opposite.
91
92
CHAPTER 6. RELATED WORK
Joan, who is mute, is ordering lunch at a nice restaurant. When the waiter arrives
at her table, she points at the “Albacore Tuna burger” on the menu. The waiter
recognizes the dish Joan was referring to and acknowledges her request.
The waiter observes Joan’s use of an established convention, pointing with her index finger, and
its intension: that she intends to direct his attention to something relevant that is located somewhere
along an invisible trajectory radiating from her finger. This trajectory may denote many alternatives:
e.g., a blemish on the menu, the menu itself, the specific word under her finger: “brioche," the font
of the word, and so on. The task context helps the waiter to realize that Joan intended to refer to
the “Albacore Tuna burger" meal. And although the task context may be consistent with many
possible alternative communication goals: e.g., she is attempting to make a comment about the
meal, she would like additional information about the meal, etc; the waiter reasons that Joan’s
communication goal is to refer to the single meal she would like to order.
The intension of Joan’s pointing message was evidence that the waiter used, along with the context,
to make inferences about what she was intending to communicate. Did the waiter mull endlessly
over the possible intended meanings that were consistent with Joan’s communicational act? No,
his inferences presumably ceased as soon as he happened upon a satisfactory interpretation for the
evidence at hand. What constitutes “satisfactory" behavior—either for Joan or the waiter—depends
on the situation at hand,3 and differs for speaker and hearer. For reference, the speaker minimally
wants to convey a particular extension to the hearer, so it is useful for her to know what he knows
or is capable of inferring from her utterance. Similarly for the hearer, the speaker’s act of reference
contains "an implicit assurance that he has enough information to uniquely identify the referent,
taking into account the semantic content of the referring expression and information from the
context, whether situational (i.e. currently perceivable), linguistic, or mental (i.e. memory and
knowledge)” Cruse [2011].
6.1.2
Processing architectures for generation
Like the counterparts in the natural language understanding community, the NLG community has
historically used modular processing architectures for generation, which are now being questioned
for being too narrow a scope. Referring expression generation (REG) has been an especially
important topic in NLG community (see Krahmer and van Deemter [2012] for a good overview);
however, the original formulation of the REG task, often attributed to Dale and Reiter [1995],
narrowly defines the task as only content determination, a small sub-module in a larger NLG
pipeline (see figure 6-3), while deferring the natural language generation step to subsequent
modules.
3
In an alternative task context where communication is of a more critical nature, such as when an air traffic
controller is attempting to communicate the specific coordinates to which the pilot should land his plane, then the
strategies would change: The speaker may include more redundant information in his coded message, and the hearer
may perform additional inferences to ensure that the intended message was accurately interpreted.
6.1. ABANDONING SERIAL PIPELINE ARCHITECTURES
93
Content determination is the problem of finding the content to include in the text, and for referring
expressions this includes a description that distinguishes the target set (the extension a speaker
intends to communicate) from the distractors (all other possibilities). For example, if the target set
in the Circles domain is {c1 , c2 }, then a valid output from a content determination algorithm is
this descriptive content, usually represented in either an attribute-value matrix or logical form:
"
c = type
color
#
circle
green
c = type(x, circle)
∧ color(x, green)
Generation
Communication Goal
Discourse Planning
Discourse Structure
Sentence Planning
Content
Determination
Lexical Choice
Aggregation
Sentence Plan
Surface Realization
Text
Figure 6-3: A traditional NLG pipeline (c.f. Mellish et al. [2004]; Reiter [1994]).
The content c is then be handed to the next step in an NLG pipeline with the ultimate goal of
becoming a referring expression (e.g. the green circles, the two green circles, some green circles),
embedded in a larger construct, like a sentence. Research on content determination for REG is
concerned with what search process attribute selection and what constitutes sufficiently descriptive
content.
These “pipeline” architectures prevent information from being shared between different layers of
linguistic analysis, contrary to evidence that the layers interact Altmann and Steedman [1988];
94
CHAPTER 6. RELATED WORK
Danlos and Namer [1988]; Horacek [2004b]; Krahmer and Theune [2002]; Stone and Webber
[1998]; Zarrieß and Kuhn [2013]. As Horacek [2004b] noted, the precise representation of the
content may depend on what expressive resources are available to the surface and lexical choice
modules. For example, suppose you are trying to generate a referring expression to identify one
out of two men, and the target has a full head of hair but the distractor is bald. Instead of the
content selection algorithm producing the logical formula, has_hair(x1), it may be preferable to
instead use the logically equivalent formula ¬ bald(x1) because it has a simpler surface form and
can be expressed as not bald, whereas in English there is no succinct modifier for has_hair(x1).
In addition, many of the same arguments for incremental reference interpretation (from 1.4.1) also
apply to generation.
For reference generation, the relevant components of the pipeline include content determination, choosing the content to express; lexical choice, choosing the words that express the content;
and surface realization, organizing the words into a valid syntactic form. In reaction to the
limitations of this modularity, the anti-modular lexical approach was developed Bauer and Koller
[2010]; Garoufi and Koller [2010, 2011]; Koller and Hoffmann [2010]; Koller and Petrick [2011];
Koller and Stone [2007]; Stone et al. [2003], in which each surface lexical item and its syntactic,
semantic, and pragmatic contributions are self-contained in its lexical entry. As discussed in the
previous chapter, this can be formulated as a problem of automated planning.
6.1.3
An anti-modular, inferential architecture for both processes
The result of abandoning the serial pipeline architectures is the inferential view that puts all
of the linguistic decisions into a single inferential process. The resulting architecture has fewer
constraints, and allows contextual information to be available at all stages:
Generation
Interpretation
Meaning Representation
Meaning Representation
All linguistic
encoding
decisions
Context
All linguistic
decoding
decisions
Referring Expression
Figure 6-4: The inferential view of communication.
As I described in the previous chapter, the serial constraint on the referring expression itself—the
output of the REG process and the input of the REI process, makes this architecture well suited to
being cast in an automated planning framework.
6.2. PLANNING-BASED APPROACHES TO GENERATION
6.2
Planning-based approaches to generation
The REG research can be divided into two largely independent research groups with different
objectives: psycholinguistic research concerned with explaining and describing what a person
does when generating a referring expression, and computational work (in AI or NLP), concerned
with building models that generate useful referring expressions. My work, like much in the
computational area, draws from psycholinguistics for its constraints, but my emphasis has been
on the engineering goal of building a working system.
The connection between automated planning and computational models of REG began with
Appelt [1985]’s ambitious system KAMP, which used a single planning system for reasoning about
the world, its knowledge and other agents’ knowledge. It showed, by demonstration, an example of
generating utterances (including referring expressions) that overloaded multiple communication
goals at once. As a running example KAMP instructed a human, John, to “Remove the pump with
the wrench in the tool box" which simultaneously satisfied the goals of (1) removing the pump,
(2) informing John how to remove the pump, and (3) informing John about the location of the
wrench. While KAMP addressed a broad number of issues, it did not apply planning techniques
to the fine-grained linguistic decisions that this thesis is concerned with. Nonetheless, KAMP
contributed a great insight: that action operators from automated planning can be used to describe
the dynamics of the speaker and hearers’ belief states for communicational actions.
In more recent years, the dominant research focus for computational models of REG has been to
find a distinguishing description of content: Given a completely defined set of elements, represented
as sets of attribute-value pairs, and a single target to identify, choose a set of attributes that
distinguish the target from the other entities in the context set. This definition avoided any
linguistic realization. The problem scope has gradually gotten wider to accommodate preferences
for attributes (e.g. a preference for selecting color over shape) Dale and Reiter [1995], attribute or
value salience (e.g. preference for blue over yellow; preference for the most recently mentioned
object) Krahmer and Theune [1998]; Viethen et al. [2011].
Eventually, the broadening of the REG picture has led to the lexical approach, where a broad
variety of linguistic decisions are packaged into the problem of choosing lexical entries Garoufi
and Koller [2011]; Koller and Stone [2007]; Koller et al. [2010]. The exact representational details
of these lexical entries leave plenty of design decisions open; however, these approaches require a
lexical, incremental theory of grammar to specify how the meanings of individual words (actions)
interact. In SPUD and its derivatives, syntactic constraints are expressed using a lexical version
of the tree-adjoining grammar (LTAG) theory, in which larger trees are assembled from smaller
elementary trees using only two compositional operations: substitution and adjunction.
Koller and Stone [2007] observed that lexical approaches could be cast as classical Strips planning
problems, and developed such a system called CRISP. CRISP defines word-actions in PDDL,
whereby each action has a precondition that requires the appropriate substitution or adjunction
site for the word’s elementary tree, and effects that describe its semantic content and syntactic
constraints.4 The goal-test function verifies that the state’s syntax is a proper LTAG tree (a single
parse tree without any unbound substitution nodes) and that the desired semantic content is
4
It should be noted that the SPUD and CRISP systems do not perform content selection; the content to be expressed
is given input in logical form.
95
96
CHAPTER 6. RELATED WORK
asserted. Garoufi and Koller [2010]; Koller et al. [2010] showed that the planning approach of CRISP
could also incorporate world context into generation: in addition to manipulating the discourse
context, words can be designed to constrain extra-linguistic context by using preconditions that
hinge upon the state of a simulated world. The idea is to interpret referring expressions such as
“[take] the second door to your left,” which, assuming the door is out of view, requires the hearer to
update his non-linguistic context by performing the physical action of moving to the left. This
captures some of the so-called presuppositions and conventional implicatures, whereby a word’s
meaning constrains the previous context set and requires the hearer to reason backwards to align
her context set with the speaker’s.
Despite the apparent advantages of building off existing planning architectures, Koller and Hoffmann [2010]; Koller and Petrick [2011] initially reported that it planning-based approaches were
“too slow to be useful”. Later, they overcame some of the inefficiencies by modifying Hoffmann
[2001]’s FF planner so that it removed tautologies from the planning domain during preprocessing
and restricted actions to those that were beneficial in solving a relaxed version of the planning
problem. Still, their approach did not scale well with task complexity. With AIGRE, I made number
of efficiency optimizations that are specific to the problem, and as a consequence, I have avoided
these efficiency problems.
As far as producing non-deterministic output, Di Fabbrizio et al. [2008] is the only other implemented model of non-deterministic REG. Their model produces stochastic outputs by randomly
choosing relevant actions and greedily assuming the next state, counting the number of times it
reaches a particular goal (and the path, a referring expression, that led to it). This is very similar
to AIGRE’s approach. However, unlike Di Fabbrizio et al. [2008]’s model, AIGRE combines the
determinism into the content determination (semantic decisions) with the linguistic realization
(syntactic decisions and word choice) in the same actions.
Dale and Reiter’s influential characterization of REG as selecting the set of properties that "rules
out" the distractors implicitly framed the problem around crisp rather than vague semantics. The
meaning of a vague assertion does not definitively "rule out" members, but generates a range of
hypotheses about which members are ruled out. van Deemter [2006]’s model, Vague, generated
referring expressions that include gradable adjectives, but managed to do so in a crisp, deterministic way. The authors intentionally avoided using standards of comparison because of their
arbitrariness, so they were required to produce the two largest ones rather than the large ones
when describing multiple items. With AIGRE, we embrace the uncertainty involved with standards;
it is what we are trying to model! Our model for vague semantics also has been characterized
differently by others as probabilities Lassiter [2011] or gradual membership functions Hersh and
Caramazza [1976]; Zadeh [1975]. AIGRE’s lexical semantics for vague meanings is most similar to
the latter, because it can be seen as the fuzzy-logic operation of α-cut, where a fuzzy set, with a
graded membership function, is translated back into a set of ordered sets.
6.3
Planning-based approaches to interpretation
Although plan recognition has a long connection with natural language processing (NLP), historically researchers have focused on two high-level problems: (1) inferring a speaker’s communica-
6.3. PLANNING-BASED APPROACHES TO INTERPRETATION
tional and task goals from a speech act Allen and Perrault [1980]; Hußmann and Genzmann [1990];
Litman and Allen [1984] and (2) identifying a plan embedded in the content of text Bruce [1977];
Charniak and Goldman [1993]; Raghavan and Mooney [2011]. My research applies plan recognition to a more fine grained level of analysis, where observed actions correspond to individual
lexical items and the plans correspond to referring expressions. Researchers from the perspective
of the generation problem have applied the planning approach to this level of analysis; however,
there is no prior work on recognizing planned referring expressions. Consequently, neither have
both planning and plan-recognition characterizations been addressed together.
As I have argued throughout this thesis, viewing REI as plan-recognition allows a number of
linguistic phenomena to be characterized in straightforward ways. This thesis is novel in its explicit
characterization of lexical ambiguity, vagueness and ellipsis as problems of partially observed
actions in plan recognition. As for the modeling of conventional implicature, recently there has
been a surge of related work which also frames the problem as reasoning about the alternative
linguistic decisions that the hearer was faced with (Benotti and Traum [2009]; Bergen et al. [2012];
Goodman and Stuhlmüller [2013]; Vogel et al.). Reasoning about speaker decisions clearly lends
itself into a planning framework. Along these lines, Vogel et al. used an optimal decision-theoretic
framework to model scalar implicatures. However, solving problems in these optimal frameworks
is computationally hard, and thus these approaches have been limited to studying single-word
choices. The engineering goals that shaped the development of AIGRE brought concerns of
computational efficiency and real-world usability to the forefront. Consequently, one of AIGRE’s
major contributions is in the implementation decisions that have made the system efficient and
scalable, including: using belief states to represent uncertainty, lazy generation of target sets with
constraints on the sets’ sizes, representing mutual exclusive interpretations implicitly via the
planning graph, non-deterministic effects, and with actions generating effect functions instead of
successor states.
97
98
CHAPTER 6. RELATED WORK
Chapter 7
Conclusion
The research presented had two primary contributions: (1) a computational model of the REG and
REI tasks that is fast, incremental and non-deterministic; (2) a cost-based approach for integrating
planning and plan recognition to solve interpretation problems. In addition, I have collected a
corpus of human referring expressions, and provided a lexical semantics for a fragment of English
noun phrases. In this chapter, I summarize and take stock of these contributions, and then present
unsettled business.
7.1
Computational models of reference production
The main contribution of this thesis was AIGRE: an implemented system that generates and
interprets referring expressions. The development of AIGRE, was informed by a curated list
of complicated expressions that I continually added to. This informed the development of the
representation of belief states, word meanings, and search strategies. The follow up experiments
using the Turk Data set has revealed important next steps.
The crux of AIGRE’s approach is to use representations to structure the space to make the common
semantics of words easiest to find, while allowing the less common meanings to still be possible.
This is achieved using an early-commit search strategy and a backtracking mechanism. Further,
many of the descriptions of lexical semantics take advantage of the implicit representation of the
belief state, which allows the interpretation and generation processes to be incremental. Together
they provide a promising approach for communicating with lexical items that have vague and
ambiguous meanings.
To the future engineers of REG and REI systems, this work should provide some insights into the
kinds of search techniques that can be used to manage the large search space required to express
ambiguity and vagueness.
I have contributed two resources that could benefit researchers in the NLG and NLU communities:
the data set containing human descriptions of the two domains, and a Python library containing
data-structures for representing and managing the consistency of partial information.
100
CHAPTER 7. CONCLUSION
7.1.1
Comments about modeling syntax
One glaring deficiency of the lexical semantics outlined in Chapter 3 is a description of how AIGRE
handles syntactic issues of scope and attachment. Consider the first two referring expressions:
(7.1) the biggest green shape
(7.2) the second biggest green circle
(7.3) the biggest
An incremental planning system should be able to handle the non-monotonicity created by these
so-called subsective adjectives: (7.1) should yield an interpretation that is not included in (7.3),
even though (7.3) is a prefix of (7.2). Adjectives whose meaning comes before the noun are called
“subsective" adjectives.
I explored one approach to representing subsective adjectives with mostly negative results. By
treating subsective adjectives as late-blooming words whose meaning is not evaluated until after
its head noun, they could be stored, as functions, in the belief state and executed later. In my
implementation, I added these effects into a deferred_effects queue along with a syntactic trigger.
When the belief state’s part_of_speech matched the trigger (e.g. a noun state), then the effect was
removed from the queue and applied to the belief state.
This solution worked well for REI, but not for REG. In generation, the deferred actions caused
serious problems for the search. The subsective actions had no immediate effect on the belief state,
and so (in the eyes of the search algorithm) they did not move the belief toward the search goal.
Subsective adjectives are part of broader class of syntactic concerns: that of deciding how to
combine the meanings of individual lexical units. In a successive version of AIGRE, I would use
a frame structure to represent the components of syntax, which specifies the order in which
they affect the composition of a belief state’s meaning and allows the speaker to bundle syntactic
decisions as a single choice. This is more consistent with psycholinguistic evidence, which suggests
that during generation, the syntactic organization of a noun phrase is determined before the
individual lexical items are chosen Schriefers et al. [1999]. This kind of frame-based representation
of syntactic structure would also allow words and content to be selected in different orders during
the generation process.
7.2
Integrating interpretation and generation
In many cases the linguistic-encoded meaning underdetermines the speaker’s intended meaning.
And in order to arrive at the speakers meaning, the hearer must deploy inferential processes that
use information from many disparate sources. This search to intended meaning is the problem
of goal recognition: recognizing the meaning that the speaker intended to communicate can
be seen as a search to final state intended by the plan. It is questionable to what extent the
goal recognition problem can be solved without also recognizing the specific hidden decisions
7.2. INTEGRATING INTERPRETATION AND GENERATION
(missing words, hidden senses) along the way—the larger plan recognition problem. The chief
insight into framing the interpretation problem as plan recognition is that we now have a way
to align the assumptions between the speaker and hearer. Each decision along the way gives a
glimpse into the speaker’s decision making process; for example, to enable the hearer to ask the
question, why did the speaker choose to include this modifier? Answers to such questions include
not only the speaker’s communication goal, but the set of assumptions that were in place when
the speaker decided to produce his referring expression. Deriving this answer is not easy, because
the mere presence of a noun modifier underdetermines the speaker’s intention: was it to rule out
a distractor? to inform the speaker of a missing attribute? or was it simply redundant information
included to increase the chance of the communication succeeding? I suspect that the referring
expression alone is not sufficient to justify these inferences; the hearer also needs a model of the
way the speaker typically produces language to justify these deeper inferences. To facilitate this
range of inferences, a computational model must be able to compare the appropriateness of a
referring plan across varying initial conditions and communication goals.
The solution proposed here (and elsewhere) was to use the alternative decisions that the speaker
may have made to reevaluate the meaning that they put forward. This solution, though promising, is highly sensitive to the cost functions of words and their situationally-derived meanings.
Subsequent work should model multiple agent’s beliefs in a simulated task context to provide a
more clear criteria for driving these meta-psychological inferential processes.
101
102
Glossary
Glossary
cardinal numbers Integers expressed as lexical items: "1", "one", "2", "two", etc.. 38, 100
dialogue referents A linguistic component of dialogue context that contains a mapping between
the linguistic forms that have successfully been used as referring expressions and their targets..
100
discourse context A catch-all phrase used to connote all information available to the speaker
and hearer up to (and during) when the utterance is being produced, including the utterance itself.
100
context set The hypothesis space of interpretations, which for reference tasks contains all the
possible valid target sets. 24, 100
distractor All elements of the context set except the target set. 16, 24, 100
goal recognition Given a sequence of observed or partially observed actions, infer the agent’s
intended goal.. 28, 29, 98, 100
hearer The agent who interprets a reference expression. In this thesis, we use masculine pronoun
to indicate the speaker. 12, 100
lexical entry The basic unit of information stored in the lexicon (a mental dictionary), which
contains a single lexical item and a representation of its meaning.. 27–29, 34, 55, 89, 92, 100
lexical item The surface form of a lexical entry, usually a single word, but it can also be a phrasal
expression or idiom. 18, 25, 27–30, 34, 48, 55, 62, 72, 73, 92, 94, 100
lexical semantics A theory of what is contained in a language’s lexical entries. 25, 100
plan recognition The inverse of the planning problem: given a sequence of observed or partially
observed actions, infer the agent’s intended goal and complete plan.. 27, 29, 35, 99, 100
questions under discussion A pragmatic theory aimed at describing the structure of information
in context that is relevant to discourse. 100
reference The act of establishing that a given target in the common ground is in focus, and
potentially introducing the target to the common ground if it is not already. 12, 100
referring expression A linguistic construction used for the purpose of reference. 11, 12, 20,
24, 28, 33, 61, 92, 100
referring expression generation The task assigned to a speaker in reference. 100
referring expression interpretation The task assigned to a hearer in reference. 100
reference resolution Identification of the referent of the expression once its meaning has been
determined. 100
referential domain The individual entities that can be referred to, given the restrictions of the
context. The context set represents valid groupings of these entities. parent. 100
resolution The process of identifying the referent(s) of a reference expression once its meaning
has been determined. 100
script A task that has been established by convention. 100
speaker The agent who produces a reference expression. In this thesis, we use feminine pronoun
to indicate the speaker, sticking to the story on page sec:story. 12, 100
Glossary
target set The member of the context set that the speaker intends to communicate. 15, 24, 38, 91,
100
103
104
Glossary
Bibliography
B. Abbott and B. Abbott. Reference. Oxford Surveys in Semantics & Pragmatics No.2. Oxford
University Press, 2010. ISBN 9780199202577.
J. Allen. Word Senses, Semantic Roles and Entailment. Invited Talks., 2011.
J. Allen and C. R. Perrault. Analyzing intention in utterances. Artificial Intelligence, 15(3):143–178,
Dec. 1980.
G. Altmann and M. Steedman. Interaction with context during human sentence processing.
Cognition, 30(3):191–238, 1988.
D. E. Appelt. Planning natural-language utterances to satisfy multiple goals. 1981.
D. E. Appelt. Planning English referring expressions. Artificial Intelligence, 1985.
C. L. Baker, J. B. Tenenbaum, and R. R. Saxe. Goal inference as inverse planning. Proceedings of
the 29th annual meeting of the cognitive science society, 2007.
D. Bauer and A. Koller. Sentence Generation as Planning with Probabilistic LTAG. Proceedings of
the 10th International Workshop on Tree Adjoining Grammar and Related Formalisms, New Haven,
CT, 2010.
L. Benotti. Implicature as an Interactive Process. Ph.D. Thesis, Jan. 2010.
L. Benotti and D. Traum. A computational account of comparative implicatures for a spoken dialogue agent. In IWCS-8 ’09: Proceedings of the Eighth International Conference on Computational
Semantics. Association for Computational Linguistics, Jan. 2009.
L. Bergen, N. D. Goodman, and R. Levy. That’s what she (could have) said: How alternative
utterances affect language use. . . . of the thirty-fourth annual conference . . . , 2012.
B. Bonet and H. Geffner. Planning as heuristic search: New results. Artificial Intelligence, 129(1-2):
5–33, June 2001.
H. P. Branigan, M. J. Pickering, J. Pearson, and J. F. McLean. Linguistic alignment between people
and computers. Journal of pragmatics, 42(9):2355–2368, Sept. 2010.
S. Brennan. Conceptual pacts and lexical choice in conversation. Journal of Experimental Psychology,
1996.
106
BIBLIOGRAPHY
B. Bruce. Plans and social action. Technical Report 34, 1977.
E. Charniak and R. P. Goldman. A Bayesian model of plan recognition. Artificial Intelligence, 64(1):
53–79, 1993.
H. H. Clark. Using language. 1996.
H. H. Clark and C. R. Marshall. Definite reference and mutual knowledge. Psycholinguistics:
critical concepts in . . . , 2002.
R. M. Cooper. The control of eye fixation by the meaning of spoken language : A new methodology
for the real-time investigation of speech perception, memory, and language processing. Cognitive
Psychology, 6(1):84–107, Jan. 1974. doi: 10.1016/0010-0285(74)90005-x. URL http://dx.doi.
org/10.1016/0010-0285(74)90005-x.
D. A. Cruse. Meaning in Language: An Introduction to Semantics and Pragmatics. Oxford University
Press, 3 edition, 2011.
R. Dale and E. Reiter. Computational interpretations of the Gricean maxims in the generation of
referring expressions. Cognitive Science, 19(2):233–263, 1995.
L. Danlos and F. Namer. Morphology and cross dependencies in the synthesis of personal pronouns
in Romance languages. In COLING ’88: Proceedings of the 12th conference on Computational
linguistics. Association for Computational Linguistics, Aug. 1988.
G. Di Fabbrizio, A. J. Stent, and S. Bangalore. Referring expression generation using speaker-based
attribute selection and trainable realization (ATTR). amandastent.com, pages 211–214, 2008.
P. Elbourne. Meaning: a slim guide to semantics. Oxford University Press, 2011.
P. E. Engelhardt, K. G. Bailey, and F. Ferreira. Do speakers and listeners observe the gricean maxim
of quantity? Journal of Memory and Language, 54(4):554–573, 2006.
G. Frege. On sense and reference. Ludlow (1997), pages 563–584, 1892.
K. Garoufi and A. Koller. Automated planning for situated natural language generation. Proceedings
of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1573–1582,
2010.
K. Garoufi and A. Koller. Combining symbolic and corpus-based approaches for the generation
of successful referring expressions. In ENLG ’11: Proceedings of the 13th European Workshop
on Natural Language Generation, pages 121–131, Nancy, France, Sept. 2011. Association for
Computational Linguistics.
A. Gatt. Generating coherent references to multiple entities. PhD. Thesis. University of Aberdeen,
2007.
C. W. Geib and M. Steedman. On Natural Language Processing and Plan Recognition. Proceedings
of the 20th International Joint Conference on Artificial Intelligence, pages 1612–1617–1617, Oct.
2006.
BIBLIOGRAPHY
M. Ghallab, D. Nau, and P. Traverso. Automated Planning. Theory & Practice. Morgan Kaufmann,
May 2004.
E. Gibson. A computational theory of human linguistic processing: Memory limitations and
processing breakdown. 1991.
K. Golden and D. Weld. Representing sensing actions: The middle ground revisited. KR, 96:
174–185, 1996.
N. D. Goodman and A. Stuhlmüller. Knowledge and implicature: Modeling language understanding
as social cognition. Topics in Cognitive Science, 5(1):173–184, 2013.
M. Goudbeek and E. Krahmer. Alignment in Interactive Reference Production: Content Planning,
Modifier Ordering, and Referential Overspecification. Topics in Cognitive Science, 4(2):269–289,
Mar. 2012.
D. Graff. Shifting sands: An interest-relative theory of vagueness. Philosophical Topics, 28(1):
45–82, 2002.
P. A. Heeman and G. Hirst. Collaborating on referring expressions. Computational Linguistics, 21
(3):351–382, Sept. 1995.
H. M. Hersh and A. Caramazza. A Fuzzy Set Approach to Modifiers and Vagueness in Natural
Language. 1976.
J. R. Hobbs. Against confusion. Diacritics, 18(3):78–92, 1988.
J. R. Hobbs, M. E. Stickel, D. E. Appelt, and P. Martin. Interpretation as abduction. Artificial
Intelligence, 63, 1993.
J. Hoffmann. FF: The Fast-Forward Planning System. AI Magazine, 22(3):57, Sept. 2001.
H. Horacek. On Referring to Sets of Objects Naturally. In Natural Language Generation, pages
70–79. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004a.
H. Horacek. On referring to sets of objects naturally. In Natural Language Generation, pages 70–79.
Springer, 2004b.
M. J. Hußmann and H. Genzmann. On trying to do things with words: another plan-based
approach to speech act interpretation. In COLING ’90: Proceedings of the 13th conference on
Computational linguistics. Association for Computational Linguistics, Aug. 1990.
N. Ide and C. Macleod. The american national corpus: A standardized resource of american english.
In Proceedings of Corpus Linguistics 2001, volume 3, 2001.
P. Jackson and I. Moulinier. Natural Language Processing for Online Applications: Text Retrieval,
Extraction and Categorization. Natural language processing. John Benjamins Pub., 2007. ISBN
9789027249920.
107
108
BIBLIOGRAPHY
P. Jordan. Learning content selection rules for generating object descriptions in dialogue. Journal
of Artificial Intelligence Research, 2005.
G. Kempen and E. Hoenkamp. Incremental sentence generation: implications for the structure
of a syntactic processor. In COLING ’82: Proceedings of the 9th conference on Computational
linguistics. Academia Praha, July 1982.
A. Kibrik. Reference in Discourse. Oxford University Press, July 2013.
E. Klepousniotou. The Processing of Lexical Ambiguity: Homonymy and Polysemy in the Mental
Lexicon. Brain and Language, 81(1-3):205–223, Apr. 2002.
A. Koller and J. Hoffmann. Waking up a sleeping rabbit: On natural-language sentence generation
with FF. In Proceedings of AAAI 2010, 2010.
A. Koller and R. P. A. Petrick. Experiences with planning for natural language generation.
Computational Intelligence, 27(1):23–40, Feb. 2011.
A. Koller and M. Stone. Sentence generation as a planning problem. Annual Meeting of the
Association of Computational Linguistics, 45(1):336, 2007.
A. Koller, A. Gargett, and K. Garoufi. A scalable model of planning perlocutionary acts. In
Proceedings of the 14th Workshop on the Semantics nd Pragmatics of Dialogue, 2010.
E. Krahmer and M. Theune. Context sensitive generation of descriptions. pages 1151–1154, 1998.
E. Krahmer and M. Theune. Efficient generation of descriptions in context. Proceedings of the
ESSLLI workshop on the generation of nominals, pages 223–264, 2002.
E. Krahmer and K. van Deemter. Computational Generation of Referring Expressions: A Survey.
Computational Linguistics, 2012.
D. Lassiter. Vagueness as probabilistic linguistic knowledge. Vagueness in Communication, pages
127–150, 2011.
D. J. Litman and J. F. Allen. A plan recognition model for clarification subdialogues. In ACL ’84:
Proceedings of the 10th International Conference on Computational Linguistics and 22nd annual
meeting on Association for Computational Linguistics. Association for Computational Linguistics,
July 1984.
C. Mellish, M. Reape, D. Scott, L. Cahill, R. Evans, and D. Paiva. A Reference Architecture for
Generation Systems. Natural Language Engineering, 10(3-4), Sept. 2004.
C. S. Mellish. Coping with uncertainty: Noun phrase interpretation and early semantic analysis.
1981.
J. S. Mill. A System of Logic, Ratiocinative and Inductive: Being a Connected View of the Principles of
Evidence, and the Methods of Scientific Investigation: in Two Volumes, volume 1. Parker, 1851.
BIBLIOGRAPHY
M. Minsky. The Emotion Machine. Commonsense Thinking, Artificial Intelligence, and the Future
of the Human Mind. Simon and Schuster, Nov. 2007.
P. Norvig. Multiple simultaneous interpretations of ambiguous sentences. 1988.
P. Norvig and R. Wilensky. A critical evaluation of commensurable abduction models for semantic
interpretation. Proceedings of the 13th conference on Computational linguistics-Volume 3, pages
225–230, 1990.
M. Palmer. Consistent criteria for sense distinctions. Computers and the Humanities, 2000.
T. PECHMANN. Incremental speech production and referential overspecification. Linguistics, 27
(1):89–110, 1989.
A. Radul and G. J. Sussman. The Art of the Propagator. Technical Report MIT-CSAIL-TR-2009-002,
MIT Computer Science and Artificial Intelligence Laboratory, 2009.
S. Raghavan and R. J. Mooney. Abductive Plan Recognition by Extending Bayesian Logic Programs.
In Machine Learning and Knowledge Discovery in Databases, pages 629–644. Springer Berlin
Heidelberg, Berlin, Heidelberg, 2011.
M. J. Ramírez and H. Geffner. Probabilistic plan recognition using off-the-shelf classical planners.
Proceedings of the Conference of the Association for the Advancement of Artificial Intelligence
(AAAI 2010), 2010.
A. Rayo. A Plea for Semantic Localism. Noûs, 2010.
E. Reiter. Has a consensus NL generation architecture appeared, and is it psycholinguistically
plausible? Proceedings of the Seventh International Workshop on Natural Language Generation
(INLG 1994), 1994.
E. Reiter and R. Dale. A fast algorithm for the generation of referring expressions. pages 232–238,
1992.
C. Roberts. Context in dynamic interpretation. The handbook of pragmatics, pages 197–220, 2004.
D. Schlangen, T. Baumann, and M. Atterer. Incremental reference resolution: the task, metrics
for evaluation, and a Bayesian filtering model that is sensitive to disfluencies. In SIGDIAL ’09:
Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest
Group on Discourse and Dialogue. Association for Computational Linguistics, Sept. 2009.
H. Schriefers, J. P. De Ruiter, and M. Steigerwald. Parallelism in the production of noun phrases:
Experiments and reaction time models. Journal of Experimental Psychology, 25(3):702, 1999.
J. C. Sedivy, M. K Tanenhaus, C. G. Chambers, and G. N. Carlson. Achieving incremental semantic
interpretation through contextual representation. Cognition, 71(2):109–147, June 1999.
R. C. Stalnaker. Assertion Revisited: On the Interpretation of Two-Dimensional Modal Semantics.
Philosophical Studies, 118(1/2):299–322, 2004.
109
110
BIBLIOGRAPHY
M. Stone. On identifying sets. In INLG ’00: Proceedings of the first international conference on
Natural language generation. Association for Computational Linguistics, June 2000.
M. Stone and B. Webber. Textual Economy through Close Coupling of Syntax and Semantics.
arXiv.org, June 1998.
M. Stone, C. Doran, B. Webber, T. Bleam, and M. Palmer. Microplanning with communicative
intentions: The SPUD system. Computational Intelligence, 19(4):311–381, 2003.
M. K. Tanenhaus. Spoken language comprehension: Insights from eye movements. Oxford
handbook of psycholinguistics, pages 309–326, 2007.
M. K. M. Tanenhaus, M. J. M. Spivey-Knowlton, K. M. K. Eberhard, and J. C. J. Sedivy. Integration
of visual and linguistic information in spoken language comprehension. Science (New York, NY),
268(5217):1632–1634, June 1995.
M. Tomasello. Origins of Human Communication. MIT Press, July 2008.
K. van Deemter. Generating vague descriptions. In INLG ’00: Proceedings of the first international
conference on Natural language generation. Association for Computational Linguistics, June
2000.
K. van Deemter. Generating Referring Expressions that Involve Gradable Properties. Computational
Linguistics, 2006.
K. van Deemter. Not Exactly: In Praise of Vagueness. Oxford University Press, 2010.
K. van Deemter, A. Gatt, I. Sluis, and R. Power. Generation of referring expressions: Assessing the
Incremental Algorithm. Cognitive Science, 2011a.
K. van Deemter, A. Gatt, R. P. G. van Gompel, and E. Krahmer. Toward a Computational Psycholinguistics of Reference Production. Topics in Cognitive Science, 2011b.
J. Viethen and R. Dale. Referring expression generation: what can we learn from human data. In
Proceedings of the Pre-Cogsci Workshop on Production of Referring Expressions: Bridging the Gap
between Computational and Empirical Approaches to Reference, volume 29, 2009.
J. Viethen, R. Dale, and M. Guhe. The impact of visual context on the content of referring
expressions. In ENLG ’11: Proceedings of the 13th European Workshop on Natural Language
Generation. Association for Computational Linguistics, Sept. 2011.
J. Viethen, M. Goudbeek, and E. Krahmer. The Impact of Colour Difference and Colour Codability
on Reference Production. pages 1084–1089, 2012.
A. Vogel, C. Potts, and D. Jurafsky. Implicatures and Nested Beliefs in Approximate DecentralizedPOMDPs. nlp.stanford.edu.
T. Winograd. Understanding natural language. Cognitive Psychology, 3(1):1–191, Jan. 1972.
L. A. Zadeh. Fuzzy logic and approximate reasoning. Synthese, 30(3-4):407–428, 1975.
BIBLIOGRAPHY
S. Zarrieß and J. Kuhn. Combining Referring Expression Generation and Surface Realization:
A Corpus-Based Investigation of Architectures. In Proceedings of the 51st Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers), pages 1547–1557, Sofia,
Bulgaria, Aug. 2013. Association for Computational Linguistics.
111