I Spy: An Interactive Game-Based Approach to

Knowledge, Skill, and Behavior Transfer in Autonomous Robots: Papers from the 2015 AAAI Workshop
I Spy: An Interactive Game-Based Approach
to Multimodal Robot Learning
Natalie Parde1 , Michalis Papakostas2,3 , Konstantinos Tsiakas2,3 ,
Maria Dagioglou3 , Vangelis Karkaletsis3 , and Rodney D. Nielsen1
1
Department of Computer Science and Engineering, University of North Texas
Department of Computer Science and Engineering, University of Texas Arlington
3
Institute of Informatics and Telecommunications, NCSR Demokritos
{Natalie.Parde, Rodney.Nielsen}@unt.edu, {Michalis.Papakostas, Konstantinos.Tsiakas}@mavs.uta.edu,
{mdagiogl, vangelis}@iit.demokritos.gr
2
Abstract
in front of a robot and describing them (Holzapfel, Neubig,
and Waibel 2008), doing so is not particularly interesting
for the human participants and thus is also fairly unlikely
to be implemented for any prolonged period of time. This
work proposes that to better engage users and increase their
motivation to further the robot’s own learning process, the
concept of explaining physical objects to robots can be embedded in a vision- and dialogue-based game—specifically,
in a robotic variant of the traditional guessing game, I Spy.
This paper discusses the initial design and construction
of a multimodal I Spy game meant to be played between
humans and humanoid robots. It provides a brief overview
of prior related work, followed by an in-depth explanation
of the approach taken to integrate interactive gameplay with
measures to extract visual and linguistic features from object
descriptions to build and enhance feature and object models.
It then presents a study designed to analyze: (1) the learning and gameplay performance of the robot, and (2) users’
perceptions of the game itself. Finally, it concludes with an
analysis of the results obtained during the study, as well as
an outline of planned future work.
Teaching robots about objects in their environment requires a
multimodal correlation of images and linguistic descriptions
to build complete feature and object models. These models
can be created manually by collecting images and related
keywords and presenting the pairings to robots, but doing so
is tedious and unnatural. This work abstracts the problem of
training robots to learn about the world around them by introducing I Spy, an interactive dialogue- and vision-based game
in which players place objects in front of a humanoid robot
and challenge it to guess which object they have in mind.
The robot gradually learns about the objects and the features
which describe them through repeated games, by updating its
knowledge with newly captured training images. This paper
details I Spy’s learning and gaming processes, describes the
approaches taken to extract information from multiple modalities both before and during gameplay, and finally discusses
the results of a study designed to evaluate the game’s model
accuracy over time, its overall performance, and its appeal to
human players.
Introduction
Background
As interactive robots become increasingly ubiquitous in a diverse array of human-centered applications such as healthcare (Nielsen et al. 2010), education (Chang et al. 2010),
and assistive technologies (Galatas et al. 2011), a growing
importance must be placed upon facilitating their ability to
understand and learn about their physical environments in
order to create more advanced, autonomous robotic systems.
To do this, mappings between visual and linguistic input
can be learned by robots in order to build more robust, realistic feature and object models. One way to achieve this
is through multimodal explanatory interaction between humans and robots, which allows a simple and effective means
of linking images of the physical environment with text- or
spoken dialogue-based object descriptions.
Although well-developed object models can be built in
this manner by simply and successively placing new objects
Previously, the use of I Spy as a means of improving robot
vision was explored in (Vogel, Raghunathan, and Jurafsky 2010), which focused on asking questions about hardcoded features to improve vision-based category and attribute knowledge. Another system that has integrated multiple modalities for robot learning, albeit not in a game context, is Ripley (Mavridis and Dong 2012). Ripley is a nonhumanoid robotic system that learns color, size, and weight
properties via a mix of computer vision techniques, motor
actions (i.e., weighing objects), and spoken dialogue interactions with users.
Some systems have proven to be very effective at learning despite their lack of a multimodal approach. One of
the most popular of these is the Never-Ending Language
Learner (NELL) (Carlson et al. 2010), which continuously
extracts information from the web to construct “beliefs” regarding different concepts. Another is IBM’s Watson (Fer-
Copyright c 2015, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
35
Initial Learning Phase. In the initial learning phase, the
robot begins with an entirely empty knowledge base. To
learn about an object, the robot captures a series of images
of the object from different angles and distances. The captured images are then segmented to isolate the object, and
visual feature vectors describing the isolated segments are
extracted. In parallel, the player supplies a short description
of the object, and the content keywords from this object description are automatically extracted as linguistic features.
Finally, these visual and linguistic features are used to train
a statistical model which associates keywords with visual
feature vectors.
Figure 1: The robot’s initial learning phase.
Gaming Phase. In the gaming phase, the robot is placed
in front of a set of objects. The robot captures a predefined number of photos of the gamespace, and these images
are segmented to isolate each of the objects in the game
space. Visual feature vectors are extracted for each of the
segmented objects, and classified according to the feature
models trained in the initial learning phase. The robot uses
the distribution of keywords returned for each visual vector
in the gamespace as input to its decision-making module.
In turn, the decision-making module uses this information
as well as a set of belief probabilities regarding the objects
to return the next keyword used in the game. A question is
generated based on this keyword, and the player responds to
the question with a “yes” or “no” answer. The robot then updates its belief probabilities for each object in the gamespace
until it is confident enough to make a guess.
Figure 2: The robot’s gaming phase.
rucci et al. 2010), which generates hypotheses based on information extracted from natural language and learns based
on the successes and failures of these hypotheses. I Spy is
similar to both of these in the sense that it adds to the robot’s
knowledge base using information extracted from natural
language, and also updates this knowledge over time based
on new input.
The work in I Spy primarily builds upon that of (Vogel,
Raghunathan, and Jurafsky 2010) by extending the range of
possible features to include traits otherwise undetectable by
vision algorithms alone, and by learning and utilizing features extracted from players’ object descriptions rather than
by relying on pre-programmed features. Thus, the system
builds multimodal feature models comprised of vision feature vectors correlated to a set of keywords drawn from user
descriptions. This allows it to function at its deepest level as
a multimodal interactive learning system.
Methods
Natural Language Processing Methods
Natural language processing is used in I Spy to extract keywords from users’ descriptions, correlate those keywords
with other features and numerical object IDs, and automatically generate questions. A brief description of how each of
these tasks is achieved follows.
Keyword Extraction from Object Descriptions. Keywords are automatically extracted from users’ object descriptions using a custom filter to eliminate common stopwords, prepositions and descriptors indicating relative direction, linking verbs or verbs indicating composition, and
quantifiers. Stopwords are taken from the standard Apache
Lucene (http://lucene.apache.org/) English stopword list.
Prepositions and descriptors indicating spatial proximity are
removed due to inconsistency with images acquired by the
robot; for example, users viewing a mug from one angle
may describe it as having a handle on its left, but if the
robot views the mug from its opposite side then the handle will appear on its right. Words indicating composition
(e.g., “made of,” “includes”) are removed because they exist in nearly every description; thus, they do not supply any
real content. Quantifiers not providing any concrete information (e.g., “it has some lines”) are removed because their
definitions are too relative to individual players’ perceptions
to yield much meaningful content. Thus, the remaining set
of extracted keywords for each object description is comprised of content words capable of providing discrete, quan-
Approach
Game Design
I Spy follows a two-stage process: (1) an initial learning
phase, in which the robot populates its initial knowledge
base by capturing a sequence of images and pairing the extracted visual feature vectors with object descriptions from
players, and (2) a gaming phase, which effectuates the actual game. Both of these stages are further decomposed into
sub-tasks requiring natural language processing, computer
vision, and machine learning methods. The flow diagrams
for these stages are presented in Figures 1 and 2. The robot
platform is NAO V4, a humanoid robot created by Aldebaran Robotics (http://www.aldebaran.com/en), running the
NAOqi v1.14.5 operating system. The JNAOqi SDK is used
to integrate the robot with Java for motion control and image
capture operations.
36
tifiable information. Common keywords extracted from various players’ object descriptions include attribute words (e.g.,
“red,” “small”), action words (e.g., “bounces,” “opens”), and
words indicating activities one could perform with the object
(e.g., “play,” “write”).
regions. In these regions, each pixel is labeled with a binary
value according to whether its original value is greater than
or equal to that of the region’s center pixel (1) or less than
the central pixel’s value (0). The result is a binary word describing all of the pixels in the region, and the concatenation
of words from all smaller regions results in the LBP feature
vector of the image.
Keyword Correlation. Once a new set of keywords is received, a matrix implemented using a dynamic, serialized
set of hash tables is updated to record all co-occurrences between keywords existing in the same object description, and
between numerical object IDs and all of their associated keywords. These counts are used in updating belief probabilities
according to “yes” or “no” responses.
Histograms of Oriented Gradients — Shape-based feature extraction. To find the shape of objects, I Spy uses
Histograms of Oriented Gradients (HOG). HOG works by
first counting occurrences of gradient orientation in localized portions of an image segment. Then, it describes the
local object shape within an image by the distribution of intensity gradients or edge directions.
Natural Language Generation. To generate naturalsounding questions during gameplay, subjects and verbs
are selected according to the chosen keyword’s part-ofspeech tag and, if applicable, its subfunction. Part-of-speech
tags are acquired using the Stanford Part-Of-Speech Tagger
(Toutanova et al. 2003) and are assigned to keywords based
on their most frequent usage in input object descriptions,
to avoid generating confusing or ambiguous questions (e.g.,
generating “Can you handle it?” rather than “Does it have a
handle?” when most users’ descriptions have used “handle”
as a noun). The words included in function-based subcategories were manually selected, in order to quickly increase
the clarity of questions, but in the future can be expanded
to draw members from online sources to allow for a greater
chance of matching highly specific keywords. Once a question’s subject and verb are selected, they are paired with the
keyword, which serves as the object of the question for most
part-of-speech categories. The I Spy question is then in most
cases constructed using SimpleNLG (Gatt and Reiter 2009),
although custom templates are used for a small number of
categories with which SimpleNLG’s realizer yields unnatural results.
RGB-S — Color-based feature extraction. Finally, to
extract color-based features, I Spy uses RGB and HSV histograms. RGB is an additive color model based on the human perception of colors, and reproduces a broad array of
colors by combining red, green, and blue light. HSV is a
color model that describes the hue or tint of colors in terms
of their shading (saturation or amount of gray) and brightness (value or luminance). I Spy describes color for a given
image segment by extracting an RGB histogram for each
separate channel (red, green, and blue), as well as a saturation histogram from the HSV color space to include information about the object’s illumination. These histograms are
then concatenated into a single color feature vector.
After extracting all of the texture, shape, and color features, the features are concatenated to construct a single visual feature vector containing all of the essential information
to describe a given image segment.
Statistical Modeling
Gaussian Mixture Models (GMMs) are used to associate
keywords with visual feature vectors in I Spy. GMMs are
probabilistic models that assume that all data points are generated from a mixture of an infinite number of Gaussian distributions with unknown parameters. They have been used
extensively as statistical models for image classification, including for classification and segmentation using color and
texture features (Permuter, Francos, and Jermyn 2006). Currently, models with tied covariances are used in order to
avoid data overfitting. Models are retrained following every
five games so that newly captured images for all objects can
be added to the training dataset. Each GMM is constructed
with two components, and the EM algorithm is used for parameter estimation (component weights, means and covariances). I Spy uses the Python SciKit-Learn (Pedregosa et al.
2011) library for model training.
Computer Vision Methods
Computer vision is used in I Spy to segment images and
extract visual feature vectors describing objects’ texture,
shape, and color, both during the initial learning phase and
the gaming phase. Image segmentation is performed using the OpenCV (Bradski 2000) Watershed Algorithm. Extracted visual features are linked with all keywords associated with an object, even though the meaning of some
of those keywords may not be expressed through texture,
shape, or color, because the robot has no initial knowledge
regarding which keywords define visual attributes. Likewise,
it has no inherent notion of what type of visual attribute
might be described by a visually-discernible keyword (to a
robot with no knowledge, “red” is just as likely to describe
a shape as it is a color). The three visual feature extractors
used are described below.
Decision-Making
Local Binary Patterns — Texture-based feature extraction. To find the texture of objects, I Spy uses local binary
patterns (LBP) feature vectors due to their illumination invariance and efficiency under different environmental setups
(Heikkila and Pietikainen 2006). To extract an LBP feature
vector, an image segment is first divided into smaller disjoint
To select the next I Spy question, the system’s decisionmaking module chooses the feature that can best divide the
existing set of highest-probability objects in half. This maximizes the “yes” or “no” answer’s impact on the robot’s current belief probabilities. At the onset of every I Spy game all
belief probabilities are uniform, indicating that all objects
37
are equally likely to be the one the player has in mind. As
the gaming procedure progresses, these belief probabilities
are updated. The amount that an answer impacts an individual object’s belief probability can be found according to
Equation 1, where pnew is equal to the new belief probability, pprevious is equal to the previous belief probability, k
is a factor that can be raised and lowered to facilitate more
rapid or gradual changes in belief probabilities during gameplay (set to 10.0 in this work), fobj is equal to the number
of times the keyword is found in users’ descriptions of the
object, fgame is equal to the number of times the keyword is
found in users’ descriptions of all objects in the gamespace,
and z is a normalization factor that results in the probabilities summing to 1.0:
pnew = (pprevious ⇥ (k ⇥ (fobj /fgame )))/z
(a) The robot’s image capture path.
(b) The robot’s initial learning phase in
progress.
Figure 3: Initial learning phase data collection.
(1)
Belief probabilities are continuously updated throughout
the game, until the robot reaches a confidence threshold.
Once this threshold is reached, the robot makes a guess regarding the object’s identity, and subsequently either wins
or loses the game. The confidence threshold is dynamically
defined, depending on the number of candidate objects currently in the game scene. These objects can change from one
round of play to the next.
Figure 4: The robot’s game photo capture setup.
Experiment
speakers age 18 or over). The first six descriptions meeting
eligibility requirements were kept for each object.
To acquire visual information for the objects, each object was placed on a white floor with a white background,
and a NAO robot was programmed to stand up and walk in
a predefined image capture path to take a series of photos
of the object from different distances and angles (see Figure 3a for the image capture path, and Figure 3b for the
initial learning phase in progress). Seven images were captured for each object. These images were segmented, and
non-relevant segments were tossed out manually, since individual image components are not the focus of this work.
To evaluate I Spy’s model accuracy over time, its overall
performance, and its appeal to human players, two experiments were performed. The first evaluated the system’s performance at scale by simulating 510 games based on real human responses to all of the possible questions. In the second,
46 games were played in real-time against humans in order
to evaluate the game’s appeal. The game’s feature models
were evaluated in the context of playing the games. Both
variations of the gameplay were preceded by the same initial learning phase. Data collection methods for the initial
learning phase as well as for the simulated and live games
are presented below, followed by a description of the design
for both experiments and, finally, the results.
Simulated Games. To capture game images for the simulated games, all 17 objects were placed on the same white
surface used for the initial learning phase. The robot was
placed at three different locations of the game space and captured one photo at each location, to ensure that each object
was included in one of the images. The items were then rearranged so that they were in different positions and orders,
and the process was repeated. This was done for 30 unique
gamespace configurations, so that 90 images were captured
(see Figure 4).
To obtain human responses for the simulated games, a
list was automatically generated containing all 289 unique
content keywords extracted from the user descriptions, and
a question was automatically generated for each keyword
in this list. AMT workers were provided one pre-captured
gamespace configuration’s images, and were asked to click
“yes” or “no” for each possible question for a specified object. 30 sets of answers (one per set of images) were acquired
for each of the 17 objects, leading to 510 responses.
Once the responses were collected, they were screened
for spam. To do this, questions were first analyzed to de-
Data Collection
Initial Learning Phase. To train the initial models used
in both the simulated and live games, 17 objects of varied
size, color, shape, and texture were used, including: digital clock, analog clock, red soccer ball, basketball, football, yellow book, yellow flashlight, blue soccer ball, apple, black mug, blue book, blue flashlight, cardboard box,
pepper, green mug, polka dot box, and scissors. Forty-three
participants (22 female) of various educational levels (4 had
a high school diploma, 15 had some college experience, 5
had an associate’s degree, 12 had a bachelor’s degree, 3 had
some post-bachelor college experience, and 4 had a master’s
degree), provided descriptions for the objects.
Object descriptions were acquired via Amazon Mechanical Turk (AMT) (https://www.mturk.com/mturk/welcome).
Participants provided demographic information (age, gender, educational level, and native language) after completing each description to control for eligibility (native English
38
cide whether they had a correct answer. Questions tagged as
nonsensical by a third-party annotator (8 of the 289 possible questions were tagged as nonsensical) were ignored regardless. The remaining questions were determined to have
a correct answer for a given object if the number of their 30
responses having the majority answer (either “yes” or “no”)
exceeded chance (15) by two standard deviations. Then, using the subset of questions having correct answers for a
given object, the average accuracy of that object’s 30 AMT
workers was determined. All of a worker’s answers were discarded if their accuracy was two standard deviations below
the average. 38 of the original responses were replaced for
this reason. Seven instances in which the same worker submitted more than one set of responses for the same object
were also replaced.
The final set of 510 responses was collected from 432
AMT workers. 184 were male, 243 were female, and 5 declined to indicate a gender. Once again, educational experience varied: 35 had a high school diploma, 107 had some
college, 40 had an associate’s degree, 149 had a bachelor’s
degree, 10 had some post-bachelor college, 70 had a master’s degree, 10 had a doctoral degree, and 11 declined to
indicate an educational level.
the first of a player’s two games, the robot stood up to greet
the player and introduce itself, briefly surveyed the gamespace, and then sat back down to play. Gameplay proceeded
with the robot asking the human player a question. Users responded with a positive or negative answer. The game continued until the robot’s confidence in one of the objects was
high enough that it could make a guess, at which point it either won or lost the game. The number of questions asked
before making this guess, and other gameplay details, were
automatically recorded by the system. Players were asked to
complete the post-game survey following their second game.
Results
Model Evaluation. To evaluate the performance of I Spy’s
feature models over time, the top ten keywords returned for
three object images were tracked as additional feature vectors from game images were added to the training data. Table
1 presents the top ten keywords obtained from testing these
different object images immediately after the initial learning
phase, after the 34 sets of images from the live games had
been added to the training data, and again after the 30 sets
of images from the simulated games had been added. Keywords that users had provided in their descriptions of these
objects are in bold, and precision is noted for each test.
Live Games. 34 different configurations of the 17 objects
were photographed prior to the live games by the robot.
These images were then segmented and visual features were
extracted from the segmented objects, producing ready-toplay versions of I Spy and thus eliminating the need to wait
for visual feature extraction during the game. 23 subjects
(8 female, Age Range: 19-56) of various educational backgrounds each volunteered to play the game twice, for a total
of 46 games. All 34 configurations were played (12 were
played twice). Each object was played either two or three
times. All games were played on the same day, in the same
setting that had been used for the initial learning phase.
An online post-game survey was created with six multiple
choice Likert scale questions (ranked from 1 to 5, with 5 being best), and two free-response questions (one asking for
additional comments, and one asking for suggestions).
Table 1: Model Testing Over Time
Initial Learning
Phase
+34 Images
+30 Images (64
Total)
Initial Learning
Phase
Experimental Design
+34 Images
Simulated Games. To run the simulated games, the captured game images were segmented and the resulting segments were manually matched to numerical object IDs for
retraining. Then visual feature vectors for all of the segmented objects, for each of the 30 sets of game images, were
automatically extracted. Once this was complete, the system
simulated each of the 30 sets of 17 games sequentially, leading to a total of 510 games, one for each player-object pairing. Models were retrained after every five sets of games.
For each I Spy question asked, the system simply looked up
the corresponding answer.
+30 Images (64
Total)
Initial Learning
Phase
+34 Images
+30 Images (64
Total)
Live Games. To run the human games, players engaged in
I Spy with the same NAO robot used to capture the initial
learning phase. Early players were given the opportunity to
select a target object from among any of the 17 objects, but
later players were assigned target objects in order to maintain an even distribution of objects played. Upon beginning
Pepper
opened, box, square, tan, hard,
handle, rectangular, colored,
white, brown
handle, small, red, round,
green, stem, ground, hard,
black, yellow
skinny, hot, grows, jalapeno,
long, smooth, seeds, chile,
edible, spicy
Apple
closed, air, ball, round,
ground, hard, black, rectangular, colored, brown
byproduct, eaten, glossy,
stem, firm, oval, shiny, red,
round, yellow
glossy, firm, byproduct,
tree, sticking, apple, natural, spotting, eaten, fruit
Analog Clock
red, ball, blue, round, hard,
yellow, metal, handle, white,
black
easy, system, alarm, clock,
face, time, metal, round,
white, black
old, large, classic, antique,
fashioned, easy, striker,
hands, numbers, metal
Precision
0%
40%
100%
Precision
10%
100%
100%
Precision
40%
100%
100%
Simulated Games. Out of the 510 simulated games
played, the robot won 259 and lost 251. On average, it asked
39
Table 2: Player Perceptions of I Spy
Question
Naturalness of Questions
Personal Enjoyment
Platform (NAO) Choice
Interaction Level
Perceived Intelligence
Likability
1
1
1
0
1
2
0
2
3
2
1
2
3
2
3
5
4
3
3
5
0
4
6
4
6
3
6
6
5
2
6
7
8
1
9
were viewed differently by the robot.
The robot’s guessing accuracy for the live games was noticeably higher than for the simulated games. One explanation for this is that users were primed more to give certain responses when questions were generated in real time,
and these responses were more in line with the information
found in the object descriptions. Another explanation is that
users were simply putting more thought into their answers;
after all, it is much more exciting to answer a question when
playing a game with a robot than when clicking through a
list of multiple choice questions. Despite the differences in
accuracy, the number of guesses that the robot took to win
or lose a game remained extremely close between simulated
and live games.
Responses to the post-game survey were promising, with
average rankings for all Likert scale questions scoring above
3. This indicates that I Spy is a feasible way to motivate people to engage in training robots about the world around them.
Avg.
3.3
3.7
4.1
3.9
3.1
4.3
12.7 questions before making a winning guess, and 15.4 before making a losing guess. It won the most games (25 out
of 30) when playing with the black mug, and the least (3 out
of 30) when playing with the blue flashlight.
Live Games. The robot won 32 of the 46 games played
(70%). On average, it asked 13 questions before making a
winning guess, and 15 before making a losing guess. It won
all of the time when users played with the digital clock, red
soccer ball, football, blue soccer ball, black mug, apple, or
polka dot box in mind, and lost all of the time when users
had the analog clock or blue flashlight in mind. It guessed
the polka dot box most easily, with an average of 7 questions
to win the game. Its longest game was one of the three in
which a player had the pepper in mind, asking 25 questions
before making a guess (the robot won).
Seventeen players responded to the post-game survey, and
the results of the survey are presented in Table 2. Of the
survey respondents, 11 had seen a robot previously, 5 had
interacted with a robot previously, and 1 worked in the field
of robotics. Only 3 had spoken with a robot before, but 16
had spoken with an automated call system. Comments on
the game were quite positive, including: “It’s awesome,” “I
loved it!” and “thanks, that was fun!”
Some comments also lent insight into the variability of
answers to I Spy questions: “I haven’t played I Spy in forever, but I remember not knowing whether to answer yes or
no to plenty of things (I’m indecisive). . . I have no idea if
this affected the robot. It guessed right anyway.” Some users
also offered suggestions for future work.
Conclusion and Future Work
The results of the I Spy experiments and model evaluation
indicate that the game is a reliable method of training robots,
and that it is appealing to human players. I Spy also offers
enormous future research potential. Although so far only
tested in a small preliminary trial (12 total participants from
a separate study), the robot can be set to learn during gameplay by soliciting additional descriptions for the target object
with which it lost. Then, even if the robot has never seen the
object previously, it can link the player’s description with
the visual feature vector extracted from the image of the object. Much more substantial evaluation of this aspect of the
game is needed before a conclusion can be made regarding
its effects upon the robot’s knowledge. However, one interesting question it creates lies in the determination of an optimal trade-off between winning the game and losing, both
from the player’s and from the robot’s perspective—from
the robot’s learning perspective, losing the game is actually
more valuable than winning, since it enables the robot to add
new information to its knowledge.
Other interesting research challenges lie in specific subsets of the gameplay process. For example, as the robot
learns an increasingly large number of keywords, it must
continue to produce natural-sounding gameplay dialogue to
query the user during gameplay. To automatically generate
questions based on previously unknown keywords, the robot
could dynamically search for existing questions that are appropriate for the feature via the web or other large text corpora. This movement toward more organically understanding the meaning of learned words would advance the system’s ability to more closely model the human learning process, while simultaneously leading to advances in dynamically generating more diverse, natural questions—both for
the purposes of I Spy and, more broadly, for applications requiring automatic question generation in general.
Discussion
The model evaluation clearly shows that the feature models perform better over time. Although initially only the
most general features are correctly linked to keywords provided by the object descriptions, the features returned become more specialized over time. Thus, it is clear that the
robot is able to correctly acquire knowledge regarding new
features over repeated gameplay.
The robot’s fairly low guessing accuracy for the simulated
games can be explained by several factors. First, an analysis
of the data collected for the simulated games showed that
players had relatively low agreement among one another regarding object properties, even when answering questions
about the same objects. Moreover, since the robot only knew
that an object was described by a particular feature if that
feature had been contained in one of the object’s six descriptions, some responses that were correct to human observers
Acknowledgments
This material is based upon work supported by the National Science Foundation under Grants No. IIS-1262860,
40
Permuter, H.; Francos, J.; and Jermyn, I. 2006. A study
of gaussian mixture models of color and texture features for
image classification and segmentation. Pattern Recognition
39(4):695–706.
Toutanova, K.; Klein, D.; Manning, C. D.; and Singer, Y.
2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference
of the North American Chapter of the Association for Computational Linguistics on Human Language TechnologyVolume 1, 173–180. Association for Computational Linguistics.
Vogel, A.; Raghunathan, K.; and Jurafsky, D. 2010. Eye spy:
Improving vision through dialog. In AAAI Fall Symposium:
Dialog with Robots.
CNS-1035913, IIS-1409897, and CNS-1338118. This work
was partially carried out at the 2014 International ResearchCentered Summer School (http://irss.iit.demokritos.gr) and
in the context of Roboskel, the robotics activity of the
Institute of Informatics and Telecommunications, NCSR
Demokritos (http://roboskel.iit.demokritos.gr/).
References
Bradski, G. 2000. The opencv library. Doctor Dobbs Journal 25(11):120–126.
Carlson, A.; Betteridge, J.; Kisiel, B.; Settles, B.; Hruschka Jr, E. R.; and Mitchell, T. M. 2010. Toward an
architecture for never-ending language learning. In AAAI,
volume 5, 3.
Chang, C.-W.; Lee, J.-H.; Chao, P.-Y.; Wang, C.-Y.; and
Chen, G.-D. 2010. Exploring the possibility of using humanoid robots as instructional tools for teaching a second
language in primary school. Educational Technology & Society 13(2):13–24.
Ferrucci, D.; Brown, E.; Chu-Carroll, J.; Fan, J.; Gondek,
D.; Kalyanpur, A. A.; Lally, A.; Murdock, J. W.; Nyberg,
E.; Prager, J.; et al. 2010. Building watson: An overview of
the deepqa project. AI magazine 31(3):59–79.
Galatas, G.; McMurrough, C.; Mariottini, G. L.; and Makedon, F. 2011. eyedog: an assistive-guide robot for the visually impaired. In Proceedings of the 4th International Conference on PErvasive Technologies Related to Assistive Environments, 58. ACM.
Gatt, A., and Reiter, E. 2009. Simplenlg: A realisation engine for practical applications. In Proceedings of the 12th
European Workshop on Natural Language Generation, 90–
93. Association for Computational Linguistics.
Heikkila, M., and Pietikainen, M. 2006. A texture-based
method for modeling the background and detecting moving
objects. IEEE transactions on pattern analysis and machine
intelligence 28(4):657–662.
Holzapfel, H.; Neubig, D.; and Waibel, A. 2008. A dialogue approach to learning object descriptions and semantic
categories. Robotics and Autonomous Systems 56(11):1004–
1013.
Mavridis, N., and Dong, H. 2012. To ask or to sense?
planning to integrate speech and sensorimotor acts. In Ultra Modern Telecommunications and Control Systems and
Workshops (ICUMT), 2012 4th International Congress on,
227–233. IEEE.
Nielsen, R. D.; Voyles, R. M.; Bolanos, D.; Mahoor, M. H.;
Pace, W. D.; Siek, K. A.; and Ward, W. 2010. A platform for
human-robot dialog systems research. In AAAI Fall Symposium: Dialog with Robots.
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.;
Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss,
R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.;
Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikitlearn: Machine learning in Python. Journal of Machine
Learning Research 12:2825–2830.
41