NIMBUS: A Hybrid Cloud-Crowd Realtime

1
arXiv:1602.07641v1 [cs.RO] 24 Feb 2016
NIMBUS: A Hybrid Cloud-Crowd Realtime
Architecture for Visual Learning in Interactive
Domains
Nick DePalma and Cynthia Breazeal
Personal Robots Group
MIT Media Lab
20 Ames Str.
Cambridge, MA 02139
Abstract—Robotic architectures that incorporate cloud-based
resources are just now gaining popularity [1]. However, researchers have very few investigations into their capabilities to
support claims of their feasibility. We propose a novel method
to exchange quality for speed of response. Further, we back this
assertion with empirical findings from experiments performed
with Amazon Mechanical Turk and find that our method improves quality in exchange for response time in our cognitive
architecture.
I. I NTRODUCTION
Real, working, robotic systems must incorporate systemic
architecture designs. The simple heterogeneity present in autonomous electro-mechanical (and biological) systems means
that interfaces between differing subsystems are inevitable
consequences of assembly. But while cognitive architecture
research has, by definition, been inspired by biology, neuroscience, cognitive science, or psychology, we must also realize
that robots are connected to resources that humans and animals
don’t have access to. Wireless and network connectivity allows
many robots lives to be connected to internet resources to
support the interaction.
Our software framework has been built over the past few
decades around the idea of synthetic lifelike agents [2], and
was adapted into a more embodied approach that takes into
account sensors and motors and incorporates elements of
interactive task learning [3] and principles of animation [4]. In
our recent past, our architecture has forked into two separate
initiatives: one in which emphasizes the ability to take advantage of small, embedded, and ever improving smartphones
[5], [6], [7] and an architecture that emphasizes its ability
to operate across small and large scale computer systems
[8], [9]. This paper will focus on a simple pipeline to unify
these movements to support lightweight calls to resources not
available to the robot alone. With this innovation, the robot
may make decisions to connect with individuals on the internet
based on the availability of resources and participants not
present for the agent interacting with the world.
Our server, Nimbus1 , operates as an always available resource for robots in the field and for interaction partners.
Nimbus serves to collect and coordinate information from an
1 This
resource is always online at http://nimbus.media.mit.edu
always-on and resilient data collection interface, crowdsource
on-demand when absolutely needed by the robot, and to act
as a computational resource for highly intensive processes.
This paper documents underlying questions regarding speed
of response and security that are highly relevant to interactive
domains like human-robot interaction.
The trade-off between quality of response from people on
the Internet and the speed at which a response can be obtained
by an autonomous system are still not well understood. Our
research group over the past few years has built a resilient
cloud infrastructure to support the robots on the ground and
in the field. One of the higher level features of the system
is to retrieve information from databases in the cloud and
when unavailable, obtain high quality labels from crowds of
people on systems such as Amazon Mechanical Turk or Crowd
Flower. In this paper, we measure round trip times from the
robot to the crowd and back and discuss what one can expect
from a system that leverages crowdsourcing as a core cognitive
resource. We introduce a novel mechanism to improve quality
of the response at the expense of response speed. We conclude
with remarks and thoughts about the future of technology built
for the purpose of efficient and interactive labeling to support
interaction.
Related Work
Few architectures have integrated real-time crowdsourcing
and cloud-based resources into their cognitive architectures
and if these architectures have, they have not been integrated
into socially intelligent robotics tasks to support these dynamic
interactions. Many surveys cover the broad spectrum of research that utilizes crowdsourcing in robotic learning tasks that
range from perception learning [10] to action learning [11]. For
a survey on various projects that use crowdsourcing in machine
learning and robotic tasks, see the comprehensive reviews by
Kehoe et. al. [12], [13]. For example, the RoboEarth project
vision is still the most mature vision of cloud resources
integrating backend architectures to support perception, action,
and mapping [14]. Rapyuta, the underlying framework, can be
very powerful but it’s emphasis on non-interactive tasks makes
real-time latency issues secondary for their level of inquiry
[15]. Details on many of these architectures are sparse, with
2
Interactive Robot
Front-End
Large Scale Data and
Response Management
Back-End
Deictic Gesture Triggers
Change in Perception
Resource or Label
Request
PerceptionAction
Loop
AgentCloud
Interactions
Deictic Action Triggers
Behavioral Saccades
Cloud or Crowd
Response
Round Trip Loop 1
Figure 1.
Label Request
NIMBUS
(Server Farm)
Crowd
Management
Crowd Response
Round Trip Loop 2
Cognitive Architecture for Visual Learning Problem
Rapyuta [15] and the work by Osentoski et. al. utilizing closedloop control [16] with a single online participant providing the
most concrete details. Other architectures treat crowdsourcing
as a secondary or tertiary characteristic of the research process,
performed after data collection to support annotation.
The research previously mentioned (except perhaps [15])
does not utilize the cloud as a resource it can autonomously
decide to use when other resources are unavailable. Rapyuta
manages to schedule available resources by the cloud and the
real-time system using specialized scheduling algorithms in
which large tasks can be scheduled to operate on cloud based
hardware or on local embedded systems inside of the robot
depending on the needs and demands of the entire architecture.
Our system, instead, focuses on the labeling task as a source
of information that can be made available on-demand through
either a social interaction between the interaction partner and
the robot or as an interaction it can request online.
In this work we use this crowdsourcing system to support
interactive perceptual learning between the robot and its environment. To support this level of inquiry, we also review
real-time perceptual learning systems. Raptor [17] is a recent
(and closely related) successful dynamic training process that
utilizes the power of ImageNet[18], a large database of object
categories, to discover early image priors that it can use as
in-situ training instances. ImageNet is used entirely for figureground patch discovery. These image patches can then be used
to train fast whitened HOG-based object detectors dynamically
throughout the interaction. This process takes under 30 seconds for new image patch to reify into an object detector. Our
perception-action loop extends this type of system and allows
the robot to take deictic actions toward exchanging figureground hypothesis enumeration with a social partner to support
object discovery. This paper serves to support this perceptionaction loop (documented more thoroughly in [19], [20], [21])
and focuses primarily on incorporating crowd based labels into
this cognitive architecture.
II. O N - THE - FLY V ISUAL L ABELING FROM L OCAL AND
R EMOTE S OURCES
Our system seeks to realize a simple abstraction: that
perception-action loops help define our visual labeling task
interactively and that backend loops completed between the
real-time embedded system and our servers are present to
support the needs of the interactive system (see Figure 1). We
have chosen a visual labeling task as this is an area where early
stimuli need to be reified into abstract representations that can
be reused within the interaction. Interactive visual labeling
in which a system is able to take early visual stimuli and
dynamically create stable object detectors (or more abstractly
deictic pointers[22], [23]) in real time is still a large open
problem space that is rich with potential impact.
To demonstrate the power of this system, we have built a
system to learn labels both through an interaction and through
our server architecture that manages an Amazon Mechanical
Turk account. Figure 1 shows the cognitive architecture that
surrounds our system. At the interaction level, the system
utilizes gaze and deictic gesture to visually guide the robot
towards relevant areas in the scene. It uses finite horizon
reasoning and simple low-level cues to visually extract select
image patch foregrounds from the scene. This interactive guiding process allows the robot to be directed towards stimuli it
wouldn’t normally be directed towards if it operated in a purely
discretized object-oriented manner and to additionally use
those early stimuli as training instances for socially relevant
deictic and lexical referencing. In the event that the user walks
away, the robot still has the ability to saccade to new stimuli
and collect labels from Amazon Mechanical Turk for a small
fee.
The following two sections introduce a natural tension in the
crowdsourcing architecture between the trade off of speed vs
quality. We introduce two methods to document the trade off
that we can expect out of systems of this type. One method is
what we call one-shot labeling. In one-shot labeling the system
attempts to collect the label as fast as possible. The other
method, called Rollover Labeling, attempts to reliably get a
label in the presence of noisy, error prone, crowd responses
by rolling over the result from one crowdsourcing participant
to another in an attempt to correct and clarify previous labels.
III. T IME - TO -R ESPONSE M EASUREMENTS
What’s the shortest latency that one could expect if the
robot needs the label as quickly as possible? And what do
3
Decision to Label
Labeling Strategies
Mechanical Turk
Wrapper
One Shot
Labeling
Return
Label
Snap Scene
d
loa e
Up cen
S
Labeling
With Rollover
Verify
Return
Label
Figure 2.
Caption: Can you find the object
in my hand?
Data Collection Procedure
Figure 3 shows that one-shot labeling can be a relatively
quick process. Response times for one-shot labeling fell in the
range of [12.9s, 520s] with a median of 37.5s while a rollover
process fell in the range of [95s, 1217.8s] with a median of
263.58 seconds, or approximately 4.3 minutes. It is clear that
one-shot labeling has the potential to be a very fast process,
optimistically only taking about 13 seconds!
Figure 3.
seconds.
Response times of two backend data collection strategies in
we lose if we want the highest quality response? In our case,
our experience is that to use crowdsourcing as an on-demand
resource, the robot must:
1) take a snapshot from its sensors
2) upload the data and wrap it in a visually legible and
descriptive manner to a server where it can be reliably
served to a participant online
3) find a participant to label it for a price2
4) wait for the participant to accept the job and read the
instructions
5) await the response
We used our architecture to snap a picture to label from a
webcam on a simple robot (see Figure 2). The image and
requisite interface details were uploaded to Nimbus and a
process began to synchronously wait for a response. Total
round-trip times are reported in Figure 3. Our process used
Amazon Mechanical Turk at a competitive rate of $0.25 for a
45 second labeling task. We ran the experiment 20 times for
each strategy.
2 This
seems to be the slowest process.
IV. Q UALITY-S PEED T RADEOFF
On the other hand, what kind of data quality can you expect
out of quick 30 second one-shot label acquisition? In previous
experiments[8], we found that the quality of online data can
vary widely. Using the same results that we acquired from
our 20 participant online experiment, we calculated the mean
bounding box and use it as a baseline answer. We calculated
the mean squared error (MSE) with respect to the baseline
and found that rollover significantly reduced the overall error.
Additionally, we saw that the responses from rollover provided a reduction in response variance. This suggests that the
participants’ extra time spent on the task provided a higher
fidelity bounding box around the training instance and thus
a higher quality result. Table I shows that while the answers
can begin with many errors, rollover helps correct them and
that the mean squared error of the final answer is significantly
better than the one-shot labeling.
MSE from baseline
Standard Deviation
One-Shot Labeling
518.6px
22.73px
With Rollover (Final)
341.5px
18.4px
Table I
Q UALITY-S PEED T RADEOFF BETWEEN ONE - SHOT LABELING AND
LABELING WITH ROLLOVER
V. D ISCUSSION & F UTURE W ORK
The conclusions of this experiment show that to unify a
cognitive architecture that incorporates interactive perceptionaction loops and interactions between the robot and crowds of
4
people on the Internet, there is a clear need to reason about the
speed at which you expect a response (if the robot is presently
interacting with a scene) and the expectation of what quality of
a response you can expect. Even in the best circumstance, oneshot labeling can take at least 12.74 seconds and take upwards
of 4 minutes depending on the crowd response latency. Timing
and scheduling when and to which interaction partners the
query must be made is an interesting dynamic challenge for
this always-on hybrid architecture.
In performing this experiment, other strategies have made
themselves salient. For example, we found that while some of
the bounding boxes and labels that were provided are rife with
error, on average, Amazon Mechanical Turk workers provided
good results. Our assumption was that one-shot labeling will
not on average provide good results and that rollover would
be needed, but instead, we found that with many answers, a
researcher can statistically remove outliers based on the results
of many parallel queries. This opens up the possibility to an
alternate strategy from strict rollover and to parallel one-shot
queries that may allow the agent to rely on the wisdom of
many to remove the untrustworthiness of smaller individuals,
and then to simultaneously model these workers and weight
them appropriately. This parallel process may simply capitalize
on one-shot labeling speed and still keep high quality results.
As we build social, embodied agents, we stake out a position
that emphasizes that many of the “usable representations”
must have social utility. These robots will inevitably be left
to their own devices at some point in their life cycle and
while we can rely on the idea that unsupervised learning
can save us, this work emphasizes the idea that robots are
never alone when they are connected to the Internet. It is a
simple trade-off to begin to have interactions with participants
on-line. Additionally, we believe that scaling our research
from individual participants to groups of people to crowds of
individuals allows us to begin to investigate social reasoning
at a larger scale, opening up the idea of relationships between
particular robotic agents in the field and particular individuals
online, and opening up the possibility that an autonomous
agent can begin to model the characteristics of individuals
and reason about their reliability at a larger scale than ever
thought possible with simple human-robot interaction alone.
ACKNOWLEDGEMENTS
This work was partially funded by the MIT Media Lab
Consortia, and by NSF Grant CCF-1138986. We want to also
thank Hae Won Park for her many comments on earlier drafts
of this paper, Jeff Orkin for his creativity and Rob Miller
for his leadership in pushing a crowdsourcing agenda that
has been inspiring to everyone involved with the Cambridge
Crowdfoo Initiative.
R EFERENCES
[1] S. Thielman, “Man behind darpa’s robotics challenge: robots will soon
learn from each other,” 2015. The Guardian. Accessed January 2016.
[2] R. B. D. I. M. Downie and Y. I. B. Blumberg, “Creature smarts: The
art and architecture of a virtual brain,” 2001.
[3] A. L. Thomaz, Socially guided machine learning. PhD thesis, Massachusetts Institute of Technology, 2006.
[4] C. Breazeal, A. Brooks, J. Gray, M. Hancher, J. McBean, D. Stiehl, and
J. Strickon, “Interactive robot theatre,” Communications of the ACM,
vol. 46, no. 7, pp. 76–85, 2003.
[5] K. Williams and C. Breazeal, “Reducing driver task load and promoting
sociability through an affective intelligent driving agent (aida),” in
Human-Computer Interaction–INTERACT 2013, pp. 619–626, Springer,
2013.
[6] K. J. Williams, J. C. Peters, and C. L. Breazeal, “Towards leveraging
the driver’s mobile device for an intelligent, sociable in-car robotic
assistant,” in Intelligent Vehicles Symposium (IV), 2013 IEEE, pp. 369–
376, IEEE, 2013.
[7] G. Gordon, C. Breazeal, and S. Engel, “Can children catch curiosity
from a social robot?,” in Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, pp. 91–98, ACM,
2015.
[8] N. DePalma, S. Chernova, and C. Breazeal, “Leveraging online virtual
agents to crowdsource human-robot interaction,” in Proceedings of CHI
Workshop on Crowdsourcing and Human Computation, 2011.
[9] C. Breazeal, N. DePalma, J. Orkin, S. Chernova, and M. Jung, “Crowdsourcing human-robot interaction: New methods and system evaluation
in a public en-vironment,” Journal of Human-Robot Interaction, vol. 2,
no. 1, pp. 82–111, 2013.
[10] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “Labelme:
a database and web-based tool for image annotation,” International
journal of computer vision, vol. 77, no. 1-3, pp. 157–173, 2008.
[11] A. Sorokin, D. Berenson, S. S. Srinivasa, and M. Hebert, “People helping
robots helping people: Crowdsourcing for grasping novel objects,” in
Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International
Conference on, pp. 2117–2122, IEEE, 2010.
[12] K. Goldberg and B. Kehoe, “Cloud robotics and automation: A survey
of related work,” EECS Department, University of California, Berkeley,
Tech. Rep. UCB/EECS-2013-5, 2013.
[13] B. Kehoe, S. Patil, P. Abbeel, and K. Goldberg, “A survey of research on
cloud robotics and automation,” Automation Science and Engineering,
IEEE Transactions on, vol. 12, no. 2, pp. 398–409, 2015.
[14] “Roboearth | a world wide web for robots.” http://www.roboearth.org.
Accessed January 2016.
[15] D. Hunziker, M. Gajamohan, M. Waibel, and R. D’Andrea, “Rapyuta:
The roboearth cloud engine,” in Robotics and Automation (ICRA), 2013
IEEE International Conference on, pp. 438–444, IEEE, 2013.
[16] S. Osentoski, C. Crick, G. Jay, and O. C. Jenkins, “Crowdsourcing for
closed loop control,” in Proc. of the NIPS Workshop on Computational
Social Science and the Wisdom of Crowds, NIPS, 2010.
[17] D. Goehring, J. Hoffman, E. Rodner, K. Saenko, and T. Darrell,
“Interactive adaptation of real-time object detectors,” in Robotics and
Automation (ICRA), 2014 IEEE International Conference on, pp. 1282–
1289, IEEE, 2014.
[18] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in Computer Vision and
Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–
255, IEEE, 2009.
[19] N. Depalma and C. Breazeal, “Object discovery vs. selection in social
action: Benefits of a competitive attention system,” International Conference on Intelligent Robots and Systems (IROS), 2015.
[20] N. Depalma and C. Breazeal, “Sensorimotor account of attention sharing in hri: Survey and metric,” 2nd Annual Symposium on Artificial
Intelligence and Human-Robot Interaction, 2015.
[21] N. DePalma and C. Breazeal, “Towards bootstrapping socially significant representations through robotic interaction alone: the joint guided
search task,” in Fifth International Symposium on New Frontiers in
Human-Robot Interaction at the Artificial Intelligence and Simulation
of Behavior Convention (AISB’16), 2016.
[22] A. G. Brooks and C. Breazeal, “Working with robots and objects:
Revisiting deictic reference for achieving spatial common ground,” in
Proceedings of the 1st ACM SIGCHI/SIGART Conference on Humanrobot Interaction, pp. 297–304, ACM, 2006.
[23] D. H. Ballard, M. M. Hayhoe, P. K. Pook, and R. P. Rao, “Deictic
codes for the embodiment of cognition,” Behavioral and Brain Sciences,
vol. 20, no. 04, pp. 723–742, 1997.