Where is this? - Gesture Based Multimodal
Interaction With An Anthropomorphic Robot
Niklas Beuter, Thorsten Spexard, Ingo Lütkebohle, Julia Peltason, Franz Kummert
Applied Computer Science, Faculty of Technology
Bielefeld University, Germany
Email: {nbeuter, tspexard, iluetkeb, jpeltaso, f ranz} @techfak.uni-bielefeld.de
Abstract—Traditional visitor guidance often suffers from the
representational gap between 2D map representations and the
real-world. Therefore, we propose a robotic information system
that exploits its physical embodiment to present a readily
interpretable interface for visitor guidance. Similar to human
receptionists, it offers a familiar point of reference that can be
approached by visitors and supports intuitive interaction through
both speech and gesture.
We focus on employing an anthropomorphic body to improve
guidance functionality and interpretability of the interaction. The
map, which contains knowledge about the environment, is used
by robot and visitor simultaneously, with the robot translating
its content into gestures. This setting affords disambiguation of
information requests and thus improves robustness. It has been
tested both in a laboratory demonstration setting and in our
university hall, where people asked for information and thereby
used the system in a natural way.
Index Terms—Human Robot Interaction, Anthropomorphic
Robot, Multi-Modal Interaction, BARTHOC
I. I NTRODUCTION
We present a robotic information system that assists visitors,
e.g. by giving directions to the next bank on campus or
by explaining an exhibit in a laboratory. It is based on an
anthropomorphic robot platform that affords an intuitive interaction style. For instance, the robot simply points in the right
direction to assist the users orientation, and it also displays
basic facial expressions to signal interaction state. We suggest
that the physical embodiment of the robot is an advantage
over other, more conventional supports like maps or interactive
terminals, because there is a direct, natural translation from
the gesture to the environment. Additionally, the appearance
and placement of the robot mimics human receptionists and
thus provides a familiar contact point for visitors. Therefore,
we consider this scenario to be ideally suited for real-world
application of social robots.
Exploiting the situated nature of the interaction, we equip
the robot receptionist with a map of the environment. The
user can approach the robot and ask for information about
the environment referring to the map using both speech and
gesture. Thus, the map provides a basis for the multimodal
interaction between user and robot. Moreover, since gestures
on the map are reliably interpreted, it provides a way for
faciliating gesture recognition which is generally still a hard
problem. Altogether, the use of the map allows for intuitive
and at the same time robust multimodal interaction interpreted
Fig. 1.
object.
Humanoid robot BARTHOC, pointing at a direction of a desired
and translated by the robot into recognisable gestures to give
directions (see figure 1).
The system has been evaluated in two settings, firstly a lab
room, where research prototypes and the involved hardware
are explained and secondly, the university main hall, representative of a public area. The main hall in particular is huge
and houses the cafeteria, several restaurants, public facilities,
and provides access to all departments. Navigating it efficiently
regularly poses a challenge for visitors and new students alike.
The article is organised as follows: In the next section
related work on anthropomorphic robotics and their interaction
capabilities is presented. Afterwards we introduce our robot
BARTHOC and the basic software system. Accordingly the
interaction scenario is described in detail and in the last part
the evaluated results of this scenario are presented followed
by a short conclusion.
II. R ELATED W ORK
Early work on social interaction in robots was performed by
Braezael in the KISMET project [1]. She focused on facilitating communication through expression of facial emotions and
produced a very expressive robotic head, capable of moving
eyes, eyebrows and -lids, ears, lips, and the head itself. It has
a natural communication, the abilities of the android are quite
sparse. It does not interact with users on its own and does not
act autonomously, but has to be remote-controlled.
In sum, it can be said that previous projects have focused
more on either interaction as such or on emulating appearance. In contrast, we focus on concrete functionality and
a reproducible scenario. This requires integration of several
capabilities, anthropomorphic appearance, speech and gesture
recognition, and production.
III. S YSTEM OVERVIEW
Fig. 2. The reception scenario. The user sits in front of the robot. Between
them is a table with additional information about the environment.
been shown that such visual feedback is important in social
interaction in general (e.g., cf [2]). However, KISMET only
communicated about objects in its immediate vicinity and does
not support reference to or interaction about places further
away.
A robot explicitly targeted at receptionist functionality
called the “roboceptionist” has been demonstrated by Kirby
et al [3]. In contrast to KISMET, the roboceptionist uses a
screen as a face, to display a rendered human head. Compared
to BARTHOC, it also lacks arms. The stated goal of the project
is long-term interaction and building rapport, not short-term
information provision and the hardware used does not afford
physical embodiment.
Gestural interaction and pointing has been implemented in
ALBERT, by the University of Karlsruhe [4]. It is a mobile
robot, equipped with stereo vision and a seven degree of
freedom actuator with a mounted hand, which is used to point
at objects. Modalities are not combined, however, and the
scenario is again centred on objects in the immediate vicinity.
A robot which is built considering multiple modalities is
Maggie [5]. The 1,35 meters tall robot is equipped with sonar,
infrared, haptic and multiple camera sensors. A tablet PC is
integrated in the chest, used for visual output, and two armlike tubes without any joints are connected to the sides of the
robot for basic gestures. Maggie can welcome guests and gives
feedback about its internal state with glowing diodes, but the
interaction is not really natural.
Robot projects which focus more on “naturalness” and
a human-like look utilise androids. These humanoid robots
copy the human body, which has been suggested to increase
acceptance from the user, because humans are accustomed to
communicate with each other, e.g cf [6]. One example is the
android repliee Q2 build at the Osaka University in Japan
[7]. The android simulates a human body in appearance and
movement, which is realised by actuators integrated in the
upper body. Repliee Q2 has the ability to imitate motions and
to fulfil gestures. While the human appearance encourages to
In this paper a robot system is proposed, which works as an
information system with a natural multimodal communication
interface. Imagine a robot positioned at the main entrance of a
big university. A guest is late, has to find the lecture room but
no idea where to go. Because only few minutes are left, there
is no time to become acquainted with a complicated computer
system but a system like the proposed may be more readily
recognisable and usable.
In our scenario the user is expected in front of the robot
and a table is located between user and robot (see figure
2). Placed on the table is a map of the environment, where
all interesting objects or locations are indicated. When
asking a question, the user can indicate the place asked for
by pointing at the map. The map is also helpful for the
user, because it denotes objects known to the robot. The
different ways for the guest to get desired information and the
functionality of the robot system is described in the following.
A. The robot BARTHOC
The proposed work was realised on our anthropomorphic
robot BARTHOC (Bielefeld Anthropomorphic Robot) (see
figure 1) [8], whose design models a human torso. The robot
has 41 degrees of freedom to move his upper body, his
arms and his head in a human-like way. The arms are built
with an elbow link and a wrist, connected to a hand with 5
independently moving fingers. With 3 DoF in each link, the
arms are capable of moving similar to human ones. For the
head, 11 motors are used to move the head up and down, to the
sides and angular. Some actuators in the mouth, the cheeks,
the eyes and the eyebrows give the potential to generate facial
expressions by moving the skin on top of them. On top of the
motors and skull, a latex mask is attached to simulate skin.
With its arms and hands, the robot is able to show iconic,
symbolic and deictic gestures, which are important for our
interaction scenario.
The available sensors are cameras in the eyes and stereo
microphones, arranged similar to the human head. After all,
the robot is inspired by the human upper body concerning
movement and sensory skills.
B. Interaction system
The robot has to perceive its environment and has to act
adequately to the present situation. Therefore, we apply a
is calculated through the clues of speaking, standing near the
robot and gazing at it. This attention and dialog system has
been developed for the mobile robot BIRON [14], but can be
used on diverse systems, including the humanoid BARTHOC
presented here. This modular system enables a robot to find,
continuously track, and interact with communication partners
in real-time with human-equivalent modalities.
IV. E NVIRONMENT I NFORMATION S YSTEM
Fig. 3. The overall system for a multimodal interaction. At the bottom the
boxes describe the hardware of the robot BARTHOC and the used sensors.
Above are the modules, which generate the behaviour of the robot. The
hardware control manages the moves of the actuators, the person tracking
remembers the recent interaction partner and the speech recognition delivers
the spoken utterances. A person attention system is used to remember all
persons in the range of the robot and to focus on the person of attention. The
speech understanding analyses the spoken utterances and a dialog system
decides how to answer. The text to speech module therefore produces a
synthetic output. The execution supervisor synchronises all actions. The
environment information system includes the gestures and produces the
multimodal movement of the robot.
combination of software modules, which handle the input,
the process and the output for controlling an interaction [10]
(see figure 3). For the initialisation of an interaction, reliable
detection and robust tracking of an interaction partner is
crucial. Therefore the robot BARTHOC uses a sound and
face detection including the recognition of the human gaze
direction. If a person speaks to the robot, the speaker’s location
can be determined from the stereo microphone setup. If there
is a face, which looks at the robot, it is most likely an
interaction partner. If this person initiates a communication by
speaking to the robot (e.g. “hello” or “hello robot”), a dialog
system [11] enables the robot to communicate and behave in
a human way. For speech recognition we use the HMM based
ESMERALDA [12], which delivers the identified words to our
speech understanding module [13]. The speech understanding
module creates a nested frame representation of the utterance.
Based on this structure, the dialog system decides what to
do next. The execution supervisor (ESV) is informed about
all actions and synchronises all involved modules, which is
necessary for a natural multimodal answer. The environment
information system (EIS) gatherers all information of the
actual action, which means that the spoken utterance and if
needed the including pointing gesture have to be connected
(For a detailed description see section IV). An additional
attention control for interacting with several persons at the
same time is also integrated. Therefore, every person around
the robot is remembered and the person with the highest
potential as an interaction partner is selected. The potential
The novel component for the proposed reception scenario
is constituted by the environment information system. It integrates speech and gesture information with the scenariospecific information from the map, which is located on the
table between robot and user. The map contains all known
objects, e.g. physical items, special locations or persons. If
the user points at an object on the map, the robot reacts by
pointing at the the position of the object in the real world.
The objects are integrated in the map by plotting them with a
specific width and height on a position on the map.
The real-world positions of the objects on the map are
known to the robot and stored in an XML document which
can be dynamically added and updated in the system.
Additionally, names, height and information about the object
itself are added to all objects, which build the data base for
the robot. For editing the map, a graphical user interface is
available, which permits the user to change the map content
at runtime. Using this information, the robot can determine
the positions of all objects in the real world and afterwards
calculate his kinematic movements for pointing at them.
Finally the following information is included in the xml map
file:
• Map : Size, Scale
• Robot: Rotation, Position, Height
• Objects: ID, Size, Position, Height, Name, Annotation
To detect a pointed object on the map in an interaction, the
gesture of the user has to be recognised and afterwards the
object has to be assigned. Therefore, we installed a camera
over the table, which observes the gestures and the map itself.
A hand-tracking component [15] is used for detecting the
gestures on the table. It provides the start and the endpoint
of a movement from a skincoloured region, which behaves
similar to saved template movements. As templates, we stored
the pointing gestures for direct pointing, uncertain pointing and
simple left or right pointing. After a gesture is recognised, the
point at the end of the hand is determined, see figure 4. This
point is the most likely pointed position on the map.
The point results in image coordinates, but the object
position is relative to the map on the table. Consequently, the
position of the map in the image has to be determined for the
calculation of the relative object position. For the detection
of the map a monochrome background is chosen, because it
leads to a faster and more robust detection rate. With the
position of the map the coordinate of the gesture point on
the map can be calculated. Because the corners of the map
(a)
(b)
(c)
(d)
Fig. 4. In figure a) the recognised gesture point is denoted with a white point. Image b) shows the corner search algorithm to find the correct position of
the map. A ray beginning in the middle of the borders searches the edges of the map and walks along the edges to the corners. In figure c) the found map is
highlighted with a red rectangle. Figure d) shows the calculation of the gesture point in map coordinates. Therefor the length of X and Y is figured up with
the information of the corners and the gesture point each in image coordinates.
Fig. 5.
The action cycle of the environment information system. The
user initiates the interaction by asking, where some questions may require
a gesture. The behaviour of the robot adapts on the actual state.
define its position, the corners are determined with a sobel
edge detection algorithm. The edges of the map are searched
with a beam, which starts at the centre of each image border
and if an edge is found, it tests whether the edge contains to
the map (figure 4b). The verification is done by comparing
the opposite edges in length, which has to be nearly equal.
The found edges are used to locate the corners of the map
by walking the edges along to the end. The found map is
shown with a red rectangle in figure 4c. Afterwards the pointed
position is calculated in X and Y coordinates relative to the
map by determining the length of the left and upper triangles
shown in figure 4d. The length of the left and upper a, b and
c are given, because the corners of the map and the pointing
positions are known. The needed angles for calculating X and
Y are determined with the law of sines and cosines, followed
by calculating X and Y itself. The resulting values are matched
with the known objects in the database and if there is an object
at the referred position, the accordant information is extracted
for further processing.
An advantage of the automatic map detection is the fact
that the whole setup does not need any calibration. The height
of the table or the position of the camera do not affect the
calculation and even the adjustment of the map or the table
do not influence it. It is, however, required that the map is
completely visible in the image.
Adjacent to the pointed object the robot has to know which
information the user wants to get. Therefore, the user has to
tell the robot, what he/she is looking for. In our interaction
scenario the user has three possibilities: Firstly, he/she can
use the map and ask what kind of object he/she points at and
the robot delivers its stored information about it. Secondly,
the user can point at an object, whose position he/she wants
to know and thirdly, he/she can ignore the map and directly
ask the robot where he/she can find a specific object. The
robot answers by pointing at the object in the real world and
by instructing the user to look at his pointing direction. In
particular the sentences for getting information from the robot
are:
1) What is this?
2) Where is this?
3) Where is hobjecti?
In the last sentence hobjecti is a specific name of an object.
The dialog with the robot works in English and German, which
are the language the speech recognition engine supports.
The typical process of an interaction is realised by firstly
greeting the robot with a spoken utterance like “welcome” or
simply “hello”. Afterwards the user has the choice of initiating a desired interaction. To start the proposed environment
information system, the user only has to ask one of the three
possibilities. The user finishes the conversation by saying e.g.
“goodbye”.
In every case the robot has to understand the spoken
utterance and to answer in an easy comprehensible way.
Furthermore, the robot movement and the speech output have
to be meaningful established. Subsequently, the EIS module
executes the following process (assumed a user exists, which
already interacts with the system):
The environment information module first looks for new
information from the ESV. If there is a new interaction initiated
by the user (see figure 5) it is tested whether an input gesture
is required. If a gesture is required the last gesture of the
user is calculated as mentioned above and the referred object
TABLE II
S YSTEM BEHAVIOUR : O BJECT FOUND
User
Speech
What is this?
Where is this?
Where is hobjecti?
Gesture
Yes
Yes
No
Robot
Speech
Gesture Mimic
This is · · ·
No
Friendly
Look there it is.
Yes
Look there it is.
Yes
Fig. 6. The angle from robot to object is calculated dependent from the line
of sight of the robot. First the angle from robot position to object position is
estimated. Afterwards the line of sight is incorporated.
TABLE I
S YSTEM BEHAVIOUR : O BJECT NOT FOUND
User
Speech
What is this?
Where is this?
Where is hobjecti?
Gesture
Yes
Yes
No
Speech
Sry, I do
not know
Robot
Gesture
No
Mimic
Confused
is estimated. If the spoken command includes a name of the
requested object, the robot determines the object directly by
its name and without any gesture input. The objects name is
accordingly compared to the database of known objects and
its information is extracted. Afterwards the module creates the
robots answer, which consists of speech, gesture and facial
expressions.
The answer depends on which information the robot can
resort to. If the robot does not know an object or if the
user points at a non specified object on the map, the robot
answers verbally that it does not know this object and it looks
interrogative (see table I). If the robot finds the object in his
database it looks friendly and returns the saved information
about the object (see table II). If the user wants to know
where an object is located, the robot calculates the position of
the object in the real world and points at it. Additionally the
robot looks friendly and instructs the user to look towards its
pointed direction. Though the facial expressions of the robot
are in particular not important, their attendance supports the
perceivability of the changing internal robot states.
The calculation of the position of the object in the real
world depends on the position of the robot and its orientation.
The required information is stored in the digital map, so the
position can be estimated as follows:
First the distance on the map can directly be determined,
because the object and robot positions on the map are known.
Fig. 7. The position of the centre of the robot is stored in the map. For
calculating the correct angle β, the previously estimated angle α has to be
moved from the centre into the shoulder.
Additionally the distance has to be transformed in real world
coordinates, which is done by multiplying the distance with
the scale unit in the map. The next step is the calculation of
the angle to the object dependent on the orientation of the
robot (see figure 6). This is important, because the robot has
to point at the object and therefor the angle dependent on
the robots line of sight is needed. First the angle δ from the
position of the robot to the object is calculated, followed by
the consideration of the robots line of sight, which leads to α.
With the information of the distance and the angle α the
movement of the robot can be worked out. The robot should
behave in all its actions nearly human like. In favour the robot
chooses the pointing arm dependent from the position of the
objects. If the object is on the left, the robot points with its
left arm and the other way around. Because the angle α has
been calculated related to the position of the robot and not
to its shoulder joint, the angle α has to be purged (see figure
7). With the information of the angle α, the distance from
shoulder to the middle of the robot dS and the distance from
the robot to the object dO , the new angle β can be calculated.
b = cos(abs(α)) · do
c = sin(abs(α)) · do
a = c − ds
β = arctan(abs(a)/b)
If the height of the object is saved in the digital map, the
robot also uses the extra information for pointing in 3D. The
angle of the arm lift is determined by the height of the robot,
correct position is detected, so no explicit results of the map
detection are presented.
To test the gesture recognition on its accuracy five persons
point on a selected position on the map and in doing so
the difference between pointed and calculated position is
determined. Table III shows the accuracy of the gestures. X
and Y are in map coordinates, which in this case is cm. In
the second column the target position is mentioned and in the
columns 3 to 5 are the detected positions.
TABLE III
G ESTURE ACCURACY
Fig. 8. The robot movement is calculated dependent on the distance from
robot to object and the height of the object. Using this information and the
knowledge of the arm length LArm the length LRobObj and LEllbowObj
are calculated. Concluding the angle for the ellbow joint is defined.
the height of the object and the distance to the object (see
figure 8). Additionally the elbow joint angles dependent on the
distance to an object. The referencing of the pointed object
is thereby easier, because a more straight angle denotes an
object with a larger distance. With the aim of a multimodal
conversation the robot has to move its head and face to
support the interaction. In conclusion the module instructs
the hardware how to behave. In combination with a pointing
gesture, the robot also shortly looks at the accordant direction.
Its facial expressions change dependent on the actual state,
which notifies the user about the change in robot behaviour.
V. E VALUATION
The proposed system is evaluated in two different scenarios.
We firstly tested in our laboratory, where several persons were
present. We also tested the robot system in our university hall
at an open house event, where docents of people stood around
BARTHOC. In both scenarios the system worked very well,
although the university hall is a more difficult surrounding,
because of its noisiness and the lightning conditions.
In the first evaluation scenario, humans could readily identify the target objects once pointed in the general direction of
their location. This assumes that objects or places are generally
in sight, can be reached by following just a single direction
and that the guidance behavior derives its value primarily from
orienting the user in a large, confusing space. However, we
believe that the general approach is also suitable for usage with
multi-part path instructions, and that a promising approach
would be to emulate how humans provide spatial guidance
onto an anthropomorphic robot, a topic of ongoing research.
To highlight the overall system functionality all different
input modalities and the answers were analysed. The gesture
input is tested on its correctness and accuracy. Afterwards the
synchronising of speech and the correct gesture is analysed,
followed by the multimodal interaction of the overall system.
The map detection itself is reliable, in nearly all cases the
Person & Coordinate in cm
1X
1Y
2X
2Y
3X
3Y
4X
4Y
5X
5Y
Target
20
10
22
13
17
10
1
1
27
1
Run 1
20,26
9,1
22,7
12,7
17,4
9,6
1,08
0,6
27,2
0,5
Run 2
20,3
8,96
22,5
12,5
17,17
9,65
1,14
0,5
26,8
0,6
Run 3
20,14
8,9
22,57
12,5
17,09
9,34
0,9
0,5
27,5
0,61
The difference in X direction is very low, in 87% beneath
0.5cm. In Y direction, respectively the direction of the pointing
gesture, the error is a bit higher, but in 74% also less than
0.5cm. This difference is explainable as many people point a
little bit ahead of an object. This outcome will be considered
in future versions of the system. Considering the dependencies
on the map, the gesture detection and the size of the objects,
which have in general a map-size of more than one cm, the
gesture recognition is very accurate.
The speech recognition also has very convincing results.
The three sentences of the environment information system
are evaluated during an interaction scenario and it is noted,
how often a sentence is not correct perceived (see table IV).
The complete interaction with the robot includes many more
spoken utterances, but only the three sentences are important
for the proposed work. Consequently only the mentioned
sentences are evaluated. In the first column of table IV the
question is shown and in the second column are the correct
and in the last the wrong number of detections:
TABLE IV
S PEECH ACCURACY
Sentence
What is this?
Where is this?
Where is object?
Correct
18
18
19
Wrong
2
2
1
The correct detection rate is reliable with about 90%. The
errors are often caused by very similar utterances like “There
it is” and “Where is this”, which are both included in the
grammar. If the complete corpus is tested, the detection rate
is a little bit lower, but nevertheless over 80%. Finally the
overall system has to be analysed in its correctness and also in
its interaction capabilities. The system was tested by six users
asking the robot to perform each of the three possible actions
10 times. This produces an amount of 180 questions, which
are analysed on the correct answer of the robot. The system
setup contains the map with 10 different objects, each about 1
to 3 cm in width and length. The map itself has got a size of
80 cm x 45 cm. An answer is correct, if the robot recognises
the correct object and subsequently returns a correct answer
including all modalities. The results are shown in table V:
TABLE V
S YSTEM ACCURACY
Question
What is this?
Where is this?
Where is object?
Correct
55
56
58
Wrong
5
4
2
The correctly identified sentences and the following behaviour of the robot have a correct rate over 90%. This
time the objects were limited to ten, which enhances the
speech recognition rate. The errors were produced both by
imprecisely gestures or wrong delivered utterances from the
speech recognition. To verify the results in a less defined
environment the system has been tested in our university
hall. The map showed the ground base of the hall, which
is in reality about 300 meters in length and 100 meters in
width, whereas the map has the measurement of 1.2 meter
x 0.4 meter. The objects were e.g. the cafeteria, the library,
the computer centre, some lecture rooms and the lavatory.
The object sizes on the map range from 1.5 cm to 10 cm.
Although the environment had bad lightening conditions and
a high sound intensity caused by the hundreds of visitors the
system worked with only a few communication errors. These
errors can be traced back to the speech recognition system
and mainly the speaker localisation, which had in about 20%
of the trials the problem to identify the correct speaker. If we
desist from the problem to identify the speaker, the system
answered in more than 60% correct to the questions and it
showed a natural and understandable behaviour. Although the
result seems poor compared to the laboratory tests, it has to
be considered that the trials were made with unexperienced
users in a highly dynamic environment.
VI. C ONCLUSIONS
In this paper, we presented a multimodal interaction scenario with an anthropomorphic robot. We focused on intuitive
interaction, realised simultaneously using verbal and nonverbal
modalities. The higher perceptibility and an error reduction
by multimodal communication modalities lead to a more
intuitive human-machine-interaction. Our aim to replicate the
interaction between humans brought us to the realisation of
the proposed environment information system. For realising
an interaction scenario our environment information system
accurately detects gestures on a map and connected with
a verbal input the system produces an easy and intuitively
understandable answer through pointing gestures supported by
speech and facial expressions.
We showed that our robot system works reliably in both
a laboratory and the university hall with a big number of
people present and that it can easily be used by naive persons.
The system obviously offers a native way for human-machine
interaction through combining gestures and speech. Last, but
not least, we have outlined an application scenario that we
consider particularly suitable for the evaluation of social
robots.
Future work will include enhancing the interaction to teach
the robot about objects in its vicinity, thus removing the need
for the explicit map creation. It would also be interesting
to investigate whether the robot could be more helpful by
initiating the interaction instead of waiting for the human.
ACKNOWLEDGEMENT
This work has been supported by the German Research
Society (DFG) within the Collaborative Research Centre 673,
Alignment in Communication.
R EFERENCES
[1] C. Breazeal, “Toward sociable robots,” Elsevier Science B.V., 2003.
[2] R. Pfeifer, “On the role of embodiment in the emergence of cognition
and emotion,” in in: Proceedings of the Toyota Conference on Affective
Minds, 1999.
[3] R. Kirby, J. Forlizzi, and R. Simmons, “Interactions with a moody
robot,” in Proc. of Human-Robot-Interaction, 2006, pp. 186–193.
[4] M. Ehrenmann, R. Becher, B. Giesler, R. Zoellner, O. Rogalla, and
R. Dillmann, “Interaction with robot assistants: Commanding albert,”
2002.
[5] J. F. Gorostiza, R. Barber, A. M. Khamis, M. M. R. Pacheco, R. Rivas,
A. Corrales, E. Delgado, and M. A. Salichs, “Multimodal human-robot
interaction framework for a personal robot,” 2005.
[6] K. F. MacDorman and H. Ishiguro, “The uncanny advantage of using
androids in cognitive and social science research,” Interaction Studies,
p. 297–337, 2006.
[7] H. Ishiguro and T. Minato, “Development of androids for studying on
human-robot interaction,” in Proceedings of 36th International Symposium on Robotics, December 2005.
[8] M. Hackel, S. Schwope, J. Fritsch, B. Wrede, and G. Sagerer, “Designing
a sociable humanoid robot for interdisciplinary research,” 2006.
[9] P. Ekman, Facial Expressions. John Wiley and Sons Ltd, 1999, ch. 16,
Handbook of Cognition and Emotion.
[10] M. H. Thorsten P. Spexard and G. Sagerer, “Human-oriented interaction with an anthropomorphic robot,” IEEE TRANSACTIONS ON
ROBOTICS, SPECIAL ISSUE ON HUMAN-ROBOT INTERACTION,
DECEMBER 2007.
[11] B. W. S. Li and G. Sagerer, “A computational model of multimodal
grounding,” in in Proc. ACL SIGdial workshop on discourse and dialog,
in conjunction with COLING/ACL 2006, ACL Press, 2006.
[12] G. A. Fink, “Developing HMM-based recognizers with ESMERALDA,”
in Text, Speech and Dialogue: Second International Workshop, TSD’99,
Plzen, Czech Republic, September 1999. Proceedings. Vol. 1692/1999,
Seite 229-234. Springer Verlag, Berlin Heidelberg, 1999.
[13] S. Hüwel and B. Wrede, “Situated speech understanding for robust
multi-modal human-robot communication,” in in Proceedings of the
International Conference on Computational Linguistics (COLING/ACL),
ACL Press, 2006.
[14] S. Lang, M. Kleinehagenbrock, S. Hohenner, J. Fritsch, G. A. Fink,
and G. Sagerer, “Providing the basis for human-robot-interaction: A
multi-modal attention system for a mobile robot,” in Proc. Int. Conf. on
Multimodal Interfaces, ACM. Vancouver, Canada: ACM, November
2003, inproceedings, pp. 28–35.
[15] N. Hofemann, “Videobasierte Handlungserkennung für die natürliche
Mensch-Maschine-Interaktion,” Ph.D. dissertation, AG Angewandte Informatik, Technische Fakultät, Universität Bielefeld, December 2006.
© Copyright 2026 Paperzz