Paper Title - Kerem Odabasi`s Web Portfolio

Design of an Interactive Humanoid Character for Multimodal
Communication
Burak Berk Ustundag
Computer Eng. Dep.
Istanbul Technical University
Istanbul, Turkey
[email protected]
Kerem Odabasi
Computer Eng. Dep.
Istanbul Technical University
Istanbul, Turkey
[email protected]
Abstract— Cognitive impact factor is one of the main measures for
media-human interaction. In this study, we explain design and
implementation of a semi-autonomous humanoid character display
system that provides high cognitive impact on people. It combines
marketing, entertainment, announcement, guidance and service sales
on one singular device. As different from conventional automated
teller machine (ATM) approach, interactive humanoid character
(IHC) first detects the people around and targets starting the
communication with them. A finite state machine model is used to
determine type of possible requirement of the person in
communication by navigating in pre-defined behavioral states due to
interaction. This way of auto-controlled telepresence saves marketing,
sales and service man-hours while it maximizes the media reach factor.
Keywords; Telepresence, Humanoid, Synthetic Character, Finite State
Machine Model, Interaction, Human Recognition, Face Recognition,
Behavior Classification
I. INTRODUCTION
Virtual intelligent characters become to take more
functional roles in real world due to rapid development of
computational intelligence and visualization technologies.
Avatar-like synthetic characters was previously proposed to
increase impact of web based applications [1]. This kind of
approach to virtual humans has various applications such as
simulation based training and virtual surgery [2]. In this
study, we would like to show that visualization of androids
and humanoids through existing display units incorporated
with network supported information systems and sensory
mechanisms is efficiently applicable even at the time we
live.We hereby explain a system that combines
advertisement, entertainment, announcement, marketing and
sales functions through an interactive humanoid character
(IHC) [3]. In order to provide different type of actions
together, we developed a new behavior direction model that
is based on multimode human-humanoid interaction. In this
new approach, some conditional entry states are defined
first. A finite state automata is used to transfer the
behavioral status of the IHC from one situation to another
due to interpretation of tendency or needs of the human in
interaction. As different from the automated teller machines,
entry state targets detection and attraction of the people
around the device. This action is referred to person named
as “Dellal” who acts as a sort of “Teller” to attract and
direct people passing around in oriental culture. The sketch
of people in communication with humanoid is shown in
Ali Norouzi
Computer Eng. Dep.
Istanbul Technical University
Istanbul, Turkey
[email protected]
figure 1 where the humanoid character is visualized in 2D or
3D platforms having sensory mechanisms. Internal structure
and working modes of the IHC system is described in
section 1.
Behavioral state transactions are triggered by predefined
state transition functions. Time domain sensor signals are a
part of state transition function variables. Different
interactive contents begin to be played with respect to
assignments for the active state. Finite state machine based
behavioral status direction is explained in section 2.
Some services including online sales may require teleoperated supervised actions. An important benefit of using
state machine model is the automated self switching
capability between supervised tele-operated services and
autonomous modes of operation. Practical solutions for
some advanced application examples such as network based
tele-operation are handled in section 3.
Figure 1. Interactive humanoid character on audio-visual rotational panel
unit
Infinite amount of creative services can be implemented
through IHC. Person oriented services due to biometric
identification, environmental & peripheral conditions
dependent advertisements, retrieval of demanded
information, taking the photo of the person and sending it to
his cellular phone, making some jokes upon type of
reaction, supplement of conditional media content,
statistical measurements, reservations, assistance services
and even multi-lingual navigation can be given as few of
potential examples. Although attitude of the people to an
humanoid especially for paid services could be questioned
at the beginning, the research on people’s trust to virtual
salespersons has shown that trust rate to online shopping is
even slightly higher than the traditional one [4]. On the
other, it is clear that cultural differences are expected to be
effective on acceptance of IHC as a service provider besides
entertainment and informative purposes.
II. INTERACTIVE HUMANOID CHARACTER
If there is more than one person recognized as human in the
scope of the vision then audio based localization of the one
in interaction provides better targeting. An example to
signal processing scheme is given in Figure 3 as audio
source localization function. The inputs of this function are
signals from 3 microphones as the sensors, and the outputs
are distance and 3 dimensional angular direction of the
separated speech resource.
IHC basically consists of an array of sensors, an internal
computer, memory unit and output devices as display unit,
speakers and optionally motion/dispenser drivers. The main
components of the sensor array are the stereo cameras.
Stereo camera vision is preferred to distinguish exact
location of the moving objects and to determine their
physical properties (shorter and closer or taller and away
etc.) [5]. S1..Sn in figure 2 denote the sensors and i1..in
indicate their signal processing units. Video signals from the
cameras are first captured and then processed via ic1 and ic2.
Figure 3. 3D audio source localizer for separation of person from the
surrounding group of recognized people
Figure 2. IHC emulation process block diagram
State space model (state machine) is used to select
interactive animation content. For this reason, event
memory is effective both in choosing the state to be
activated and it also provides direct input to play the current
content. For example, detection of face direction and
position [6] is a classified event for both the transition
function and the animation engine. Dimension “v” of the
event vector E(t) is independent from the sensor array size
“n”.
This first level signal processing covers signal conditioning,
digitizing, normalization/calibration, filtering and basic
pattern recognition components. Inputs for the behavioral
control functions of the IHC are not directly the processed
signals. They are mainly the events. For this reason,
processed signals are entered into event classifier. Event
vector at any time t can be given as,
E(t)= [e1(t), e2(t),…,ev(t)]
(1)
where any of ei(t), i=1..v, are classified events with respect
to signal based definitions in event data base (Figure 2) as
functions of 1 to n dimensional input variables. On the
other hand a decision can be dependent on passed values of
classified events besides the latest one. For this reason an
array of digital delays (z-1) is used to construct v
dimensional past values of E(t). If the memory has “p”
times delays and each are shifted in synchronization with
sampling time T then the event memory becomes as
v×(p+1) dimensional matrix,
[E]= [E(t), E(t-T), E(t-2T),….,E(t-pT)]
(2)
Hence event memory[E] can be used as variables to state
transition functions and the directed animation block.
Gestures recognition and classification parameters can be
used as variables of event triggering functions too [7].
Hence, as an example, humanoid can track the eyes of the
person in front of the unit while it makes an announcement.
Meanwhile it can switch the content that includes animation
related to options indicated by direction of the persons arm
and fingers. The synchronizer provides continuation of the
content while transition from one state to another. This is
achieved by keeping one or more identical frames of the
humanoid figure as coupling set inside a database.
III. FINITE STATE MODEL FOR HUMANOID
BEHAVIOR
An important aspect of this study is to determine type of
services that might be in the interest of the human around.
The proposed strategy is constructed on transition to the
most appropriate state by triggering of the state transition
functions defined for each permitted state couple. Classified
events, time, occurrence counting and logical inputs are
used as inputs acting as variables of the transition functions.
Hence the humanoid can act different for the same human
gestures depending on the requirement of the active state.
The first state is the entry state and it targets to catch the
interest of the human around upon detection of the
existence. The second level states can vary due to type of
actions such as directional change of walk etc. The third
level is used for determination of possible interest by some
trials and observation of the gestures. The forth level of
states is the first level related to chosen interest of the
human. It can be initiation of advertisement content or of an
online service sale. Later levels are dependent of the
complexity of the active animation content. If the interaction
disappears within this sequence or the last level of content
has been played then IHC returns to one of the first level
states. Random functions, peripheral measurements (light
level, time of day etc.) and occurrence counting are some
alternative inputs for choosing the next entry state.
Generalized form of the finite state machine is shown in
figure 5. There are n-levels as depth of the active contents
and m-variations for each level in the generalized form.
State definition matrix is consisting of logical variables to
show existence of a content and position of a state. The
design example related state definition is extracted by
multiplication of generalized form and the state definition
matrices (Figure 4).
status is chosen. Active animation engine in Figure 7 selects
and switches to the content determined by the active state. It
is connected to audio, visual and motional content database
recorded for each state. On the other hand active content
means the humanoid vision is not only chosen upon the
active state but also it is played and directed with respect to
classified events during the animation. Audio-based motion
synthesis is an option that provides more realistic view to
humanoid due to synchronization between the mouth and
other affected face portions and the audio content [8].
Figure 6. State transition frame link matrix for continues visual coupling
from one state to another state
Figure 4. Extraction of active states from generalized form by the state
definition matrix
Figure 5. State transition function matrix example for 3 layered-7 states
automata example.
State transition function matrix in Figure 5 determines the
logical conditions for state changes. Each logical condition
for the state transition represents the logical state of a
classified event or its past values inside the memory. Frame
link matrix in figure 6 is used to prevent discontinuity of the
vision during the content switching while transition from
one state to another state. It keeps one or more frame
couples for each defined set of states. While the content
respecting to state (1,1) is being performed, it has been
switched to the content (2,1) through the frame 3 of (1,1) to
frame 46 of (2,1) in the given example. If more than one
couple is defined for the transition between one state to
another then the shortest (8,32) path with respect to current
Figure 7. Audio/visual/motional animation of the humanoid character
IV. PRACTICAL
IMPLEMENTATION OF THE IHC AND
TELEPRESENCE
MAX/MSP software is used to animate, play and switch the
content in the first tests for validation of IHC concept. Its
user friendly graphical screen and communication features
provide flexibility of the implementation of event triggered
audio-visual interactive contents. Input data from external
detection, recognition and classification software is received
via MIDI interface.
Figure 8. Three screen-captures during the humanoid character
demonstration
In addition to autonomous mode of operation, several
services still require operator support. One of the
achievements of IHC system had been the auto detection of
the state that respects to requirement of an external data
content service. In this case control of the humanoid
character is automatically transferred to the related operator
while it visually provides continuation of the movements.
This supervisory mode is managed remotely over the
network (Figure 9). The face gestures of the operator in
supervision can also be used as a parameter of the same
humanoid character [9, 10].
Figure 9. Network connection for Telepresence of the operator
Harmony between the humanoid character facial animation
and the voice of the operator is also important. Direct
feature extraction from the operator [11] is used to simplify
the synchronization without use of any dedicated motion
models.
V. APPLICATIONS
The range of application areas for IHC touches on many
aspects of computing, and as computing becomes more
ubiquitous, practically every aspect of interaction with
objects, and the environment, as well as human-human
interaction (e.g., remote collaboration, etc.) will make use of
IHC techniques. In the following sections, we describe
specific application areas, described in , in which interesting
progress has been made.
A. Human Spaces
Computing is expanding beyond the desktop, integrating
with everyday objects in a variety of scenarios. As our
discussions show, this implies that the model of user
interface in which a person sits in front of a computer is no
longer the only model. One of the implications of this is that
the actions or events to be recognized by the “interface” are
not necessarily explicit commands. In smart conference
room applications, for instance, multimodal analysis has
been applied mostly for video indexing . Although such
approaches are not meant to be used in real-time, they are
useful in investigating how multiple modalities can be fused
in interpreting communication. It is easy to foresee
applications in which “smart meeting rooms” actually react
to multimodal actions in the same way intelligent homes
should. Projects in the video domain include MVIEWS [13],
a system for annotating, indexing, extracting, and
disseminating information from video streams for
surveillance and intelligence applications. An analyst
watching one or more live video feeds is able to use pen and
voice to annotate the events taking place. The annotation
streams are indexed by speech and gesture recognition
technologies for later retrieval, and can be quickly scanned
using a timeline interface, then played back during review
of the film. Pen and speech can also be used to command
various aspects of the system, with multimodal utterances
such as “Track this” or “If any object enters this area, notify
me immediately.”
B. Ubiquitous devices
The recent drop in costs of hardware has led to an explosion
in the availability of mobile computing devices.
One of the major challenges is that while devices such as
PDAs and mobile phones have become smaller and more
powerful, there has been little progress in developing
effective interfaces to access the increased computational
and media resources available in such devices. Mobile
devices, as well as wearable devices, constitute a very
important area of opportunity for research in IHC because
natural interaction with such devices can be crucial in
overcoming the limitations of current interfaces. Several
researchers have recognized this, and many projects exist on
mobile and wearable IHC [13].
C. Users with Disabilities
People with disabilities can benefit greatly from IHC
technologies. Various authors have proposed approaches for
smart wheel-chair systems which integrate different types of
sensors. The authors of [17] introduce a system for
presenting digital pictures non-visually (multimodal output),
and the techniques in can be used for interaction using only
eye blinks and eye brow movements. Some of the
approaches in other application areas could also be
beneficial for people with disabilities.
D. Public and Private Spaces
In this category we place applications implemented to
access devices used in public or private spaces. One
example of implementation in public spaces is the use of
IHC in information kiosks. These are challenging
applications for natural multimodal interaction: the kiosks
are often intended to be used by a wide audience, thus there
may be few assumptions about the types of users of the
system. On the other hand, there are applications in private
spaces. One interesting area is that of implementation in
vehicles. This is an interesting application area due to the
constraints: since the driver must focus on the driving task,
traditional interfaces (e.g., GUIs) are not so suitable. Thus,
it is an important area of opportunity for IHC research,
particularly because depending on the particular
deployment, vehicle interfaces can be considered safetycritical.
E. Virtual Environments
Virtual and augmented reality has been a very active
research area at the crossroads of computer graphics,
computer vision, and human-computer interaction. One of
the major difficulties of VR systems is the interaction
component, and many researchers are currently exploring
the use of interaction analysis techniques to enhance the
user experience. One reason this is very attractive in VR
environments is that it helps disambiguate communication
between users and machines (in some cases virtual
characters, the virtual environment, or even other users
represented by virtual characters).
F. Art
Perhaps one of the most exciting application areas of IHC is
art. Vision techniques can be used to allow audience
participation and influence a performance. In [14], the
authors use multiple modalities (video, audio, pressure
sensors) to output different “emotional states” for Ada, an
intelligent space that responds to multimodal input from its
visitors. In [15], a wearable camera pointing at the wearer’s
mouth interprets mouth gestures to generate MIDI sounds
(so a musician can play other instruments while generating
sounds by moving his mouth). In [16], limb movements are
tracked to generate music. IHC can also be used in museums
to augment exhibitions [16].
VI. CONCLUSION
Interactive humanoid character appears as an effective tool
to improve cognitive impact of interactive contents at public
locations. On the other hand it is a useful way to provide
multiple services by reducing man-hours of marketing and
sales.
Finite automata approach provides automated
selection of the services. Telepresence directed from a
central location also cost effective since it homogenizes use
of the resource independent from the geographical location.
This saving dependent on the centralization can be
calculated by using statistical demand frequency and
average service time parameters in queuing theory.
There is additional need of stability investigation tool for the
complex state machine designs. The technological level we
reach still requires human operator support especially for
commercial transactions and it is achieved by self switching
to supervised mode of IHC. Multimode communication of
IHC system and its autonomous mode capabilities provides
implementation of creative services. Preparation of the
active IHC media content requires a process automation
tool. Otherwise manual editing and indexing of audio, video
and motion data besides the definition of triggering
transition functions requires special consideration and it is a
time consuming process. Using the real peoples’ vision as
humanoid, instead of synthetic character is still worked on
and it can help us to bring services directly oriented to the
owner of the vision. For example buying the ticket from the
actress of the movie can become possible in this way.
REFERENCES
[1] Bickmore, T., Cook, L., Churchill, E., Sullivan, J.: Animated
autonomous personal representatives. In: Proc. Second Int. Conf.
Autonomous Agents, pp. 8--15. Minneapolis (1998)
[2] Thalmann, D.: The Role of Virtual Humans in Virtual Environment
Technology and Interfaces. Frontiers of Human-Centred Computing,
Online Communities and Virtual Environments, Earnshaw R., Guejd
R., van Dam A., Vince J., (eds.) pp. 27--38. Springer-Verlag, London
(2001)
[3] Ustundag, B., Odabasi, K., Aksaz, E.: Interactive Humanoid Teller,
Patent Application, Turkish Patent Institute Application no.2009-G22168 (2009)
[4] Komiak, S., Wang, W., Benbasat, I.: Comparing Customer Trust in
Virtual Salespersons With Customer Trust in Human Salespersons. In:
38th Hawaii International Conference on System Sciences (2005)
[5] Dankers, A., Barnes, N., Zelinsky A., “Bimodal Active Stereo Vision”,
Proc. 5th Int Conf on Field and Service Robotics (FSR05) ( 2005)
[6] Bartlett, M. S., Littlewort, G., Fasel, I., Movellan, J. R.: Real Time Face
Detection and Facial Expression Recognition: Development and
Applications to Human Computer Interaction. In: IEEE Workshop on
Face Processing in Video, Washington (2004)
[7] Nickel, K., Stiefelhagen, R.: Pointing Gesture Recognition based on
3D-Tracking of Face, Hands and Head Orientation. In: ICMI’03,
ACM, Vancouver (2003)
[8] Ma, J., Cole, R.: Animating visible speech and facial expressions. The
Visual Computer vol. 80, pp. 86--105. Springer-Verlag (2004)
[9] Chai, J., Xiao, J., Hodgins, J.: Vision-based Control of 3D Facial
Animation. In: Eurographics/SIGGRAPH Symposium on Computer
Animation’03 Breen D., Lin M. (eds.) ACM Press, pp. 193--206
(2003)
[10] Darrell, T., Basu, S., Wren, C., Pentland, A.: Perceptually-driven
Avatars and Interfaces: active methods for direct control. Technical
report, MIT Media Lab Perceptual Computation Section, TR 416
(1997)
[11] Deng, Z., Busso, C., Narayanan, S., Neumann U.: Audiobased Head
Motion Synthesis for Avatar based Telepresence Systems. In: ACM
SIGMM 2004 Workshop on Effective Telepresence, pp. 24--30, ACM
Press, New York (2004)
[12] Maestri G., 3D sample chararcter design, http://www.lynda.com
(2008)