Design of an Interactive Humanoid Character for Multimodal Communication Burak Berk Ustundag Computer Eng. Dep. Istanbul Technical University Istanbul, Turkey [email protected] Kerem Odabasi Computer Eng. Dep. Istanbul Technical University Istanbul, Turkey [email protected] Abstract— Cognitive impact factor is one of the main measures for media-human interaction. In this study, we explain design and implementation of a semi-autonomous humanoid character display system that provides high cognitive impact on people. It combines marketing, entertainment, announcement, guidance and service sales on one singular device. As different from conventional automated teller machine (ATM) approach, interactive humanoid character (IHC) first detects the people around and targets starting the communication with them. A finite state machine model is used to determine type of possible requirement of the person in communication by navigating in pre-defined behavioral states due to interaction. This way of auto-controlled telepresence saves marketing, sales and service man-hours while it maximizes the media reach factor. Keywords; Telepresence, Humanoid, Synthetic Character, Finite State Machine Model, Interaction, Human Recognition, Face Recognition, Behavior Classification I. INTRODUCTION Virtual intelligent characters become to take more functional roles in real world due to rapid development of computational intelligence and visualization technologies. Avatar-like synthetic characters was previously proposed to increase impact of web based applications [1]. This kind of approach to virtual humans has various applications such as simulation based training and virtual surgery [2]. In this study, we would like to show that visualization of androids and humanoids through existing display units incorporated with network supported information systems and sensory mechanisms is efficiently applicable even at the time we live.We hereby explain a system that combines advertisement, entertainment, announcement, marketing and sales functions through an interactive humanoid character (IHC) [3]. In order to provide different type of actions together, we developed a new behavior direction model that is based on multimode human-humanoid interaction. In this new approach, some conditional entry states are defined first. A finite state automata is used to transfer the behavioral status of the IHC from one situation to another due to interpretation of tendency or needs of the human in interaction. As different from the automated teller machines, entry state targets detection and attraction of the people around the device. This action is referred to person named as “Dellal” who acts as a sort of “Teller” to attract and direct people passing around in oriental culture. The sketch of people in communication with humanoid is shown in Ali Norouzi Computer Eng. Dep. Istanbul Technical University Istanbul, Turkey [email protected] figure 1 where the humanoid character is visualized in 2D or 3D platforms having sensory mechanisms. Internal structure and working modes of the IHC system is described in section 1. Behavioral state transactions are triggered by predefined state transition functions. Time domain sensor signals are a part of state transition function variables. Different interactive contents begin to be played with respect to assignments for the active state. Finite state machine based behavioral status direction is explained in section 2. Some services including online sales may require teleoperated supervised actions. An important benefit of using state machine model is the automated self switching capability between supervised tele-operated services and autonomous modes of operation. Practical solutions for some advanced application examples such as network based tele-operation are handled in section 3. Figure 1. Interactive humanoid character on audio-visual rotational panel unit Infinite amount of creative services can be implemented through IHC. Person oriented services due to biometric identification, environmental & peripheral conditions dependent advertisements, retrieval of demanded information, taking the photo of the person and sending it to his cellular phone, making some jokes upon type of reaction, supplement of conditional media content, statistical measurements, reservations, assistance services and even multi-lingual navigation can be given as few of potential examples. Although attitude of the people to an humanoid especially for paid services could be questioned at the beginning, the research on people’s trust to virtual salespersons has shown that trust rate to online shopping is even slightly higher than the traditional one [4]. On the other, it is clear that cultural differences are expected to be effective on acceptance of IHC as a service provider besides entertainment and informative purposes. II. INTERACTIVE HUMANOID CHARACTER If there is more than one person recognized as human in the scope of the vision then audio based localization of the one in interaction provides better targeting. An example to signal processing scheme is given in Figure 3 as audio source localization function. The inputs of this function are signals from 3 microphones as the sensors, and the outputs are distance and 3 dimensional angular direction of the separated speech resource. IHC basically consists of an array of sensors, an internal computer, memory unit and output devices as display unit, speakers and optionally motion/dispenser drivers. The main components of the sensor array are the stereo cameras. Stereo camera vision is preferred to distinguish exact location of the moving objects and to determine their physical properties (shorter and closer or taller and away etc.) [5]. S1..Sn in figure 2 denote the sensors and i1..in indicate their signal processing units. Video signals from the cameras are first captured and then processed via ic1 and ic2. Figure 3. 3D audio source localizer for separation of person from the surrounding group of recognized people Figure 2. IHC emulation process block diagram State space model (state machine) is used to select interactive animation content. For this reason, event memory is effective both in choosing the state to be activated and it also provides direct input to play the current content. For example, detection of face direction and position [6] is a classified event for both the transition function and the animation engine. Dimension “v” of the event vector E(t) is independent from the sensor array size “n”. This first level signal processing covers signal conditioning, digitizing, normalization/calibration, filtering and basic pattern recognition components. Inputs for the behavioral control functions of the IHC are not directly the processed signals. They are mainly the events. For this reason, processed signals are entered into event classifier. Event vector at any time t can be given as, E(t)= [e1(t), e2(t),…,ev(t)] (1) where any of ei(t), i=1..v, are classified events with respect to signal based definitions in event data base (Figure 2) as functions of 1 to n dimensional input variables. On the other hand a decision can be dependent on passed values of classified events besides the latest one. For this reason an array of digital delays (z-1) is used to construct v dimensional past values of E(t). If the memory has “p” times delays and each are shifted in synchronization with sampling time T then the event memory becomes as v×(p+1) dimensional matrix, [E]= [E(t), E(t-T), E(t-2T),….,E(t-pT)] (2) Hence event memory[E] can be used as variables to state transition functions and the directed animation block. Gestures recognition and classification parameters can be used as variables of event triggering functions too [7]. Hence, as an example, humanoid can track the eyes of the person in front of the unit while it makes an announcement. Meanwhile it can switch the content that includes animation related to options indicated by direction of the persons arm and fingers. The synchronizer provides continuation of the content while transition from one state to another. This is achieved by keeping one or more identical frames of the humanoid figure as coupling set inside a database. III. FINITE STATE MODEL FOR HUMANOID BEHAVIOR An important aspect of this study is to determine type of services that might be in the interest of the human around. The proposed strategy is constructed on transition to the most appropriate state by triggering of the state transition functions defined for each permitted state couple. Classified events, time, occurrence counting and logical inputs are used as inputs acting as variables of the transition functions. Hence the humanoid can act different for the same human gestures depending on the requirement of the active state. The first state is the entry state and it targets to catch the interest of the human around upon detection of the existence. The second level states can vary due to type of actions such as directional change of walk etc. The third level is used for determination of possible interest by some trials and observation of the gestures. The forth level of states is the first level related to chosen interest of the human. It can be initiation of advertisement content or of an online service sale. Later levels are dependent of the complexity of the active animation content. If the interaction disappears within this sequence or the last level of content has been played then IHC returns to one of the first level states. Random functions, peripheral measurements (light level, time of day etc.) and occurrence counting are some alternative inputs for choosing the next entry state. Generalized form of the finite state machine is shown in figure 5. There are n-levels as depth of the active contents and m-variations for each level in the generalized form. State definition matrix is consisting of logical variables to show existence of a content and position of a state. The design example related state definition is extracted by multiplication of generalized form and the state definition matrices (Figure 4). status is chosen. Active animation engine in Figure 7 selects and switches to the content determined by the active state. It is connected to audio, visual and motional content database recorded for each state. On the other hand active content means the humanoid vision is not only chosen upon the active state but also it is played and directed with respect to classified events during the animation. Audio-based motion synthesis is an option that provides more realistic view to humanoid due to synchronization between the mouth and other affected face portions and the audio content [8]. Figure 6. State transition frame link matrix for continues visual coupling from one state to another state Figure 4. Extraction of active states from generalized form by the state definition matrix Figure 5. State transition function matrix example for 3 layered-7 states automata example. State transition function matrix in Figure 5 determines the logical conditions for state changes. Each logical condition for the state transition represents the logical state of a classified event or its past values inside the memory. Frame link matrix in figure 6 is used to prevent discontinuity of the vision during the content switching while transition from one state to another state. It keeps one or more frame couples for each defined set of states. While the content respecting to state (1,1) is being performed, it has been switched to the content (2,1) through the frame 3 of (1,1) to frame 46 of (2,1) in the given example. If more than one couple is defined for the transition between one state to another then the shortest (8,32) path with respect to current Figure 7. Audio/visual/motional animation of the humanoid character IV. PRACTICAL IMPLEMENTATION OF THE IHC AND TELEPRESENCE MAX/MSP software is used to animate, play and switch the content in the first tests for validation of IHC concept. Its user friendly graphical screen and communication features provide flexibility of the implementation of event triggered audio-visual interactive contents. Input data from external detection, recognition and classification software is received via MIDI interface. Figure 8. Three screen-captures during the humanoid character demonstration In addition to autonomous mode of operation, several services still require operator support. One of the achievements of IHC system had been the auto detection of the state that respects to requirement of an external data content service. In this case control of the humanoid character is automatically transferred to the related operator while it visually provides continuation of the movements. This supervisory mode is managed remotely over the network (Figure 9). The face gestures of the operator in supervision can also be used as a parameter of the same humanoid character [9, 10]. Figure 9. Network connection for Telepresence of the operator Harmony between the humanoid character facial animation and the voice of the operator is also important. Direct feature extraction from the operator [11] is used to simplify the synchronization without use of any dedicated motion models. V. APPLICATIONS The range of application areas for IHC touches on many aspects of computing, and as computing becomes more ubiquitous, practically every aspect of interaction with objects, and the environment, as well as human-human interaction (e.g., remote collaboration, etc.) will make use of IHC techniques. In the following sections, we describe specific application areas, described in , in which interesting progress has been made. A. Human Spaces Computing is expanding beyond the desktop, integrating with everyday objects in a variety of scenarios. As our discussions show, this implies that the model of user interface in which a person sits in front of a computer is no longer the only model. One of the implications of this is that the actions or events to be recognized by the “interface” are not necessarily explicit commands. In smart conference room applications, for instance, multimodal analysis has been applied mostly for video indexing . Although such approaches are not meant to be used in real-time, they are useful in investigating how multiple modalities can be fused in interpreting communication. It is easy to foresee applications in which “smart meeting rooms” actually react to multimodal actions in the same way intelligent homes should. Projects in the video domain include MVIEWS [13], a system for annotating, indexing, extracting, and disseminating information from video streams for surveillance and intelligence applications. An analyst watching one or more live video feeds is able to use pen and voice to annotate the events taking place. The annotation streams are indexed by speech and gesture recognition technologies for later retrieval, and can be quickly scanned using a timeline interface, then played back during review of the film. Pen and speech can also be used to command various aspects of the system, with multimodal utterances such as “Track this” or “If any object enters this area, notify me immediately.” B. Ubiquitous devices The recent drop in costs of hardware has led to an explosion in the availability of mobile computing devices. One of the major challenges is that while devices such as PDAs and mobile phones have become smaller and more powerful, there has been little progress in developing effective interfaces to access the increased computational and media resources available in such devices. Mobile devices, as well as wearable devices, constitute a very important area of opportunity for research in IHC because natural interaction with such devices can be crucial in overcoming the limitations of current interfaces. Several researchers have recognized this, and many projects exist on mobile and wearable IHC [13]. C. Users with Disabilities People with disabilities can benefit greatly from IHC technologies. Various authors have proposed approaches for smart wheel-chair systems which integrate different types of sensors. The authors of [17] introduce a system for presenting digital pictures non-visually (multimodal output), and the techniques in can be used for interaction using only eye blinks and eye brow movements. Some of the approaches in other application areas could also be beneficial for people with disabilities. D. Public and Private Spaces In this category we place applications implemented to access devices used in public or private spaces. One example of implementation in public spaces is the use of IHC in information kiosks. These are challenging applications for natural multimodal interaction: the kiosks are often intended to be used by a wide audience, thus there may be few assumptions about the types of users of the system. On the other hand, there are applications in private spaces. One interesting area is that of implementation in vehicles. This is an interesting application area due to the constraints: since the driver must focus on the driving task, traditional interfaces (e.g., GUIs) are not so suitable. Thus, it is an important area of opportunity for IHC research, particularly because depending on the particular deployment, vehicle interfaces can be considered safetycritical. E. Virtual Environments Virtual and augmented reality has been a very active research area at the crossroads of computer graphics, computer vision, and human-computer interaction. One of the major difficulties of VR systems is the interaction component, and many researchers are currently exploring the use of interaction analysis techniques to enhance the user experience. One reason this is very attractive in VR environments is that it helps disambiguate communication between users and machines (in some cases virtual characters, the virtual environment, or even other users represented by virtual characters). F. Art Perhaps one of the most exciting application areas of IHC is art. Vision techniques can be used to allow audience participation and influence a performance. In [14], the authors use multiple modalities (video, audio, pressure sensors) to output different “emotional states” for Ada, an intelligent space that responds to multimodal input from its visitors. In [15], a wearable camera pointing at the wearer’s mouth interprets mouth gestures to generate MIDI sounds (so a musician can play other instruments while generating sounds by moving his mouth). In [16], limb movements are tracked to generate music. IHC can also be used in museums to augment exhibitions [16]. VI. CONCLUSION Interactive humanoid character appears as an effective tool to improve cognitive impact of interactive contents at public locations. On the other hand it is a useful way to provide multiple services by reducing man-hours of marketing and sales. Finite automata approach provides automated selection of the services. Telepresence directed from a central location also cost effective since it homogenizes use of the resource independent from the geographical location. This saving dependent on the centralization can be calculated by using statistical demand frequency and average service time parameters in queuing theory. There is additional need of stability investigation tool for the complex state machine designs. The technological level we reach still requires human operator support especially for commercial transactions and it is achieved by self switching to supervised mode of IHC. Multimode communication of IHC system and its autonomous mode capabilities provides implementation of creative services. Preparation of the active IHC media content requires a process automation tool. Otherwise manual editing and indexing of audio, video and motion data besides the definition of triggering transition functions requires special consideration and it is a time consuming process. Using the real peoples’ vision as humanoid, instead of synthetic character is still worked on and it can help us to bring services directly oriented to the owner of the vision. For example buying the ticket from the actress of the movie can become possible in this way. REFERENCES [1] Bickmore, T., Cook, L., Churchill, E., Sullivan, J.: Animated autonomous personal representatives. In: Proc. Second Int. Conf. Autonomous Agents, pp. 8--15. Minneapolis (1998) [2] Thalmann, D.: The Role of Virtual Humans in Virtual Environment Technology and Interfaces. Frontiers of Human-Centred Computing, Online Communities and Virtual Environments, Earnshaw R., Guejd R., van Dam A., Vince J., (eds.) pp. 27--38. Springer-Verlag, London (2001) [3] Ustundag, B., Odabasi, K., Aksaz, E.: Interactive Humanoid Teller, Patent Application, Turkish Patent Institute Application no.2009-G22168 (2009) [4] Komiak, S., Wang, W., Benbasat, I.: Comparing Customer Trust in Virtual Salespersons With Customer Trust in Human Salespersons. In: 38th Hawaii International Conference on System Sciences (2005) [5] Dankers, A., Barnes, N., Zelinsky A., “Bimodal Active Stereo Vision”, Proc. 5th Int Conf on Field and Service Robotics (FSR05) ( 2005) [6] Bartlett, M. S., Littlewort, G., Fasel, I., Movellan, J. R.: Real Time Face Detection and Facial Expression Recognition: Development and Applications to Human Computer Interaction. In: IEEE Workshop on Face Processing in Video, Washington (2004) [7] Nickel, K., Stiefelhagen, R.: Pointing Gesture Recognition based on 3D-Tracking of Face, Hands and Head Orientation. In: ICMI’03, ACM, Vancouver (2003) [8] Ma, J., Cole, R.: Animating visible speech and facial expressions. The Visual Computer vol. 80, pp. 86--105. Springer-Verlag (2004) [9] Chai, J., Xiao, J., Hodgins, J.: Vision-based Control of 3D Facial Animation. In: Eurographics/SIGGRAPH Symposium on Computer Animation’03 Breen D., Lin M. (eds.) ACM Press, pp. 193--206 (2003) [10] Darrell, T., Basu, S., Wren, C., Pentland, A.: Perceptually-driven Avatars and Interfaces: active methods for direct control. Technical report, MIT Media Lab Perceptual Computation Section, TR 416 (1997) [11] Deng, Z., Busso, C., Narayanan, S., Neumann U.: Audiobased Head Motion Synthesis for Avatar based Telepresence Systems. In: ACM SIGMM 2004 Workshop on Effective Telepresence, pp. 24--30, ACM Press, New York (2004) [12] Maestri G., 3D sample chararcter design, http://www.lynda.com (2008)
© Copyright 2026 Paperzz