Pogany: A Tangible Cephalomorphic Interface for Expressive Facial Animation Christian Jacquemin LIMSI-CNRS & Univ. Paris 11, BP 133, 91403 ORSAY, France. [email protected] Abstract. A head-shaped input device is used to produce expressive facial animations. The physical interface is divided into zones, and each zone controls an expression on a smiley or on a virtual 3D face. Through contacts with the interface users can generate basic or blended expressions. To evaluate the interface and to analyze the behavior of the users, we performed a study made of three experiments in which subjects were asked to reproduce simple or more subtle expressions. The results show that the subjects easily accept the interface and get engaged in a pleasant affective relationship that make them feel as sculpting the virtual face. This work shows that anthropomorphic interfaces can be used successfully for intuitive affective expression. 1 Anthropomorphic Devices for Affective Communication We have designed and built a head-shaped tangible interface for the generation of facial expressions through intuitive contacts or proximity gestures. Our purpose is to offer a new medium of communication that can involve the user in an affective loop [1]. The input to the interface consists of intentional and natural affective gestures, and the output is an embodiment of the emotional content of the input gestures. The output is either a facial expression or a smiley, it is used as a feedback to the user so that she can both tune her interactions with the interface according to the output (cognitive feedback), and feel the emotions expressed by the virtual actor or the smiley (affective feedback). The input device is a hollow resin head with holes and an internal video camera that captures the positions of the fingers on the interface. The output is an interactive smiley or an expressive virtual 3D face (see figure 1). The user can control a wide range of expressions of the virtual avatar through correspondences between finger contacts an a set of basic expressions of emotions. The interface is used as a means to display ones own expressions of emotions as well as a means to convey emotions through the virtual face. We take advantage of the anthropomorphic shape of the input, a stylized human head, to establish easily learnable correspondences between users contacts and expressed emotions. Even though a doll head was used in an “early” design of tangible interfaces in the mid-90s [2], human shapes are more widely used as output interfaces (e.g. Embodied Conversational Agents) than as input devices. Through our study we show that anthropomorphic input interfaces Fig. 1. Experimental setup. are experienced as an engaging and efficient means for affective communication, particularly when they are combined with a symmetric output that mirrors the emotions conveyed by the input interface. 2 Anthropomorphic Input/Output Device We now examine in turn the two components of the interface: the physical tangible input device, and the virtual animated face together with the mapping between gestures and expressions. Three experimental setups have been proposed: two setups in which strongly marked expressions can be generated on an emoticon or a 3D virtual face, and a third and more attention demanding experiment in which subtle and flexible expressions of the 3D face are controlled by the interface. At this point of the development, no social interaction is involved in our study in order to focus first on the interface usability and on the ease of control of the virtual agent’s expressions. 2.1 Input: Cephalomorphic Tangible Interface The physical and input part of the interface is based on the following constraints: 1. it should be able to capture intuitive gestures through hands and fingers as if the user were approaching someone’s face, 2. direct contacts as well as gestures in the vicinity of the head should also be captured in order to allow for a wide range of subtle inputs through distant interactions, 3. as suggested by the design study of SenToy [3], the shape of the physical interface should not have strongly marked traits that would make it look like a familiar face, or that would suggest predefined expressions, 4. the most expressive facial parts of the interface should be easily identified without the visual modality in order to allow for contact interaction: eyes, eyebrows, mouth, and chin should have clearly marked shapes. The first constraint has oriented us towards multi-touch interaction techniques that can detect several simultaneous contacts. Since the second constraint prohibits the use of pressure-sensitive captures that cannot report gestures without contacts, we have chosen a vision-based capture device that is both multitouch and proximity-sensitive. The interface is equipped with a video camera, and 43 holes are used to detect the positions of the fingers in the vicinity of the face (figure 2). In order to detect the positions and gestures of both hands, right and left symmetric holes play the same role in the mapping between interaction and facial animation. The holes are chosen among the 84 MPEG4 key points used for standard facial animation [4]. The points are chosen among the mobile key points of this formalism, for instance points 10.* and 11.* for ear and hair are ignored. The underlying hypothesis for selecting these points is that, since they correspond to places in the face with high mobility, they also make sensible capture points for animation control. Fig. 2. Cross section of the physical interface and list of capture holes. The third constraint has oriented us towards an abstract facial representation that would hardly suggest a known human face. Since we wanted the interface to be however appealing for contact, caress or nearby gestures, its aesthetics was a concern. Its design is deliberately soft and non angular; it is loosely inspired by Mademoiselle Pogany, a series of sculptures of the 20th century artist Constantin Brancusi (figure 3). The eye and mouth reliefs are prominent enough to be detected by contact with the face (fourth constraint). The size of the device (14cm high) is similar to a joystick, and is about three times smaller than a human face. Fig. 3. Overview of the physical interface and bimanual interaction. All the tests have been done with bare hands and normal lighting conditions (during day time with natural light and in the evening with regular office lighting). 2.2 Output: Expressive Smiley or Virtual 3D Face A straightforward way to provide users with a feedback on the use of the interface for affective communication is to associate their interactions with expressions of emotions on an animated face. We have used two type of faces: a smiley and a realistic 3D face with predefined or blended expressions. Of course other types of correspondences can be established and we do not claim that the physical interface should be restricted to control facial animation. Other mappings are under development such as the use of the interface for musical composition. In a first step, we however found it necessary to check that literal associations could work before turning to more elaborated appliances. The association of interactions with facial animations is performed in two steps. First the video image is captured with the ffmpeg library1 and transformed into a bitmap of gray pixels. After a calibration phase, bitmaps are analyzed at each frame around each hole by computing the difference between the luminosity at calibration time and its current value. The activation of a capture hole is the ratio between its current luminosity and its luminosity at calibration time. The activation of a zone made of several holes is its highest hole activation. In a second step, zone activations are associated with facial expressions. Each expression is a table of keypoint transformations, a Face Animation Table (FAT) in MPEG4. The choice of the output expression depends on the rendering mode. In the non-blended mode, the expression associated with the highest activated zone is chosen. In the blended mode, a weighted interpolation is made between the 1 http://ffmpeg.mplayerhq.hu/ expressions associated with each activated zone. Facial animation is implemented in Virtual Choreographer (VirChor)2 , an OpenSource interactive 3D rendering tool. VirChor stores the predefined FATs, receives expression weights from the video analysis module, and produces the corresponding animations. 2.3 Basic and Blended Facial Expressions The mapping between interactions and expressions relies on a partitioning of the face into six symmetrical zones shown in the center part of the two images in figure 4. Each zone is associated with a single basic expression and the level of activation of a zone is the percentage of occlusion of the most occluded key point in this zone. Thus hole occlusion by fingers is used to control expressions on the virtual faces (smiley or 3D face). All the zones are symmetrical so that right- and left-handed subjects are offered the same possibilities of interactions. Two sets of 6 basic facial expressions were designed for the smiley and for the 3D face that the users could identify and reproduce quickly. For the smiley, the 6 expressions correspond to 5 basic emotions and a non expressive face with closed eyes: angry face, surprised eyebrows, surprised mouth, happy mouth, sad mouth, closed eyes (see upper part of figure 4). Only the angry face expression involves both the upper and the lower part of the face. Each basic expression of the 3D face (lower part of figure 4) is associated with an Action Unit (AU) of Ekman and Friesen’s Facial Action Coding System [5]: a contraction of one or several muscles that can be combined to describe the expressions of emotions on a human face. Only 6 of the 66 AUs in this system are used; they are chosen so that they have simple and clear correspondences with expressions of the smiley. The only noticeable difficulty is the correspondence between the angry face smiley, which involves modifications of the upper, lower, and central part of the face, and the associated 3D expression of AU4 (Brow Lowerer ) that only involves the upper part of the face. 3D basic face expressions are deliberately associated with AUs instead of more complex expressions in order to facilitate the recognition of blended expressions in the third task of the experiment. In this task, the users have to guess what are the basic expressions involved in the synthesis of complex expressions resulting from the weighted interpolation of AUs. Through this design, only a small subset of facial expressions can be produced. They are chosen so that they can be easily distinguished. More subtle expressions could be obtained by augmenting the number of zones through a larger resin cast with more holes or through this version of the interface with less holes in each zone. The 3D animation of each basic expression is made by displacing the MPEG4 key points. Since these expressions are restricted to some specific parts of the human face, they only involve a small subset of the 84 MPEG4 key points. For example, the basic expression associated with AU2 (Outer Brow Raiser ) is based on the displacement of key points 4.1 to 4.6 (eye brows), while the expression of AU12 relies on key points 2.2 to 2.9 (inner mouth) and 8.1 to 8.8 (outer mouth). 2 http://virchor.sf.net/ Fig. 4. Mapping between face zones and emoticons or virtual face expressions. We now turn to the study of the interface usability in which the users were asked to reproduce basic or blended expressions of emotions on a virtual face through interactions with the physical interface. 3 Usability Study: Control of Emoticons or Facial Expressions through Pogany As for the SenToy design experiment [3], our purpose is to check whether a user can control a virtual character’s expressions (here the face) through a tangible interface that represents the same part of the body. Our usability study is intended to verify that (1) users can quickly recognize facial expressions from a model, and (2) that they can reproduce them at various levels of difficulty. Last, we wanted to let the users express themselves about their feelings during the experiment and their relationship to the device. 3.1 Experiment 22 volunteers have participated to the experiment: 12 men and 10 women aged between 15 and 58 (average 29.1). Each experiment lasts between 30 and 50 minutes depending on the time taken by the subject to train and to accomplish each task. Each subject is first introduced by the experimenter to the purpose of the experiment, and then the tasks are explained with the help of the two zone/expression association schemas of figure 4. The experiment consists of three tasks in which users must use the physical interface to reproduce models of facial expressions. Each task corresponds to a different mapping between users’ interactions and facial expressions. In task T1 , the visual output is a smiley, and in task T2 and T3 , the output is a 3D animated face. In task T2 only basic expressions of the virtual face are controlled, while in task T3 blended expressions are produced through the interactions of the user. The tasks are defined as follows: 1. Task T1 : The face zones on the interface are associated with smileys as shown in the upper image of figure 4. If several zones are occluded by the user’s hand(s) and finger(s), the zone with the strongest occlusion wins. 2. Task T2 : The face zones of the interface are associated with basic expressions as shown in the bottom image of figure 4. The same rule as in 1. applies for zone-based occlusion: highest occluded zone wins. 3. Task T3 : The face zones are associated with the same expressions as in 2. But zone activation is now gradual (from 0 inactive to 1 maximally active) and several zones can be activated simultaneously. Each zone weight is equal to the percentage of occlusion of the most occluded key point. The resulting facial animation is made of blended expressions as described in 2.3. Tasks T1 and T2 could be easily implemented through keyboard inputs. We have chosen a tangible interface as input device because we wanted the user to get used to the interface on simple tasks T1 and T2 before experiencing the more complex task T3 . We also wanted to observe the user, and check the usability and the quality of affective communication through this interface even on simple tasks. To make a simple parallel with other types of interactions: even though firing bullets in a FPS game could also be performed by pressing a keyboard key, most players certainly prefer to use the joystick trigger. In tasks T1 and T2 , the subject faces a screen on which a target expression is shown and she must reproduce the same expression on the smiley (task T1 ) or on the virtual face (task T2 ) by touching the interface on the corresponding zones. 20 expressions are randomly chosen among the 6 basic ones. They are shown in turn and change each time the user holds the finger positions that display this expression for at least 0.5 second. Before each task begins, the subject can practice as long as she needs until she feels ready. The experimental setup for task T1 is shown in figure 1 given at the beginning of the article. For the tasks T2 and T3 , the output is a 3D face instead of a smiley. In task T3 , the generated expressions are composed by blending the expressions associated with each zone (weights are computed from zone activation as explained above). The blended expressions are implemented as weighted combinations of elementary expressions as in [6]. Two such blended expressions, Face11 and Face14, are shown in figure 5. They can be described as 6-float weight vectors based on the 6 basic expressions and associated AUs of figure 4: AU2, AU4, AU12, AU15, AU26, and AU43. Each vector coordinates are in [0..1]. The vector of Face11 is (1, 0, 0, 0, 0, 1) because it is a combination of AU2 and AU43 fully expressed. Similarly, the vector of Face14 is (0, 0.5, 0.5, 0, 0.5, 0) because it is a combination of AU4, AU12, and AU26 partially expressed. Fig. 5. Two blended expressions of emotions (Face11 and Face14 of task T3 ). The 15 target blended expressions are designed as follows: 6 of them are basic expressions (weight 1.0), 6 are combinations of 2 basic expressions (weights 1.0 and 1.0, see Face11 in figure 5), and 3 are combinations of 3 partially weighted expressions (weights 0.5, 0.5 and 0.5, see Face14 in figure 5). Blended expressions are obtained by pressing simultaneously on several zones in the face. For example, Face14 is obtained through the semi-activation of three zones: the nose (AU4), the lips (AU12), and the jaws (AU26). The difficulty of task T3 comes from the combination of expressions, possibly in the same part of the face, and from the necessity to control simultaneously several zones in the face. In task T3 , we let the user tell the experimenter when she is satisfied with an expression before turning to the next one. We use selfevaluation because automatic success detection in this task is more delicate than for T1 and T2 , and because we are interested in letting the user report her own evaluation of the quality of her output. Contrary to previous experiments on conversational agents in which the users are asked to tell which expressions they recognize from an animation sequence [7], we evaluate here the capacity to reproduce an expression at a subsymbolic level without any explicit verbalization (no labelling of the expression is required). Table 1 summarizes the definition of the tasks and gives the quantities measured during these tasks: the time to reproduce an expression for the three tasks and, for T3 , the error between the expression produced on the virtual face and the target expression. Table 1. Task definition. Task Avatar Target T1 T2 T3 Measures Success detection Smiley Emoticons Time Hold target emoticon 0.5 sec. 3D face Basic expressions Time Hold target expression 0.5 sec. 3D face Blended Expressions Time & error Self-evaluation Before starting the experiment, the experimenter insists on three points: the subject should feel comfortable and take as much time she needs to practice before starting, speed is not an issue, and the final and anonymous questionnaire is an important part of the experiment. The subject can handle the interface as she wants: either facing her or the other way round. She is warned that when the interface faces her, the color of her cloths can slightly disturb finger contact recognition due to the video analysis technique. 3.2 Quantitative Results Tasks T1 and T2 : The average time taken by the subjects to complete the tasks is very similar for the first two tasks: 7.4 and 7.5 sec. per target expression with standard deviations of 3.6 and 3.3. Even though the target expressions are defined on very different face models (smiley vs. 3D face), the similar modes of interaction (winning zone defines the expression) make the two tasks very similar for the subjects. They rate these two tasks as easy: 1.6 and 1.7 with standard deviations of 0.7 and 0.6 on a scale of 1 (easy) to 3 (difficult). Task T3 : The figures are very different for task T3 , in which the difficulty is due to the blended combination of expressions. It requires (1) to analyze an expression and guess its ingredients, and (2) to progressively tune the weights of each ingredient in order to obtain a resulting expression as close as possible to the proposed target. For T3 , the subjects have taken an average time of 26.9 sec. to reproduce each expression, with a high standard deviation of 13.7. Half a minute is however not very long for such a complex task when compared with the complexity of the input (43 holes, several sites for finger positioning on the physical input, and complex output made of interpolations between basic expressions). The error between an achieved face and the required target is the sum of the distances between the coordinates of the face performed by the user and the target face in the 6-dimensional space of the facial expressions. For example, if the user produces a blended face weighted by (0.2, 0.6, 0.5, 0.1, 0.8, 0.0), its distance to Face14 (0.0, 0.5, 0.5, 0.0, 0.5, 0.0) is 0.2 + 0.1 + 0 + 0.1 + 0.3 + 0 = 0.7. Surprisingly, the time taken to make an expression for task T3 does not depend on the composition of the expression (28 sec. for basic expressions, 26 sec. for dual expressions, and 25 sec. for triple expressions). The average error (1.79) on binary expressions made of two fully weighted expressions such as Face11 is higher than single expressions (1.23) or partially weighted triple expressions such as Face14 (1.31). This result suggests that mildly blended expressions are easier to reproduce than heavily blended ones. Task T3 has been unanimously rated as difficult (2.77 in a 1-3 scale with only one subject rating it as easy). Role of Expertise: In the questionnaire, the subjects were asked questions about their level of expertise: average use of a computer, musical instrument practice, use of 3D devices, and gaming. We now investigate whether expert subjects perform better or quicker than non-expert ones. The two leftmost histograms of figure 6 analyze the average time taken to accomplish tasks T1 and T2 for each level of expertise. Each group of bars is associated with a time range and each bar is associated with a level of expertise (low, medium, or high). The height of a bar represents the percentage of subjects with this level of expertise that have performed the experiment in this time range. The histograms show that the expert subjects are quicker in the smiley experiment than the two other categories. In the second experiment, all the subjects take approximately the same time. This is probably because the time taken to identify the facial expressions does not vary with the expertise of the subject, and therefore increases the duration for expression recognition and reproduction for expert subjects. 1 low av. high 0.8 0.6 0.4 0.6 0.4 0.2 0.2 0 0 0-5 5-10 15-20 average task duration Task T3 (blended expressions) low av. high 0.8 percentage percentage Task T2 (basic expressions) error Task T1 (smiley) 1 0-5 5-10 10-15 15-20 average task duration 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 low av. high 10-20 20-30 30-40 40-50 60-70 average task duration Fig. 6. Duration and error as a function of expertise. In the histogram for the third task (rightmost histogram in figure 6), the height of a bar represents the error of the blended faces realized by the subjects in comparison with the target face. Bars are grouped by time duration intervals as for the two other tasks. These histograms show that expert subjects are not quicker than non expert ones for task T3 , but they get better quality: average error for low or average experts is 1.67 and 1.40, and 1.15 for highly expert subjects. Incidentally, this histogram also shows that the slowest subjects do not perform significantly better than the fastest ones. 3.3 Subjective Evaluation The questionnaire contained several fields in which subjects could write extended comments about the experiment. All but one of the subjects have positively rated their appreciation of the interface: 3 or 4 on a scale of 1 (very unpleasant) to 4 (very pleasant). Their appreciations concern the tactile interaction, and the naturalness and the softness of the correlation between kinesthetics and animation. They appreciate that the contact is soft and progressive: natural, touch friendly, reactive, simple are among the words used by the subjects to qualify the positive aspects of this interface. Users also appreciate the emotional transfer to the virtual avatars that make the smiley and the face more “human”. Some of the subjects have even talked to the interface during T3 , rather kindly, such as Come on! Close your eyes. The questionnaire asked the subjects whether the face was conveying expressions of emotions and whether they would have preferred it with another shape. They qualify the face as expressionless, calm, placid, quiet, passive, neutral... and have positive comments about the aesthetics of the interface. One of them finds that it looks like an ancient divinity. It confirms our hypothesis that a neutral face is appreciated by the users for that type of interface. Some users feel uncomfortable with the interface, because it requires a tactile engagement. But for the subjects who accept to enter in such an “intimate” experience with the interface, the impression can become very pleasant as quoted by a 43 years old male subject: The contact of fingers on a face is a particular gesture that we neither often nor easily make. [. . . ] Luckily, this uncomfortable impression does not last very long. After a few trials, you feel like a sculptor working with clay. . . Depreciative comments concern the difficulty to control accurately the output of the system because of undesirable shadows on neighboring zones, and the smallness of the tactile head that makes the positioning of the fingers on capture holes difficult. Some subjects have however noticed the interest of the video capture by using distant positions of the hand for mild and blended expressions. Criticism also concerns the limits of the experimental setup: some users would have liked to go a step further in the possibility of modifying the facial expressions and control an output device as innovative as the input interface. To sum up, depreciative comments concern mainly technical limitations of the interface that should be overcome by using gestures (sequences of hole occlusions) instead of static contacts. Positive comments concern mostly the perspectives for affective communication opened by this new type of interface. 4 Conclusion and Future Developments The evaluation on the usability of the interface reported here shows that a headshaped interface can be successfully used by expert and non-expert subjects for affective expressions. Comments in the questionnaire show that, for most of the users, this type of interaction is a very positive and pleasant experience. Our future work on this interface will follow three complementary directions. At the technical level, Hidden Markov Models are currently implemented so that gestures can be recognized in addition to static interactions. Since tactile communication relies on a wide palette of caresses and contacts, it is necessary to capture pressure, speed, and directions of gestures. At the application level, we intend to design new experiments in which more than one subject will be involved in order to study the communicative properties of this interface in situations of intimate or social relationship. Last we will improve the quality of facial rendering to generate more realistic expressions [8]. 5 Acknowledgement Many thanks to Clarisse Beau, Vincent Bourdin, Laurent Pointal and Sébastien Rieublanc (LIMSI-CNRS) for their help in the design of the interface; Jean-Noël Montagné (Centre de Ressources Art Sensitif), Francis Bras, and Sandrine Chiri (Interface Z) for their help on sensitive interfaces; Catherine Pelachaud (Univ. Paris 8) for her help on ECAs and for her detailed comments on a draft version of this article. This work is supported by LIMSI-CNRS Talking Head action coordinated by Jean-Claude Martin. References 1. Sundström, P., Ståhl, A., Höök, K.: In situ informants exploring an emotional mobile meassaging system in their everyday practice. Int. J. Hum.-Comput. Stud. 65(4) (2007) 388–403 Special issue of IJHCS on Evaluating Affective Interfaces. 2. Hinckley, K., Pausch, R., Goble, J.C., Kassell, N.F.: Passive real-world interface props for neurosurgical visualization. In: CHI ’94: Proceedings of the SIGCHI conference on Human factors in computing systems, New York, NY, USA, ACM Press (1994) 452–458 3. Paiva, A., Andersson, G., Höök, K., Mourão, D., Costa, M., Martinho, C.: Sentoy in FantasyA: Designing an affective sympathetic interface to a computer game. Personal Ubiquitous Comput. 6(5-6) (2002) 378–389 4. Ostermann, J.: Face animation in MPEG-4. In Pandzic, I.S., Forchheimer, R., eds.: MPEG-4 Facial Animation. Wiley, Chichester, UK (2002) 17–55 5. Ekman, P., Friesen, W.V.: Facial action coding system: A technique for the measurement of facial movement. Consulting Psychologists Press, Palo Alto, CA, USA (1978) 6. Tsapatsoulis, N., Raouzaiou, A., Kollias, S., Crowie, R., Douglas-Cowie, E.: Emotion recognition and synthesis based on MPEG-4 FAPs. In Pandzic, I.S., Forchheimer, R., eds.: MPEG-4 Facial Animation. Wiley, Chichester, UK (2002) 141–167 7. Pelachaud, C.: Multimodal expressive embodied conversational agents. In: MULTIMEDIA ’05: Proceedings of the 13th annual ACM international conference on Multimedia, New York, NY, USA, ACM Press (2005) 683–689 8. Albrecht, I.: Faces and Hands: Modeling and Animating Anatomical and Photorealistic Models with Regard to the Communicative Competence of Virtual Humans. Ph. D. Thesis, Universität des Saarlandes (2005)
© Copyright 2026 Paperzz