Pogany: A Tangible Cephalomorphic Interface

Pogany: A Tangible Cephalomorphic Interface
for Expressive Facial Animation
Christian Jacquemin
LIMSI-CNRS & Univ. Paris 11, BP 133, 91403 ORSAY, France.
[email protected]
Abstract. A head-shaped input device is used to produce expressive
facial animations. The physical interface is divided into zones, and each
zone controls an expression on a smiley or on a virtual 3D face. Through
contacts with the interface users can generate basic or blended expressions. To evaluate the interface and to analyze the behavior of the users,
we performed a study made of three experiments in which subjects were
asked to reproduce simple or more subtle expressions. The results show
that the subjects easily accept the interface and get engaged in a pleasant affective relationship that make them feel as sculpting the virtual
face. This work shows that anthropomorphic interfaces can be used successfully for intuitive affective expression.
1
Anthropomorphic Devices for Affective Communication
We have designed and built a head-shaped tangible interface for the generation of
facial expressions through intuitive contacts or proximity gestures. Our purpose
is to offer a new medium of communication that can involve the user in an
affective loop [1]. The input to the interface consists of intentional and natural
affective gestures, and the output is an embodiment of the emotional content
of the input gestures. The output is either a facial expression or a smiley, it
is used as a feedback to the user so that she can both tune her interactions
with the interface according to the output (cognitive feedback), and feel the
emotions expressed by the virtual actor or the smiley (affective feedback). The
input device is a hollow resin head with holes and an internal video camera
that captures the positions of the fingers on the interface. The output is an
interactive smiley or an expressive virtual 3D face (see figure 1). The user can
control a wide range of expressions of the virtual avatar through correspondences
between finger contacts an a set of basic expressions of emotions. The interface is
used as a means to display ones own expressions of emotions as well as a means
to convey emotions through the virtual face.
We take advantage of the anthropomorphic shape of the input, a stylized
human head, to establish easily learnable correspondences between users contacts and expressed emotions. Even though a doll head was used in an “early”
design of tangible interfaces in the mid-90s [2], human shapes are more widely
used as output interfaces (e.g. Embodied Conversational Agents) than as input devices. Through our study we show that anthropomorphic input interfaces
Fig. 1. Experimental setup.
are experienced as an engaging and efficient means for affective communication,
particularly when they are combined with a symmetric output that mirrors the
emotions conveyed by the input interface.
2
Anthropomorphic Input/Output Device
We now examine in turn the two components of the interface: the physical tangible input device, and the virtual animated face together with the mapping
between gestures and expressions. Three experimental setups have been proposed: two setups in which strongly marked expressions can be generated on an
emoticon or a 3D virtual face, and a third and more attention demanding experiment in which subtle and flexible expressions of the 3D face are controlled by
the interface. At this point of the development, no social interaction is involved
in our study in order to focus first on the interface usability and on the ease of
control of the virtual agent’s expressions.
2.1
Input: Cephalomorphic Tangible Interface
The physical and input part of the interface is based on the following constraints:
1. it should be able to capture intuitive gestures through hands and fingers as
if the user were approaching someone’s face,
2. direct contacts as well as gestures in the vicinity of the head should also be
captured in order to allow for a wide range of subtle inputs through distant
interactions,
3. as suggested by the design study of SenToy [3], the shape of the physical
interface should not have strongly marked traits that would make it look like
a familiar face, or that would suggest predefined expressions,
4. the most expressive facial parts of the interface should be easily identified
without the visual modality in order to allow for contact interaction: eyes,
eyebrows, mouth, and chin should have clearly marked shapes.
The first constraint has oriented us towards multi-touch interaction techniques that can detect several simultaneous contacts. Since the second constraint
prohibits the use of pressure-sensitive captures that cannot report gestures without contacts, we have chosen a vision-based capture device that is both multitouch and proximity-sensitive. The interface is equipped with a video camera,
and 43 holes are used to detect the positions of the fingers in the vicinity of the
face (figure 2). In order to detect the positions and gestures of both hands, right
and left symmetric holes play the same role in the mapping between interaction
and facial animation.
The holes are chosen among the 84 MPEG4 key points used for standard
facial animation [4]. The points are chosen among the mobile key points of this
formalism, for instance points 10.* and 11.* for ear and hair are ignored. The
underlying hypothesis for selecting these points is that, since they correspond to
places in the face with high mobility, they also make sensible capture points for
animation control.
Fig. 2. Cross section of the physical interface and list of capture holes.
The third constraint has oriented us towards an abstract facial representation
that would hardly suggest a known human face. Since we wanted the interface to
be however appealing for contact, caress or nearby gestures, its aesthetics was a
concern. Its design is deliberately soft and non angular; it is loosely inspired by
Mademoiselle Pogany, a series of sculptures of the 20th century artist Constantin
Brancusi (figure 3). The eye and mouth reliefs are prominent enough to be
detected by contact with the face (fourth constraint). The size of the device
(14cm high) is similar to a joystick, and is about three times smaller than a
human face.
Fig. 3. Overview of the physical interface and bimanual interaction.
All the tests have been done with bare hands and normal lighting conditions (during day time with natural light and in the evening with regular office
lighting).
2.2
Output: Expressive Smiley or Virtual 3D Face
A straightforward way to provide users with a feedback on the use of the interface
for affective communication is to associate their interactions with expressions of
emotions on an animated face. We have used two type of faces: a smiley and a
realistic 3D face with predefined or blended expressions. Of course other types
of correspondences can be established and we do not claim that the physical
interface should be restricted to control facial animation. Other mappings are
under development such as the use of the interface for musical composition. In a
first step, we however found it necessary to check that literal associations could
work before turning to more elaborated appliances.
The association of interactions with facial animations is performed in two
steps. First the video image is captured with the ffmpeg library1 and transformed
into a bitmap of gray pixels. After a calibration phase, bitmaps are analyzed at
each frame around each hole by computing the difference between the luminosity
at calibration time and its current value. The activation of a capture hole is the
ratio between its current luminosity and its luminosity at calibration time. The
activation of a zone made of several holes is its highest hole activation.
In a second step, zone activations are associated with facial expressions. Each
expression is a table of keypoint transformations, a Face Animation Table (FAT)
in MPEG4. The choice of the output expression depends on the rendering mode.
In the non-blended mode, the expression associated with the highest activated
zone is chosen. In the blended mode, a weighted interpolation is made between the
1
http://ffmpeg.mplayerhq.hu/
expressions associated with each activated zone. Facial animation is implemented
in Virtual Choreographer (VirChor)2 , an OpenSource interactive 3D rendering
tool. VirChor stores the predefined FATs, receives expression weights from the
video analysis module, and produces the corresponding animations.
2.3
Basic and Blended Facial Expressions
The mapping between interactions and expressions relies on a partitioning of
the face into six symmetrical zones shown in the center part of the two images
in figure 4. Each zone is associated with a single basic expression and the level
of activation of a zone is the percentage of occlusion of the most occluded key
point in this zone. Thus hole occlusion by fingers is used to control expressions
on the virtual faces (smiley or 3D face). All the zones are symmetrical so that
right- and left-handed subjects are offered the same possibilities of interactions.
Two sets of 6 basic facial expressions were designed for the smiley and for
the 3D face that the users could identify and reproduce quickly. For the smiley,
the 6 expressions correspond to 5 basic emotions and a non expressive face with
closed eyes: angry face, surprised eyebrows, surprised mouth, happy mouth, sad
mouth, closed eyes (see upper part of figure 4). Only the angry face expression
involves both the upper and the lower part of the face.
Each basic expression of the 3D face (lower part of figure 4) is associated
with an Action Unit (AU) of Ekman and Friesen’s Facial Action Coding System
[5]: a contraction of one or several muscles that can be combined to describe the
expressions of emotions on a human face. Only 6 of the 66 AUs in this system are
used; they are chosen so that they have simple and clear correspondences with
expressions of the smiley. The only noticeable difficulty is the correspondence
between the angry face smiley, which involves modifications of the upper, lower,
and central part of the face, and the associated 3D expression of AU4 (Brow
Lowerer ) that only involves the upper part of the face.
3D basic face expressions are deliberately associated with AUs instead of
more complex expressions in order to facilitate the recognition of blended expressions in the third task of the experiment. In this task, the users have to
guess what are the basic expressions involved in the synthesis of complex expressions resulting from the weighted interpolation of AUs. Through this design,
only a small subset of facial expressions can be produced. They are chosen so
that they can be easily distinguished. More subtle expressions could be obtained
by augmenting the number of zones through a larger resin cast with more holes
or through this version of the interface with less holes in each zone.
The 3D animation of each basic expression is made by displacing the MPEG4
key points. Since these expressions are restricted to some specific parts of the
human face, they only involve a small subset of the 84 MPEG4 key points. For
example, the basic expression associated with AU2 (Outer Brow Raiser ) is based
on the displacement of key points 4.1 to 4.6 (eye brows), while the expression of
AU12 relies on key points 2.2 to 2.9 (inner mouth) and 8.1 to 8.8 (outer mouth).
2
http://virchor.sf.net/
Fig. 4. Mapping between face zones and emoticons or virtual face expressions.
We now turn to the study of the interface usability in which the users were
asked to reproduce basic or blended expressions of emotions on a virtual face
through interactions with the physical interface.
3
Usability Study: Control of Emoticons or Facial
Expressions through Pogany
As for the SenToy design experiment [3], our purpose is to check whether a user
can control a virtual character’s expressions (here the face) through a tangible
interface that represents the same part of the body. Our usability study is intended to verify that (1) users can quickly recognize facial expressions from a
model, and (2) that they can reproduce them at various levels of difficulty. Last,
we wanted to let the users express themselves about their feelings during the
experiment and their relationship to the device.
3.1
Experiment
22 volunteers have participated to the experiment: 12 men and 10 women aged
between 15 and 58 (average 29.1). Each experiment lasts between 30 and 50
minutes depending on the time taken by the subject to train and to accomplish
each task. Each subject is first introduced by the experimenter to the purpose
of the experiment, and then the tasks are explained with the help of the two
zone/expression association schemas of figure 4.
The experiment consists of three tasks in which users must use the physical
interface to reproduce models of facial expressions. Each task corresponds to a
different mapping between users’ interactions and facial expressions. In task T1 ,
the visual output is a smiley, and in task T2 and T3 , the output is a 3D animated
face. In task T2 only basic expressions of the virtual face are controlled, while in
task T3 blended expressions are produced through the interactions of the user.
The tasks are defined as follows:
1. Task T1 : The face zones on the interface are associated with smileys as shown
in the upper image of figure 4. If several zones are occluded by the user’s
hand(s) and finger(s), the zone with the strongest occlusion wins.
2. Task T2 : The face zones of the interface are associated with basic expressions
as shown in the bottom image of figure 4. The same rule as in 1. applies for
zone-based occlusion: highest occluded zone wins.
3. Task T3 : The face zones are associated with the same expressions as in 2.
But zone activation is now gradual (from 0 inactive to 1 maximally active)
and several zones can be activated simultaneously. Each zone weight is equal
to the percentage of occlusion of the most occluded key point. The resulting
facial animation is made of blended expressions as described in 2.3.
Tasks T1 and T2 could be easily implemented through keyboard inputs. We have
chosen a tangible interface as input device because we wanted the user to get
used to the interface on simple tasks T1 and T2 before experiencing the more
complex task T3 . We also wanted to observe the user, and check the usability
and the quality of affective communication through this interface even on simple
tasks. To make a simple parallel with other types of interactions: even though
firing bullets in a FPS game could also be performed by pressing a keyboard
key, most players certainly prefer to use the joystick trigger.
In tasks T1 and T2 , the subject faces a screen on which a target expression is
shown and she must reproduce the same expression on the smiley (task T1 ) or on
the virtual face (task T2 ) by touching the interface on the corresponding zones.
20 expressions are randomly chosen among the 6 basic ones. They are shown
in turn and change each time the user holds the finger positions that display
this expression for at least 0.5 second. Before each task begins, the subject can
practice as long as she needs until she feels ready. The experimental setup for
task T1 is shown in figure 1 given at the beginning of the article. For the tasks
T2 and T3 , the output is a 3D face instead of a smiley.
In task T3 , the generated expressions are composed by blending the expressions associated with each zone (weights are computed from zone activation as
explained above). The blended expressions are implemented as weighted combinations of elementary expressions as in [6]. Two such blended expressions, Face11
and Face14, are shown in figure 5. They can be described as 6-float weight vectors based on the 6 basic expressions and associated AUs of figure 4: AU2, AU4,
AU12, AU15, AU26, and AU43. Each vector coordinates are in [0..1]. The vector
of Face11 is (1, 0, 0, 0, 0, 1) because it is a combination of AU2 and AU43 fully
expressed. Similarly, the vector of Face14 is (0, 0.5, 0.5, 0, 0.5, 0) because it is a
combination of AU4, AU12, and AU26 partially expressed.
Fig. 5. Two blended expressions of emotions (Face11 and Face14 of task T3 ).
The 15 target blended expressions are designed as follows: 6 of them are basic
expressions (weight 1.0), 6 are combinations of 2 basic expressions (weights 1.0
and 1.0, see Face11 in figure 5), and 3 are combinations of 3 partially weighted
expressions (weights 0.5, 0.5 and 0.5, see Face14 in figure 5). Blended expressions
are obtained by pressing simultaneously on several zones in the face. For example,
Face14 is obtained through the semi-activation of three zones: the nose (AU4),
the lips (AU12), and the jaws (AU26).
The difficulty of task T3 comes from the combination of expressions, possibly
in the same part of the face, and from the necessity to control simultaneously
several zones in the face. In task T3 , we let the user tell the experimenter when
she is satisfied with an expression before turning to the next one. We use selfevaluation because automatic success detection in this task is more delicate than
for T1 and T2 , and because we are interested in letting the user report her own
evaluation of the quality of her output. Contrary to previous experiments on
conversational agents in which the users are asked to tell which expressions
they recognize from an animation sequence [7], we evaluate here the capacity to
reproduce an expression at a subsymbolic level without any explicit verbalization
(no labelling of the expression is required).
Table 1 summarizes the definition of the tasks and gives the quantities measured during these tasks: the time to reproduce an expression for the three tasks
and, for T3 , the error between the expression produced on the virtual face and
the target expression.
Table 1. Task definition.
Task Avatar Target
T1
T2
T3
Measures
Success detection
Smiley Emoticons
Time
Hold target emoticon 0.5 sec.
3D face Basic expressions
Time
Hold target expression 0.5 sec.
3D face Blended Expressions Time & error Self-evaluation
Before starting the experiment, the experimenter insists on three points:
the subject should feel comfortable and take as much time she needs to practice
before starting, speed is not an issue, and the final and anonymous questionnaire
is an important part of the experiment. The subject can handle the interface as
she wants: either facing her or the other way round. She is warned that when
the interface faces her, the color of her cloths can slightly disturb finger contact
recognition due to the video analysis technique.
3.2
Quantitative Results
Tasks T1 and T2 : The average time taken by the subjects to complete the
tasks is very similar for the first two tasks: 7.4 and 7.5 sec. per target expression
with standard deviations of 3.6 and 3.3. Even though the target expressions are
defined on very different face models (smiley vs. 3D face), the similar modes of
interaction (winning zone defines the expression) make the two tasks very similar
for the subjects. They rate these two tasks as easy: 1.6 and 1.7 with standard
deviations of 0.7 and 0.6 on a scale of 1 (easy) to 3 (difficult).
Task T3 : The figures are very different for task T3 , in which the difficulty is
due to the blended combination of expressions. It requires (1) to analyze an
expression and guess its ingredients, and (2) to progressively tune the weights
of each ingredient in order to obtain a resulting expression as close as possible
to the proposed target. For T3 , the subjects have taken an average time of 26.9
sec. to reproduce each expression, with a high standard deviation of 13.7. Half
a minute is however not very long for such a complex task when compared with
the complexity of the input (43 holes, several sites for finger positioning on
the physical input, and complex output made of interpolations between basic
expressions).
The error between an achieved face and the required target is the sum of
the distances between the coordinates of the face performed by the user and the
target face in the 6-dimensional space of the facial expressions. For example,
if the user produces a blended face weighted by (0.2, 0.6, 0.5, 0.1, 0.8, 0.0), its
distance to Face14 (0.0, 0.5, 0.5, 0.0, 0.5, 0.0) is 0.2 + 0.1 + 0 + 0.1 + 0.3 + 0 = 0.7.
Surprisingly, the time taken to make an expression for task T3 does not
depend on the composition of the expression (28 sec. for basic expressions, 26 sec.
for dual expressions, and 25 sec. for triple expressions). The average error (1.79)
on binary expressions made of two fully weighted expressions such as Face11
is higher than single expressions (1.23) or partially weighted triple expressions
such as Face14 (1.31). This result suggests that mildly blended expressions are
easier to reproduce than heavily blended ones. Task T3 has been unanimously
rated as difficult (2.77 in a 1-3 scale with only one subject rating it as easy).
Role of Expertise: In the questionnaire, the subjects were asked questions
about their level of expertise: average use of a computer, musical instrument
practice, use of 3D devices, and gaming. We now investigate whether expert
subjects perform better or quicker than non-expert ones.
The two leftmost histograms of figure 6 analyze the average time taken to
accomplish tasks T1 and T2 for each level of expertise. Each group of bars is
associated with a time range and each bar is associated with a level of expertise (low, medium, or high). The height of a bar represents the percentage of
subjects with this level of expertise that have performed the experiment in this
time range. The histograms show that the expert subjects are quicker in the
smiley experiment than the two other categories. In the second experiment, all
the subjects take approximately the same time. This is probably because the
time taken to identify the facial expressions does not vary with the expertise of
the subject, and therefore increases the duration for expression recognition and
reproduction for expert subjects.
1
low
av.
high
0.8
0.6
0.4
0.6
0.4
0.2
0.2
0
0
0-5
5-10
15-20
average task duration
Task T3 (blended expressions)
low
av.
high
0.8
percentage
percentage
Task T2 (basic expressions)
error
Task T1 (smiley)
1
0-5
5-10
10-15
15-20
average task duration
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
low
av.
high
10-20 20-30 30-40 40-50 60-70
average task duration
Fig. 6. Duration and error as a function of expertise.
In the histogram for the third task (rightmost histogram in figure 6), the
height of a bar represents the error of the blended faces realized by the subjects
in comparison with the target face. Bars are grouped by time duration intervals
as for the two other tasks. These histograms show that expert subjects are not
quicker than non expert ones for task T3 , but they get better quality: average
error for low or average experts is 1.67 and 1.40, and 1.15 for highly expert
subjects. Incidentally, this histogram also shows that the slowest subjects do not
perform significantly better than the fastest ones.
3.3
Subjective Evaluation
The questionnaire contained several fields in which subjects could write extended
comments about the experiment. All but one of the subjects have positively
rated their appreciation of the interface: 3 or 4 on a scale of 1 (very unpleasant)
to 4 (very pleasant). Their appreciations concern the tactile interaction, and
the naturalness and the softness of the correlation between kinesthetics and
animation. They appreciate that the contact is soft and progressive: natural,
touch friendly, reactive, simple are among the words used by the subjects to
qualify the positive aspects of this interface. Users also appreciate the emotional
transfer to the virtual avatars that make the smiley and the face more “human”.
Some of the subjects have even talked to the interface during T3 , rather kindly,
such as Come on! Close your eyes.
The questionnaire asked the subjects whether the face was conveying expressions of emotions and whether they would have preferred it with another shape.
They qualify the face as expressionless, calm, placid, quiet, passive, neutral...
and have positive comments about the aesthetics of the interface. One of them
finds that it looks like an ancient divinity. It confirms our hypothesis that a
neutral face is appreciated by the users for that type of interface.
Some users feel uncomfortable with the interface, because it requires a tactile
engagement. But for the subjects who accept to enter in such an “intimate”
experience with the interface, the impression can become very pleasant as quoted
by a 43 years old male subject: The contact of fingers on a face is a particular
gesture that we neither often nor easily make. [. . . ] Luckily, this uncomfortable
impression does not last very long. After a few trials, you feel like a sculptor
working with clay. . .
Depreciative comments concern the difficulty to control accurately the output
of the system because of undesirable shadows on neighboring zones, and the
smallness of the tactile head that makes the positioning of the fingers on capture
holes difficult. Some subjects have however noticed the interest of the video
capture by using distant positions of the hand for mild and blended expressions.
Criticism also concerns the limits of the experimental setup: some users would
have liked to go a step further in the possibility of modifying the facial expressions
and control an output device as innovative as the input interface.
To sum up, depreciative comments concern mainly technical limitations of the
interface that should be overcome by using gestures (sequences of hole occlusions)
instead of static contacts. Positive comments concern mostly the perspectives for
affective communication opened by this new type of interface.
4
Conclusion and Future Developments
The evaluation on the usability of the interface reported here shows that a headshaped interface can be successfully used by expert and non-expert subjects for
affective expressions. Comments in the questionnaire show that, for most of the
users, this type of interaction is a very positive and pleasant experience.
Our future work on this interface will follow three complementary directions.
At the technical level, Hidden Markov Models are currently implemented so
that gestures can be recognized in addition to static interactions. Since tactile
communication relies on a wide palette of caresses and contacts, it is necessary
to capture pressure, speed, and directions of gestures. At the application level,
we intend to design new experiments in which more than one subject will be
involved in order to study the communicative properties of this interface in
situations of intimate or social relationship. Last we will improve the quality of
facial rendering to generate more realistic expressions [8].
5
Acknowledgement
Many thanks to Clarisse Beau, Vincent Bourdin, Laurent Pointal and Sébastien
Rieublanc (LIMSI-CNRS) for their help in the design of the interface; Jean-Noël
Montagné (Centre de Ressources Art Sensitif), Francis Bras, and Sandrine Chiri
(Interface Z) for their help on sensitive interfaces; Catherine Pelachaud (Univ.
Paris 8) for her help on ECAs and for her detailed comments on a draft version
of this article. This work is supported by LIMSI-CNRS Talking Head action
coordinated by Jean-Claude Martin.
References
1. Sundström, P., Ståhl, A., Höök, K.: In situ informants exploring an emotional
mobile meassaging system in their everyday practice. Int. J. Hum.-Comput. Stud.
65(4) (2007) 388–403 Special issue of IJHCS on Evaluating Affective Interfaces.
2. Hinckley, K., Pausch, R., Goble, J.C., Kassell, N.F.: Passive real-world interface
props for neurosurgical visualization. In: CHI ’94: Proceedings of the SIGCHI conference on Human factors in computing systems, New York, NY, USA, ACM Press
(1994) 452–458
3. Paiva, A., Andersson, G., Höök, K., Mourão, D., Costa, M., Martinho, C.: Sentoy
in FantasyA: Designing an affective sympathetic interface to a computer game.
Personal Ubiquitous Comput. 6(5-6) (2002) 378–389
4. Ostermann, J.: Face animation in MPEG-4. In Pandzic, I.S., Forchheimer, R., eds.:
MPEG-4 Facial Animation. Wiley, Chichester, UK (2002) 17–55
5. Ekman, P., Friesen, W.V.: Facial action coding system: A technique for the measurement of facial movement. Consulting Psychologists Press, Palo Alto, CA, USA
(1978)
6. Tsapatsoulis, N., Raouzaiou, A., Kollias, S., Crowie, R., Douglas-Cowie, E.: Emotion
recognition and synthesis based on MPEG-4 FAPs. In Pandzic, I.S., Forchheimer,
R., eds.: MPEG-4 Facial Animation. Wiley, Chichester, UK (2002) 141–167
7. Pelachaud, C.: Multimodal expressive embodied conversational agents. In: MULTIMEDIA ’05: Proceedings of the 13th annual ACM international conference on
Multimedia, New York, NY, USA, ACM Press (2005) 683–689
8. Albrecht, I.: Faces and Hands: Modeling and Animating Anatomical and Photorealistic Models with Regard to the Communicative Competence of Virtual Humans.
Ph. D. Thesis, Universität des Saarlandes (2005)