Machine Interpretation of Emotion

COGNITIVE
17,
SCIENCE
539-622
(1993)
Machine Interpretationof Emotion:
Designof a Memory-BasedExpertSystem
for Interpreting FacialExpressions
in Termsof SignaledEmotions
GARRETT D. KEARNEY
SATI MCKENZIE
University of Green with
As a first
step
in involving
based
expert
system
expression
in terms
camera
will eventually
manually
emotion
of face
geometry
user emotion
in human-computer
interaction,
a memory
(JANUS;
Kearney,
1991) was designed
to interpret
facial
of the signaled
emotion.
Anticipating
that a VDU-mounted
supply face parameters
automatically,
JANUS now accepts
made measurements
an a digitized
full-face
photograph
and returns
labels used by college students.
An intermediate
representation
in terms
actions
(e.g.,
mouth
open)
is also used.
Production
rules convert
the
Into these. A dynamic
memory
(Kolodner,
1934; Schank,
1932) interprets
the face actions
that new emotion
been implemented
in terms of emotion
labels. The memory
is dynamic
labels con be learned
with experience.
A prototype
on o Sun 2020
system
using POPLOG.
Validotion
the prototype
suggest
with those of college
that the
students
interpretations
without
formal
achieved
instruction
are generally
in emotion
in the sense
system
has
studies
an
consistent
signals.
1. INTRODUCTION
JANUS’ is a memory-basedexpert systemcapableof interpreting facial
expressions
in termsof the emotionssignaled.It wasdevelopedasan experiWe thank Professor M. Bramer, presently with the University of Portsmouth, and Geoffrey
D. A. Sullivan of the University of Reading for several useful discussions, and the reviewers of
this article for their valuable criticism. Garrett Kearney acknowledges financial support from
the Science and Engineering Research Council.
Correspondence and requests for reprints should be sent to Garrett Kearney, Department
of Computing and Information Technology, University of Greenwich, Wellington Street,
Woolwich, London, SE18 6PF, England.
I At an advanced stage in this project, we came across references to other systems called
JANUS: Day (1987) conceived a hybrid system of neural networks and a production system
concerned with integrating automatic and controlled problem solving; and Fischer, Lemke,
A.C., Mostalglio, T., and March, A.I. 1991) described an integration of Hypertext with a
knowledge-based design environment. There is also the CA0 software package for electrotechnical systems (Colombani, Sabonnodiere, E. Auriol, P., and Pardo-Gibson, O., 1988), the
Sydney University Library researchers’ facility (Brodie, 1989), the decision support system
(Raghavan & Chand, 1989), and the BBN & ISINL system (Hinrichs, 1988). None of these have
any bearing on the research reported in this article.
589
590
KEARNEY
AND
MCKENZIE
ment in making computers sensitive to the body language of users. The possibility of using nonverbal communication as a means of human-computer
interaction has attracted some attention recently and several systems have
been reported (Mase, Suenaga, & Akimoto, 1987; Sheehy, 1989). The problem of recognition and recall of facial features is also of interest to psychologists and raises a number of fundamental questions relating to the structure,
organization, and functioning of human memory (see Bruce, 1988, for a
critical overview; also, among others, Baddeley, 1979; Bower, Gilligan, &
Monteiro, 1981; Bower & Karlin, 1974; Patterson & Baddeley, 1977; Strnad
& Meuller, 1977; Wells & Hryciw, 1984; Winograd, 1976); bearing on the
way faces are perceived (Courtois & Mueller, 1979; Ellis, Jeeves, Newcombe,
& Young, 1986; Galper 8c Hochberg, 1971; Jensen, 1986; Sergent, 1984),
and the importance of context effects (Bower & Karlin, 1974; Watkins, Ho,
8c Tulving, 1976).
Although there has been a wealth of research over the past century in
specifying the facial actions signaling emotions, the problem of how these
are represented in memory and the strategies enabling their recognition and
recall have received less attention than the related question of face recognition. Despite the considerable theorizing linking the role of emotions to
goals, motives, and plans in humans (Izard, 1971; Izard & Tomkins, cited in
Izard, 1971; Oatley & Johnson-Laird, 1985; Sloman, 1986), only Sloman
and Croucher (1981a, 1981b) appeared to accept that robots, too, will have
emotions. So, also, with the obverse: There have been few attempts to equip
computers with the means of recognizing and acting upon the signaled emotions of their users. Sheehy (1989) planned to detect a user’s eyebrow lift in
surprise as a telling communication in computer-user dialogue.
As a first step in computer recognition of user emotion, the JANUS system converts face geometry into static face action format and classifies an
expression by matching it to the typical expressions of six universal emotions.
Use of the word “static” makes explicit that only the end state of the movement is measured in comparison to the neutral position, and not the movement itself. Atypicahties are further labeled by analogy to expressions on
which the system has already been trained. The output is one or more emotion labels. Such labels were acquired from college students without formal
training in face perception. The direction of future work will attempt to make
these meaningful in the context of the goals pursued in the user-machine
interaction.
JANUS lacks a vision “front end” and does not attempt a solution to the
automatic measurement of face emotion parameters but is designed to accept
a facial description from a human source and return an emotion label. The
input description may be geometric (coordinate positions of 34 selected
landmarks currently obtained from manual measurements on a digitized
full-face photograph) or syntactic (a list of verbal face actions, e.g., “mouth
MACHINE
INTERPRETATION
Ge;;ztric
591
OF EMOTION
Face
Actions
Interpret
Learn
Learn
Mode
Interpretation
Flgure
1. Basic
components
of JANUS
” “nose flared”).<The geometric description, if used, is converted into
open,
syntactic form prior to interpretation. The conversion is done by a rule base.
The interpretation is in the form of an emotion label, such as “happy” or
“angry,” and is accomplished by a dynamic memory based on Schank’s
(1982) memory organization packets and his theory of reminding and learning and Kolodner’s (1984) computer implementation. In addition to offering
interpretations by analogy to those accompanying similar expressions experienced in the past, JANUS is capable of learning new emotion labels and
associated face actions, thereby increasing its expertise with use. This allows
memory to be trained before use in accordance with the intended purpose.
The basic components of JANUS are shown in Figure 1.
JANUS differs from conventional expert systems in incorporating a
dynamic memory. The advantage of memory-based systems is that, like
human beings, they develop their expertise through experience. They also
offer the possibility of successfully tackling a problem at a more generalized
level if no specific rules apply. Human beings do this when faced with new
situations (Schank, 1984).
Validation and evaluation studies play an important role in the development of expert systems. Validation studies on JANUS have been aimed at
testing both the rule-base and the dynamic memory components. Both the
interpretation and learning functions have been considered. The conclusions
of JANUS were compared with those of human “lay experts” (i.e., without
formal training in emotion signals) drawn from college personnel. An additional gold standard used to assess the capability of these personnel was
provided by the descriptions given in Ekman and Friesen (1976b, 1984).
592
KEARNEY
Figure
AND
2. Facial
MCKENZIE
‘Landmarks’
Both informal qualitative assessments and quantitative comparisons using
standard statistical techniques were carried out. The results of these studies
appear to support the claim that JANUS performs at least as well as the lay
experts. The level of expertise also appears acceptable, though more extensive
field trials will be necessary to confirm this. The design and operation of the
basic components of JANUS, namely, the rule base and the dynamic memory
are discussed in Sections 2 and 3. Section 4 covers the validation studies
on JANUS. A discussion of related theoretical issues forms the subject of
Section 5. Our general conclusions are presented in Section 6.
2. THE RULE BASE
The rule base performs the task of converting the geometric description of a
face into a list of static “face actions.” The geometric description consists
of the positions of 34 “landmarks” (Figure 2) measured manually on images
with respect to the tip of the nose. These measurements are made on a digitized full-face photograph. The measured distances are normalized to take
into account differences in size and scale. Thus, horizontal distances are
divided by the distance between the inner angles of the eyes and vertical
MACHINE
INTERPRETATION
OF EMOTION
593
distances are divided by the length of the nose. The verbal description is in
the form of a list of features, action pairs such as “eyes-wide” or “noseflared.” Actions have been defined for six features, namely, brow, eyes,
nose, mouth, cheeks, and jaw. A full list of face actions is given in Table 1.
The choice of features and actions was influenced by the work of Ekman
and Friesen (1984) who published a comprehensive account of such facial
cues and associated emotions together with illustrative photographs. The
landmarks were chosen with regard to their potential in registering change
in the position of these features. The representation adopted is more in line
with their earlier Facial Affect Scoring Technique (FAST; Ekman, Friesen,
& Tomkins, 1971), not their more precise anatomically based Facial Action
Code approach (FACS; Ekman & Friesen, 1976a, 1978), which is, in terms
of muscle “action units,” capable of distinguishing among all visible facial
behavior and free of any theoretical bias about the possible meaning of
facial behaviors. Their analysis of the muscles underlying all facial movements allowed them to define 44 action units in terms of which they attained
their target. These have been used, among other uses, to define the groupings
underlying the expressions of facial emotion, which can be done very precisely. Training is required to define correctly face expressions in terms of
their causative muscle actions; “experienced” scorers in Ekman, Friesen,
and O’Sullivan’s (1988) report had from 1 to 4 years experience. In Ekman,
Davidson, and Friesen (1990), the scorers had more than 1 year of experience
using FACS, and their reliability had proven satisfactory against a standard.
However, the earlier syntactic approach was considered more suitable
both for training memory by lay experts and as a first attempt at a computer
facility that draws on the expertise of everyday college students. Admittedly,
this decision arose because of the conditions specific to the project. There
were no resources to train subjects in the FACS technique and the paradigm
followed was that fundamental to many expert systems. In these computer
systems the expert knowledge, which is applied when the expert in a domain
tackles a representative set of problems, is extracted from his or her selfreport. Another expert, this time in knowledge engineering, traditionally
represents the expert’s reasoning in the form of a rule base, and creates an
inference engine to access the pertinent knowledge for an input query. This
is the paradigm for JANUS. Assuming that most adults have amassed some
expertise in “reading” faces according to their experience, I interviewed
groups of college students, showing face photographs and recording their
interpretations and reasons. No subject gave reasons other than in descriptive
terms. None were aware of the scientific research in the domain. In all cases
their reasons were in terms used in FAST. Ekman and Friesen (1984) supplied
the typical expressions and descriptions relevant to the templates with which
such acquired knowledge could be compared. For these reasons JANUS is
more influenced by FAST. It must not, however, be thought that a similar
Brows
llered
upper
plJlled
square
inlid raised
lid lowered
lid lowered
lower
upper
open
upper
lip tensed
lower lip tensed
upper lip raked
lower lip raised
lower lip lowered
upper lip evened
lower lip evened
compressed
sUghUy
wide
lower lid tensed
I
bared
up
down
Mouth
upper lid tensed
lid raised
screwed
Nose
lower lid raised
Eyes
Table 1: Face actions used in JANUS
I
verdcel
nose-mouth
raised
Cheeks
grooves
drop
Jaw
MACHINE
An
example
rule
in
POP11 and
define eyes-l-lid-raised
;;;comment:
the natural
OF EMOTION
language
595
equivalent
( mug) -> i;
The difference
;;;right
eye (erio
;;;than
that of the neutrnl
;;;the
INTERPRETATION
between
(2)) nnd the lower
the y-values
lid below
fnce: nerll(2),
of the internal
the pupil
nerio
angle of the
(erll (2)) is less
(2)
value of “i” is true or false
(
(erii(2)
- eria(2))
c (nerll(2)
- neria(2))
) -> i;
enddefine;
Natuml Language equivalent:
If
the vertical distance between the centre of the lower lid margin and the level of the inner angle
of the right eye is less than that,of the neutral face
Then
the lower eyelid is “raised”.
Figure
3.
system could not be constructed using FACS if the experts could be trained
in the technique. Other workers have utilized FACS to good effect, for example,
Mase (1991). JANUS allows all input face actions to be stored. Memory is
organized around typicalities and anomalies, with the result that its interpretations are very much a function of its experience in training. In a solely
interpretative use, for example, when fed by a video camera (not implemented at present), its output would draw on its experience. A total of 38
rules were defined. Most of these (26) are independent of other face actions
(apart from comparison with a neutral face of the same person), whereas
the rest (12) use the context of other actions on the same face. For instance,
the rule for “brows contracted” is context-free and uses only the distance
between the inner ends of the eyebrows. Tension in the lower eyelid, however,
cannot be defined directly and is inferred from a combination of “lower
eyelid raised, ” “cheeks not raised,” and the “mouth turned down.” An
example rule is given in Figure 3.
596
KEARNEY
AND
3. THE DYNAMIC
MCKENZIE
MEMORY
The dynamic memory performs two functions. In interpret mode, it accepts
a syntactic description of the face and returns the appropriate emotion label.
In learn mode, it accepts both a syntactic description and the attributed
emotion label and adds them to its repertoire for future use. How it performs
these functions will be discussed in this section.
The organization of the dynamic memory brings together &hank’s (1982)
theorizing on the functional organization of human autobiographical memories, Kolodner’s (1984) computer representations of some of these conceptual structures in the design of a fact retrieval system, and the theory of
universal face expressions of emotion (e.g., Ekman & Friesen, 1971; Ekman,
Sorenson, & Friesen, 1969).
&hank’s (1982) model of the human memory is an attempt to explain
how memories of autobiographical social events are stored, organized, and
remembered. For one event to remind one spontaneously of another, both
must be represented within the same dynamic chunking memory structure,
which organizes episodes according to their contextual or thematic similarities and differences. Both must be indexed by a similar explanatory theme
that has sufficient salience in the individual’s experience to have merited
such atypical indexing in the past. His memory organizational packets (MOPS)
organize sets of scenes (general actions) related by a shared goal. These
organize memories in terms of typicalities and atypicalities.
Schank’s (1982) ideas were applied to the domain of diplomatic events by
Kolodner (1984). She developed a system (Cyrus) that organized on-line
news reports concerning the diplomatic activities of two U.S. Secretaries of
State within an incrementally self-organizing event-content-addressable
computer memory. This allowed fact retrieval and elaboration of incomplete
events, making use of generalizations formed from prior input. Some of
her ideas and representations have been used in the design of JANUS, for
example, representing the typical generalities in the frame of the MOP and
indexing differences below the frame by their atypical features and the
methods of promotion and demotion of these.
In JANUS the events are the micro-events of a set of co-existing static
face actions displayed in the service of the goal of communicating the emotion of the person. The dynamic memory is initially endowed with six basic
expression pools housed in the frames of six FACILMOPS.
Each is composed
of a set of face actions from which the different typical expressions of a basic
emotion (e.g., happiness, sadness, anger, disgust, fear, and surprise) can be
composed. The choice of these was influenced by Ekman and Friesen’s
(1971; Ekman, Sorenson, & Friesen, 1969) theory of face actions of those
emotions, which are universal. The associated face actions were adapted
from Ekman and Friesen (1984) with such modifications as were necessitated
MACHINE
INTERPRETATION
OF EMOTION
597
Table 2: Face actions associated with basic emotions
W-v:
lower eyelids
are upturned,
Sad:
medial ends of Ihe brows
are drawn closer
IO one
brows are raised slraighlenlng Iha usual downcurve, brows are
mid-line producing brow corrugalions mostly In cenlre of forehead,
raised, inner parts 01 Ihe eyelids are raised higher lhan the
mouth are down-turned.
upper eye lids are lowered.
4ngry:
brows are lowered onto lhe ayes, medial ends of the brows are drawn closer lo one another.
eyes narrowed. inner parts of the eyelids are raised higher than the outer parts. lower
eyelids ara raised, lower eye lids are tensed. upper eyelids are lowered. upper eyelids are
tensed, nostrils are flared, lips are compressed
logether, lower lips are lensed, upper lips
am tensed, mouth is open al all, laelh are showing, mouth is’ widely-open,
moulh Is widgly
open and assumes a squarish aperture, grooves between the wings of nose and moulh
comers are deep and more varlical.
Afraid:
bath brows raised corrugaling lhe lorehead all across,
medial aspecl of brows are raised
slraighlenlng the usual downcunre,
medial ends of lhe brows are drawn closer lo one another,
lower eyelids are tensed. eyes widely open,
upper eyelids are raised,
lower eyelids are
raised,
inner parts of the eyelids are raised higher than the ouler parts, mouth Is widalyopen, mouth is open al all, mouth pulled back widening il horizontally,
upper lips are tensed,
teeth are showing.
Disgusted:
brows are lowered onto Ihe eyes. lower eyelids are raised, eyes narrowed,
nose screwed up
producing lransverse wrinkles al the bridge, lower lip is turned out, leelh are showing, upper
lip is turned oul , upper lip raised, lower lip is lowered,
nostrils are Ilared. lower lips raised,
grooves between the wings of nosa and mouth comers am deep and more vertical.
mouth open
al all, cheeks ara raised making a fullness below Ihe eyes, lips are compressed together, lower
lip is tensed, upper lip is tensed.
Surprised:
both brows raised corrugaling Ihe lorehesd all across, ayes widely open, upper eyellds raised,
lower eyelids are lowered, moulh is open al all, mouth is slighlly-open. mouth widely open. jaw
is dmpped.
are raised. mouth Is open al all, moulh Is widely-open,
comers 01 Ihe moulh
Ihe leelh ara showing, cheeks are raised making a fullness below the eyes
another,
medial aspecl
of
raised inwards lowards the
lower eyelids are
outer parts,
corners of Ihe
by design constraints, which were discussed in Section 2. A list of the basic
emotions and face actions is given in Table 2.
JANUS organizes facial expressions of emotion. Each of six FACE_
MOPSis essentially a tree with typical universal expressions stored at the root
(frame) and related, but atypical, face actions forming subtrees below
the frame. Any recurring expression is channeled down the tree until it reaches
an identical event previously encountered. This results in “reminding”
whereby the emotion attributed to the previous expression is made available.
Expressions that have not been encountered before are automatically incorporated into new branches of the tree. Frequently occurring events are recognized as being “typical” and the memory is restructured to reflect this.
The computer organization of JANUS’s dynamic memory is a tree of
nodes and links (represented as recordtypes). The nodes contain in their
information field a variety of input components [a single face-feature:
598
KEARNEY
Figure
4. Basic
tree
of pre-defined
AND
MCKENZIE
action
components:
only
a few
links
“brows”] binary face action: “brows raised” or an input event identification:
“ev0”; or they may reference by name an object that may contain: (a) typical
face-actions of an emotion (~~cE210p content frame); (b) a typical facefeature abstracted from experience (fsub-MOP); (c) a typical face-action
abstracted from experience (sub-MOP); or (d) in the case of leaf nodes: identifiers of complete input events. An input event is a list of syntactic faceaction pairs and perhaps an interpretation. These objects (a-d) are also
represented as instances of Poplog’s Flavor Object Classes. Details of (a-d)
such as lists of face actions, related interpretations, and references to other
objects are stored in these data structures. Attached procedures (demons)
are used to carry out the dynamic restructuring of memory to accommodate
new input events. The links may be of three types-a feature, an action or
“event” -depending on the type of object pointed to. Each link is composed
of two items: type of link and node pointer. The initial state of the tree is
shown in Figure 4 where some of the links have been omitted for clarity.
The root node (m0) is linked by six “feature” links to the first-level nodes
(ml-m6). These are in turn connected by action links to Level 2 nodes (m7m12) each of which contains the typical face-action pool of one of the six
basic emotions in the content frames. New events are incorporated as shown
in Figure 5. An event, ev0, related to ml2 but differing from it in two respects
(“mouth pulled,” “ cheeks raised”) is entered. The differences are indexed
below ml2 and two branches are created, each with:
feature - < feature > - action - < feature-action > - event - < ev0 >
where < . . . > represents nodes in the tree. All subtrees below second-rank
nodes (FACEAJOP frames) have this sequence. The same event is indexed
twice (at nodes ml5 and m18), and could be accessed (remembered) if either
of the two actions occur in a subsequent event traversing the same path.
The dynamic reorganization is further illustrated in Figures 6a, 6b, and
6c. Identical events, evl, ev2, which differ from the typical (Face-Mop) in
MACHINE
INTERPRETATION
599
OF EMOTION
pulled
+ cheeks
Figure
below
5. A new
m12.
event,
~0,
differs
= ev0
raised
in two
actions
from
those
cheeks
6a.
Identical
Face-MOP
Brows low
events
differing
from
the typical
6b.
Fsub-MOP
are
chee&
and sub-MOP
cheeks
Figure
indexes
ev0
raised
indexed
below
ml2
sub MOP
fsub-MOPcheeks
Figure
Each
ralsed
cheeks
Flgure
in m12.
6~. Promotion
ralsz
formation
raised
of sub-MOP
having “cheeks-raised” are indexed below that node (Figure 6a). The two
are then collapsed into a single branch (sub-MOP) in Figure 6b. After six
occurrences of the same event (an artitrary number adopted from Kolodner,
1984), JANUS decides that this is a “typical” situation and promotes the
action “cheeks raised” to FACUOP
level and indexes the events directly
off that node (Figure 6~).
600
KEARNEY
AND
MCKENZIE
Table 3: Heuristic rules for FACE-MOP selection:
Selection
IF
THEN
depends
on specillc face aclions in the inpul lace expresslon
eyes lower-lid-lowered
and
brows raised
MOP-surprised-geneFace
ELSE IF
and
and
event viz:
eyes lower -lid-lensed
eyes lower-lid-raised
brows raised
THEN
ELSE IF
and
mouth up
no1 ( nose screwed)
MOP-happy_gen-Face
and
and
and
and
brows lowered
brows contracled
eyes inlid-raised
( moulh compressed
or mouth wide)
not ( moulh upper-lip-raised)
MOP-angry-gen_Face
THEN
ELSE IF
THEN
ELSE IF
and
and
or
or
or
or
THEN
ELSE IF
THEN
( moulh upper-lip-raised
mouth upper-lip-tensed
( mouth lower-lip-raised
or mouth lower-lip-lowered))
( nose screwed and cheeks raised and eyes lower-lid-raised
and brows lowered)
( nose screwed and cheeks raised and eyes lower-lid-raised
and moulh upper-lip-raised)
( moulh upper-lip-raised
and moulh lower-lip-everted
and
cheeks raised and nose screwed)
( mouth upper-lip-ralsed
and ( moulh lower-lip-raised
or moulh
lower-lip-lowered)
and ( nose screwed or cheeks n-I-velt))
MOP-dIsgusted_gen_Face
( brows cenlre-raised
or
mouth down
or brows cenlre-raised
MOP-sad-geneFace
and eyes inlid-raised
and eyes lower-lid-raised)
A face-actionlist enteredfor interpretation,is first assignedto oneof the
six basic emotions.The face actionsin the frame of a FACE-MOPare not
definitional. Together,they are a pool drawn from typical expressionsof
that emotion. The face actionsof an input eventare matchedagainsteach
of thesepoolsto decideautomaticallythe onethat includesmost of it. The
largestratio of matchedface actionsin the input to the number of face
actionsin eachof the six FACE-MOPpools decidesthe issue.If a tie results,
a heuristicof salientfeatures(seeTable3) is applied.If still tied, both competing emotionsare output on the assumptionthat the expressionshows
both emotions.If all the input actionsareconsumedby the chosenemotion,
thenthat emotionlabelis returned.If someof the input actionsareatypical,
MACHINE
INTERPRETATION
OF EMOTION
601
the sub tree is traversed with these in search of similar learned events and if
any are found, the corresponding leaf interpretations along with the FACEMOP emotion are returned. In none are found, the FACE-MOP emotion
is returned.
When a face-action list is entered to be learned, it is accompanied by an
interpretation and the complete event is listed in a leaf. If no atypical feature
actions are present, this leaf is indexed directly off the FACE-MOP frame,
but if there are atypical face actions, subtrees are traversed or forged with
each of these as before, but any atypical actions not already present in the
tree are now included as separate branches and a new event leaf node is
added. A new instance of the event object class is also created. Thus, an
input event may be referenced several times within the tree, once for each
atypical face action that is present.
4. VALIDATION
Validation studies on JANUS addressed the question of whether its interpretations were acceptable to human beings judging the same photographs.
Other than that describing the basic emotions, the knowledge JANUS acquires
rests on the discriminations made by human beings to series of face expressions that have not been systematically validated. In many cases the judges’
competence was tested using “gold standards” of systematically validated
expressions obtained from two sources. Ekman and Friesen (1984) described
in detail the face actions associated with basic emotion classes. They also
published a set of validated photographic slides (Ekman BEFriesen, 1976b)
of faces exhibiting these basic emotions. The set includes neutral faces for
comparison. With the permission of the publishers, a selection of these slides
were digitized and used for purposes of validation.
College personnel and other lay experts were presented with the same
photographs and asked to identify face actions and emotional state. These
were then compared with the interpretations JANUS obtained from the same
faces by passing the 34 face coordinates of each digitized image through the
rule base, thereby deriving the set of face actions to be entered into memory.
Lay experts were also used to teach JANUS about new emotional states in
order to test the learning function. Human experts also played other roles as
judges or arbiters in deciding how they rated the interpretations of JANUS
against those of other humans in a blind comparison.
4.1 Quantitative Validation of the Rule Base
The aim here was to obtain a precise estimate of the measure of agreement
between the conclusions of JANUS and those of human beings. The rule
base was tested using four experts (9-12; different from those used for the
prelimmary investigation) and 17 photographs. The questionnaire for eliciting
602
KEARNEY
x2 i 4.72;
AND
MCKENZIE
d.f. = 9; p P 0.95
*17 photos times 5 choices per photo = 85 possible agreements(A) and
disagreements(r)-A) per judge. Key: A = agree, D-A = disagree, exp. =
expected frequencies
able 4a: With- & without Janus pair-wise comparisons over six features
face actions was divided into six sections corresponding to the six features:
brows@), eyes(7), nose(4), mouth(l4), checks(2), and jaw(l). The numbers
in parentheses represent the number of face actions for each feature. The
number of agreements and disagreements for each feature over 17 non-goldstandard, unvalidated faces of six basic emotions in varied intensity posed
by one of the authors were computed for all possible pairs involving the
experts A-D and JANUS(J) and tested for significance using the chi-square
test. The scores for pairs without JANUS were also tested for significance
using the chi-square test. The comparisons for “brows” took the form
shown in Table 4 and were of two classes: with and without JANUS. The
results are given in Table 4a.
It is apparent that the near-significant result involving the eyes is not part
of a general trend in the direction predicted by the alternative hypothesis. A
breakdown of the disagreements in the eye section from the point of view of
JANUS incriminates, in particular, judgments of “eyes narrowly open,”
“upper eyelid raised,” and “lower eyelid raised” in this order as producing
most dissent. All of these reached full agreement in some faces and certain
faces (Numbers 1,4, 5 & 12). Differences of 1 pixel can be decisive, so that
MACHINE
INTERPRETATION
OF EMOTION
603
Table 5: With- & without Janus pair-wise comparisons over six features
I
f
x
Jaw
e
2
x2
x2
x2
x2
=
=
With
9.32;
e as descrlbedin
JANUS
without JANUS
14.12;
= 7.27;
= 24.07;
= 6.03;
= 11.13;
clear delineation of the face points is essential. Definition is not good in the
eye region because of the natural shadows cast therein. Considering the
quality of the images available, JANUS’s agreement or lack of it, might be
considered as a lower bound on the potential of this approach using state of
the art imaging techniques.
Although the rule based is thus in fair agreement with lay experts’ judgments of face actions present, it would be gratifying to find that both agreed
when standard photographs were used. It was decided, therefore, to test the
rule base using expressions that had been well validated as showing basic
emotions. The pictures used for this purpose were taken from Ekman and
Friesen [1976b: Pictures of facial affect, PFA)]: 84(happy), 9l(disgusted),
90(surprise), 92(neutral), 4l(neutral), 38(angry), and 37(afraid)]. No conscious bias dictated this choice except that the expressions seemed very well
defined. Such photographs would undoubtedly depict the typical face actions
for these emotions although these are, unfortunately, not detailed for each
photograph in the published material. Agreement as to the depicted emotion
of these expressions was very high among the human judges who made the
original validation of this published source.
The test of the rule base proceeded along the same lines as before, namely,
comparing human judgments of these photographs with those produced by
the rule base. The experts in this case were five clinical psychologists. The
results, enabling a with and without JANUS eomparison are displayed in
Table 5. The results do not reveal significant differences in the six feature
areas although the comparison of the mouth with JANUS approaches significance (.05). The without JANUS counterpart is a little less (.12). The
mouth area is therefore judged with some disparity and the striking characteristic of JANUS’s performance in this respect is that fewer face actions in
this area are judged present in many cases.
Comparing JANUS’s performance with human judges is not enough. It
is necessary to compare the human judges’ performance with a validated
standard. A better test of the rule base might then be a comparison of the
results of the rule base on the five preceding photos with descriptions of
604
KEARNEY
AND
MCKENZIE
very similar expressions of the same models, pictured and described in some
detail in “Unmasking the Face” [UTF, (Ekman & Friesen, 1984). PFA
(Ekman & Friesen, 1976b) is a collection of face transparencies without
details of face actions. UTF is a book tutor that describes photographed
expressions in detail]. Some of the models in the two sources are the same.
The results for the five PFA faces are shown in Table 6: JANUS performed
slightly better overall than the human judges as reflected in the total face
actions in agreement with the UTF descriptions, but varied from face to
face as was true of the human judges.
4.2 Validation of the Dynamic Memory
This involves testing that the FACE-MOP basic emotions are accessed correctly by their component face actions when these are used as input to the
system and, also, that the basic emotions output by an untrained JANUS
in response to a test set of digitized images do not, as a block, differ significantly from those adjudged by human judges to be present. A blind study,
in which the emotional labels output by a trained JANUS were analyzed
with regard to their acceptability to human observers, was made.
4.2.1 Interpretation of Basic Emotion Category: A Qualitative Study.
JANUS has “given” knowledge about what face actions may be commonly
associated with each basic emotion. These expressions are held in the frame
of the FACFOP
and, because there can be many typical expressions for
a basic emotion, the knowledge is represented as a pool of face actions to
which a particular user-input list of face actions can be compared. The
number of matches in each of the six FACELMOP pools divided by the number
of face actions in the respective pool gives in each case a quotient, the greatest
of these is used to select the FACE-MOP under which the input will be classified. To validate this function, that is, whether the “correct” FACE-MOP is
selected, the frame face actions of the basic emotion under investigation
were input in increasing random combinations starting with singles, then
pairs, then triples, then quadruples, and so on. Because of the potential
MACHINE
INTERPRETATION
OF EMOTION
605
combinatorial explosion, just seven random combinations were tried at
each level. The level at which 100% success for these seven was achieved
was used as an estimate of the sensitivity of the ability of frame face actions
to select the “right” Face-Mop. In Ekman and Friesen (1984) the separate
combinations of face actions and appearances typifying each basic emotion
are described, but what is being validated here is that a purely numerical
measure (a quotient: input matches/tally of pool) will select that pool rather
than any other FACE-MOP pool. The results suggest that an input of a single
given face action varies in its ability to access the FACE-MOP under test.
Singles from the “surprised” pool are exceptional (all correct). Pairs and
triples are all correct also. Thus, one of the correct emotions (“surprised”)
achieved 100% hit rate (over seven consecutive randomly selected inputs) at
the one-grouping level, “happy” and “sad,” at the two-groupings level,
“afraid” and “disgusted,” at the four-grouping level, and “angry” at the
five-grouping level. Bearing in mind that the number of face actions in the
associated pool was 8, 9, 7, 14, 17, & 17 respectively, we concluded that the
sensitivity of the FACE-MOP selection function was acceptable. The learning
capability was investigated to ensure that new input face actions and emotion
labels were learned and correctly retrieved in subsequent interpretations.
Two experts were asked to view six photographs, one for each basic category
and supply lists of face actions together with their own interpretations.
These were entered into JANUS. Subsequent input of the same face actions
in retrieve mode, that is, without an accompanying emotion, did retrieve the
correct interpretations. Thus, the learning function appeared satisfactory.
4.2.2 Validation of the Basic Emotion Output by JANUS by Comparison
with Human Judges: A Quantitative Study. A more stringent test would be
whether JANUS returns the same basic emotions as do human experts when
presented with an arbitrary set of face photographs. Four college computer
staff (Experts A-D), aged 21 (male), 24 (female), 38 (male) and 38 (female)
years, without formal training in facial expressions were presented with 17
full-face black and white A6-sized photographs of one of the authors posing
various intensities of each of the emotions: happy, sad, angry, disgusted,
afraid, and surprised, and asked to select, in each case, from the emotion
terms, “happy,” “sad,” “disgusted,” “afraid,” “angry,” and “surprised,”
the term that described the emotion signaled. The face actions obtained by
passing the geometric descriptions of these same photographs through the
rule base were input to JANUS and the returned emotion was noted. All
attributions are presented in Table 7.
Photograph 2 was used as the neutral expression for comparison and was
omitted from the analysis. In order to assessthe capability of these judges,
the four judges, in a separate trial, interpreted 24 PFA pictures randomly
selected over the six basic emotion labels. Their performance in this task
3
10
11
12
13
14
15
16
9
1
3
4
5
6
7
8
Happy
Afraid
Afraid
Happy
Allgw
Disgusted
Afraid
Afraid
1Photo 1 JANUS
Sad
Afraid
Sad
Happy
Surprised
Dissusted
Angry
Afraid
Sad
Happy
Disgusted
su
Afraid
Happy
Surprised
Surprised
Happy
Surprised
Disgusted
Angry
Disgusted
Surprised
Happy
Happy
Angry
Sad
Angry
I Angry
1 Happy
Sad
Surprised
Happy
Happy
Sad
Disgusted
Disgusted
I Afraid
1 Discrusted
I
9
lo
8
1
3
4
5
6
7
I
_
1 Expert A 1 Expert B 1 Expert C I Expert D I Photo 1
Table 7 : Interpretation of basic emotion category
MACHINE
INTERPRETATION
OF EMOTION
607
(79.2, 95.8, 91.6, and 91.6% correct compared to the published emotion
labels for the same faces) indicates that they are in no way a deviant group
when it comes to judging faces. However, there is clearly a spread of agreement among the five sources in the table, and statistical tests were applied to
test for the significance of these differences overall.
The kappa statistic (Cohen, 1960, 1968) was used to test for agreement
among the five raters on the results in Table 7 with the following results:
x = .467, &r(x) = .0013,2 = 13.13. This value of 2 exceeds the .Ol% significance level and we concluded that the five raters including JANUS exhibit
significant agreement. Additional cases gave the following results:
With
With
With
With
With
With
JANUS omitted,
Expert A omitted,
Expert B omitted,
Expert C omitted,
Expert D omitted,
Experts C and D
x = .45, var(x) = .0093, Z = 9.46;
x = .45, var(x) = .00212, Z = 9.83;
x = 44, var(x) = .00254, Z = 8.8;
x = .46, var(x) = .00228, Z = 9.63;
x = .469, var(x) = .00139, Z = 12.56;
omitted, x = .59, var(x) = .00043, Z = 9.1.
The x values suggest a moderate agreement in all of these cases and this
agreement does not vary to the extent that would suggest any particular
expert was markedly deviant. All these Z values exceed the .Ol significance
level (Z = 2.32). This would argue for rejecting the hypothesis that the
agreement is due to chance, suggesting, instead, significant agreement between
the judgments.
A better test of the ratings of JANUS vis-a-vis the human experts may
be obtained by using the Williams (1976) “In” statistic on the results of
Table 7. This test is specially designed to compare the joint agreement of
several raters (human experts) with another rater (say, JANUS). For purposes
of this discussion, we merely state that a statistic “In” (where n is the number
of reference raters, not including JANUS) can be derived by:
fll = &/i”
PO represents the overall agreement of the isolated rater with the reference
raters whereas Pn represents the overall group agreement among raters 1 - n.
The results are given in Table 8.
Five sets of calculations were done with JANUS and the four human experts being selected in turn for scrutiny. Two additional caseswere considered:
1. The ratings of JANUS were all replaced by the fixed emotion “disgusted,” and
2. The ratings of JANUS were all replaced by randomly selected emotions.
These contrived situations were used to assess the sensitivity of the test.
In addition to calculating “In” for each case, it is important to be able to
estimate the error limits for the “In.” Following Williams (1976), upper
Expert
Expert
Expert
Using
Using
Using
Contrived
I
Contrived
Expert
Using
random
fixed
responses
expert:
expert:
0.51
0.52
0.27
I
1.34
0.90
1.37
1.14
0.63
1.19
1.27
1.04
1.09
1.21
A
I4
JANUS: 0.29
JANUS:
expert:
from
from
focused
focused
focused
response
D as the
C as the
B as the
expert:
expert:
focused
focused
A as the
JANUS as the
Using
Case
upper bound
A
on I 4 at the
5% sign.
level
Table 8: I4 comparisons in test cases of Table 4.2
MACHINE
INTERPRETATION
OF EMOTION
609
bounds were calculated for the population “In,” at the .05% significance
level. The results are presented in Table 8. A value of “In” close to 1 would
suggest that the ratings of the test judge are as consistent with those of the
reference judges as the ratings of the latter are mutually consistent. An upper
bound of 1 or more would confirm this at the .05% confidence level. An
“In” significantly less than 1 (and an upper bound of less than 1 at the chosen
confidence level) would imply that the ratings of the test judge are not consistent with those of the reference judges, The results suggest that there is
practically no difference between the joint ratings of the test judge and the
reference judges. Thus, Experts A-D and JANUS agree, although Expert C
is slightly anomalous. This is in marked contrast with the last two (contrived)
cases, where JANUS (with tailored ratings) is clearly inconsistent with the
human experts.
Another approach is to use “meta-judges” to rate the interpretations in
Table 7 in a blind comparison. The meta-judges (aged 30-35 years) had no
formal training in recognizing facial expressions. Two were brothers who
had lived apart many years and one was married to the third rater so these
as a group could have developed some perceptual commonality. The photographs that they were to judge were those upon which the judgments of
Table 7 were made. They were not aware that one of the sources of those
judgments was a computer. As an index of their prowess in such a task, they
scored, respectively: 40(80%), 42(84%), and 45(90%) of SO(lOO%) in agreement with a gold standard set of expressions (PFA, Ekman 8c Friesen, 1976b).
It was felt that their ratings of the judgments in Table 7 would have credibility. Each meta-judge was asked to indicate whether each of the lay experts’
interpretations was (a) good, (b)fair, or (c)poor. These were compared using
the Friedman (1937) ANOVA (analysis of variance) test. This is appropriate
where the same group of subjects is studied under different treatments and
the outcomes are to be compared. In this case the outcomes are the number
of “a” and “b” grades given to the interpretations of Table 7 by meta-judges
observing the same set of photographs. We wished to find out whether or
not the grades obtained by the five differed significantly. The data are prepared as a two-way table of five columns and 17 rows, in which each row
contains the rank positions across the row of the number of “a”s and “by’s
accredited to each expert (including JANUS). A column tabulates these over
17 face photographs. The test statistic, XT is distributed approximately as
the chi-square with degrees of freedom equal to the number of columns
minus 1. A value equal to or greater than that at the .05% level of significance (9.49) implies that the hypothesis that all the samples came from the
same population may be rejected. The results of this test argue for accepting
the five samples as coming from the same population: (xr = 2.5694, df = 4,
P P .5). The ratings are shown in Table 9.
Although it is desirable that posed expressions should be recognizable for
the intended emotion, it is possible that we do not always signal our true
photo
no.
1
3
4
5
6
7
6
9
10
11
12
13
14
15
16
17
16
awry
disgusted
hww
angry
disgusted
afraid
afraid
disgusted
surprised
hww
sad
afraid
surprised
sad
hwy
afraid
afraid
JANUS
I = three mete-judges
r
bab
baa
CCC
888
cca
aaa
aab
aab
bba
bbb
888
888
baa
bab
baa
a88
888
sad
afraid
sad
hww
disgusted
dlsgusted
afraid
afraid
pained
happy
hwy
sad
afraid
afraid
sad
wm
disgusted
expert1
ratings of the same photograph
r
cbc
baa
bee
888
bcb
aaa
aab
aab
888
aab
888
888
baa
aab
baa
888
aaa
hwy
afraid
surprised
hwvy
sad
dlsgusted
w3w
afraid
disgusted
surprlsed
hwv
sad
afraid
afraid
sad
wry
disgusted
expert2
r
bab
baa
bcb
aaa
bat
aaa
acb
aab
bba
bbb
aaa
888
baa
aab
baa
888
888
dlsgusted
afraid
happy
hww
afraid
sad
awry
wnf
dlsgusted
happy
hww
sad
dlsgusted
angry
sad
awry
dlsgusted
expert3
Table 9: Three me&judges’ ratings of interpretations of 17 garryphotos.
The ratings (r) shown are the better of first and second(if any) interpretations
iy:he interpretation by the me&judge of the same photograph:
surprkd
happy
hwv
sad
dlsgurted
afraid
afraid
dlsgurted
hwy
happy
sad
surprised
afraid
sad
wry
disgusted
baa
bbb
888
bbb
aaa
acb
bba
aab
888
888
ccb
cob
baa
aaa
888
bC0
sad
coo
r
expert4
r
CbC
bba
bba
888
cab
aaa
baa
baa
abb
baa
aaa
aaa
bba
baa
aab
aaa
aaa
of each lay-expert compared
MACHINE INTERPRETATIONOF EMOTION
611
Table 10: Interpretations of JANUS after training
1 Photo Basic Label
1Alter native Learned Labels
Cheerful, Anticipating pleasure
Puzzled, l-earful. Uncomprehendmg
Puzzled, l-earful, Uncomprehendmg
Cheerful. Antlcbatlna Dleasure
. ....
aiFij,
I:
7
8
9
10
ii
I i2
I 14l3
15
16
17
..
18
3’y
0: gusted
Af ri.-aid
Afraid
Disausted
!Worised
iWvy
sad
Afraid
Surnrlsnd
- -. - - ;nrl
L,AMWU
.“J’
d.
1I
I Fleceative to araument. interested
1 Cheerful, anticipating pleasure
I Depressed, Unhappv, Havmct dtstaste
1 Puzzled, Fearful, Uncomwehendina
,I .l-lrmnntlva
.V-r .,.- to argument, Interested
1I ”nepressed, Unhappy, Having distaste
I, n&liking, Displeased
. *.
I-
feelings, or present them in an idiosyncratic way. The accuracy problemwhether the face expression is recognized as signaling the intended emotioncannot be addressed because we do not know what emotion was felt, but do
know what was desired to be signaled and the model’s success in signaling
this can be calculated for the photographs on which Table 7 rest. The intended
emotions can be compared with the interpretations made by the experts and
JANUS in whether the intended message got through. The number of judgments in accord with the intended emotion over 17 non-gold-standard photographs were as follows: JANUS(13), Expert A(12), Expert B(14), Expert
C(8), and Expert D(ll).
4.2.3 Validation of the Learning and Recall Functions. The learning and
recall functions of the dynamic memory were tackled next. A set of face
photographs different from those used in previous tests was presented to a
group of 30 people and a total of 50 event descriptions (face actions and
nonstandard emotion labels) were obtained from them. These were entered
into JANUS in the learn mode. This resulted in an experienced (“trained”)
memory. The question addressed was: How acceptable are these labels when
output again in response to a face description input in interpret mode? The
following procedure was designed to discover this. Geometric descriptions
of the 17 photographs used in the “untrained”
validation (but not in the
training session) were converted to face actions by the rule base and input to
the dynamic memory. The interpretations returned are given in Table 10.
I
612
KEARNEY
AND
MCKENZIE
Table 11:
rating
Learned emotion.
no %
Bask emotion label
no %
good
fair
103
30.66
78
52.35
117
34.82
47
31.54
poor
116
34.52
24
16.11
total
336
100.0
149
100.0
There are several learned emotions but only one basic emotion
Fifty-five independent judges (untrained people and college students)
were asked to judge the same faces but their judgments were discarded.
They were told that “other people” had interpreted them as showing this or
that emotion and were asked to rate these as good, fair, or poor. Each judge
rated up to three photographs each, both for the basic emotion category
and the new “learned” emotion labels, yielding 149 basic and 336 learned
ratings. The results are given in Table 11. The results show a clear preference
for the basic (Face-Mop) emotion label. This is in line with the view of
emotions from a prototype perspective rather than as classical concepts
(Fehr & Russell, 1984) if we view the FACE-MOP labels as the more typical
“core” and the learned lead labels as the more fuzzy perimeter. This is a
different analysis than that shown in Table 11.
A practical validation of JANUS’s interpretative power is seen if the choice
given to the user is addressed directly. Remembering that the FACE-MOP
basic emotion is output as well as the learned interpretations for each face
description entered in search of an interpretation, validation may proceed
with reference to the highest grade obtained per face, “given” or “learned”
regardless. This analysis showed that 94% of the interpretations were approved
in some measure: good, 105 (70.5%), fair, 35 (23.5%), poor, 9 (60/o).
An attempt was made to assess the meta-judges’ prowess in interpreting
such photographs by asking them to interpret 24 gold standard PFAs (Ekrnan
& Friesen, 1976b). The 24 pictures were randomly selected within the six
basic emotions, and each had been well validated by Ekman and Friesen
(1976) for the emotion signaled. Unfortunately,
there was a considerable
drop-out in this endeavor with 39 of 55 (70.9%) being ultimately tested. The
task was to judge which basic emotion-happy,
sad, angry, afraid, surprised,
or disgusted-each picture depicted. Their responses were compared with
the published validation of these pictures. On average, the 39 meta-judges
scored 20.5 correct: 24 (85.42%, range = 13-24). Their accuracy appears
satisfactory in comparison with this published classification.
MACHINE
INTERPRETATION
OF EMOTION
613
4.3 Discussion of System Validation
The basis of the validation was comparison of JANUS with humans studying
the same face photographs either directly or through the agency of metajudges. The criterion underlying this approach is that one cannot expect
JANUS to agree with the humans to a greater extent than the latter agree
among themselves. Given that the validation is only as sound as the capabilities of the human lay experts and that although, a priori, we assume that
any adult is an expert at identifying emotions, it turns out that the consensus
of the adults used is only moderate. However, as a criterial policy, this is
still not enough: It is necessary to have some measure of the humans’ prowess
in “reading” faces. That the particular quartets of lay experts were somewhat varied in both their interpretations of the face features present, and
emotion signaled in judging face photographs, may suggest that these are
not tasks on which there is close agreement between humans generally, or
that it happened to be a discordant group. The problem is that there is no
normative test of capability for recognizing individual face actions.
One measure available for objectively assessing prowess in detecting the
emotion signaled was the gold standard set of transparencies: PFA (Ekman
& Friesen, 1976b). Each transparency features a face, on which the expression
has been classified as characteristic of one of the basic emotions (in contradistinction to the other basic emotions) by a high level of consensus of observers
in the original validation of the set. The consensus classification for each
transparency is taken to be the “correct” basic emotion for validating
JANUS. Where possible, the judges involved in the validation of JANUS
were asked to judge the emotion signaled by a randomly chosen set of these
transparencies in order to give some inkling as to their capabilities.
In relation to this gold standard, JANUS and the meta-judges perform
well enough, but what is an acceptable passing mark? The 4 interpreters of
the basic emotions depicted in the 17 photographs (posed by one of the
authors on which JANUS was validated), scored an average of 79.2%,
95.8%,91.6%,
and 91.6% correct on 24 PFA transparencies that had produced in the canonical validation, an average consensus agreement of 91.21%
for the emotion depicted in each case (range = 71-100%); the 3 persons
meta-judged the interpretations of JANUS and these judges achieved 80070,
84070, and 90% PFA “correct” interpretations over 50 transparencies (with
an average canonical consensus of 91.78%), and 38 persons meta-judging
the aptness of the JANUS learned interpretations averaged 85.42% correct
over 24 transparencies (with an average canonical consensus of 91.21 To). If
one accepts that this degree of capability on the gold standard commands
respect, one will have faith in their assessment of the system output emotion
(trained and basic emotions together): Only 7 judges of 55 judging the JANUS
learned interpretations, decided a face was not acceptably described either
by the basic or the learned interpretations. Their adverse judgments involved
5 different faces of the total shown.
614
KEARNEY
AND
MCKENZIE
There is no comparable gold standard for recognizing face actions, but
use was made of UTF (Ekman & Friesen, 1984) as reported in Table 6. Of 29
face actions described in detail in UTF with comparable expressions on the
same model in PFA, JANUS derived 26 from geometrical measures. The
validation of the rule base showed that JANUS often disagreed with the
human experts as to which eye actions were present. This was particularly
evident in accessing the degree of eye opening. JANUS decides this on arbitrary’numerical intervals and more work is needed if these are to grasp human
perceptual distinctions. The relatively greater rise of the inner part of the
upper eyelid also was a feature that caused disagreement. Although indicated
plainly by a diagram on the validation questionnaire, the comparison with
the neutral face seemed to have been overlooked by the experts. There are
two aspects to JANUS that would be improved upon in a field model; because
both affect validation, they may, with advantage, be mentioned here. First,
a well-validated set of face expressions should form the basis on which the
measurements are undertaken for refining the rule base and for validating
system function. Acquiring models, training them, producing images with
good definition and without shadow and procuring representative samples
of people to judge these is a major project. With the constraints in time,
equipment, and budget available to the JANUS project, we would consider
the results achieved to be a lower bound on what can be achieved in this
methodology with ample resources. Second, there are trained experts in the
facial recognition of emotional expressions of the face, but they are few and
far between. A definitive system could only gain in interpretative capability
by drawing its expert knowledge from such sources. The problem with using
everyday people as knowledge sources is the uncertainty about their capability.
5. DISCUSSION
JANUS brings together two diverse psychological theories: Schank’s (1982)
theory of reminding and the Ekman and Friesen (1976b, 1984) explicit theory
of which face expressions signal which basic emotions. The product is an
emotion retrieval system that learns from its experience of input faceexpression events and applies this learning, as an individual view of its
world, to subsequent input. What has been achieved is a small research
prototype that produces an emotional label or a choice of several of these
from an input description of a facial expression. Because the intended applications of this technological approach would involve data extractable
from video camera frames, the input face description is in the form of the xand y- coordinates of 34 standard face locations.
Several assumptions are fundamental to the approach, and are discussed
briefly.
MACHINE
l
l
l
l
l
INTERPRETATION
OF EMOTION
615
An event approach has been adopted: face expressions of emotion are
treated as autobiographical events, although JANUS uses events from
many people. Events are constrained to a list of face actions with or
without an emotion label. This is a very constrained type of event with
context limited to the accompanying face actions in the expression.
It is assumed that the FACE-MOP frame pool forms part of the contextfree autobiographical knowledge (cf. Conway & Bekerian, 1987a) about
emotions, which is updated by abstraction from the input over time.
However, the co-existence of a set of these face actions overlapping on
a face at the one time implies a mutual context.
Because there are only six FACE-MOPS and the input is constrained only
by the necessity of having one face action from the union of these
pools, there is the implicit assumption that all emotions that can be
registered on a face can be classified under these six basic emotion
Face-Mops that compete for the input event.
It is not clear whether JANUS models human neurological behavior. It
is not at ah certain that all facial expressions and, by implication, all
emotions able to be displayed on the face can be classified under the six
basic emotions. It is not intuitive to us that we have abstractive conceptual structures for typical face actions or that atypicalities from them
are abstracted from experience and that perception is organized around
them. Does one smile that does not reach the eyes remind one of
another? Faced with an expression, do alternative emotion labels come
to mind?
Oatley and Johnson-Laird (1985), Sloman (1986), and Sloman and
Croucher (1981a, 1981b) emphasized the crucial role that emotions may
play in intelligent systems. Thus, within a changing world, they may act
as essential interrupts in a system with multiple motives but limited
resources or, released at particular junctures of multigoal planning
sequences, act as global communicators and coordinators maintaining
transition modes and preparing for action by focusing attention on certain goals. The emphasis in JANUS is rather the obverse function of
giving computers an awareness of the emotion that the user might be
feeling so that inferences may be drawn about the motivations implied
by them. Perceived emotions, however, are open to more than one interpretation. Human expressions should be of some communicatory
value to computers and robots but their interpretation in terms of
motives and plans would require severely restricted contexts and evidence from other sources. To rely on facial expression alone to communicate junctures of plans would produce only generalities. Speech
is precise, but the face may convey discordant information that casts
doubt on the veracity of the words, for example, in sarcasm. Humans
make use of multiple channels of communication. Body language is one
616
l
KEARNEY
AND
MCKENZIE
channel only. Petajan (1985) combined acoustic recognition with automatic lip-reading and found that the latter always improves the recognition rate of digits, letters, and words compared to acoustic recognition
alone. Happily, within human-computer
interaction the user can be
prompted to confirm the body language verbally or in terms of key
strokes in response to screen queries.
JANUS was conceived as having a potential role to play with a humancomputer interaction: as a step in the direction of making the computer
more sensitive to the cognitive states of the user. Frames grabbed from
a video camera scanning the user would be processed by an automatic
feature-finding algorithm [not implemented, but Bromley (1977, cited
in Laughery, Rhodes, & Batten, 1981), Craw, Ellis, & Lishman (1987),
Petajan (1985), and Sakai, Ngao, & Kanade (1972), among others, have
done work along these lines] into the required x-v-coordinates, which,
input to JANUS in turn, would produce the emotion. The emotion
along with other information would infer the user’s motives and plans.
These are interpreted within the context of the user’s coding goals and
plans that have already been communicated to the dialogue coordinator’s
user model and may prompt automated messages to screen requesting
further amplification. Useful information from other monitoring sources
(e.g., natural language input, keyboard posture, and lip-reading) need
to be coordinated in the dialogue. For a system that learns its expertise
incrementally from the same user with whom it has to interact in daily
use, the user’s repertoire of facial expressions and validated associations could represent a much more informed personal knowledge source
when the present expression is interpreted in reference to events that
caused it in the past. In order to focus exclusively on the coding task,
the context needs to be constrained in order to interpret the cognitions
associated with the facial expression in terms of the coding only to the
exclusion of incidental causes, for example, indigestion. But this is for
the future. Without such constraints, the cognitive associations can only
be indicated in far more general terms (e.g., see Roseman, 1982).
In practice, none of these steps would be without great problems. Automatic feature measurement algorithms are far from perfect and have not, to
our knowledge, been proved capable of detecting the fine distances required
for JANUS’s needs. Such algorithms, if developed, would need to be hardwired to effect real-time processing. Contour tracing requiring processing
of frames buffered in memory is too time consuming (Petajan, 1985). The
need for real-time video analysis can be relaxed and measurement can be
carried out on random or sampled grabbed frames. Much of the behavioral
measurement on people is carried out in this way. The measurement is indirect insofar as it is made by observers or some technical device (Wallbott,
MACHINE
INTERPRETATION
OF EMOTION
617
1980 discussed the various techniques and their advantages and disadvantages). Digital time codes are required for computer analysis (Ekman,
Friesen, & Taussig, 1969). One problem with this approach for the use
referred to before is not knowing in what stage of the expression the grab
has occurred, and conversely, how to locate the beginning frame of a movement. Expressions have an onset, a peak, and a decline, and serial frames in
succession would be required to differentiate these. The problem would be
compounded for blends.
In view of these difficulties, direct measurement might appear a useful
alternative, at least in experimental situations. Johansson (1973) attached
small lights in strategic positions on the body and filmed the subject in
motion without any other source of light. Observers viewed the film, which
depicted moving spots of light, and interpreted them correctly. Bassili
(1975) used a related technique on the face to show that emotional expressions could be interpreted. There are several techniques described by
Mitchelson (1975) for measuring body motion in real time automatically by
fixing miniature radiation emitters or transducers to parts in motion. These
include sources of infrared and polarized light. It would seem probable that
a safe, unobtrusive technique will be forthcoming for facial measurement.
This will be no solution for real-world automatic monitoring of expressions.
Our methodology may be criticized on grounds that a connectionist approach would provide a higher recognition rate. We have not come across
any published evidence supporting such a claim. The single-layered,
Perceptron-like Wisard (Aleksander & Burnett, 1983; Aleksander, Thomas,
& Bowden (1984); Stonham, 1986) can classify smiles from frowns but it is
uncertain whether it would be able to generalize this learning to all comers
over all the basic emotions. Although faces have been used as patterns in
connectionist network models (Kohonen, 1977; Kohonen, Oja, & Lehtio,
1981; McClelland & Rumelhart, 1985) and the emergent properties of such
networks can simulate an approximation of the function ascribed to face
recognition units (see Bruce, 1988), it remains uncertain how they would
perform in this domain.
MOPS were chosen to represent JANUSs memory because of their
crucial role in Schank’s (1982) theory of dynamic memory and reminding.
There is no separate concept of“‘working
memory” in the system. No new
FACE-MOPS
are formed in the course of classifying input, but the six
FACE-MOPS
are not static structures in memory, but dynamic, constantly
monitored, and liable to change in content. In the course of the system’s
use, “ad hoc categories” (see Barsalou, 1990) are formed within the
organization of the FACE-MOPS.
In JANUS, these are called sub-MOPS.
Sub-MOPS are provisional categories on formation because they may be the
result of some temporary regularity in the environment, which is not maincharacteristically
group
tained over ensuing experience. Face-Mops
618
KEARNEY
AND
MCKENZIE
together, in their content frames, face actions with a shared goal: that of
signaling an emotion at the one time. The conception of MOPS as dynamic
memory structures in human memory may have their proponents and critics
but they deserve and receive consideration in explaining experimental findings (cf. Conway & Bekerian’s A-MOPS, 1987b). They are useful as knowledge representation structures in knowledge-based systems (cf. Kolodner’s
E-MOPS, 1984; Lebowitz’s Spec-Mops, 1980). JANUS demonstrates their
use in this respect at a much lower level of specialized knowledge than
usually met with. MOPS, as originally conceived, have social, personal, and
physical aspects, and these aspects are distinguishable in FACE-MOPS.
There are a number of ways in which the scope of JANUS could be improved:
JANUS might be fooled by false and masked emotions because one of
the discriminators between these is not measured. Smiling is not a
unitary class of behavior (Bkman et al., 1988; Ekman et al., 1990;
Ekman & Friesen, 1984). One would have to represent the action of orbicularis oculi pars lateralis (which both raises the cheek, tightens the
peri-orbital ring muscle and draws in the periocular skin) by some linear
vertical distance. Although there is a heuristic for raised cheeks, the
dropping of skin under the brow (which is a telling sign of genuinely
happy eyes) has a fullness that cannot be conveyed by a linear measure.
One would expect more sophisticated systems to make use of brightness
intensity data analysis to supplement distances in representing fullness.
As a heuristic, the distance from below the eyebrow to the upper eyelid
might be considered, but this will vary also with movements of the
brow, say in “surprise,” and so lacks specificity.
A further limitation is evident in the exclusion from JANUS of face
signals that control, emphasize, punctuate, and give shades of meaning
to speech. These would need to be allowed for, depending on the use to
which they are put.
Intensity of emotion has not been implemented. The capability to rate
the intensity of facial actions would be desirable in JANUS. However,
the level of precision that could be achieved in measuring distances on
digitized photographs was not sufficient to put this into effect.
6. CONCLUSIONS
The methodology described is, on the whole, capable of mapping face
geometry to emotion labels, although the correspondence with some human
judgments may be less than perfect and the rule base may benefit from
further refinements, perhaps from a statistical analysis of the most efficient
parameters for representing a face action (Pilowski, Thornton, & Stokes,
1985,1986) and, also, improved to take blends and intensities of emotion. It
MACHINE
INTERPRETATION
OF EMOTION
619
is envisaged that such a system could form the basis of a perceptual front
end, providing input to the computer’s user model. It is possible that the
classificatory and learning tasks required to monitor human facial expressions will be easier with a connectionist approach. We are not aware of any
system that perfects this at present, but we are looking at the possibility. In
this task we seek a macrostructural approximation to model such mappings.
REFERENCES
Aleksander, I., & Burnett. P. (1983). Reinventing man. London: Kogan Page.
Aleksander, I., Thomas, W.V., & Bowden, P.A. (1984). Wisard-A radical step forward in
image recognition. Sensory Review, 4, 120-124.
Baddeley, A. (1979). Applied cognitive and cognitive applied psychology: The case of face
recognition. In L. Nilsson (Ed.), Perspectives on memory research. Hillsdale, NJ:
Erlbaum.
BarsaIou, L.W. (1990). Are there static category representations in long-term memory? Behavioural and Brain Sciences, 9, 6X-652.
Bassili, J.N. (1975). Facial motion in the perception of faces and of emotional expression.
Journal of Experimental Psychology: Human Perception and Performance, 4.373-379.
Bower, C., Gilligan, S., & Monteiro, K. (1981). Selectivity of learning caused by affective
states. Journal of Experimental Psychology: General, IlO,, 451-473.
Bower, G., & Karlin, M. (1974). Depth of processing pictures of faces and recognition
memory. Journal of Experimental Psychology, 103, 751-757.
Brodie, M. (1989). Making it work: An overview of the Janus Project. LASIE, 19(5), 104-112.
Bruce, V. (1988). Recognising faces. London: Erlbaum.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educutionul und Psychological Measurement, 20, 37-46.
Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213-220.
Colombani. D., Sabonnadiere, E., Auriol, P., & Pardo-Gibson, 0. (1988). Janus. A CA0
software package for calculating the electro magnetic susceptibility of industrial electro
technical systems. Conference Actes du Colloque Electronique de Puissance;Colloque
International ‘Les RF1 et EM1 en Electronique de Puissance’, 71-78. Publisher Electron
Puissance, Paris, France.
Conway, M.A., & Bekerian, D.A. (1987a). Situational knowledge and emotion. Cognition and
Emotion, 1(2), 145-191.
Conway, M.A., & Bekerian, D.A. (1987b). Organization in autobiographical memory.
Memory & Cognition, U(2), 119-132.
Courtois, M.R.. & Mueller, J.H. (1979). Processing multiple physical features in facial recognition. Bulletin of the Psychonomic Soceity, 14, 74-76.
Craw, I., Ellis, H., & Lishman, J.R. (1987). Automatic extraction of face features. Pattern
Recognition Letters, 5(2), 183-187.
Ekman, P., Davidson, R.J., & Friesen, W.V. (1990). The duchenne smile: Emotional expression and brain physiology. Journal of Personality und Social Psychology, 58, 342-352.
Ekman, P., & Friesen, W. (1971). Constants across cultures in the face and emotion. Journal
of Personality and Social Psychology, 17(2), 124-129.
Ekman, P., & Friesen, W. (1976a). Measuring facial movement. Journul of Environmentul
Psychology and Nonverbal Behavior, I, 56-57.
Ekman, P., & Friesen, W. (1976b). Pictures of facial 4ffect. Palo Alto, CA: Consulting Psychologists Press.
620
KEARNEY
AND
MCKENZIE
Ekman, P., & Friesen, W. (1978). The facial action coding system: A technique for the measurement of facial movement. Palo Alto, CA: Consulting Psychologists Press.
Ekman, P., & Friesen, W. (1984). Unmasking theface. A guide to recognizing emotions from
facial cues. Englewood Cliffs, NJ: Prentice Hall.
Ekman, P., Friesen, W.V., &O’Sullivan, M. (1988). Smiles when lying. Journal of Personality
and Social Psychology, 54, 414-420.
Ekman, P., Friesen, W., & Taussig, T.G. (1969). VID-R and SCAN: Tools and methods for
the automated analysis of visual records. In G. Gerbner, D.R. Holsti, K. Krippendorf,
W.J. Paisley, & P.J. Stone (Eds.), The analysis of communication content. New York:
Wiley.
Ekman, P., Friesen, W., & Tomkins, S. (1971). Facial affect scoring technique: A first validity
study. Semiotica, 3, 37-58.
Ekman, P., Sorenson, E., & Friesen, W. (1969). Pan-cultural elements in facial displays of
emotion. Science, 64, 86-88.
Ellis, H.D., Jeeves, F., Newcombe, A., &Young, A. (Eds). (1986). Aspects offaceprocessing.
Dordrecht, Netherlands: Nijhoff.
Fehr, B., & Russell, J.A. (1984). Concept of emotion viewed from a prototype perspective.
Journal of Experimental Psychology: General, 113. 464-486.
Fischer, G., Lemke, AC., Mastaglio, T., 8c March, A.I. (1991). Critics: An emerging approach to knowledge-based human computer interaction. International Journal of
Man-Machine Studies, 35(5). 695-721.
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the
analysis of variance. Journal of the American Statistical Association, 32, 675-701.
Galper, R.E., & Hochberg, J. (1971). Recognition memory for photographs of faces. American
Journal of Psychology, 84, 351-359.
Hinricks, E.W. (1988). Tense, quantifiers, and contexts. Computational Linguistics, 14(2),
3-14.
Izard, C.E. (1971). The face of emotion. New York: Appleton-Century Crofts.
Jensen, D. (1986). Facial perception: Holistic or feature analytic? Proceedings of the Human
Factors Society, 30, (P. I), 729-733.
Johansson. G. (1973). Visual perception of biological motion and a model for its analysis.
Perception and Psychophysics, 14, 201-393.
Kearney, G.D. (1991). Design of a memory based expert system for interpreting facial expressions in terms of signalled emotions. Unpublished doctoral dissertation, Thames Polytechnic, London, England.
Kohonen, T. (1977). Associative memory-A system theoretical approach. Berlin: SpringerVerlag.
Kohonen, T., Oja, E., & Lehtio, P. (1981). Storage and processing of information in distributed
associative memory systems. In G. Hinton & J.A. Anderson (Eds.), Parallel models of
associative memory. Hillsdale, NJ: Erlbaum.
Kolodner, J.L. (1984). Retrieval and organizational strategies in conceptual memory: A computer model. Hillsdale, NJ: Erlbaum,
Laughery, K., Rhodes, B., Jr., &Batten, G.W., Jr. (1981). Computer-guided recognition and
retrieval of facial Images. In G.M. Davies, Ellis, H.D., & Shephard, J.W. (Eds.), Perceiving and remembering faces. London: Academic.
Lebowitx, M. (1980). Generalization and memory in an integrated understanding system. (Tech.
Rep. No. 186). New Haven, CT: Yale University. Department of Computer Science.
Mase, K. (1991). Recognition of facial expression from optical flow. IEICE Transactions,
E 74(10), pp. 3474-3483.
Mase, K., Suenaga, Y., & Akimoto, T. (1987). Head Reader-A head motion understanding
system for better man-machine interaction. Proceedings of the 1987 IEEE International
Conference on Systems, Man, and Cybernetics, 3, 970-974.
MACHINE
INTERPRETATION
OF EMOTION
621
McClelland, J.L., & Rumelhart, D.E. (1985). Distributed memory and the representation of
general and specific information. Journal of Experimental Psychology: General, 114,
159-188.
Mitchelson, D.L. (1975). Recording of movement without photography. In D.W. Grieve
(Ed.), Techniques for the analysis of human movement. Lepus, imprint of A. and C.
Block, London.
Oatley, K., & Johnson-Laird, P.N. (1985). Sketch for a cognifive theory of the emotions
(Cognitive Science Research Paper CSRP.045). Falmer, England: University of Sussex.
Patterson, K.E., & Baddeley, A. (1977). When face recognition fails. Journalof Experimental
Psychology, Human Learning and Memory, 3, 406-417.
Petajan, E. (1985). Automatic lipreading to enhance speech recognition. Proceedings of
the IEEE Coderence on Computer Vision and Pattern Recognition, 40-47. ISBN
0818606339.
Pilowski, I., Thornton, M., & Stokes, B. (1985). A microcomputer based approach to the
quantification of facial expressions. Australasian Physical & Engineering Sciences in
Medicine 8, 70-75.
Pilowski, I., Thornton, M., & Stokes, B. (1986). Towards the quantification of facial expressions with the use of a mathematical model of the face. In H. Ellis, M.A. Jeeves, F.
Newcombe, & A. Young, A. (Eds.), Aspects of face processing. Lancaster, England:
Martinus Nijhoff.
Raghavan, S.A., &Chard, D.R. (1989). Exploring active decision support: The JANUS project.
In R. Blanning, D. King, Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences, Volume 111: Decision Support and Knowledge
Based Systems Track (Cat. No. 89TH0244-4), 33-35. Washington, DC: IEEE Computing Society Press.
Roseman, I. (1982). Cognitive aspects of discrete emotions. Unpublished doctoral dissertation, Yale University, New Haven, CT.
Sakai, T., Ngao, M., & Kanade, T. (1972). Computer analysis and classifications of photographs of human faces. Proceedings of the First USA-Japan Computer Conference,
Tokyo, I, 55-62. AFIPS and Information Processing Society of Montvale, NJ.
Schank, R.C. (1982). Dynamic memory: A theory of reminding and learning in computers and
people. Cambridge, England: Cambridge University Press.
Schank, R.C. (1984). Memory-based expert systems (Interim Report AFOSR. TR.84-0814).
New Haven, CT: Yale University, Computer Science Department.
Sergent, J. (1984). An investigation into component and configural processes underlying face
perception. British Journal of Psychology, 75, 221-242.
Sheehy, N.P. (1989). Non-verbal behaviour in the demonstrator. In Communication Failure in
dialogue techniques for detection and repair. Deliverable 9. Implementation of
Dialogue System (Esprit Project 527, Ref. CFID.Dg.2). Leeds, England: University of
Leeds, Department of Psychology.
Sloman, A. (1986). Motives, mechanisms and emotions (Cognitive Science Research Reports,
Serial No. CSRP 0620). Fahner, Brighton, England: University of Sussex, School of
Social Studies.
Sloman, A., & Croucher. M. (1981a). Why robots will have emotions. Cognitive Science Research Paper, No. 176. University of Sussex, School of Social Sciences, Fahner,
England.
Sloman, A., & Croucher, M., (1981b). You don’t need a soft skin to have a warm heart
(Cognitive Science Research Paper, Serial No. CSRP 004). Falmer, Brighton, England:
University of Sussex, School of Social Sciences.
Stonham, T.J. (1986). Practical face recognition and verification with Wisard. In H. Ellis,
M.A. Jeeves, F. Newcombe, & A. Young @ids.), Aspects offaceprocessing. Lancaster,
England: Martinus Nijhoff.
622
KEARNEY
AND
MCKENZIE
Strnad, B., &Mueller, J.H. (1977). Levels of processing in facial recognition memory. Bulletin
of the Psychonomic Society, 9, 17-18.
Wallbott, H.G. (1980). The measurement of human expression: In W. von Raffler-Engel
(Ed.), Aspects of nonverbal communication. Lisse: Swets & Zeitlinger.
Watkins, M.J., Ho, E., & Tulving, E. (1976). Context effects in recognition memory for faces.
Journal of Verbal Learning and Verbal Behavior, 15, 505417.
Wells, G.L., & Hryciw, B.A. (1984). Memory for faces: Encoding and retrieval operations.
Memory and Cognition, 12, 338-344.
Wiiiams, G.W. (1976). Comparing the joint agreement of several raters with another rater.
Biometrics, 32, 619-627.
Winograd, E. (1976). Recognition memory for faces following nine different judgements.
Bulletin of the Psychonomic Society, 8, 419-421.