Audiovisual prosody in problematic dialogue situations

Audiovisual prosody in
problematic dialogue situations
Marc Swerts
Communication & Cognition
Tilburg University
General problem

Spoken dialogue systems (SDS) are prone to error, especially because
of errors in the ASR component of such systems

Errors will remain a problem for future systems, e.g. when they have to
operate in noisy conditions, with non-native speakers or when the
domain of the system becomes larger

Therefore: key task for most dialogue managers in SDS systems is error
handling:
– Prevent errors (e.g. optimal dialogue strategies)
– Detect errors (e.g. acoustic and semantic confidence scores)
– Correct errors (e.g. feedback cues, system prompts)
Prosody and error handling

Recent interest in the use of speech prosody for error handling
– To detect misrecognized utterances which have been shown to be
prosodically different from correctly recognized utterances (e.g. Hirschberg
et al. 2004)
– To distinguish positive from negative feedback cues about the smoothness
of the interaction (e.g. Krahmer et al. 2002)
– To locate places where speakers try to correct a prior utterance (corrections
tend to be hyperarticulated, which often leads to ‘spiral’ errors) (Oviatt et al.
1998)

Previous research only focused on verbal features; in this talk we
concentrate on the effect of errors on visual features as well (audiovisual
prosody)
This talk

Report on analyses of interactions between speakers and their dialogue
partners (both humans and machine)

Study audiovisual features of speakers
– When speakers notice they themselves have a problem (Part 1)
– When speakers notice their dialogue partners have a problem (Part 2)
Part 1

What are audiovisual features of a speaker who experiences
communication problems?
Uncertainty

Speakers are not always equally confident about or committed to what
they are saying

Suppose someone asks a question (Who wrote hamlet? What is the
capital of Switzerland?)
– Speakers may be sure about their answer, or rather uncertain
– Speakers may not know the answer, though it may be on the tip of their
tongue

These differences in confidence level are reflected in the way speakers
present themselves; this is useful for their addressees
Questions to be addressed

How can visual cues from a speaker’s face be used as signals of level of
uncertainty? How important are such cues compared to auditory cues?

Are their significant differences between different kinds of speakers in
their use of visual cues for uncertainty? (here: age differences)
Experiment 1: Production of Uncertainty
(based on Smith and Clark 1993)

Experiment in three stages (Hart 1965):
1.
2.
3.
Answers to factual questions (WISC, WAISC, Trivial Pursuit ).
Test how certain subject is (s)he would recognize the correct answer in a multiplechoice test (Feeling of Knowing (FOK)-scores).
Recognition test (Multiple-choice).

“Tip of the tongue”: non-answer (“I don’t know”) with a high FOK.

Subjects were filmed during first test; they could not see the experimentor.

Adults: person with highest score got a small reward.

Children all got a small award
Subjects and questions

20 adults
 Students and collegues [20 – 50]
 40 questions
 n = 800



Who wrote Hamlet?
 How many degrees in a circle?
 What is the capital of Switzerland?
 ...
20 children
 Group 4 [7 – 8]
 30 questions
 n = 600
Who is the president of the U.S.?
 Where can you buy a Happy Meal?
 What is the color of peanut butter?
 ...
Labelling

All 1400 utterances were manually labelled by 4 independent
judges.

Consensus labeling of presence/absence of different audiovisual features.

Verbal: high intonation, filled pauses, delay, number of words.

Visual: eyebrow, smile, “funny face”, gaze [adults only]
Eyebrow raising
Smile
Gaze (diverted)
Funny face
Results adults
FOK correlation
Answers
Non-answers
Words
-.344
.401
Gaze acts
-.309
.347
Marked features
-.422
.462

Answers: Presence of filled pause, delay, high intonation, eyebrow,
smile, funny face and different gaze acts correspond with significantly
lower FOK score.

Non-answers: Presence of filled pause, delay, high intonation, eyebrow,
smile, funny face and different gaze acts correspond with significantly
higher FOK score
Results children

Answers: Presence of eyebrow, funny face and delay correspond with
significantly lower FOK score.

Non-answers: Presence of smile corresponds with a significantly higher
FOK score.

Other than that no significant findings.

In general: children are much less expressive than adults, use
occasionally very long delays, and hardly any filled pauses.
Conclusion experiment 1

Speakers express their level of uncertainty via various audiovisual cues.

Adults do this much more than children (‘self-presentation’)

Opposite findings for answers and nonanswers.

How is uncertainty perceived? What are the important features?
– In different modalities?
– By different judges?
Experiment 2: Perception of uncertainty
(based on Brennan and Williams 1995)



Stimuli: 60 adult responses from Experiment 1.
Answers
Non-answers
High FOK
15
15
Low FOK
15
15
120 subjects participated:
Vision+sound
Sound only
Vision only
40
40
40
Task: judge level of uncertainty of speaker (FOAK scores).
FOAK scores for answers and nonanswers
1
high FOK
low FOK
0,8
0,6
0,4
0,2
0
answer
nonanswer
Different conditions
1
high FOK
low FOK
0,8
0,6
0,4
0,2
0
Vision+Sound
Sound only
Vision only
Conclusion experiment 2

Observers can estimate a speaker’s level of uncertainty on the basis of
audiovisual cues.

Answers are “easier” than nonanswers.

Scores for unimodal stimuli are good (both sound only and vision only),
but those for bimodal stimuli go best.
Experiment 3: Perception of uncertainty




For different speakers/judges: adults vs. children
Same task: judge level of (un)certainty
Stimuli: only answers, selected from experiment 1.
Child answers
Adult-answers
High FOK
15
15
Low FOK
15
15
Adult speaker
Child speaker
Adult judge
20
20
Child judge
20
20
80 subjects participated
FOAK scores for children and adults
1
high FOK
low FOK
0,8
0,6
0,4
0,2
0
adults
children
adult judges
adults
children
child judges
Conclusion experiment 3

Adults are “better” judges than children.
(Detecting behavior one does not display is more difficult..)

Adults are “better” judged than children.
(What is not signalled cannot be detected.)
Part 2

What are audiovisual features of a speaker who notices that his/her
dialogue partner has communication problems?
Feedback cues

Dialogue partners continuously send and receive signals on the status of
the information which is being exhanged
– Positive feedback cues (‘go on’) when there are no problems
– Negative feedback cues (‘go back’) when there are problems

Previous research revealed that negative feedback cues are
prosodically ‘marked’ (e.g. higher, louder, longer) (e.g. Krahmer et al.
2002, Shimojima et al. 2002)

Here: series of experiments to investigate whether speakers use visual
cues as well as auditory ones for distinguishing positive from negative
cues
Data

Taken from an audiovisual corpus of 9 subjects engaged in telephone
conversations with a speaker-independent traintime table information
system; they had to query the system on 7 train journeys (63
interactions)

Subjects were video-taped during their interactions; they were led to
believe the data collection for the development of a new video-phone

76% of the dialogues were successfully completed; 374 out of 1183
speaker turns were misunderstood by the system (32%)
Set-up of perception experiment

We performed three perception experiments in which 66 subjects were
shown selected video-clips from these recorded human-machine
interactions

The clips constituted ‘minimal pairs’, in that they consisted of
comparable utterances that had originally occurred either in a
problematic or in an unproblematic dialogue exchange

The subjects’ task was to guess whether the presented clip came from a
problematic or unproblematic context
Study 1: verification questions

Subjects saw users listening to verification questions from the system
(so users are silent), which can be unproblematic (such as in 1), or
problematic (such as in 2)
1. User: Amsterdam
System: So you want to travel to Amsterdam?
2. User: Amsterdam
System: So you want to travel to Rotterdam?
Users listening to system questions
No problem
Problem
Study 2: Destination utterances

Subjects saw speakers uttering a destination; this could the speaker’s
first attempt (unproblematic) (like in 1), or it could be a correction in
response to a verification question of misrecognized or misunderstood
information (like in 2)
1. System: To which station do you want to travel?
User: Rotterdam
2. System: So you want to travel to Amsterdam?
User: Rotterdam
Slot filling (speakers utter destination)
No Problem
Problem
Study 3: negations

Subjects saw speakers uttering a negation (“nee”, no), which could be a
response to a general yes-no question (like in 1), or a response to a
verification question which contains incorrect information (like in 2)
1. System: Do you want me to repeat the connection?
User: No
2. System: So you want to travel to Amsterdam?
User: No
Negations
No Problem
Problem
Increasing level of frustration…
Findings

In all three studies, subjects were able to correctly distinguish
problematic from unproblematic fragments above chance level (task was
easier for verification stimuli, and slot fillers)

In order to gain insight into the audiovisual features that may have
functioned as cues we labeled the data in terms of level of
hyperarticulation (6 levels), and presence or absence of a number of
visual features (most important: smile, head movement, diverted gaze,
frown, brow raise)

Both level of hyperarticulation and relative number of visual cues were
correlated with perceived and actual problems
Degree of hyperarticulation
Perceived problems
Actual problems
Amount of visual variation
Perceived problems
Actual problems
General conclusion

Dialogue problems have been shown to have consequences for
audiovisual characteristics of a speaker who experiences problems
him/herself or who notices that the dialogue partner has communication
problems

In general, it appears that problematic dialogue situations lead to more
dynamic facial expressions and marked prosodic behaviour
More information

Research reported here joint work with Emiel Krahmer, Pashiera
Barkhuysen (PhD project) and Lennard van de Laar (technical assistant)
within the FOAP (“Functions of audiovisual prosody”) project:
foap.uvt.nl

Other interests: audiovisual cues to end-of-utterance, focus, emotion,
deceptive speech, and personality; incorporation of findings in ECAs
through collaborations
Data collection
Adults
Children
n
FOK
Correct answers
575
0.94
Incorrect answers
129
0.76
Non-answers
96
0.42
n
FOK
Correct answers
371
0.96
Incorrect answers
125
0.74
Non-answers
131
0.50
Contrary to adults, children have few high FOK non-answers.
Manipulated data

Gain more insight into relevance of visual and auditory cues; because of
ceiling effects it was difficult to establish the strength relation between
these two types of cues

Answers (1 HighFOK, 1 LowFOK) from 5 speakers were selected; words
had to have a similar sound shape (e.g. Goethe-Goofy; Zurich-Zorro, …)

Sound and image were separated to create mixed stimuli (e.g. HighFOK
vision combined with LowFOK sound)

Both original and mixed stimuli were presented to 120 subjects who had
to rate the FOK level (7-point scale) of each stimulus
Face:sure
Voice:sure
Face:unsure
Voice:unsure
Face:sure
Voice:unsure
Face:unsure
Voice:sure
Conclusions experiment 4

Overall: bias towards uncertainty in FOK ratings

FOK ratings are significantly influenced by verbal (intonation pattern)
and visual cues from the face; some speaker effect

However, facial information has much stronger cue value
Series of studies

Production of uncertainty (based on Smith and Clark 1993):
“Feeling of Knowing” (FOK)
– Experiment 1: Adults + children

Perception of uncertainty (based on Brennan and Williams 1995):
“Feeling of Another’s Knowing” (FOAK)
– Experiment 2: Unimodal vs multimodal
– Experiment 3: Adults x children
Future goals

Integrate the findings in Embodied Conversational Agents in order to
make these more natural and believable, in particular for error handling
strategies (working hypothesis: Users are more likely to tolerate
incorrect answers if the system signals its uncertainty)

Explore whether visual features can be used as an additional resource
for error detection (growing interest in incorporating visual information in
automatic recognition process)
Audiovisual prosody

Prosody defined as those features that do not determine what a speaker
says, but rather how he or she says it
– Verbal: intonation, tempo, loudness, voice quality, pauses, ….
– Visual: facial expressions, hand and arm gestures, body language, …

Audiovisual prosody = verbal + visual prosody
Self-presentation

Auditory cues (Smith and Clark, 1993; Brennan and Williams, 1995):
– Linguistic hedges (“I am not sure, but…”, “I think..”)
– Filled pauses (uh and uhm)
– Prosody (question intonation)

This study: possible visual cues (are natural and important ingredient of
daily conversations as well)