ppt

TT
Centrum för talteknologi
Multi-modal expression of
Swedish prominence
Björn Granström
Centre for Speech Technology, Department of
Speech, Music and Hearing, KTH, Stockholm, Sweden
Historical background
• Prosody for speech synthesis at KTH,
together with Rolf Carlson
• The Lund intonation model – Gösta
Bruce et al.
Several joint projects
Profs – Prosodic phrasing in Swedish ~1989-1992
Gösta Bruce, Björn Granström and more
First reference: G. Bruce and B. Granström. Modelling
Swedish intonation in a text-to-speech system. STLQPSR, 30(1):17-21, 1989. (on the KTH web)
Potentially ambiguous sentences, varying in
phrase boundary location
Entering greve Piper´s humble
residence
Windows Explorer (2).lnk
Several joint projects, cont.
Prosodiag - Prosodic Segmentation and Structuring of Dialogue
(HSFR + NUTEK) 1993 –1996
Gösta Bruce, Björn Granström, Kjell Gustafson, David House,
Paul Touati
Project Description
The object of study is the prosody of dialogue in a language
technology framework. The primary goal of the project is to
increase our understanding of how prosodic aspects of
speech are exploited interactively in dialogue and on the
basis of this increased knowledge to be able to create a
more powerful prosody model.
Late reference: Gösta Bruce, Johan Frid, Björn Granström,
Kjell Gustafson, Merle Home, and David House. Prosodic
segmentation and structuring of dialogue. TMH-QPSR,
37(3):1-6, 1996.
More than 20 joint publications – and then?
Much in the context of the annual
phonetics meetings – next:
Project meetings in
inspirering surroundings
..probing many different cultures
Is prosody more than sound?
• Our bias: communication is multi-modal
• Traditionally prosodic functions are
signaled by “gestures”, perceived by “eye
and ear”
• This concerns both body and face
gestures
• Preliminary hypothesis: F0~eyebrow
height - e.g. Cavé et al. (1996)
• Easy to put to a test with multimodal
speech synthesis
Eyebrow vs intonation
1 No eyebrow motion
2 Eyebrow motion
controlled by the
fundamental frequency
of the voice
3 Eyebrow motion at
focal accents +
4 Eyebrow motion at
the first focal accent +
“Jag heter Axel, inte Axell” (translation: “My name
is Axel, not Axell”). In Sweden Axel is a first name
as opposed to Axell, which is a family name.
Goals and research context
• How are visual expressions used to convey
and strengthen prosodic functions?
• Understand interactions between visual
expressions, dialog functions and speech
acoustics
• Context: animated talking agent
– Realistic communicative behavior using
multimodal speech synthesis
Visual prosodic functions
• Prominence
– stress
– focus
• Phrasing
• Utterance type
– question
– statement
• Dialogue functions
– back channeling
– turntaking
• Attitudes
• Emotions
Visual prosody cont.
• What is underlying?
• How tight is the AV connection?
• What are the important visual
gestures?
• More optional than acoustic prosodic
parameters?
• Individual and cultural variation
• Reinforcing or qualifying acoustics?
Formal experiment
Prominence due to eyebrow rise
5 content words: ”När pappa fiskar stör piper Putte”
When dad is fishing sturgeon, Putte is whimpering
Example of stimuli
Task: “which word is most prominent”
(identical acoustics – varied location of eyebrow movement)
No eyebrow
movement (neutral)
Eyebrow movement
Prominence increase due to
eyebrow movement
% prominence due to
eyebrow movement
Influence on judged prominence by eyebrow
movement
50
40
30
20
10
0
Swedish
Foreign
All
Feedback experiment
•
•
•
•
Mini dialogues (two turns)
Travel agent application
Both visual and acoustic feedback cues
Affirmative cues – agent
understands/accepts the request
• Negative cues – agent is unsure about the
request (seeks confirmation)
• Six cues hypothesised
Granström, House & Swerts (2002)
Pos/Neg
feedback
experiment
Smile
Head movement
Eyebrows
Eye closure
F0 contour
Delay
Affirmative setting
Head smiles
Head nods
Eyebrows rise
Eyes close a bit
Declarative intonation
Immediate reply
Negative setting
Head has neutral expression
Head leans back
Eyebrows frown
Eyes open widely
Interrogative intonation
Slow reply
(Granström, House & Swerts 2002)
H
m
en
t
De
la
y
su
re
m
cl
o
ov
e
Ey
e
ea
d
br
ow
co
nt
ou
r
Ey
e
F0
ile
Sm
Average response value
Cue strength
3
2,5
2
1,5
1
0,5
0
Recording of communicative
interactions
Automatic tracking of reflective spots in 3D (Qualisys)
Interactions: emotion and
articulation (resynthesis)
(from AV speech database –
EU/PF_STAR project)
Measurement points
for lip coarticulation
analysis
left mouth
corner
Vertical
distance
Lateral
distance
The expressive mouth
”left mouth corner”
• All vowels
(sentences)
–
–
–
–
–
Encouraging
Happy
Angry
Sad
Neutral
(Svanfeldt et al. 2003)
Prompted read speech database
• Expressive modes:
– Confirming, questioning, certain, uncertain, happy,
(angry)
• 39 short, content neutral sentences with three
possible focal accent positions each, e.g.
• Båten seglade förbi (The boat sailed by)
• Dom flyttade möblerna (They moved the furniture)
• Nonsense words (VCV, VCCV, CVC)
• Digits
Mean eyebrow positions for one speaker
Nose marker traces with automatic (blue) and two human (red)
annotated head nods (adapted from Cerrato & Svanfeldt 2006)
Happy
Confirming
Examples from the database
Båten
Focal accent on:
seglade
förbi
Exploitation of visual parameters
• Visual cues exploited at focal accent
• Mouth cues
– Happy, encouraging
• Eyebrow cues
– Happy, questioning
• Vertical head nods
– Confirming
Analysis in terms of FAP and FMQ
MPEG-4 Facial Animation Parameter (FAP)
A subset of 31 FAPs out of the 68 FAPs defined in the
MPEG-4 standard, including only the ones that we were
able to calculate directly from our measured point data
Focal Motion Quotient, FMQ, defined as the
standard deviation of a FAP parameter taken over a
word in focal position, divided by the average standard
deviation of the same FAP in the same word in non-focal
position.
The focal motion quotient, FMQ, averaged across all sentences,
for all measured MPEG-4 FAPs for several expressive modes
4,5
4
3,5
Angry
Happy
Confirming
Questioning
Certain
Uncertain
Neutral
3
2,5
2
1,5
1
articulation
0
50: head roll
49: head yaw
48: head pitch
38: squeeze right eyebrow
37: squeeze left eyebrow
36: raise right outer eyebrow
35: raise left outer eyebrow
34: raise right mid eyebrow
33: raise left mid eyebrow
32: raise right inner eyebrow
31: raise left inner eyebrow
60: raise right cornerlip
59: raise left cornerlip
54: strech right cornerlip
53: strech left cornerlip
56: lower top lip rm
55: lower top lip left mid
51: lower top midlip
17: push top lip
58: raise bottom lip rm
57: raise bottom lip lm
52: raise bottom midlip
16: push bottom lip
42: lift right cheek
41: lift left cheek
40: puff right cheek
39: puff left cheek
18: depress chin
15: shift jaw
14: thrust jaw
3: open jaw
FAP
I head
brows
I
smile
I
0,5
The effect of focus on the variation of several
groups of MPG-4 /FAP parameters,
for different expressive modes
FMQ (Focal Motion Quotient)
3
2,5
2
articulation
smile
brows
head
1,5
1
0,5
0
Neutral
Uncertain
Certain
Questioning
Confirming
Happy
Angry
The effect of focal accent on selected parameter
variations in Certain and Uncertain readings
4
FMQ (Focal Motion Quotient)
3,5
31: raise left inner
eyebrow
32: raise right inner
eyebrow
33: raise left mid
eyebrow
34: raise right mid
eyebrow
48: head pitch
3
2,5
2
1,5
1
49: head yaw
0,5
0
Certain
Uncertain
What´s next?
• Better recordings
• Detailed analysis of the eye region:
”Gaze and wrinkles”
• Use in applications, e.g. spoken
dialogue systems
• And more audible prosody…….
New cooperative project
SIMULEKT - Simulering av svenskans
prosodiska dialekttyper (Simulating
intonational varieties of Swedish)
VR 2007-2009
And finally………..
Congratulations!
Well done Gösta!