TT Centrum för talteknologi Multi-modal expression of Swedish prominence Björn Granström Centre for Speech Technology, Department of Speech, Music and Hearing, KTH, Stockholm, Sweden Historical background • Prosody for speech synthesis at KTH, together with Rolf Carlson • The Lund intonation model – Gösta Bruce et al. Several joint projects Profs – Prosodic phrasing in Swedish ~1989-1992 Gösta Bruce, Björn Granström and more First reference: G. Bruce and B. Granström. Modelling Swedish intonation in a text-to-speech system. STLQPSR, 30(1):17-21, 1989. (on the KTH web) Potentially ambiguous sentences, varying in phrase boundary location Entering greve Piper´s humble residence Windows Explorer (2).lnk Several joint projects, cont. Prosodiag - Prosodic Segmentation and Structuring of Dialogue (HSFR + NUTEK) 1993 –1996 Gösta Bruce, Björn Granström, Kjell Gustafson, David House, Paul Touati Project Description The object of study is the prosody of dialogue in a language technology framework. The primary goal of the project is to increase our understanding of how prosodic aspects of speech are exploited interactively in dialogue and on the basis of this increased knowledge to be able to create a more powerful prosody model. Late reference: Gösta Bruce, Johan Frid, Björn Granström, Kjell Gustafson, Merle Home, and David House. Prosodic segmentation and structuring of dialogue. TMH-QPSR, 37(3):1-6, 1996. More than 20 joint publications – and then? Much in the context of the annual phonetics meetings – next: Project meetings in inspirering surroundings ..probing many different cultures Is prosody more than sound? • Our bias: communication is multi-modal • Traditionally prosodic functions are signaled by “gestures”, perceived by “eye and ear” • This concerns both body and face gestures • Preliminary hypothesis: F0~eyebrow height - e.g. Cavé et al. (1996) • Easy to put to a test with multimodal speech synthesis Eyebrow vs intonation 1 No eyebrow motion 2 Eyebrow motion controlled by the fundamental frequency of the voice 3 Eyebrow motion at focal accents + 4 Eyebrow motion at the first focal accent + “Jag heter Axel, inte Axell” (translation: “My name is Axel, not Axell”). In Sweden Axel is a first name as opposed to Axell, which is a family name. Goals and research context • How are visual expressions used to convey and strengthen prosodic functions? • Understand interactions between visual expressions, dialog functions and speech acoustics • Context: animated talking agent – Realistic communicative behavior using multimodal speech synthesis Visual prosodic functions • Prominence – stress – focus • Phrasing • Utterance type – question – statement • Dialogue functions – back channeling – turntaking • Attitudes • Emotions Visual prosody cont. • What is underlying? • How tight is the AV connection? • What are the important visual gestures? • More optional than acoustic prosodic parameters? • Individual and cultural variation • Reinforcing or qualifying acoustics? Formal experiment Prominence due to eyebrow rise 5 content words: ”När pappa fiskar stör piper Putte” When dad is fishing sturgeon, Putte is whimpering Example of stimuli Task: “which word is most prominent” (identical acoustics – varied location of eyebrow movement) No eyebrow movement (neutral) Eyebrow movement Prominence increase due to eyebrow movement % prominence due to eyebrow movement Influence on judged prominence by eyebrow movement 50 40 30 20 10 0 Swedish Foreign All Feedback experiment • • • • Mini dialogues (two turns) Travel agent application Both visual and acoustic feedback cues Affirmative cues – agent understands/accepts the request • Negative cues – agent is unsure about the request (seeks confirmation) • Six cues hypothesised Granström, House & Swerts (2002) Pos/Neg feedback experiment Smile Head movement Eyebrows Eye closure F0 contour Delay Affirmative setting Head smiles Head nods Eyebrows rise Eyes close a bit Declarative intonation Immediate reply Negative setting Head has neutral expression Head leans back Eyebrows frown Eyes open widely Interrogative intonation Slow reply (Granström, House & Swerts 2002) H m en t De la y su re m cl o ov e Ey e ea d br ow co nt ou r Ey e F0 ile Sm Average response value Cue strength 3 2,5 2 1,5 1 0,5 0 Recording of communicative interactions Automatic tracking of reflective spots in 3D (Qualisys) Interactions: emotion and articulation (resynthesis) (from AV speech database – EU/PF_STAR project) Measurement points for lip coarticulation analysis left mouth corner Vertical distance Lateral distance The expressive mouth ”left mouth corner” • All vowels (sentences) – – – – – Encouraging Happy Angry Sad Neutral (Svanfeldt et al. 2003) Prompted read speech database • Expressive modes: – Confirming, questioning, certain, uncertain, happy, (angry) • 39 short, content neutral sentences with three possible focal accent positions each, e.g. • Båten seglade förbi (The boat sailed by) • Dom flyttade möblerna (They moved the furniture) • Nonsense words (VCV, VCCV, CVC) • Digits Mean eyebrow positions for one speaker Nose marker traces with automatic (blue) and two human (red) annotated head nods (adapted from Cerrato & Svanfeldt 2006) Happy Confirming Examples from the database Båten Focal accent on: seglade förbi Exploitation of visual parameters • Visual cues exploited at focal accent • Mouth cues – Happy, encouraging • Eyebrow cues – Happy, questioning • Vertical head nods – Confirming Analysis in terms of FAP and FMQ MPEG-4 Facial Animation Parameter (FAP) A subset of 31 FAPs out of the 68 FAPs defined in the MPEG-4 standard, including only the ones that we were able to calculate directly from our measured point data Focal Motion Quotient, FMQ, defined as the standard deviation of a FAP parameter taken over a word in focal position, divided by the average standard deviation of the same FAP in the same word in non-focal position. The focal motion quotient, FMQ, averaged across all sentences, for all measured MPEG-4 FAPs for several expressive modes 4,5 4 3,5 Angry Happy Confirming Questioning Certain Uncertain Neutral 3 2,5 2 1,5 1 articulation 0 50: head roll 49: head yaw 48: head pitch 38: squeeze right eyebrow 37: squeeze left eyebrow 36: raise right outer eyebrow 35: raise left outer eyebrow 34: raise right mid eyebrow 33: raise left mid eyebrow 32: raise right inner eyebrow 31: raise left inner eyebrow 60: raise right cornerlip 59: raise left cornerlip 54: strech right cornerlip 53: strech left cornerlip 56: lower top lip rm 55: lower top lip left mid 51: lower top midlip 17: push top lip 58: raise bottom lip rm 57: raise bottom lip lm 52: raise bottom midlip 16: push bottom lip 42: lift right cheek 41: lift left cheek 40: puff right cheek 39: puff left cheek 18: depress chin 15: shift jaw 14: thrust jaw 3: open jaw FAP I head brows I smile I 0,5 The effect of focus on the variation of several groups of MPG-4 /FAP parameters, for different expressive modes FMQ (Focal Motion Quotient) 3 2,5 2 articulation smile brows head 1,5 1 0,5 0 Neutral Uncertain Certain Questioning Confirming Happy Angry The effect of focal accent on selected parameter variations in Certain and Uncertain readings 4 FMQ (Focal Motion Quotient) 3,5 31: raise left inner eyebrow 32: raise right inner eyebrow 33: raise left mid eyebrow 34: raise right mid eyebrow 48: head pitch 3 2,5 2 1,5 1 49: head yaw 0,5 0 Certain Uncertain What´s next? • Better recordings • Detailed analysis of the eye region: ”Gaze and wrinkles” • Use in applications, e.g. spoken dialogue systems • And more audible prosody……. New cooperative project SIMULEKT - Simulering av svenskans prosodiska dialekttyper (Simulating intonational varieties of Swedish) VR 2007-2009 And finally……….. Congratulations! Well done Gösta!
© Copyright 2026 Paperzz