System - b

VOICE USER INTERFACES II
CS160: User Interfaces
John Canny
UPDATES
PROG 03 (Recipes) due Thursday 3/16 – you need your
group’s Fire Tablet for this.
DESIGN 05 (Contextual Inquiry) due on 3/14
REMINDER
Amazon Fire Tablets are available (one per team):
• Please collect from a GSI during office hours this week
(posted on bCourses page).
• Bring a deposit check for $50 to UC Regents
• Try out PROG 02 right away!
LAST TIME: VUI DESIGN I
• When to Use Voice
• Conversational Design Principles
• Personae
• Writing Sample Dialogs
• Testing and iterating
• Building out
• Best Practices
LAST TIME: GRICE’S MAXIMS
Quality: say what you truly believe.
Quantity: Say as much information as is needed, but not more.
Relevance: Talk about what is relevant to the conversation at hand.
Manner: Try to be clear and explain in a way that makes sense to
others.
THE COOPERATIVE PRINCIPLE
A conversational utterance should be: “such as is required, at the stage
it occurs, by the accepted purpose of direction of the talk exchange in
which it occurs”
i.e. Conversation is purposeful.
Speaker’s have intents that they wish to fulfil. Cooperative conversation
is a process where speakers attempt to satisfy each other’s perceived
intents.
PRAGMATICS – SPEECH ACTS
Speech has both syntactic and pragmatic meanings, which are often
different due to social conventions (politeness).
“Can you close the window?”
Is both a question (syntax), but a request (pragmatics).
Aaron: “Can we meet tomorrow at noon?”
Aaron’s intent is to request-suggest a meeting.
Brenda: “Sorry, I have a weekly scrum meeting then.”
Brenda replies that she is busy, and politely explains the conflict. This
cooperative reply facilitates scheduling the event at another time.
COMMON GROUND (STALNAKER)
Common ground is often established from shared context (e.g. location),
and shared visual and auditory scenery:
“Nice sunset”
“It handles the corners really well”
“The new menu is really good”
THIS TIME
• Word categories/sentiment analysis
• Dialog Management
• SSML and Prosody
• Cohesion
• Information Structure
• Spoken vs Written English
• Register
• Explore!
WORD CATEGORIES
Often a user’s utterance will contain different words which are
synonymous or equivalent for the agent’s response.
These words can be grouped into categories of similar meaning.
Rather than the word itself, the intent should be driven by the category.
Examples:
• Car, auto, vehicle,…
• Phone, cellphone, mobile, smartphone,…
• Yes, yeah, yup, sure, fine,….
WORD CATEGORIES
There are widely-used lists of category-word pairs.
See e.g. Wordnet: https://wordnet.princeton.edu/
You can use this to augment your dialogs with multiple options for
related words.
EMOTION CATEGORIES
A related phenomenon is the sentiment of the user’s utterances, which
can be estimated from the choice of words they make:
See also the MPQA list http://mpqa.cs.pitt.edu/ also try “sentiwordnet”
Positive word examples:
Absolutely
Bloom
Brave
Bubbly
Calm
Abundant
Bountiful
Bright
Bunch
Care
Accept
Bounty
Brilliant
Burgeon
Celebrate
EMOTION CATEGORIES
A related phenomenon is the sentiment of the user’s utterances, which
can be estimated from the choice of words they make:
Negative word examples:
abysmal
adverse
alarming
angry
annoy
anxious
apathy
appalling
can't
clumsy
coarse
cold
cold-hearted
deny
despicable
detrimental
collapse
confused
contradictory
contrary
corrosive
corrupt
crazy
creepy
EMOTION CATEGORIES
From the positive and negative and other
emotions words in the user’s speech, the
system can infer their general sentiment.
Based on this, the system can adapt its own
sentiment accordingly. E.g. express empathy
if user seems sad or depressed.
Tell jokes or say something inspiring when users are upbeat.
An agent also emit a sentiment according to the choice of words in its
dialogs. Be aware of this. It may be a good idea to measure it.
THIS TIME
• Word categories/sentiment analysis
• Dialog Management
• SSML and Prosody
• Cohesion
• Information Structure
• Spoken vs Written English
• Register
• Explore!
DIALOG MANAGEMENT
Consider a Pizza-ordering app. The information it needs comprises 6 “slots”:
•
•
•
•
•
•
Number of pizzas
Size(s)
Toppings
Street address
Phone number
Zip code
Dialog management leads the user through the process of giving this information.
Consider these sample dialogs:
DIALOG MANAGEMENT
PIZZA APP Hi there, welcome to Pearl’s Pizza, home of the famous gingerbread pizza
crust. What kind of pizza can I get you?
USER
Um, I’d like a large pepperoni, please.
[At this point, we have filled in the number of pizzas (1), and the toppings
(pepperoni). Now, all that remains is the address and phone number.]
PIZZA APP That’s one large pepperoni. Great. What’s the delivery address?
USER
Cherry Lane.
PIZZA APP And the street number?
USER
321.
PIZZA APP And finally, what’s your zip code?
USER
94019.
DIALOG MANAGEMENT
The information needed is:
• Number of pizzas
• Size(s)
• Toppings
• Street address
• Phone number
• Zip code
But note the the prompts were:
• “What kind of pizza?”
• “What’s the delivery address?”
Which encourages the user to fill in multiple slots in each response, streamlining the
interaction. Its also more natural.
DIALOG MANAGEMENT
PIZZA APP Hi there, welcome to Pearl’s Pizza, home of the famous gingerbread pizza
crust. What kind of pizza can I get you?
USER
Uhh...yeah, I wanna order some pizzas.
PIZZA APP Great, that’s what I’m here for. How many would you like?
USER
Two, please.
PIZZA APP And what sizes?
USER
One small, one large.
PIZZA APP OK, let’s start with the small. What ingredients would you like on that one?
USER
Olives, mushrooms, and sun-dried tomatoes…
DIALOG MANAGEMENT
Here the user was less cooperative.
Since they gave only minimal information, the system leads them through the missing
slots, one at a time.
THIS TIME
• Word categories/sentiment analysis
• Dialog Management
• SSML and Prosody
• Cohesion
• Information Structure
• Spoken vs Written English
• Register
• Explore!
SSML AND PROSODY
SSML is Speech Synthesis Markup Language.
It is an W3C XML standard that supports control of:
• Voice (speaker characteristics)
• Pronunciation
• Lexicon
• Prosody
SSML AND VOICES
Voice (speaker characteristics): Nationality, age, formality
<voice> element
Pronunciation:
<say-as> element – style of speaking
<phoneme> element – phonetic spelling
Lexicon:
<lexicon> element - custom lexicon, e.g. chemistry jargon
Emphasis
<emphasis> element - emphasize a word or phrase
Prosody
<prosody> element
<prosody pitch=“”> attribute – control pitch
<prosody rate=“”> attribute – control speaking rate
<prosody volume=“”> element – control volume
<prosody contour=“”> element – control pitch contour
SPEAKER CHARACTERISTICS
Amazon Polly - advanced Text-To-Speech synthesizer:
https://console.aws.amazon.com/polly/home/SynthesizeSpeech
Speaker characteristics:
• Nationality
• Gender
• Age
• Formality
• Naturalness
SAY-AS
<speak>
Hello, how are you?
Hello, how are <say-as interpret-as=“spell-out”>you</say-as>?
The number is 5551212
The number is <say-as interpret-as=“digits”>5551212</say-as>?
</speak>
SAY-AS provides control over speaking style.
PHONETICS
<speak>
You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
</speak>
Each phoneme element specifies a phonetic alphabet (here IPA or International Phonetic
Alphabet), and then a string giving the spelling.
Polly also supports x-sampa — The Extended Speech Assessment Methods Phonetic
Alphabet (X-SAMPA)
Note: IPA contains non-standard characters, x-sampa is plain-text.
https://developer.amazon.com/public/solutions/alexa/alexa-skills-kit/docs/speech-synthesis-markup-language-ssml-reference#supported-ssml-tags
EMPHASIS
<speak>
Hello, how are you?
Hello, how <emphasis level="strong">are</emphasis> you?
Hello, how are <emphasis level="strong">you</emphasis>?
</speak>
Emphasis can be used to mark the focus of the sentence. Emphasis is supported in
Polly, but not Alexa.
PITCH
<speak>
Hello, how are you?
<prosody pitch=“x-high”>Hello, how are you?</prosody>
<prosody pitch=“x-low”>Hello, how are you?</prosody>
</speak>
Note: the prosody element is supported in Polly, but not Alexa.
PITCH CONTOUR
<speak>
Hello, how are you?
<prosody contour=‘(0%,+0%) (70%,+30%) (100%,+0%)’>Hello, how are
you?</prosody>
</speak>
How Are you?
Note: contour is parsed but not implemented at this time in Polly.
USING PROSODY
Many limitations right now.
Be sure to test the target voice carefully – not all voices respond to prosodic shaping.
Best to use these capabilities to fix prosody errors rather than to optimize pronunciation.
THIS TIME
• Word categories/sentiment analysis
• Dialog Management
• SSML and Prosody
• Cohesion
• Information Structure
• Spoken vs Written English
• Register
• Explore!
COHESION
Consider this utterance:
System: You have five bookmarks. Here is the first bookmark,…
next bookmark,… that was the last bookmark.
Versus:
System: You have five bookmarks. Here is the first one,…
next one,… that was the last one.
The second form sounds more like a coherent message. Cohesion is helped by
pronouns, “this,” “that,” etc., and other pointer words like “one.”
THIS TIME
• Word categories/sentiment analysis
• Dialog Management
• SSML and Prosody
• Cohesion
• Information Structure
• Spoken vs Written English
• Register and Jargon
• Explore!
INFORMATION STRUCTURE
Imagine an Oscar’s award presentation:
SYSTEM: And the winner is… <emphasis>La La Land</emphasis>.
for best production design
Vs:
SYSTEM: And the winner for best production design is… <emphasis>La La
Land</emphasis>.
INFORMATION STRUCTURE
Also
USER:
Hey Google, What is the highest mountain in North America?
SYSTEM: Highest mountains in North America include Denali at 20,310 feet,…
Vs:
SYSTEM: Denali is the highest mountain in North America
END-FOCUS PRINCIPLE
END FOCUS: Place new or focal information near the end of a sentence.
End focus minimizes memory load – we don’t have to remember a datum that was buried
earlier in the system’s response.
It also reduces cognitive load – the response typically contains a query (old information)
that the focal datum satisfies. We can check the query before we receive the answer:
USER:
What’s tomorrow’s weather look like?
SYSTEM: Tomorrow’s weather should be partly cloudy with a high near 60.
Query (old)
Answer (new)
END-FOCUS PRINCIPLE
Versus:
USER:
What’s tomorrow’s weather look like?
SYSTEM: Partly cloudy with a high near 60 should be tomorrow’s weather
Also:
USER:
Who was the thirtieth president of the US?
GOOGLE: Calvin Coolidge was the thirtieth president of the US
ALEXA: The thirtieth president of the US was Calvin Coolidge
i.e. Alexa implements end-focus more consistently.
ACTIVE VS PASSIVE VOICE
USER:
Who discovered North America?
SYSTEM: Leif Erikson discovered North America.
[active voice]
SYSTEM: North America was discovered by Leif Erikson. [passive voice]
Passive voice is considered less compelling, but is more effective here.
Interestingly, Alexa avoids the passive voice with its answer:
ALEXA: North America is a discovery by Leif Erikson.
THIS TIME
• Word categories/sentiment analysis
• Dialog Management
• SSML and Prosody
• Cohesion
• Information Structure
• Spoken vs Written English
• Register
• Explore!
CONTRACTIONS
Use “you’re”, “can’t”, “it’s”, “don’t” freely in dialog. They sound informal in written speech,
but natural in spoken language.
SYSTEM: Finally, if you’re finished with your bookmark list, just say…
“I’m done”… and I’ll put it away.
SYSTEM: Finally, if you are finished with your bookmark list, just say…
“I am done”… and I will put it away
Btw, notice how prosody is influenced by word choice:
SYSTEM: And last of all, if you’re done with your bookmark list, just say…
“I’m done”… and I’ll put it away
MUST AND MAY
Must is a common-used written device to indicate a requirement (you can find it in
CS160 assignment descriptions).
But it sounds overbearing in speech:
SYSTEM: You must say your PIN one digit at a time –
for example, two one zero zero.
Try “go ahead and” instead:
SYSTEM: Go ahead and say your PIN one digit at a time –
for example, two one zero zero.
WILL AND “GOING TO”
Will is a preferred form for future tense in written English but sounds stilted in spoken
language:
SYSTEM: I will now record your account number –
say the number one digit at a time.
SYSTEM: I’m going to record your account number –
say the number one digit at a time.
Try instead “I’d like to”, add some dialog management and a pause:
SYSTEM: I’d like to record your account number –
go ahead and say the number - one digit at a time.
THIS TIME
• Word categories/sentiment analysis
• Dialog Management
• SSML and Prosody
• Cohesion
• Information Structure
• Spoken vs Written English
• Register
• Explore!
REGISTER
Register is the level of formality in the dialog. Note that spoken language is automatically
much more informal than written text. To the point where it may be “ungrammatical”.
formal
acquire
activate an account
Create an account
Create a bookmark
Encounter (information)
Encounter (difficulty)
Exit list
Experience difficulties
Obtain
Pause
Provide
Receive
…
colloquial
get
set up an account
Set up an account
Make, add a bookmark
Find, come across
Have problems, have trouble with
Be done with a list
Have problems/trouble
Get
Take a break
give
get
…
REGISTER
Generally, spoken dialog should be informal. There are a few reasons for this:
• It better matches “normal” conversation.
• Informal language is typically easier to interpret for most users. Less-educated users
may have trouble with formal language.
• Informal language is usually more efficient.
Exceptions:
• Formal invitations/responses, e.g. wedding invitations.
• Serious/somber communication.
• Communication where maximum clarity is needed (business/contracts).
REGISTER
In formal (written) grammar, this is correct:
A. SYSTEM: to whom would you like to speak?
And this is not:
B. SYSTEM: who would you like to speak to?
But B is overwhelmingly more common in spoken discourse, and A sounds pretentious.
REGISTER
Halliday’s dimensions of register:
Mode: refers to the channel of communication – written vs. spoken
Field: has to do with the content of the discourse as well as the social setting. E.g.
wedding invitation, notice of a clothing vendor holding a sale in their store,…
Tenor: invokes roles and relationships between user and system. This is where we have
design freedom in VUIs.
REGISTER AND TENOR
Tenor: invokes roles and relationships between user and system. This is where we have
design freedom in VUIs. Compare:
SYSTEM: you must visit the registration web site at…
SYSTEM: please visit the registration web site at…
SYSTEM: why don’t you visit the registration web site at…
SYSTEM: you might want to visit the registration web site at…
REGISTER AND TENOR
Tenor: invokes roles and relationships between user and system. This is where we have
design freedom in VUIs. Compare:
SYSTEM: you must visit the registration web site at…
- sounds like a supervisor/subordinate relationship
SYSTEM: please visit the registration web site at…
- sounds like a sales assistant/customer
SYSTEM: why don’t you visit the registration web site at…
- advice from a peer?
SYSTEM: you might want to visit the registration web site at…
- advice from a (passive-aggressive) peer?
THIS TIME
• Word categories/sentiment analysis
• Dialog Management
• SSML and Prosody
• Cohesion
• Information Structure
• Spoken vs Written English
• Register
• Explore!
EXPLORE!
This is a young and very fast-growing field.
It leverages and enables many other new technologies:
• Smart vehicles
• Wearable computers (smartwatches, glasses, VR gear)
• IoT
• Smart agent back-ends (IBM Watson)
• Robotics
By working on VUIs, you are by definition a pioneer.
Don’t take the design principles we give you as “finished” or anywhere near final.
EXPLORE! - OBSERVATIONS
The field is also in a technical transition because of
the shift from classical to deep learning AI technologies.
Recognizers and TTS systems capture much more
“real-world” knowledge about context and even
individual traits (from users’ training data).
OBSERVATION 1 - AUDITIONS
The current round of voices (e.g. Polly voices) are very complex (they are trained from
large amounts of data from real people).
• You can’t get a feeling for their traits from a short listen.
• Be prepared to “audition” them on substantial tracts of text.
• Their affect (prosody) depends on local context.
So the material has a big effect on what they can emote.
Just like a human actor, they will be better in some
roles than others.
• Its likely in future that voices will be directly “tunable” for style.
But its unclear whether this will be better than a large set of
discrete voices with realistic personas.
OBSERVATION 2
Its very hard to shape utterances with current (direct) prosody controls.
• Low level controls (pitch, volume, pitch contour)
have “shallow” and unrealistic effects.
• They are also very costly to tune
(you have several parameters to tune).
On the other hand, the voices themselves have strong (and voice-specific) ties between
the text content and the prosody they emit while speaking.
Therefore you have a lot of freedom to control prosody and register by careful choice of
the text to be spoken – note that this will only work for the specific voice you are using.
This is like writing a part for an actor.
• Affect words have a big influence.
• Phrases carry more context than individual words.