WOZError

What can humans do
when faced with ASR
errors?
Dan Bohus
Dialogs on Dialogs Group, October 2003
Question



We’re trying to build systems that can deal
with a noisy recognition channel
Q: How good are humans are that?
More importantly, how do they do it?


What strategies do they use?
How do they decide which one to use when?

What kind of knowledge used in the process?
WOZ experiments

Modify the WOZ setting so that the wizard
does not hear the user, but rather receives
the recognition result (text in these cases)

Exploring Human Error Handling Strategies
[Gabriel Skantze]

A Study of Human Dialogue Strategies in the
Presence of Speech Recognition Errors
[Teresa Zollo]
WOZ experiments

Modify the WOZ setting so that the wizard
does not hear the user, but rather receives
the recognition result (text in these cases)

Exploring Human Error Handling Strategies
[Gabriel Skantze]

A Study of Human Dialogue Strategies in the
Presence of Speech Recognition Errors
[Teresa Zollo]
Domain/Task, Experiments


Problem-solving task
Wizard is guiding user through a campus



Wizard has detailed map
User has small fraction of map with their current
surroundings
Experiments


8 users, 8 operators, balanced male/female
5 scenarios per user → 40 dialogs
WOZ / Experimental Setting

Wizard receives recognition results on a GUI



Not parsed (user plays parser also)
Confidence denoted by color intensity
Users know they are talking to a human




Normal wizard more costly
Hard to maintain subjects for longitudinal studies
Conflicting information on change in linguistic patterns
when speaking to a machine vs. to a human
Operators are naïve, they are also subjects of the study
Results


43% WER, 7.3% OOV
Manual labeling of operator understanding







Full understanding
Partial understanding
Non-understanding
Misunderstanding
Very few misunderstandings
Operators good at rejecting
Users thought they were almost always understood
Results (continued)

3 main operator strategies (approx equally distributed) for
dealing with non- and partial understandings:
 Continuation of route description
 Signal of non-understanding
 Task-related question

PARADISE-like
regression indicates
strategy 2 is
inversely correlated
with “how well do you
think you did?”
WOZ experiments

Modify the WOZ setting so that the wizard
does not hear the user, but rather receives
the recognition result

Exploring Human Error Handling Strategies
[Gabriel Skantze]

A Study of Human Dialogue Strategies in the
Presence of Speech Recognition Errors
[Teresa Zollo]
Domain / Experiments

TRIPS-Pacifica: planning the evacuation of
the fictitious island Pacifica



Construct a plan to transport all the civilians on
Pacifica to Barnacle by 5 am so that they can be
evacuated from the island (the play will be
deployed at midnight)
+ the road between Calypso and Ocean Beach is
impassable
Only 7 dialogs (September ’99)
WOZ / Experimental Setting





Wizard assisted by GUI for quick information
access and generating synthesized
responses
Sphinx-2 (CMU), TrueTalk (Entropics)
Wizard receives string of words (paper does
not mention confidence scores)
User debriefing questionnaire
Wizard annotates interaction transcript with
knowledge sources used in decisions, etc…
Results

Small corpus





7 dialogs
348 utterances
Manually labeled misunderstandings
Overall WER: 30%
Looked at positive and negative feedback
Negative feedback

Request for full repetition: 33/80


WH-replacement of missing or erroneous word:
12/80


8/12 cases users responded with the precise info
Attempt to salvage correct word: 20/80



24/33 cases users complied and repeated/rephrased
Possibly increase user satisfaction?
Similar responses to ask for repeat
Request for verification: 15/80

10/15 responded by explicit affirmations
What if we wanted to do
these?

Request for full repetition: 33/80


WH-replacement of missing or erroneous word:
12/80


8/12 cases users responded with the precise info
Attempt to salvage correct word: 20/80



24/33 cases users complied and repeated/rephrased
Possibly increase user satisfaction?
Similar responses to ask for repeat
Request for verification: 15/80

10/15 responded by explicit affirmations
More negative feedback
results


Wizards gave negative feedback in 80 cases
(35%) of the total 227 recognized incorrectly
Compensation for ASR:



Ignoring words that are not salient in the TRIPS
domain
Hypothesizing correct words based on phonetic
similarity
Q: So, what does that say? Better parsing?
Positive feedback






Using an acknowledgement term (okay, right)
Simple response to question (next relevant
contribution)
Conversational/social response i.e.
greetings/thanks
Providing a next unsolicited relevant
contribution
Clarifying or correcting
Paraphrasing
Conclusions

Observations consistent with theoretical
grounding models (Clark et al)


Negative feedback only when really needed
Unless ASR is perfect (and sometimes even
then), wizards give explicit indications of their
understanding
Discussion…


WOZ setting…
Wizard = Parser + Dialog Manager

Seems that humans can extract more info from
text than current parsers



we need better, more robust parsers?
How about Wizard = Dialog Manager?
Domain choice


Skantze results make sense in chosen domain
How can such results hold across domains?