VUI Evaluation (PARADISE)

ITCS 6010
VUI Evaluation
Paradise & SUM
PARADISE
Paradigm for Dialogue System Evaluation

Goal: Maximize User Satisfaction
PARADISE
Paradigm for Dialogue System Evaluation

Performance is modeled as a weighted function of a
task-based success measure and dialogue-based
cost measures, where weights are computed by
correlating user satisfaction with performance.

Dialogue tasks are represented as Attribute Value
Matrix (AVM) pairs.
Attribute Value Matrix (AVM)

AVM represents dialogue task



Information exchange between agent and
user
Represented as ordered pairs of attribute and
possible values
AVM instantiation of task

Same regardless of dialogue
AVM (cont’d)
A1: Hello this is Train Enquiry Service. DC,AC,DR,DT
Please speak after the tone. Which
information do you need?
U1: I want to go from Torino to Milano. DC, AC
A2: Do you want to go from Trento to DC, AC
Milano? Yes or No?
U2: No.
DC, AC
A3: Do you want to leave from Trento?
Yes if No?
DC
U3: No.
DC
A4: Where do you want to leave from? DC
U4: I want to leave from Torino.
DC
A5: Do you want to leave from Torino? DC
Yes or No?
U5: Yes.
DC
A6: Do you want to go to Milano?
AC
U6: Yes.
AC
A7: At which time do you want to leave? DR
U7: I want to travel in the evening.
DR
A8: Do you want to leave between 6
and 9pm? Yes or No?
DR
U8: Yes.
DR
A9: There is a train leaving at 8:00pm. DT
….
B1: Hello this is Train Enquiry Service.
Please speak after the tone. Which
information do you need?
U1: I want to travel from Torino to Milano.
B2: At which time do you want to leave
from Merano to Milano?
U2: No, I want to leave from Torino in the
evening.
B3: Do you want to leave from Torino
between 6 and 11pm?Please answer Yes
or No.
U3: Yes.
B4: A train leaves at 8:00pm.
….
attribute
actual value
depart-city
Torino
arrival-city
Milano
depart-range
Evening
depart-time
8pm
DC,AC,DR,DT
DC, AC
DC, AC,DR
DC, DR
DC, DR
DC,DR
DT
PARADISE
Paradigm for Dialogue System Evaluation

Advantages


PARADISE approach addresses performance and user
satisfaction
Disadvantages


Too complex to compute.
Need a large sample size up front
Alternative Approaches

What’s important?


Maximize User Satisfaction
Maximize Task Success
User Satisfaction

How do we measure user satisfaction?

Questionnaires

Interviews

Focus Groups
Task Success

How do we measure task success?

Logging Actual Use

Performance Measurement

Walkthroughs

Pilot Testing
Task Success

For each dialogue and the entire conversation
establish AVMs.

Measure task success with respect to:


Task completion time
Accuracy or Errors (e.g. misinterpretations)
Conclusions

PARADISE is good, but too complex!

Measure user satisfaction and task success.

What if user satisfaction not most relevant aspect?
Speech Usability Metric (SUM)

Uses 3 metrics:




User satisfaction
Accuracy
Task completion time
Eliminates restriction of one factor to
determine usability
Speech Usability Metric (SUM)

SUM = X * User Satisfaction + Y *
Accuracy + Z * Completion Time



X+Y+Z=1
X, Y, Z > 0
Weights determined by evaluator
User Satisfaction

Surveys

Questionnaires

Interviews
Accuracy

Misinterpretations


Out-of-vocabulary errors


System recognizes wrong word
Words not in system grammar
Wrong choice

Correct word recognized, wrong path chosen
Task Completion Time




Time to complete task
Time for expert to complete task (ETCT)
Maximum time to complete task (MTCT)
Expected time to complete task (ExTCT)
Conclusion

SUM determines usability of a speech
application


Utilizes 3 pre-defined metrics
Allows for greater flexibility