How much data is enough? –
Generating reliable policies w/MDP’s
Joel Tetreault
University of Pittsburgh
LRDC
July 14, 2006
Problem
Problems with designing spoken dialogue systems:
How to handle noisy data or miscommunications?
Hand-tailoring policies for complex dialogues?
What features to use?
Previous work used machine learning to improve the
dialogue manager of spoken dialogue systems
[Singh et al., ‘02; Walker, ‘00; Henderson et al., ‘05]
However, very little empirical work [Paek et al., ‘05;
Frampton ‘05] on comparing the utility of adding
specialized features to construct a better dialogue
state
Goal
How does one choose which features best
contribute to a better model of dialogue state?
Goal: show the comparative utility of adding three
different features to a dialogue state
4 features: concept repetition, frustration, student
performance, student moves
All are important to tutoring systems, but also are
important to dialogue systems in general
Previous Work
In complex domains, annotation and testing is timeconsuming so it is important to properly choose best
features beforehand
Developed a methodology for using Reinforcement
Learning to determine whether adding complex
features to a dialogue state will beneficially alter
policies [Tetreault & Litman, EACL ’06]
Extensions:
Methodology to determine which features are the best
Also show our results generalize over different action
choices (feedback vs. questions)
Outline
Markov Decision Processes (MDP)
MDP Instantiation
Experimental Method
Results
Policies
Feature Comparison
Markov Decision Processes
What is the best action an agent should take
at any state to maximize reward at the end?
MDP Input:
States
Actions
Reward Function
MDP Output
Policy: optimal action for system to take in
each state
Calculated using policy iteration which
depends on:
Propagating final reward to each state
the probabilities of getting from one state to the next
given a certain action
Additional output: V-value: the worth of each
state
MDP’s in Spoken Dialogue
MDP works offline
MDP
Training data
Policy
Dialogue
System
User
Simulator
Human
User
Interactions work online
ITSPOKE Corpus
100 dialogues with ITSPOKE spoken
dialogue tutoring system [Litman et al. ’04]
All possible dialogue paths were authored by
physics experts
Dialogues informally follow question-answer
format
60 turns per dialogue on average
Each student session has 5 dialogues
bookended by a pretest and posttest to
calculate how much student learned
Corpus Annotations
Manual annotations:
Tutor Moves (similar to Dialog Acts)
[Forbes-Riley et al., ’05]
Student Frustration and Certainty
[Litman et al. ’04] [Liscombe et al. ’05]
Automated annotations:
Correctness (based on student’s response to last question)
Concept Repetition (whether a concept is repeated)
%Correctness (past performance)
MDP State Features
Features
Values
Correctness
Correct (C), Incorrect (I)
Certainty
Certain (cer), Neutral (neu), Uncertain (unc)
Concept Repetition
New Concept (0), Repeated (R)
Frustration
Frustrated (F) , Neutral (N)
% Correctness
50-100% (H)igh, 0-49% (L)ow
MDP Action Choices
Action
Example Turn
SAQ (Short Answer
Question)
“What is the direction of that force relative to your
fist?”
CAQ (Complex Answer
Question)
“What is the definition of Newton’s Second Law?”
Mix
“If it doesn’t hit the center of the pool what do you
know about the magnitude of its displacement from
the center of the pool when it lands? Can it be zero?
Can it be nonzero?
NoQ
“So you can compare it to my response…”
MDP Reward Function
Reward Function: use normalized learning gain to
do a median split on corpus:
( posttest pretest )
NLG
(1 pretest )
10 students are “high learners” and the other 10 are
“low learners”
High learner dialogues had a final state with a
reward of +100, low learners had one of -100
Methodology
Construct MDP’s to test the inclusion of new state
features to a baseline:
Develop baseline state and policy
Add a feature to baseline and compare polices
A feature is deemed important if adding it results in a
change in policy from a baseline policy given 3 metrics:
# of Policy Differences (Diff’s)
%Policy Change (%PC)
Expected Cumulative Reward (ECR)
For each MDP: verify policies are reliable (V-value
convergence)
Hypothetical Policy Change Example
B1 State
0 Diffs
5 Diffs
Policy
B1+Certainty
State
+Cert 1
Policy
+Cert 2
Policy
1 [C]
CAQ
[C,Cer]
[C,Neu]
[C,Unc]
CAQ
CAQ
CAQ
Mix
CAQ
Mix
2 [I]
SAQ
[I,Cer]
[I,Neu]
[I,Unc]
SAQ
SAQ
SAQ
Mix
CAQ
Mix
Tests
B2+
+Concept
B1+
Correctness
Baseline 1
+Certainty
+Frustration
Baseline 2
+%Correct
Baseline
Actions: {SAQ, CAQ, Mix, NoQ}
Baseline State: {Correctness}
Baseline network
SAQ|CAQ|Mix|NoQ
[C]
[I]
FINAL
Baseline 1 Policies
#
State
State Size
Policy
1
[C]
1308
NoQ
2
[I]
872
Mix
Trend: if you only have student correctness as a
model of student state, give a hint or other state act
to the student, otherwise give a Mix of complex and
short answer questions
But are our policies reliable?
Best way to test is to run real experiments with
human users with new dialogue manager, but that is
months of work
Our tact: check if our corpus is large enough to
develop reliable policies by seeing if V-values
converge as we add more data to corpus
Method: run MDP on subsets of our corpus
(incrementally add a student (5 dialogues) to data,
and rerun MDP on each subset)
Baseline Convergence Plot
Methodology: Adding more Features
Create more complicated baseline by adding
certainty feature (new baseline = B2)
Add other 4 features (concept repetition, frustration,
performance, student move) individually to new
baseline
Check V-value and policy convergence
Analyze policy changes
Use Feature Comparison Metrics to determine the
relative utility of the three features
Tests
B2+
+Concept
B1+
Correctness
Baseline 1
+Certainty
+Frustration
Baseline 2
+%Correct
Certainty
Previous work (Bhatt et al., ’04) has shown the
importance of certainty in ITS
A student who is certain and correct, may require a
harder question since he or she is doing well, but
one that is correct but showing some doubt is a sign
they are becoming confused, give an easier
question
B2: Baseline + Certainty Policies
B1 State
Policy
B1+Certainty
State
+Certainty Policy
1
[C]
NoQ
[C,Cer]
[C,Neu]
[C,Unc]
Mix
SAQ
Mix
2
[I]
Mix
[I,Cer]
[I,Neu]
[I,Unc]
Mix
NoQ
Mix
Trend: if neutral, give SAQ or NoQ, else give Mix
Baseline 2 Convergence Plots
Baseline 2 Diff Plots
Diff: For each subset corpus, compare policy
with policy generated with full corpus
Tests
B2+
+Concept
B1+
Correctness
Baseline 1
+Certainty
+Frustration
Baseline 2
+%Correct
Feature Comparison (3 metrics)
# Diff’s
Number of new states whose policies differ from the
original
Insensitive to how frequently a state occurs
% Policy Change (%P.C.)
Take into account the frequency of each state-action
sequence
Occurences of Diff
% PC
State
Total # State Occurences
Feature Comparison
Expected Cumulative Reward (E.C.R.)
One issue with %P.C. is that frequently occurring states
have low V-values and thus may bias the score
Use the expected value of being at the start of the
dialogue to compare features
ECR = average V-value of all start states
Feature Comparison Results
State Feature
#Diff’s
%P.C.
E.C.R
Student Move
10
82.2%
43.21
Concept Repetition
10
80.2%
39.52
Frustration
8
66.4%
31.30
Percent Correctness
4
44.3%
28.47
Trend of SMove > Concept Repetition > Frustration > Percent
Correctness stays the same over all three metrics
Baseline: Also tested the effects of a binary random feature
If enough data, a random feature should not alter policies
Average diff of 5.1
How reliable are policies?
Frustration
Concept
Possible data size is small and with increased data we may see more fluctuations
Confidence Bounds
Hypothesis: instead of looking at the V-values
and policy differences directly, look at the
confidence bounds of each V-value
As data increases, confidence of V-value
should shrink to reflect a better model of the
world
Additionally, the policies should converge as
well
Confidence Bounds
CB’s can also be used to distinguish how
much better an additional state feature is
over a baseline state space
That is, if the lower bound of a new state
space is greater than the upper bound of the
baseline state space
Crossover Example
More complicated Model
ECR
Baseline
Data
Confidence Bounds: App #2
Automatic model switching
If you know a model, at it’s worst (ie. It’s lower
bound is better than another model’s upper
bound) then you can automatically switch to the
more complicated model
Good for online RL applications
Confidence Bound Methodology
For each data slice, calculate upper and lower
bounds on the V-value
Take transition matrix for slice and sample from each row
using direch. statistical formula 1000 times
Run MDP on all 1000 transition matrices to get a range of
ECR’s
do this b/c real world data is not exactly approximating what
data is like in the real world, but may be close
So get 1000 new transition matrices that are all very similar
Rows with not a lot of data are very volatile so expect large
range of ECR’s, but as data increases, transition matrices
should stabilize such that most of the new matrices produce
similar policies and values as the original
Take upper and lower bounds at 2.5% percentile
Experiment
Original action/state setup did not show
anything promising
State/action space too large for data?
Not best MDP instantiation
Looked at a variety of MDP configurations
Refined reward metric
Adding discourse segmentation
+essay Instantiation with ’03+’05 data
+essay Baseline1
+essay Baseline2
+essay B2+SMove
Feature Comparison Results
State Feature
#Diff’s
%P.C.
E.C.R
Student Move
5
43.4%
49.17
Concept Repetition
3
25.5%
42.56
Frustration
1
0.03%
32.99
Percent Correctness
3
11.19%
28.50
Reduced state size: Certainty = {Cert+Neutral, Uncert}
Trend that SMove and Concept Repetition are the best features
B2 ECR = 31.92
Baseline 1
Upper = 23.65 Lower = 0.24
Baseline 2
Upper = 57.16 Lower = 39.62
B2+ Concept Repetition
Upper = 64.30 Lower =49.16
B2+Percent Correctness
Upper =48.42 Lower = 32.86
B2+Student Move
Upper = 61.36 Lower = 39.94
Discussion
Baseline 2 – has crossover effect and policy
stability
More complex features (B2 + X) – have
crossover effect, but not sure if polices are
stable (some stabilize at 17 students)
Indicates that 100 dialogues isn’t enough for
even this simple MDP? (but is enough for
baseline 2 to feel confident about?)
© Copyright 2026 Paperzz