p - University of Pittsburgh

Using Natural Language Processing to
Analyze Tutorial Dialogue Corpora Across
Domains and Modalities
Diane Litman,
University of Pittsburgh, Pittsburgh, PA
Johanna Moore, Myroslava Dzikovska, Elaine Farrow
University of Edinburgh, Edinburgh, Scotland
Outline
Introduction
 Dialogue Data and Prior Results
 A Common Research Framework
 Predicting Learning from Student Dialogue
 Summary and Future Work

Motivation
 An

empirical basis for designing tutorial dialogue systems
What aspects of dialogue are predictive of learning?
– Student behaviors

Do results generalize across tutoring situations?
– Domain (mechanics versus electricity in physics)
– Modality (spoken versus typed)
– Tutor (computer versus human)

Can natural language processing be used for automation?
Tutorial Dialogue Research
 Many
correlations between dialogue and learning
– e.g. [Chi et al. 2001, Katz et al. 2003, Rose et al. 2003,
Craig et al. 2004, Boyer et al. 2007, Ohlsson et al. 2007]
 Difficult
–
–
–
–
to generalize findings due to different
annotation schemes
learning measures
statistical approaches
software tools
Two Prior Tutorial Dialogue Corpora
 ITSPOKE
– Spoken dialogue with a computer tutor
– Conceptual mechanics
– 100 dialogues, 20 students
 BEETLE
– Typed dialogue with human tutors
– Basic electricity and electronics
– 60 dialogues, 30 students
• Back-end is Why2-Atlas (VanLehn, Jordan, Rose et al., 2002)
• Sphinx2 speech recognition and Cepstral text-to-speech
BEETLE Student Screen
Lesson
Slides
Simulation-Based
Circuit Workspace
Chat
Window
Common Experimental Aspects
 Data
Collection
– Students take a multiple choice pretest
– Students work problems with dialogue tutor
– Students take a (non-identical) posttest
 Data Analysis
– Dialogues are annotated with various tagsets
– Quantitative measures of dialogue behavior are
examined for correlations with learning
Prior Correlation Results

Student domain content and novel dialogue content positively predict learning
– both ITSPOKE and BEETLE

However, measures such as domain content are computed differently across systems
– ITSPOKE: # of student lexical items in an online physics dictionary
» Tutor:
» Student:
What is the definition of Newton’s second law?
an object in motion tends to stay in motion until its act by an outside force
– BEETLE: # of student segments containing information relevant to lesson topics
» Tutor:
» Student:

If bulb B is damaged, what do you think will happen to bulbs A and C?
A and C will not light up.
Would other findings have generalized with a more uniform analysis?
– affect/attitudes (ITSPOKE only)
– words or turns, accuracy, impasses (BEETLE only)
A Common Research Framework: I
 Map
related but non-identical annotations to
identical tagsets
– Word tokenizer
– Dictionary-based domain content tagger
 Additional
BEETLE experiments
– Impact of domain dictionary
– Impact of automated content tagging
Methods: I
 Extract
quantitative measures of student dialogue
behavior from tagged data
–
–
–
–
StuWords
StuPhysicsDictWords
StuBeetleDictWords
StuContentSegmentWords
 Correlate
measures with learning gain
– partial correlations with Posttest, after regressing out Pretest
Methods: I
 Extract
quantitative measures of student dialogue
behavior from tagged data (normalized)
–
–
–
–
StuWords / Words
StuPhysicsDictWords / Words
StuBeetleDictWords / Words
StuContentSegmentWords / Words
 Correlate
measures with learning gain
– partial correlations with Posttest, after regressing out Pretest
Methods: I
 Extract
quantitative measures of student dialogue
behavior from tagged data (normalized)
–
–
–
–
StuWords / Words
StuPhysicsDictWords / Words
StuBeetleDictWords / Words
StuContentSegmentWords / Words
 Correlate
measures with learning gain
– partial correlations with Posttest, after regressing out Pretest
Results I: Correlations with Posttest
(controlled for Pretest)
Measure
StuWords / Words
StuPhysicsDictWords / Words
StuBeetleDictWords / Words
StuContentSegmWords / Words
BEETLE ITSPOKE
R
p
R
p
.34 .08
.17 .48
.22 .26
.60 .01
.38 .04 NA NA
.43 .02 NA NA
• Student talk is not significantly correlated with learning
(although trend in BEETLE)
Results I: Correlations with Posttest
(controlled for Pretest)
Measure
StuWords / Words
StuPhysicsDictWords / Words
StuBeetleDictWords / Words
StuContentSegmWords / Words
BEETLE ITSPOKE
R
p
R
p
.34 .08
.17 .48
.22 .26
.60 .01
.38 .04 NA NA
.43 .02 NA NA
• Domain talk is significantly correlated with learning
Results I: Correlations with Posttest
(controlled for Pretest)
Measure
StuWords / Words
StuPhysicsDictWords / Words
StuBeetleDictWords / Words
StuContentSegmWords / Words
BEETLE ITSPOKE
R
p
R
p
.34 .08
.17 .48
.22 .26
.60 .01
.38 .04 NA NA
.43 .02 NA NA
• Domain talk is significantly correlated with learning
• Domain dictionary matters
Results I: Correlations with Posttest
(controlled for Pretest)
Measure
StuWords / Words
StuPhysicsDictWords / Words
StuBeetleDictWords / Words
StuContentSegmWords / Words
BEETLE ITSPOKE
R
p
R
p
.34 .08
.17 .48
.22 .26
.60 .01
.38 .04 NA NA
.43 .02 NA NA
• Using natural language processing for domain tagging is a
viable alternative to manual annotation of contentful discourse
segments
A Common Research Framework: II
 Map
related but non-identical annotations to
common higher level theoretical constructs
– DeMAND coding scheme [Campbell et al. poster]
– Dialogues uniformly represented/queried using NXT,
NLTK
DeMAND:
Coding Utterances for Significant Events

Consider common theories of effective learning events
• Constructivism /
generative learning
• Student produces a lot of new
information
NOVELTY
– Duffy & Jonassen, 1992
• Impasses
– Van Lehn et. al., 2003
• Accountable talk
– Wolf, Crosson & Resnick, 2006
• Deep processing /
cognitive effort
– Nolen, 1988
ACCURACY
&orDOUBT
• Student is incorrect
correct with
low confidence
ACCURACY
& DEPTH
• Student is both accurate
& deep
• Student utterances are deep
DEPTH
(regardless of accuracy)
Mapping Tag Dimensions to Constructs
Construct
BEETLE
ITSPOKE
Effort
Depth=Yes
Answer=Deep
Accountable Depth=Yes
Accuracy=Correct V Missing
Answer=Deep
Accuracy=Correct
Impasses
(Certainness = Uncertain V Mixed
Accuracy=Correct)
V
(Accuracy=Incorrect V Partial)
(Doubt=Yes
Accuracy=Correct V Missing)
V
(Accuracy=Incorrect V Errors)
• Some learning event constructs map directly from the
tagging dimensions
• cognitive effort: student utterances that are deep
Mapping Tag Dimensions to Constructs
Construct
BEETLE
ITSPOKE
Effort
Depth=Yes
Answer=Deep
Accountable Depth=Yes
Accuracy=Correct V Missing
Answer=Deep
Accuracy=Correct
Impasses
(Certainness = Uncertain V Mixed
Accuracy=Correct)
V
(Accuracy=Incorrect V Partial)
(Doubt=Yes
Accuracy=Correct V Missing)
V
(Accuracy=Incorrect V Errors)
• Other constructs map tag values from multiple dimensions
• accountable talk: utterances that are accurate and deep
Mapping Tag Dimensions to Constructs
Construct
BEETLE
ITSPOKE
Effort
Depth=Yes
Answer=Deep
Accountable Depth=Yes
Accuracy=Correct V Missing
Answer=Deep
Accuracy=Correct
Impasses
(Certainness = Uncertain V Mixed
Accuracy=Correct)
V
(Accuracy=Incorrect V Partial)
(Doubt=Yes
Accuracy=Correct V Missing)
V
(Accuracy=Incorrect V Errors)
•Other constructs map tag values from multiple dimensions
• accountable talk: utterances that are accurate and deep
• student impasses: utterances that are correct with doubt, or incorrect
Methods: I I
 Extract
quantitative measures of student dialogue
behavior from tagged data
– Tag Dimensions
» depth, novelty, accuracy, doubt
– Learning Constructs
» effort, knowledge construction, impasses, accountable
 Predict
learning gain from dialogue measures
– multivariate linear regression
– dependent measure: posttest
– independent measures: pretest and sets of dialogue measures
Methods: I I
 Extract
quantitative measures of student dialogue
behavior from tagged data (normalized)
– Tag Dimensions
» % depth, % novelty, % accuracy, % doubt
– Learning Constructs
» % effort, % knowledge construction, % impasses, % accountable
 Predict
learning gain from dialogue measures
– multivariate linear regression
– dependent measure: posttest
– independent measures: pretest and sets of dialogue measures
Methods: I I
 Extract
quantitative measures of student dialogue
behavior from tagged data (normalized)
– Tag Dimensions
» % depth, % novelty, % accuracy, % doubt
– Learning Constructs
» % effort, % knowledge construction, % impasses, % accountable
 Predict
learning gain from dialogue measures
– multivariate linear regression
– dependent measure: posttest
– independent measures: pretest and sets of dialogue measures
Results II: Regressions with Posttest
Measures
4 Tags
BEETLE
ITSPOKE
Predictors
R2
%Right
.46 .00 %Right
p
Predictors
R2
p
.23 .03
4 Constructs %Impasses .22 .01 %Accountable, .50 .01
%Effort
• The same tag dimension is selected as most predictive of learning
across corpora, after unifying methods (%Right)
•Beetle
Accuracy = Correct V Missing
•ITSPOKE
Accuracy = Correct
Results II: Regressions with Posttest
Measures
BEETLE
Predictors R2
4 Tags
%Right
ITSPOKE
p
Predictors
.46 .00 %Right,
R2
p
.23 .03
4 Constructs %Impasses .22 .01 %Accountable, .50 .01
%Effort
• When using stepwise regression, different learning constructs are
selected as best predictors across corpora
Results II: Regressions with Posttest
Measures
BEETLE
ITSPOKE
Predictors
R2
4 Tags
%Right
.46 .00 %Right,
ITSPOKE
Constructs
%Accountable .18 .07 %Accountable .50 .01
%Effort
%Effort
p Predictors
R2
p
.23 .03
• However, constructs trained from ITSPOKE predict learning when
tested on BEETLE
• both predictors individually significant (p < .03) for BEETLE
Summary
 Methods
for uniformly annotating and statistically
analyzing previously collected dialogue corpora
 Enhancement
of original findings
– Replication of positive correlations with student domain content
– Impact of dictionary
– Use of natural language processing for automated tagging
 Emergence
of new results across corpora
– Positive correlations with student accuracy
– Accountable talk and student effort together predict learning
Future Directions
 Further
generalization of prior results
– tutor behaviors (e.g., questioning, restating)
– additional corpora
 More
sophisticated natural processing for content
tagging
Thank You!
Questions?
Graphical User Interface
Annotated Human-Human Excerpt
T:
S:
T:
Which one will be faster?
The feathers.
The feathers - why?
[Short Answer Question]
S:
Because there’s less matter.
[Deep Answer]
[Novel/Single Answer]
[Restatement,
Deep Answer Question]
All turns in both corpora were manually coded for
dialogue acts (Kappa > .6)
Question: If bulb B is damaged, what do you think will happen to bulbs A and C?
Accountable Talk:
utt83a: student: both bulbs A and C will go out because this scenario
would act the same as if there was an open circuit
Accuracy = Correct; Cognitive Processing = Present
Non-Accountable Talk:
utt69: student: A and C will not light
up
battery
Accuracy = Correct; Cognitive Processing = Absent
Cognitive Effort and Potential Impasse:
C since b is
utt122a: student: bulb a willAlight but bBand c won't
damaged and breaks the closed path circuit
Accuracy = Incorrect; Cognitive Processing = Present
X
Potential Impasse:
utt97: student: both would be either dim or not light I would think
Accuracy = Partially Correct; Cognitive Processing = Absent; Signs of
Low Confidence = Yes