Using Natural Language Processing to Analyze Tutorial Dialogue Corpora Across Domains and Modalities Diane Litman, University of Pittsburgh, Pittsburgh, PA Johanna Moore, Myroslava Dzikovska, Elaine Farrow University of Edinburgh, Edinburgh, Scotland Outline Introduction Dialogue Data and Prior Results A Common Research Framework Predicting Learning from Student Dialogue Summary and Future Work Motivation An empirical basis for designing tutorial dialogue systems What aspects of dialogue are predictive of learning? – Student behaviors Do results generalize across tutoring situations? – Domain (mechanics versus electricity in physics) – Modality (spoken versus typed) – Tutor (computer versus human) Can natural language processing be used for automation? Tutorial Dialogue Research Many correlations between dialogue and learning – e.g. [Chi et al. 2001, Katz et al. 2003, Rose et al. 2003, Craig et al. 2004, Boyer et al. 2007, Ohlsson et al. 2007] Difficult – – – – to generalize findings due to different annotation schemes learning measures statistical approaches software tools Two Prior Tutorial Dialogue Corpora ITSPOKE – Spoken dialogue with a computer tutor – Conceptual mechanics – 100 dialogues, 20 students BEETLE – Typed dialogue with human tutors – Basic electricity and electronics – 60 dialogues, 30 students • Back-end is Why2-Atlas (VanLehn, Jordan, Rose et al., 2002) • Sphinx2 speech recognition and Cepstral text-to-speech BEETLE Student Screen Lesson Slides Simulation-Based Circuit Workspace Chat Window Common Experimental Aspects Data Collection – Students take a multiple choice pretest – Students work problems with dialogue tutor – Students take a (non-identical) posttest Data Analysis – Dialogues are annotated with various tagsets – Quantitative measures of dialogue behavior are examined for correlations with learning Prior Correlation Results Student domain content and novel dialogue content positively predict learning – both ITSPOKE and BEETLE However, measures such as domain content are computed differently across systems – ITSPOKE: # of student lexical items in an online physics dictionary » Tutor: » Student: What is the definition of Newton’s second law? an object in motion tends to stay in motion until its act by an outside force – BEETLE: # of student segments containing information relevant to lesson topics » Tutor: » Student: If bulb B is damaged, what do you think will happen to bulbs A and C? A and C will not light up. Would other findings have generalized with a more uniform analysis? – affect/attitudes (ITSPOKE only) – words or turns, accuracy, impasses (BEETLE only) A Common Research Framework: I Map related but non-identical annotations to identical tagsets – Word tokenizer – Dictionary-based domain content tagger Additional BEETLE experiments – Impact of domain dictionary – Impact of automated content tagging Methods: I Extract quantitative measures of student dialogue behavior from tagged data – – – – StuWords StuPhysicsDictWords StuBeetleDictWords StuContentSegmentWords Correlate measures with learning gain – partial correlations with Posttest, after regressing out Pretest Methods: I Extract quantitative measures of student dialogue behavior from tagged data (normalized) – – – – StuWords / Words StuPhysicsDictWords / Words StuBeetleDictWords / Words StuContentSegmentWords / Words Correlate measures with learning gain – partial correlations with Posttest, after regressing out Pretest Methods: I Extract quantitative measures of student dialogue behavior from tagged data (normalized) – – – – StuWords / Words StuPhysicsDictWords / Words StuBeetleDictWords / Words StuContentSegmentWords / Words Correlate measures with learning gain – partial correlations with Posttest, after regressing out Pretest Results I: Correlations with Posttest (controlled for Pretest) Measure StuWords / Words StuPhysicsDictWords / Words StuBeetleDictWords / Words StuContentSegmWords / Words BEETLE ITSPOKE R p R p .34 .08 .17 .48 .22 .26 .60 .01 .38 .04 NA NA .43 .02 NA NA • Student talk is not significantly correlated with learning (although trend in BEETLE) Results I: Correlations with Posttest (controlled for Pretest) Measure StuWords / Words StuPhysicsDictWords / Words StuBeetleDictWords / Words StuContentSegmWords / Words BEETLE ITSPOKE R p R p .34 .08 .17 .48 .22 .26 .60 .01 .38 .04 NA NA .43 .02 NA NA • Domain talk is significantly correlated with learning Results I: Correlations with Posttest (controlled for Pretest) Measure StuWords / Words StuPhysicsDictWords / Words StuBeetleDictWords / Words StuContentSegmWords / Words BEETLE ITSPOKE R p R p .34 .08 .17 .48 .22 .26 .60 .01 .38 .04 NA NA .43 .02 NA NA • Domain talk is significantly correlated with learning • Domain dictionary matters Results I: Correlations with Posttest (controlled for Pretest) Measure StuWords / Words StuPhysicsDictWords / Words StuBeetleDictWords / Words StuContentSegmWords / Words BEETLE ITSPOKE R p R p .34 .08 .17 .48 .22 .26 .60 .01 .38 .04 NA NA .43 .02 NA NA • Using natural language processing for domain tagging is a viable alternative to manual annotation of contentful discourse segments A Common Research Framework: II Map related but non-identical annotations to common higher level theoretical constructs – DeMAND coding scheme [Campbell et al. poster] – Dialogues uniformly represented/queried using NXT, NLTK DeMAND: Coding Utterances for Significant Events Consider common theories of effective learning events • Constructivism / generative learning • Student produces a lot of new information NOVELTY – Duffy & Jonassen, 1992 • Impasses – Van Lehn et. al., 2003 • Accountable talk – Wolf, Crosson & Resnick, 2006 • Deep processing / cognitive effort – Nolen, 1988 ACCURACY &orDOUBT • Student is incorrect correct with low confidence ACCURACY & DEPTH • Student is both accurate & deep • Student utterances are deep DEPTH (regardless of accuracy) Mapping Tag Dimensions to Constructs Construct BEETLE ITSPOKE Effort Depth=Yes Answer=Deep Accountable Depth=Yes Accuracy=Correct V Missing Answer=Deep Accuracy=Correct Impasses (Certainness = Uncertain V Mixed Accuracy=Correct) V (Accuracy=Incorrect V Partial) (Doubt=Yes Accuracy=Correct V Missing) V (Accuracy=Incorrect V Errors) • Some learning event constructs map directly from the tagging dimensions • cognitive effort: student utterances that are deep Mapping Tag Dimensions to Constructs Construct BEETLE ITSPOKE Effort Depth=Yes Answer=Deep Accountable Depth=Yes Accuracy=Correct V Missing Answer=Deep Accuracy=Correct Impasses (Certainness = Uncertain V Mixed Accuracy=Correct) V (Accuracy=Incorrect V Partial) (Doubt=Yes Accuracy=Correct V Missing) V (Accuracy=Incorrect V Errors) • Other constructs map tag values from multiple dimensions • accountable talk: utterances that are accurate and deep Mapping Tag Dimensions to Constructs Construct BEETLE ITSPOKE Effort Depth=Yes Answer=Deep Accountable Depth=Yes Accuracy=Correct V Missing Answer=Deep Accuracy=Correct Impasses (Certainness = Uncertain V Mixed Accuracy=Correct) V (Accuracy=Incorrect V Partial) (Doubt=Yes Accuracy=Correct V Missing) V (Accuracy=Incorrect V Errors) •Other constructs map tag values from multiple dimensions • accountable talk: utterances that are accurate and deep • student impasses: utterances that are correct with doubt, or incorrect Methods: I I Extract quantitative measures of student dialogue behavior from tagged data – Tag Dimensions » depth, novelty, accuracy, doubt – Learning Constructs » effort, knowledge construction, impasses, accountable Predict learning gain from dialogue measures – multivariate linear regression – dependent measure: posttest – independent measures: pretest and sets of dialogue measures Methods: I I Extract quantitative measures of student dialogue behavior from tagged data (normalized) – Tag Dimensions » % depth, % novelty, % accuracy, % doubt – Learning Constructs » % effort, % knowledge construction, % impasses, % accountable Predict learning gain from dialogue measures – multivariate linear regression – dependent measure: posttest – independent measures: pretest and sets of dialogue measures Methods: I I Extract quantitative measures of student dialogue behavior from tagged data (normalized) – Tag Dimensions » % depth, % novelty, % accuracy, % doubt – Learning Constructs » % effort, % knowledge construction, % impasses, % accountable Predict learning gain from dialogue measures – multivariate linear regression – dependent measure: posttest – independent measures: pretest and sets of dialogue measures Results II: Regressions with Posttest Measures 4 Tags BEETLE ITSPOKE Predictors R2 %Right .46 .00 %Right p Predictors R2 p .23 .03 4 Constructs %Impasses .22 .01 %Accountable, .50 .01 %Effort • The same tag dimension is selected as most predictive of learning across corpora, after unifying methods (%Right) •Beetle Accuracy = Correct V Missing •ITSPOKE Accuracy = Correct Results II: Regressions with Posttest Measures BEETLE Predictors R2 4 Tags %Right ITSPOKE p Predictors .46 .00 %Right, R2 p .23 .03 4 Constructs %Impasses .22 .01 %Accountable, .50 .01 %Effort • When using stepwise regression, different learning constructs are selected as best predictors across corpora Results II: Regressions with Posttest Measures BEETLE ITSPOKE Predictors R2 4 Tags %Right .46 .00 %Right, ITSPOKE Constructs %Accountable .18 .07 %Accountable .50 .01 %Effort %Effort p Predictors R2 p .23 .03 • However, constructs trained from ITSPOKE predict learning when tested on BEETLE • both predictors individually significant (p < .03) for BEETLE Summary Methods for uniformly annotating and statistically analyzing previously collected dialogue corpora Enhancement of original findings – Replication of positive correlations with student domain content – Impact of dictionary – Use of natural language processing for automated tagging Emergence of new results across corpora – Positive correlations with student accuracy – Accountable talk and student effort together predict learning Future Directions Further generalization of prior results – tutor behaviors (e.g., questioning, restating) – additional corpora More sophisticated natural processing for content tagging Thank You! Questions? Graphical User Interface Annotated Human-Human Excerpt T: S: T: Which one will be faster? The feathers. The feathers - why? [Short Answer Question] S: Because there’s less matter. [Deep Answer] [Novel/Single Answer] [Restatement, Deep Answer Question] All turns in both corpora were manually coded for dialogue acts (Kappa > .6) Question: If bulb B is damaged, what do you think will happen to bulbs A and C? Accountable Talk: utt83a: student: both bulbs A and C will go out because this scenario would act the same as if there was an open circuit Accuracy = Correct; Cognitive Processing = Present Non-Accountable Talk: utt69: student: A and C will not light up battery Accuracy = Correct; Cognitive Processing = Absent Cognitive Effort and Potential Impasse: C since b is utt122a: student: bulb a willAlight but bBand c won't damaged and breaks the closed path circuit Accuracy = Incorrect; Cognitive Processing = Present X Potential Impasse: utt97: student: both would be either dim or not light I would think Accuracy = Partially Correct; Cognitive Processing = Absent; Signs of Low Confidence = Yes
© Copyright 2026 Paperzz