Markers of confidence and correctness in spoken medical narratives

INTERSPEECH 2013
Markers of confidence and correctness in spoken medical narratives
Kathryn Womack1 , Cecilia Ovesdotter Alm2 , Cara Calvelli3
Jeff B. Pelz4 , Pengcheng Shi5 , Anne Haake5
1
Dept. of ASL & Interpreting Edu., 2 Dept. of English, 3 College of Health Sciences & Tech.
4
Center for Imaging Science, 5 Computing & Information Sciences
Rochester Institute of Technology, Rochester, NY, USA
[email protected], [email protected], [email protected]
[email protected], [email protected], [email protected]
Abstract
We make several contributions in this work. First, we review the literature relating to confidence, both in and out of
the clinical domain, and the development of medical expertise.
Second, we present and analyze a novel dataset, comprising
spoken narratives by dermatologists who were asked to narrate
their medical reasoning process during medical image inspection tasks. Third, we report and compare predictive models that
automatically infer diagnostic correctness or confidence using
prosodic features, narrative features, and disfluency features.
This allows us to better understand which features help predict
confidence and diagnostic correctness respectively and better
understand the confidence-correctness relationship.
This study explores diagnostic correctness and physicians’ selfassessed feelings of confidence in spoken medical narratives.
Dermatologists were shown images of dermatological cases and
asked to narrate their diagnostic thought processes by providing
a description of the case, a list of differential diagnoses, a final
diagnosis, and, the percent confidence (from 0 to 100%) in their
final diagnosis. We describe this novel corpus and present a case
study from the dataset. We then report on predictive models for
diagnostic accuracy and physician-reported confidence as a way
of studying how these extralinguistic features affect the speech
of the physicians, comparing use of narrative features, prosodic
features, and disfluency features for classification.
Index Terms: spoken medical narratives, diagnostic correctness, diagnostic confidence, medical reasoning
2. Previous work
Characterizing how speech contributes to conveying confidence
remains an unresolved topic, and describing the expression of
confidence is a multi-faceted problem. Studies have looked at
the speaker’s reported confidence as well as the perceptual rating by listeners of the speaker’s confidence. There is a lack of
consensus on the relationship between actual versus percieved
confidence and how they are expressed in speech.
Speech features play an important role in the way that
listeners judge confidence. Intonation, the use of hedges or
fillers, silent pause duration, and utterance duration have all
been shown to inform automatic confidence classification or
show statistical correlation [10, 11, 12, 13, 14]. Smith and
Clark found in a question-answering study that respondents
took longer and used more hedges when they were unsure of
the correct answer [15]. This study considers medical decision
making, which depends on expertise and reasoning as opposed
to simple memory recall, but similar phenomena likely apply.
Importantly, judgments by listeners of speakers’ confidence
also have effects on discourse and implications for learning and,
accordingly, for designing learning systems. For example, Barr
found in a teaching study that listeners learned more effectively
from speakers who sounded confident (e.g. used few hesitations) and postulates that this is related to the listeners’ judgment of confidence in the speaker [16]. Moreover, Forbes-Riley
and Litman found that a computer tutoring system that adjusted
its response based on its perceived confidence of the human user
resulted in more efficient learning by the student [17, 18], showing importance of both automatic confidence-detection and the
importance of listeners’ perception of speakers’ confidence.
Although listener perception is important, many studies assume that listener perception accurately reflects the actual confidence level of the speaker, which is not necessarily the case.
1. Introduction
In the medical domain, the rate of diagnostic errors is estimated
to be around 15% [1, 2]. Overconfidence is a significant cause
of clinical misdiagnoses [1, 2, 3, 4, 5, 6]. Being able to recognize speech patterns of physicians who are confident versus unsure and correct versus incorrect in their final diagnosis
would be useful, alongside eye-movement information [7], to
better understand the cognitive reasoning processes of domain
experts. This can serve as a first step toward mapping out where
the diagnostic reasoning and decision-making processes break
down when errors occur.
Specifically, there is a need for development of physician
support systems and training systems which are user-centered
and adaptive to the individual. We envision systems that can
identify if users’ diagnostic reasoning strategies take a wrong
turn, potentially leading toward an incorrect diagnosis. Tracking speech patterns that act as flags of decision-making problems or over- vs. underconfidence would be useful as one form
of real-time behavioral sensors in such systems.
How uncertainty is affected by and plays a role in complex
problem-solving is a prominent focus of current research. Highlighting the importance of accurately monitoring one’s selfperception of performance and actual performance, residents
(experts in-training) tend to show less correlation between selfreported confidence and diagnostic correctness [1, 8]. Physicians who are more confident are also less likely to use decision
support systems [9], which means that they could miss useful
available information due to overconfidence. Systems that effectively contribute to training and accurate self-monitoring is
one way to work toward reducing misdiagnoses.
Copyright © 2013 ISCA
2549
25 - 29 August 2013, Lyon, France
Pon-Barry and Shieber report that, in their spoken questionanswering study, self-reported confidence by participants was
on average lower than the annotators’ rating of the speaker’s
confidence [13]. Additionally, only 73% of their participants
were labeled as “self-aware” ([13], p. 5), meaning that they were
confident when correct and unsure when incorrect.
In the legal domain, facial recognition studies examining
the accuracy and confidence of eyewitnesses have found weak
or variable correlation between participants’ confidence and
correctness [19, 20] and that the correlation can be changed
by manipulating factors such as lighting and viewing duration
[21, 22]. Similar findings hold true in the equally high-stakes
domain of medical reasoning. Studies in both domains have
shown stronger intra-subject correlation of accuracy and confidence and weak inter-subject correlation [8, 23], meaning that
an individual might give consistent confidence ratings, but individual judgments of “high” and “low” confidence differ.
How decision makers’ confidence and diagnostic correctness interact in medical diagnostic decision making is also an
ongoing subject of research. Many studies have found that there
is weak or variable correlation between physician confidence
and diagnostic accuracy (e.g. [1, 2, 24, 25]). Understanding
how these two factors interact, which factors influence them,
and how they are expressed in speech, would be useful to improve the human-centered design of physician training systems
and, broadly, to further our understanding of how these extralinguistic characteristics are expressed.
The current study builds on previous findings by using a
spoken corpus, in which we have both diagnostic correctness
and self-reported diagnostic confidence ratings, in order to better understand how the two compare with respect to specific
speech features. McCoy et al. postulated that indicators of
speaker certainty help mirror the correctness of a medical expert
speaker, with encouraging initial prediction results performing
around or above tough baselines, despite a modest dataset size
and notable class imbalance [26].
(a) C. Conf.
Correct
Incorrect
(c) H. Dur.
(d) L. Dur.
High (≥66)
235
220
Confidence
Med (33-65)
73
195
Low (<33)
33
112
Table 1: Number of narratives with correct vs. incorrect final
diagnosis and high, medium, or low confidence.
narratives was 63%. Boxplots comparing confidence scores and
total narrative duration are in Figure 1. The mean confidence of
participants on a specific image ranges from 44% to 98% (see
images A and D in section 3.1); and physicians’ mean confidence scores over all images range from 25% to 84%. Table
1 shows the numbers of narratives with high vs. medium vs.
low confidence, equally binned by score, and correct vs. incorrect diagnosis. More than two-thirds of correct narratives have
a high confidence score. In addition, over three-quarters of incorrect narratives have a medium or high confidence score; estimating diagnostic accuracy appears in many cases to be problematic, with a tendency for overconfidence2 .
3. Data characterization and analysis
We use a novel corpus consisting of 8681 medical narratives,
involving 29 dermatologists. 11 participants were practicing attending dermatologists and 18 were dermatology residents intraining. Each participant was individually shown 30 images of
dermatological conditions in a random order and asked to provide a description of the case; a list of differential diagnoses to
consider; a final diagnosis; and a confidence score, a percentage
from 0 to 100% of how confident they felt in their final diagnosis, with 100% being completely sure. The conditions included
a variety of types of lesions and diseases that ranged from common to rare so as to include images in a range of difficulty.
All narratives are in English and were collected in an elicitation scenario in North America. Speech and eye-movements
were simultaneously recorded; this study focuses on the spoken narratives, transcripts, and annnotations. Each narrative
was manually time-aligned by a single transcriber, with experience in this domain, using Praat [27]. All spoken tokens were
transcribed, including silent pauses over 0.3 seconds and disfluencies (filled pauses, edits, etc). A domain expert (co-author Dr.
Calvelli) annotated narratives for diagnostic correctness.
The corpus contained over 92,000 tokens, with an average
of about 100 tokens per narrative and approximately 1 minute
duration per narrative. The average confidence score over all
1 Two
(b) I. Conf.
Figure 1: Fundamental descriptive statistics of narratives in
this corpus, with median (red bar) and interquartile range
(boxed). (a) Confidence when correct; (b) Confidence when incorrect; (c) Narrative duration (sec) when confidence is high
(≥66%); (d) Narrative duration when confidence is low (<33).
3.1. Case study
To provide an example of the narratives in this corpus and the
effects of diagnostic correctness and participants’ confidence,
we here examine four images, each having 29 narratives. Basic
information about these four images is in Table 2, with examples from spoken narratives in Table 3 and images in Figure 2.
Image A had the correct final diagnosis given by all participants, and the average confidence score was 98%. The pattern
followed by most participants was to describe the lesion and
immediately name the final diagnosis. Some participants listed
one or two other possible diagnoses in the differential while others omitted the differential. This explains in part why the average duration of narratives for this image is shorter.
All participants gave the correct final diagnosis for image
B, while none gave the correct final diagnosis for image C.
Nonetheless, these two images have similar typical narrative
structure patterns. For both, the average confidence score was
high: 77%. Most participants in both images gave several pos2 We recognize that other divisions of confidence levels are possible
and that physicians were not given the opportunity to administer followup tests as would generally be the case in a diagnostic scenario.
narratives were lost in data collection due to technical issues.
2550
ID
A
B
C
D
% Corr.
100%
100%
0%
0%
Conf.
98%
77%
77%
44%
Dur.
42 s
58 s
62 s
92 s
# Tokens
75
101
103
140
# in Diff.
1
3
4
5
Image A
Image B
Image D
Figure 2: Case study images. (Image C not shown for privacy.)
Images used with permission from Logical Images, Inc.
Table 2: For four images: the percent of participants who gave
the correct diagnosis, mean confidence scores, mean narrative
duration, mean number of tokens in a narrative, and mean number of diagnoses listed in the differentials.
on patient history and could be confirmed with clinical tests.
The physicians are missing vital diagnostic information, and
this explains the high rate of incorrect diagnoses, but this also
provides us with valuable observations of how much information the physicians can (or cannot) glean from the image alone.
Many of the physicians seem to follow the intuitive process in image A, giving less evidence and a short (or omitted)
differential. In contrast, for image D, many physicians seem
to use the analytic process, discussing the case in depth and
giving several options. Narratives in images B and C seem to
be mixed, and it is not easy to tell exactly which method the
physician might be using. These results provide additional evidence that physicians might resort to the more cautious analytic
method when they are unsure or have insufficient experience
to use the intuitive process, and that confidence and the dual
reasoning methods are related. In addition, certain clues of spoken medical narratives’ structure (e.g. the narrative’s duration)
could suggest which type of reasoning the physician is using.
sible diagnoses in the differential. Commonly, they hedged that
one diagnosis was ‘more likely’ while other diagnoses ‘could
be’ included in the differential as a ‘possibility’. The key difference between these two images is that the ‘most likely’ diagnosis in image B was correct, while the ‘most likely’ diagnosis in
image C was incorrect (usually the same incorrect choice).
In contrast, image D had no correct final diagnoses given
and the average confidence score was 44%. Almost all participants gave an extensive differential with three to five different
possible diagnoses, but only three participants listed the correct
diagnosis in the differential. Final diagnoses for image D were
varied, and most participants made different decisions.
There are two methods of clinical decision-making: the
intuitive process that relies on exemplar-based pattern recognition, which is typically shorter; and the analytic process
that steps through all available information, which is typically
longer [6, 28]. The intuitive process is less reliable because it
relies on heuristics and pattern-matching, and is therefore more
subjective; whereas the analytic process takes longer but is more
objective [28, 29, 30]. These case studies could show different
patterns because of the distinctiveness of the diagnosis for image A, combined with the rarity of the diagnosis for image D.
Additionally, the diagnoses for images C and D rely in part
4. Prediction models
The J48 decision tree in Weka3 was used for classification because of its ease of human readability and of evaluating individual features. Three groups of features, and their combinations,
were evaluated at the narrative-level: prosody features, disfluency features, and narrative features, listed in Table 4.
For development, parameter tuning was performed with
60% of the corpus as a development training set and 20% as
a development testing set. Following the tuning phase, given
the small size of the corpus, we report on evaluation that used
15-fold cross-validation on the whole corpus. This partitioning
ensured that training and testing sets differed compared to the
development tuning phase and also allowed a more balanced understanding of the classification performance across test folds.
A: “Um. Numerous tan to gray-brown, um, verrucous stuck-on
plaques. Differential diagnosis [is] seborrheic keratosis. Diagnosis
[is] seborrheic keratosis. Percent certain [is] hundred percent.”
B: “Okay, localized to the right lower cutaneous lip and right mucosal
lip of an apparently young white male, there’s a depigmented patch.
[It is] very well-located delina– de– uh, delineated. And, uh, this is
vitiligo. Differential diagnosis might include post-inflammatory hypopigmentation, but I think one-hundred percent sure, this is vitiligo.”
C: “So [this is the] face of a older [patient]. Some – some prominent
erythema periocularly. There’s some serum crust as well. Top thing
I would consider is allergic contact dermatitis. Other consideration[s]
might be dermatomyositis, but I favor allergic contact dermatitis and
I’d rate it probably at, um, eighty-five percent.”
D: “So we got [what] looks to be a extremity, probably a n– leg and
thigh in a Caucasian. There’s multiple erythematous papules, uh,
some of which have a central crust, such as the one in the center of
the view. Um, there’s one large e– erythematous annular plaque on
the lower edge of the view [but] it’s not in good focus. Differential’s
broad here as well. You could think LyP, PLEVA, uh, psoriasis with,
uh, impetigo. You could think, um, prurigo nodularis. You could
think, um, yeah. You could think disseminated [actinic] porokeratosis, DSAP. But, uh, I would say either LyP or PLEVA, that would be
my top two differential. And I would say, given that look, maybe –
maybe PLEVA, final answer. [I’m] fifty percent confident.”
4.1. Predicting diagnostic correctness
Experiments4 for diagnostic correctness (i.e. correct or incorrect) were run for each of the feature group combinations, with
and without the narratives’ confidence score as a feature. The
majority-class baseline was 61%. Most of the experimental runs
had above-baseline performance (see results in Table 5). Experiments that included confidence as a feature had an average of
5% higher accuracy than trials that did not include the confidence score. In fact, prediction using only confidence as a feature was as good as or better than other feature combinations
(except, marginally, prosody with disfluency features and confidence), revealing confidence as a fair indicator of correctness.
Decision tree inspection shows that very high confidence tends
to be used to label correct diagnoses, but also results in substantial erroneous predictions, mirroring the complex link between
confidence and correctness.
Table 3: Examples to illustrate common reasoning patterns
in each of 4 images; punctuation and capitalization added for
readability. The correct diagnoses for images C and D are italicized in the differential (dermatomyositis and LyP, respectively).
3 See
http://www.cs.waikato.ac.nz/ml/weka/.
– confidence factor: 0.15, min instances to branch: 8.
4 Parameters
2551
Prosody features:
Min. pitch (Hz)
Min. intensity (dB)
Time of min. pitch (s) *
Time of min. intensity (s) •
Max. pitch (Hz)
Max. intensity (dB)
Time of max. pitch (s)
Time of max. intensity (s)
Mean pitch (Hz) *
Mean intensity (dB)
St. deviation of pitch
St. deviation of intensity *
Pitch range (max - min)
Intensity range (max - min)
Disfluency features:
% Silent pauses (s)
# Silent pauses / # tokens
% Filled pauses (s)
# Filled pauses / # tokens
% False starts (s)
# False starts / # tokens
% Repetitions (s)
# Repetitions / # tokens
# Disfluent tokens / # non-disfluent tokens • *
Narrative features:
# Tokens
# Tokens per second *
Total duration (s) *
Total duration of speech (s) • *
Type-token ratio (TTR)
TTR (without disfluencies)
Length of initial silent pause (s)
anced classes; alternative binning or sampling methods might
capture more useful class boundaries.
4.3. Important features for classification
Weka’s CFS subset evaluator [31] was used after testing to see
which features were valuable for classification. 15-fold crossvalidation was used, providing the percentage of folds in which
each feature was useful. A feature was considered important if
it was useful in more than 50% of test folds (see Table 4).
For confidence predictions, the top features for each scale
were narrative duration (total time the physician viewed the image) and duration of speech (from the time they started talking
until they finished), which have shown differences in other analysis (see Figure 1, section 3.1). This explains in part why experiments involving narrative features out-performed other experiments. Correctness was useful for each confidence scale except
2-scale (but correctness as the only feature was uninformative,
resulting only in majority class prediction, seen in Table 5).
Few features were useful for diagnostic correctness prediction; confidence score was the only feature that was useful in
all of the testing folds. Useful features appear in each feature
group, which could explain why there was low variation in accuracy between experiments.
Predicting diagnostic confidence and correctness, and comprehensively mapping out the confidence-correctness relationship, may benefit from more sophisticated algorithms and features, including semantic, clinical, or contextual features; as
well as syntactic parsing or segmentation (e.g. [32]). Error analysis and other evaluation metrics, outside the scope of this work,
may also yield new information.
Table 4: List of features considered for classification. Features
marked with • were useful for correctness prediction while features marked with * were useful for confidence prediction.
+C
–C
P
66
60
N
67
63
D
68
61
P+N
65
62
P+D
69
60
N+D
68
64
All
67
64
ø
68
—
+Dx
–Dx
54
52
56
57
50
52
59
55
50
51
56
56
58
56
52
—
Table 5: Top: Correctness prediction accuracy (%) with confidence as a feature (+ C) and removed (– C). Bottom: 3-scale
confidence prediction accuracy (%) with correctness as a feature (+ Dx) and removed (– Dx). Results above baseline are in
bold. P = prosody, N = narrative, D = disfluency features.
5. Conclusion
We have described a novel dataset of medical narratives detailing dermatologists’ reasoning and decision-making processes
while examining dermatological cases. Having physicians’ selfreported confidence in their final diagnosis provides a new facet
for examining the relationship between confidence and correctness, and how they are encoded in spoken medical expert tasks.
Case studies of four images show a possible link with
narrative-level features (e.g. narrative duration), diagnostic confidence, diagnostic correctness, and method of reasoning (i.e.
analytic vs. intuitive). Automatic prediction classifiers for
correctness and confidence show that confidence scores inform diagnostic correctness rather than the reverse, and that
certain narrative features help predict confidence above baseline. The complex confidence-correctness relationship deserves more attention, e.g. how individuals’ self-awareness or
decision-makers’ risk-taking profiles may play a role.
Results using shallow features are a good starting point, but
highlight the need for deep-level meaningful features, including
medical concepts [33]. This study has provided initial, tentative
encouragement for predicting diagnostic correctness and confidence with some success, with potential for beneficial integration into multi-modal support systems that also consider other
information and modalities, such as eye gaze.
4.2. Predicting confidence
For confidence score prediction5 , we binned the confidence
scores equally according to 2-scale, 3-scale, 4-scale, and 5scale bins, to compare accuracy of each. Because the average
confidence score was quite high, all trials had high majorityclass baselines. All 2-scale prediction trials performed ±1%
of baseline, likely because of class imbalance, as the majorityclass baseline was 78%. Most trials for 3-scale and 4-scale predictions were above baseline, with 3-scale predictions (high,
medium, and low; see Table 1) having the highest overall accuracy compared to its baseline, which was 52% (see Table 5).
Across the different scales for confidence, trials which included narrative features and correctness had the highest accuracy. This finding, that narrative features (such as duration)
relate to confidence mirrors the discussion in section 3.1 that
method of clinical decision-making (i.e. analytic vs. intuitive)
and confidence also seem to be expressed via narrative features.
The best performance in the 3-scale scenario was obtained
with prosodic and narrative features, with diagnostic correctness (+7% above baseline). Including correctness as a feature
improved most experiments, but not all, and correctness as the
only feature did not score above baseline for the 3-scale case
(in Table 5). This contrasts with the usefulness of confidence
for predicting correctness, which could reflect self-awareness
issues. The binning approach was ad-hoc and yielded unbal5 Parameters
6. Acknowledgements
This work was supported by NIH grant 1 R21 LM01003901A1. We appreciate the contributions of the reviewers, participants, the transcriber, the research group; Dr. Lowell Goldsmith
for helpful suggestions; and Logical Images, Inc. for images.
– confidence factor: 0.45, min instances to branch: 8.
2552
7. References
[21] D. S. Lindsay, J. D. Read, and K. Sharma, “Accuracy and confidence in person identification: The relationship is strong when
witnessing conditions vary widely,” Psychological Science, vol. 9,
no. 3, pp. 215–218, 1998.
[1] E. S. Berner and M. L. Graber, “Overconfidence as a cause of
diagnostic error in medicine,” The American Journal of Medicine,
vol. 121, no. 5A, pp. S2–S23, 2008.
[22] J. S. Shaw and T. K. Zerr, “Extra effort during memory retrieval
may be associated with increases in eyewitness confidence,” Law
and Human Behavior, vol. 27, no. 3, pp. 315–329, 2003.
[2] M. Graber, “Diagnostic errors in medicine: A case of neglect,”
The Joint Commission Journal on Quality and Patient Safety, pp.
106–113, 2005.
[3] K. H. Hall, “Reviewing intuitive decision-making and uncertainty: The implications for medical education,” Medical Education, vol. 36, pp. 216–224, 2002.
[23] B. A. Spellman and E. R. Tenney, “Credible testimony in and out
of court,” Psychonomic Bulletin & Review, vol. 17, no. 2, pp. 168–
173, 2010.
[4] G. R. Norman and K. W. Eva, “Diagnostic error and clinical reasoning,” Medical Education, vol. 44, pp. 94–100, 2010.
[24] D. A. Davis, P. E. Mazmanian, M. Fordis, R. V. Harrison, K. E.
Thorpe, and L. Perrier, “Accuracy of physician self-assessment
compared with observed measures of competence,” Journal of the
American Medical Assocation, vol. 296, no. 9, pp. 1094–1102,
2006.
[5] P. Croskerry, “The importance of cognitive errors in diagnosis and
strategies to minimize them,” Academic Medicine, vol. 78, pp.
775–780, 2003.
[25] A. Baumann, “Overconfidence among physicians and nurses: The
‘micro-certainty, macro-uncertainty’ phenomenon,” Social Science & Medicine, vol. 32, no. 2, pp. 167–174, 1991.
[6] P. Croskerry, “Perspectives on diagnostic failure and patient
safety,” Healthcare Quarterly, pp. 50–56.
[7] R. Li, J. Pelz, P. Shi, and A. R. Haake, “Learning image-derived
eye movement patterns to characterize perceptual expertise,” in
CogSci, 2012, pp. 190–195.
[26] W. McCoy, C. O. Alm, C. Calvelli, J. B. Pelz, P. Shi, and
A. Haake, “Linking uncertainty in physicians’ narratives to diagnostic correctness,” in Proceedings of the Workshop on ExtraPropositional Aspects of Meaning in Computational Linguistics.
Jeju, Republic of Korea: Association for Computational Linguistics, July 2012, pp. 19–27.
[8] C. P. Friedman, G. G. Gatti, T. M. Franz, G. C. Murphy, F. M.
Wolf, P. S. Heckerling, P. L. Fine, T. M. Miller, and A. S. Elstein,
“Do physicians know when their diagnoses are correct? Implications for decision support and error reduction,” Journal of General
Internal Medicine, vol. 20, no. 4, pp. 334–339, 2005.
[27] P. Boersma, “Praat, a system for doing phonetics by computer,”
Glot International, pp. 341–345, 2001.
[9] S. Dreiseitl and M. Binderb, “Do physicians value decision support? A look at the effect of decision support systems on physician opinion,” Artificial Intelligence in Medicine, vol. 33, no. 1,
pp. 25–30, 2005.
[28] P. Croskerry, “A universal model of diagnostic reasoning,” Academic Medicine, pp. 1022–1028, 2009.
[29] P. Croskerry, “The theory and practice of clinical decisionmaking,” Canadian Journal of Anesthesia, vol. 52, no. 6, pp. R1–
R8, 2005.
[10] S. E. Brennan and M. Williams, “The feeling of another’s knowing: Prosody and filled pauses as cues to listeners about the
metacognitive states of speakers,” Journal of Memory and Language, vol. 34, pp. 383–398, 1995.
[30] P. Croskerry, “Clinical cognition and diagnostic error: Applications of a dual process model of reasoning,” Advances in Health
Science Education, vol. 14, pp. 27–35, 2009.
[11] C. Dijkstra, E. Krahmer, and M. Swerts, “Manipulating uncertainty: The contribution of different audiovisual prosodic cues
to the perception of confidence,” Proceedings of Speech Prosody,
2006.
[31] M. A. Hall, “Correlation-based feature selection for machine
learning,” Ph.D. dissertation, University of Waikato, 1999.
[32] E. Shriberg, A. Stolcke, D. Hakkani-Tür, and G. Tür, “Prosodybased automatic segmentation of spech into sentences and topics,”
Speech Communication, vol. 32, no. 1-2, pp. 127–154, 2000.
[12] J. Liscombe, J. Hirschberg, and J. Venditti, “Detecting certainness in spoken tutorial dialogues,” Proceedings of Interspeech, pp.
1837–1840, 2005.
[33] X. Guo et al, 2013, Work in preparation.
[13] H. Pon-Barry and S. M. Shieber, “Recognizing uncertainty in
speech,” EURASIP Journal on Advances in Signal Processing,
2011.
[14] H. Pon-Barry, “Prosodic manifestations of confidence and uncertainty in spoken language,” Proceedings of Interspeech, pp. 74–
77, 2008.
[15] V. L. Smith and H. H. Clark, “On the course of answering questions,” Journal of Memory and Language, vol. 32, pp. 25–38,
1993.
[16] D. J. Barr, “Paralinguistic correlates of conceptual structure,” Psychonomic Bulletin & Review, vol. 10, no. 2, pp. 462–467, 2003.
[17] K. Forbes-Riley and D. Litman, “Designing and evaluating a
wizarded uncertainty-adaptive spoken dialogue tutoring system,”
Computer Speech and Language, 2009.
[18] K. Forbes-Riley and D. Litman, “Benefits and challenges of realtime uncertainty detection and adaptation in a spoken dialogue
computer tutor,” Speech Communication Special Issue: Sensing
Emotion and Affect, 2011.
[19] T. A. Busey, J. Tunnicliff, G. R. Loftus, and E. F. Loftus, “Accounts of the confidence-accuracy relation in recognition memory,” Psychonomic Bulletin & Review, vol. 7, no. 1, pp. 26–48,
2000.
[20] K. Krug, “The relationship between confidence and accuracy:
Current thoughts of the literature and a new area of research,”
Applied Psychology in Criminal Justice, vol. 3, no. 1, pp. 7–41,
2007.
2553