Using linguistic analysis to characterize conceptual units of thought in spoken medical narratives

INTERSPEECH 2013
Using linguistic analysis to characterize conceptual units of thought
in spoken medical narratives
Kathryn Womack1 , Cecilia Ovesdotter Alm2 , Cara Calvelli3
Jeff B. Pelz4 , Pengcheng Shi5 , Anne Haake5
1
Dept. of ASL & Interpreting Edu., 2 Dept. of English, 3 College of Health Sciences & Tech.
4
Center for Imaging Science, 5 Computing & Information Sciences
Rochester Institute of Technology, Rochester, NY, USA
[email protected], [email protected], [email protected]
[email protected], [email protected], [email protected]
Abstract
and spoken language understanding [5, 6].
We find evidence that patterns of speech within these
information-rich thought units differ in character from the surrounding speech and are unique compared with previous work
with casual speech corpora. This finding is a step toward being able to identify, with the help of linguistic evidence from
speech, where reasoning errors such as misdiagnoses occur.
This study explores spoken medical narratives in which dermatologists were shown images of dermatological conditions and
asked to explain their reasoning process while working toward
a diagnosis. This corpus has been annotated by a domain expert for information-rich conceptual units of thought, providing
opportunity for analysis of the link between diagnostic reasoning steps and speech features. We explore these annotations
in regards to speech disfluencies, prosody, and type-token ratios, with the finding that speech tagged within thought units is
unique from non-tagged speech in each of these aspects. Additionally, we discuss pattern differences in temporal thought unit
distribution based on diagnostic correctness.
Index Terms: conceptual units of thought, speech features, disfluencies, diagnostic correctness, medical reasoning
2. Previous work
Researchers have considered the relevance of patterns of disfluencies for examining cognitive load from a linguistic perspective, but more work is needed that takes advantage of such
speech features, and other patterns of speech, as a window to
understanding the medical reasoning process. For example,
filled pauses tend to precede topics that are being introduced
to the discourse or are relatively complex [1, 7, 8, 9]. Disfluencies have also been found useful in information extraction and
segmentation of speech [4, 10].
In related work, silent pauses are not typically considered as
a speech disfluency. Some studies have looked at silent pauses’
distribution and patterns [11, 12] and the perception of pauses
by listeners [13, 14], but silent pauses are not typically treated
as being in the same category as filled pauses or repairs.
However, it has been suggested that listeners seem to better
comprehend words or phrases following disfluencies because of
the pause before the word, not necessarily because of the disfluency itself; and that this effect holds equally for filled or silent
pauses of the same duration [15, 16]. Comparison of eventrelated potential (ERP) work by MacGregor et al. [17] suggests
that cognitive effects of disfluent silent pauses on listeners are
the same as filled pauses [18], but that the effects of disfluent
repetitions differ [19]. While it is unlikely that silent pauses
function identically to other disfluencies, their production and
reception need to be futher examined and better understood. For
this reason, this study treats silent pauses as a type of disfluency.
Improving our understanding of the relationship between
linguistic features and medical reasoning has positive implications for understanding medical reasoning in a new way and
for helping to identify, with speech features, where this process
breaks down and leads to an incorrect diagnosis. Speech features in this corpus have been able to predict diagnostic correctness to an initial encouraging degree of accuracy, as reported by
McCoy et al [20], but further analysis is necessary to improve
our understanding of the underlying processes.
1. Introduction
In expert domains such as the medical field, high-stakes cognitive reasoning and decision-making take place. This study aims
to further our understanding of dermatological diagnostic decision making as expressed in speech. We use a spoken corpus
with manual annotations by an expert for diagnostic units of
thought (henceforth, thought units), i.e. semantic units which
capture specific conceptual steps of the diagnostic reasoning
process, such as describing the form or distribution of a lesion.
We look at the patterns of speech disfluencies, prosody, and
type-token ratios with respect to diagnostic thought units in the
medical narratives. We also describe findings of differences in
patterns of temporal trajectory of thought unit tags based on
whether the given final diagnosis is correct or incorrect.
Analysis of linguistic phenomena, such as speech disfluencies, in this specialized medical domain is useful to better understand how they vary in different situations and if they can be
a meaningful linguistic window toward better understanding the
medical reasoning process. For example, studies have shown
that disfluencies indicate cognitive processing (e.g. [1, 2, 3]),
but such research thus far has largely focused on casual or conversational discourse. We investigate if there is linguistic evidence suggesting that the information coded by thought units
can be distinguished from other speech in the narratives, since
they reflect critical steps in the medical decision-making process that may require greater cognitive demand. Identifying
these patterns is also a step towards automatic labelling of diagnostic thought units or other similar semantic phenomena [4]
Copyright © 2013 ISCA
3722
25 - 29 August 2013, Lyon, France
Thought unit
Demographics
Configuration
Distribution
Location
Primary
Secondary
Differential Diagnosis
Final Diagnosis
Recommendation
The current model of medical decision making describes
two types of reasoning: the “intuitive” process which relies on
pattern matching based on the physicians’ experience and the
step-by-step “analytic” process which carefully considers all information [21]. These systems can break down, or the system
used may be inadequate for a number of reasons. This study’s
findings can lead toward predicting which reasoning process a
physicians are using and where in the reasoning process they
might make a wrong turn that leads to an incorrect diagnosis.
Abbr.
DEM
CON
DIS
LOC
PRI
SEC
DIF
DX
REC
Example
elderly
annular
bilaterally
elbow
plaque
crusted
this could be a or b
this is a
needs a biopsy
3. Data and annotation
Table 1: List of considered thought unit labels. Primary and
Secondary refer to medical lesion morphology. [25]
The corpus used in this paper consists of transcripts acquired
from a data elicitation scenario involving 16 dermatologists.
Each dermatologist was shown 50 images of dermatological
conditions and asked to describe the image while working toward a diagnosis. This study uses a subset of the narratives
which have been specially annotated and are discussed below.
The data elicitation used a modification of the MasterApprentice scenario described by Beyer and Holtzblatt [22]. A
student “apprentice” sat silently with an expert dermatologist
“master” while the “master” narrated his or her thoughts. The
narratives are monologues; however, this scenario creates the
feeling of dialogue in the experts and encourages narration of
their thoughts in rich detail.
This study considers the narratives’ speech, transcripts, and
annotations. The narratives were manually single-annotated and
time-aligned at the word level using Praat [23]; Figure 1 shows
an example. Four transcribers worked on time-alignment with
supervision. Every speech token was transcribed, including disfluencies and silent pauses over 0.3 seconds. For methodological reasons, analysis considers silent pauses that occur after the
physician started speaking and before they finished speaking.
The thought unit annotation scheme encodes narratives using nine thought units, listed in Table 1, which attempt to identify common, important steps in the diagnostic process; this
annotation scheme is discussed in detail by McCoy et al [24].
Initially, co-author Calvelli and a second dermatologist annotated 60 narratives. This allowed evaluation of inter- and intraannotator agreement, which was moderate to good [24]. For
this study, the annotated subcorpus was expanded to include an
additional 157 narratives annotated by Dr. Calvelli, leading to a
total of 217 annotated narratives. Current analysis accordingly
considers Dr. Calvelli’s initial annotation and the extended annotated subcorpus (N=217).
There were two differences between the initial and the more
recent annotation scenarios. The first is that the transcripts
given to the annotators for the original 60 narratives had been
mostly cleaned of disfluencies and given punctuation for read-
ability, while the second set of transcripts included all tokens
and had no punctuation; cleaning was not needed since no
project-external annotator was included. The second change
was a slight redefining of the primary and secondary medical
lesion morphologies. Thus, for correspondance, this work considers these two tags as one merged unit: Morphology (MOR).
The considered subset of narratives with thought units totals
almost 3 hours of audio or about 18,200 tokens; the narratives
are, on average, 45 seconds. Slightly more than one-third of
tokens are tagged in thought units. Individual tags, on average,
span 2.6 tokens and 1.3 seconds. Labels vary in frequency, with
MOR tags being the most frequent (∼1,000 tags over ∼2,000
tokens), then DIF (∼400 tags over ∼1,300 tokens), and REC
being the least frequent (11 tags over 62 tokens).
Additional annotation for correctness was done on the full
corpus by three annotators [24]. In this study, we consider a
narrative to have a correct diagnosis if 2 or 3 of the annotators
agree that the final diagnosis is correct.
4. Results and discussion
4.1. Disfluencies and thought units
Several common types of speech disfluencies are examined here
(with examples in Table 2): silent pauses (“SIL”) (1), non-nasal
filled pauses (ah, er, and uh) (2), nasal filled pauses (hm and um)
(3), repetitions (4), and edits (5). For this study, a repetition is
a one- or two-word phrase that is repeated with a silent pause,
filled pause, or nothing between the words; e.g. they uh they
show. An edit is a word that was cut off; e.g. on the r- left wrist.
... [um]3 so given the distribution [my [SIL]1 my]4 differential
would be [uh]2 like [a a]4 contact either [allergi-]5 allergic
contact dermatitis maybe ...
Table 2: Example of disfluency types in the corpus. (Taken from
a similar elicitation situation and used with permission.)
A list of the disfluency types and proportions are shown in
Figure 2. Silent pauses comprised the majority of disfluencies,
followed by filled pauses, with repetitions and edits having the
smallest proportions.
A total of 25% of tokens in the subcorpus were disfluent;
17% of all tokens were silent pauses and about 6.5% filled
pauses. This percentage of disfluencies is higher than that reported by several corpora of casual conversations (e.g. in a
range of 3.5 to 6 disfluencies/100 words in casual conversations
Figure 1: Screenshot of Praat, used to time-align each narrative’s speech with its transcription and annotations, and to extract speech information for analysis.
3723
meaning that the description of morphology is the first cognitively demanding clinical reasoning step they will take. Importantly, in the full corpus of 800 narratives, 62% of narratives in
which the physician provided a correct medical morphology had
a correct final diagnosis, while only 37% of narratives with an
incorrect or missing morphology had a correct final diagnosis.
Correctly identifying the morphology is a vital step that guides
the clinicians to hone in on specific potential diagnoses [25].
Thus, analysis indicates that tagged and non-tagged text behave differently. This could be in part because thought unittagged text is often short noun phrases. Additionally, disfluencies are likely uttered earlier in the narrative as cognitive
clues related to a concept mentioned later on. Disfluencies tend
to precede cognitively demanding steps (cf. section 2), so one
would expect that more disfluencies would be found preceding
tagged speech. Gaze analysis, of eye-tracking data co-collected
with speech in this corpus, has found that people tend to look at
lesion features before they discuss them [30]. Similarly, speech
phenomena that occur earlier in the narrative may be indicative
of cognitive reasoning that is related to utterances later on.
Figure 2: The percentages (of all disfluencies) of each disfluency type in the subcorpus. The number of instances of each:
silent pause (3067); non-nasal filled pause (779); nasal filled
pause (436); edits (156); repetitions (102).
4.2. Type versus token ratio and thought units
The type/token ratio (TTR) can indicate a text’s stylistic properties. In this data, the average TTR of the narratives is 0.6,
or 0.8 if disfluencies are removed; the average TTR of nontagged text only is very similar. In comparison, the average
TTR over each type of thought unit tag2 is 0.9, and did not
change when disfluencies were removed; as discussed above,
disfluencies are less prevalent in thought unit spans. The high
TTR of thought unit tags indicates they are rich and dense in
vocabulary, likely because the vocabulary is specialized and the
typical short span of thought unit tags. There was little variation in the average TTR for each specific thought unit, ranging
from 0.9 to ∼1. These findings indicate that thought units and
non-tagged text are qualitatively different, supporting the hypothesis that thought unit tags capture specific information and
behave differently than non-tagged text.
Figure 3: The proportion of non-disfluent words vs. each type of
disfluency in each thought unit. For example, 12% of all tokens
tagged as REC were silent pauses.
4.3. Prosody and thought units
or interviews [26, 27, 28, 29]). This difference could be due to
how each study defines disfluencies and could also be an indication that the narratives are from a technical expert domain.
For analysis of disfluencies, the span of thought unit tags
were expanded by one token to the left and right because the
first annotation had been done on cleaned transcripts (cf. section
3).1 Thought unit tags varied in frequency and span, so analysis
looks at disfluencies as a proportion of tagged tokens.
As shown in Figure 3, the only two thought unit categories
with higher-than-usual disfluency percentages (compared with
the disfluency percentage over all tokens) were tokens tagged
as Morphology and non-tagged tokens. The Morphology tag
had the highest percentage of silent pauses (22%) while other
disfluencies were more prominent outside of thought unit tags.
The high percentage of silent pauses in Morphology tags
could indicate physicians’ thoughtfulness. Participants tended
to identify patient demographics, location, and distribution of
the lesion before or while discussing the medical morphology,
Duration (mean)
Duration (st. dev.)
Pitch (mean)
Pitch (st. dev.)
Intensity (mean)
Intensity (st. dev.)
Tagged
0.48
0.31
66 Hz
28
148 dB
55
Non-tagged
0.50
0.74
63 Hz
36
135 dB
75
Table 3: Prosody features in tokens tagged with thought units
and non-tagged tokens. Each test was significant, p<0.01.
Additional analysis tested the duration, mean pitch, and
mean intensity of individual tokens tagged with thought units
against non-tagged tokens.3 As shown in Table 3, Tokens’
2 TTR measures the number of unique types divided by the number of total tokens. For each narrative, all words within each thought
unit tag over the whole narrative were considered for TTR measurement, even if the tags were split; e.g. “[elderly male]DEM ... [elderly
patient]DEM ” would have a TTR of 3/4 = 0.75 (not 1 and 1).
3 Features were extracted with Praat and tested with two-sample ttests assuming unequal variances.
1 The span of non-tagged text was not expanded, to avoid overlap. If
two tags were adjacent, tokens were assigned only to their actual tag. If
one token was between two tags (e.g. ”SIL” in Figure 1), the token was
assigned to the left tag.
3724
tags during the last quarter of the narrative and few DX tags. If
participants were more confident when they were correct, they
may have chosen to either omit or give a smaller differential and
provided a final diagnosis more readily.4 When incorrect, participants might have been unsure and given a longer differential
instead of settling on one diagnosis.
No real difference was found in the duration of each thought
unit in correct versus incorrect narratives, with the exception of
MOR tags. Morphology tags in correct narratives on average
lasted for 9% of the narrative’s duration, while they lasted for
only 5% of incorrect narratives’ duration (p<0.015 ). This again
indicates the importance of identifying the medical morphology
as a critical medical reasoning step, although it is difficult to
determine the relationship between morphology and diagnostic correctness. It could be that physicians who spend more
time and provide more detail on morphology notice more, giving them additional clues for the final diagnosis; or it could be
that physicians who give the correct diagnosis are more familiar
with that diagnosis’ morphology and can explain more or draw
on this experience.
mean pitch and mean intensity were both higher for thought
unit-tagged tokens, which could be an indication of prominence
or importance in tagged speech [31], although further analysis
would be needed. The typical short span of thought unit tags
and specific parts of speech captured by them also plays a role
in these results. Variance in the data was higher for every test
in non-tagged tokens, showing additional evidence that thought
unit-tagged tokens have less variability than non-tagged tokens.
5. Diagnostic correctness and thought units
We visually examined the number of each thought unit tag
along the narratives’ temporal progress, comparing narratives
in which the observer gave the correct final diagnosis against
narratives with an incorrect final diagnosis, shown in Figure
4. There are several differences in the temporal trajectories of
thought units that could provide clues as to the thought processes of the physicians in these contexts.
In correct narratives, the number of MOR tags peak between 25% and 60% of the way through the narrative. In contrast, MOR tags in incorrect narratives peak around 20% and
then trail off. In incorrect narratives, it could be that the physicians paid less attention to the Morphology description, giving the descriptions earlier and being less likely to revisit them,
than in the correct narratives, where they might have been more
likely to provide evidence for their favored diagnosis.
The temporal distribution of DX vs. DIF tags also differs. In
correct narratives, few DIF tags are present and many DX tags
are given at the end. In incorrect narratives, there are many DIF
6. Conclusions
This study examines information-rich thought unit annotation
in spoken medical narratives, with the finding that speech tokens tagged in thought units linguistically behave distinctly
from non-tagged tokens in various ways, and that the temporal
patterns of thought units are different in narratives with a correct diagnosis vs. an incorrect diagnosis. These findings are a
step toward characterizing information-rich speech in clinical
decision-making, information extraction of semantic thought
units, and identification of misdiagnoses.
In particular, several intriguing results have surfaced relating to the medical lesion morphology step, which describes
the form of the lesion, in the diagnostic process. Morphology
tags have the highest percentage of silent pauses and an aboveaverage percentage of total disfluencies.The duration of Morphology tags in correct narratives is significantly longer than in
incorrect narratives and they appear to show a difference in temporal trajectory between correct and incorrect narratives. Moreover, based on the full corpus, we know that physicians who
give the correct medical morphology are more likely to provide
the correct diagnosis. These findings point toward the need for
further analysis of this critical medical reasoning step.
Further work will continue to explore the relationship between medical reasoning and linguistic patterns in medical narratives based on diagnostic correctness, physician expertise, and
image characteristics, such as perceived vs. actual difficulty of
a diagnostic case or type of lesion. We have also, more recently,
collected a similar, larger corpus in which physicians were additionally asked to provide their level of confidence in their final
diagnosis, which provides the opportunity to explore another
facet of diagnostic reasoning.
7. Acknowledgements
This work was partially supported by NIH grant 1 R21
LM010039-01A1 and NSF grant IIS-0941452. We appreciate
the contributions of the participants, annotators, reviewers, and
the research group; and Logical Images, Inc. for images.
Figure 4: The location of thought units as percent temporal progression through the narratives; comparing correct narratives
(top, N=123) against incorrect narratives (bottom, N=94). The
peak of each thought unit is labelled; REC and CON are not
labelled because there are few occurances.
4 Instructions to the participants during data elicitation were flexible
to allow more natural descriptions, and physicians were not specifically
asked to provide a differential diagnosis.
5 A two-sample t-test assuming unequal variances was used.
3725
8. References
[20] W. McCoy, C. O. Alm, C. Calvelli, J. B. Pelz, P. Shi, and
A. Haake, “Linking uncertainty in physicians’ narratives to diagnostic correctness,” in Proceedings of the Workshop on ExtraPropositional Aspects of Meaning in Computational Linguistics.
Jeju, Republic of Korea: Association for Computational Linguistics, July 2012, pp. 19–27.
[1] J. E. Arnold, C. L. Hudson Kam, and M. K. Tanenhaus, “If you
say thee uh you are describing something hard: The on-line attribution of disfluency during reference comprehension,” Journal
of Experimental Psychology: Learning, Memory, and Cognition,
vol. 33, no. 5, pp. 914–930, 2007.
[21] P. Croskerry, “A universal model of diagnostic reasoning,” Academic Medicine, pp. 1022–1028, 2009.
[2] D. J. Barr and M. Seyfiddinipur, “The role of fillers in listener
attributes for speaker disfluency,” Language and Cognitive Processes, vol. 25, no. 4, pp. 441–455, 2010.
[22] H. Beyer and K. Holtzblatt, Contextual Design: Defining
Customer-Centered Systems. Morgan Kaufmann, 1997.
[3] V. L. Smith and H. H. Clark, “On the course of answering questions,” Journal of Memory and Language, vol. 32, pp. 25–38,
1993.
[23] P. Boersma, “Praat, a system for doing phonetics by computer,”
Glot International, pp. 341–345, 2001.
[24] W. McCoy, C. O. Alm, C. Calvelli, R. Li, J. Pelz, P. Shi, and
A. Haake, “Annotation schemes to encode domain knowledge in
medical narratives,” in Proceedings of the Sixth Linguistic Annotation Workshop. Jeju, Republic of Korea: Association for Computational Linguistics, July 2012, pp. 95–103.
[4] E. Shriberg, A. Stolcke, D. Hakkani-Tür, and G. Tür, “Prosodybased automatic segmentation of spech into sentences and topics,”
Speech Communication, vol. 32, no. 1-2, pp. 127–154, 2000.
[5] R. De Mori, F. Béchet, D. Hakkani-Tür, M. McTear, G. Riccardi,
and G. Tür, “Spoken language understanding: Interpreting the
signs given by a speech signal,” IEEE Signal Processing Magazine, vol. 25, pp. 50–58, 2008.
[25] K. Wolff, T. B. Fitzpatrick, R. A. Johnson, and D. Suurmond, Fitzpatrick’s color atlas and synopsis of clinical dermatology, 6th ed.
McGraw-Hill Medical Pub. Division, 2005.
[6] Y.-Y. Wang, L. Deng, and A. Acero, “Spoken language understanding,” IEEE Signal Processing Magazine, vol. 22, no. 5, pp.
16–31, 2005.
[26] H. Bortfeld, S. D. Leon, J. E. Bloom, M. F. Schober, and S. E.
Brennan, “Disfluency rates in conversation: Effects of age, relationship, topic, role, and gender,” Language and Speech, vol. 44,
no. 2, pp. 123–147, 2001.
[7] J. E. Arnold, M. Fagnano, and M. K. Tanenhaus, “Disfluencies
signal theee, um, new information,” Journal of Psycholinguistics
Research, vol. 32, no. 1, pp. 25–36, 2003.
[27] E. Shriberg, “To ‘errrr’ is human: Ecology and acoustics of speech
disfluencies,” Journal of the International Phonetic Association,
vol. 31, no. 1, 2001.
[8] S. E. Brennan and M. Williams, “The feeling of another’s knowing: Prosody and filled pauses as cues to listeners about the
metacognitive states of speakers,” Journal of Memory and Language, vol. 34, pp. 383–398, 1995.
[28] S. Oviatt, “Predicting and managing spoken disfluencies during
human-computer interaction,” Computer Speech and Language,
vol. 9, pp. 19–35, 1995.
[9] M. Corley and O. W. Stewart, “Hesitation disfluencies in spontaneous speech: The meaning of um,” Language and Linguistics
Compass, vol. 2, no. 4, pp. 589–602, 2008.
[29] S. Schachter, N. Christenfeld, B. Ravina, and F. Bilous, “Speech
disfluency and the structure of knowledge,” Journal of Personality
and Social Psychology, vol. 60, no. 3, pp. 362–367, 1991.
[10] P. A. Heeman, “Modeling spontaneous speech events during
recognition,” First International Conference on Universal Access
in Human-Computer Interaction, 2001.
[30] P. Vaidyanathan, J. Pelz, W. McCoy, C. Calvelli, C. O. Alm, P. Shi,
and A. Haake, “Visualinguistic approach to medical image understanding,” Proceedings of AMIA 2012 Annual Symposium, p.
1661, 2012.
[11] E. Campione and J. Véronis, “A large-scale multilingual study of
silent pause duration,” Speech Prosody, 2002.
[31] D. G. Watson, J. E. Arnold, and M. K. Tanenhaus, “Tic tac toe: Effects of predictability and importance on acoustic prominence in
language production,” Cognition, vol. 106, pp. 1548–1557, 2008.
[12] K. Kirsner, J. Dunn, and K. Hird, “Fluency: Time for a paradigm
shift,” Proceedings of DiSS’03, Disfluency in Spontaneous Speech
Workshop, pp. 13–16, 2003.
[13] B. Megyesi and S. Gustafson-Čapkovà, “Pausing in dialogues and
read speech in Swedish: Speakers’ production and listeners’ interpretation,” Proceedings of the Workshop on Prosody in Speech
Recognition and Understanding, pp. 107–113, 2001.
[14] T. Lövgren and J. van Doorn, “Influence of manipulation of short
silent pause duration on speech fluency,” Proceedings of DiSS05,
Disfluency in Spontaneous Speech Workshop, pp. 123–126, 2005.
[15] K. G. Bailey and F. Ferreira, “Disfluencies affect the parsing
of garden-path sentences,” Journal of Memory and Language,
vol. 49, pp. 183–200, 2003.
[16] S. E. Brennan and M. F. Schober, “How listeners compensate for
disfluencies in spontaneous speech,” Journal of Memory and Language, vol. 44, pp. 274–296, 2001.
[17] L. J. MacGregor, M. Corley, and D. I. Donaldson, “Listening to
the sound of silence: Disfluent silent pauses in speech have consequences for listeners,” Neuropsychologia, vol. 48, pp. 3982–3992,
2010.
[18] M. Corley, L. J. MacGregory, and D. I. Donaldson, “It’s the way
that you, er, say it: Hesitations in speech affect language comprehension,” Cognition, vol. 105, pp. 658–668, 2007.
[19] L. J. MacGregory, M. Corley, and D. I. Donaldson, “Not all disfluencies are are equal: The effects of disfluent repetitions on language comprehension,” Brain and Language, vol. 111, pp. 36–45,
2009.
3726