INTERSPEECH 2013 Using linguistic analysis to characterize conceptual units of thought in spoken medical narratives Kathryn Womack1 , Cecilia Ovesdotter Alm2 , Cara Calvelli3 Jeff B. Pelz4 , Pengcheng Shi5 , Anne Haake5 1 Dept. of ASL & Interpreting Edu., 2 Dept. of English, 3 College of Health Sciences & Tech. 4 Center for Imaging Science, 5 Computing & Information Sciences Rochester Institute of Technology, Rochester, NY, USA [email protected], [email protected], [email protected] [email protected], [email protected], [email protected] Abstract and spoken language understanding [5, 6]. We find evidence that patterns of speech within these information-rich thought units differ in character from the surrounding speech and are unique compared with previous work with casual speech corpora. This finding is a step toward being able to identify, with the help of linguistic evidence from speech, where reasoning errors such as misdiagnoses occur. This study explores spoken medical narratives in which dermatologists were shown images of dermatological conditions and asked to explain their reasoning process while working toward a diagnosis. This corpus has been annotated by a domain expert for information-rich conceptual units of thought, providing opportunity for analysis of the link between diagnostic reasoning steps and speech features. We explore these annotations in regards to speech disfluencies, prosody, and type-token ratios, with the finding that speech tagged within thought units is unique from non-tagged speech in each of these aspects. Additionally, we discuss pattern differences in temporal thought unit distribution based on diagnostic correctness. Index Terms: conceptual units of thought, speech features, disfluencies, diagnostic correctness, medical reasoning 2. Previous work Researchers have considered the relevance of patterns of disfluencies for examining cognitive load from a linguistic perspective, but more work is needed that takes advantage of such speech features, and other patterns of speech, as a window to understanding the medical reasoning process. For example, filled pauses tend to precede topics that are being introduced to the discourse or are relatively complex [1, 7, 8, 9]. Disfluencies have also been found useful in information extraction and segmentation of speech [4, 10]. In related work, silent pauses are not typically considered as a speech disfluency. Some studies have looked at silent pauses’ distribution and patterns [11, 12] and the perception of pauses by listeners [13, 14], but silent pauses are not typically treated as being in the same category as filled pauses or repairs. However, it has been suggested that listeners seem to better comprehend words or phrases following disfluencies because of the pause before the word, not necessarily because of the disfluency itself; and that this effect holds equally for filled or silent pauses of the same duration [15, 16]. Comparison of eventrelated potential (ERP) work by MacGregor et al. [17] suggests that cognitive effects of disfluent silent pauses on listeners are the same as filled pauses [18], but that the effects of disfluent repetitions differ [19]. While it is unlikely that silent pauses function identically to other disfluencies, their production and reception need to be futher examined and better understood. For this reason, this study treats silent pauses as a type of disfluency. Improving our understanding of the relationship between linguistic features and medical reasoning has positive implications for understanding medical reasoning in a new way and for helping to identify, with speech features, where this process breaks down and leads to an incorrect diagnosis. Speech features in this corpus have been able to predict diagnostic correctness to an initial encouraging degree of accuracy, as reported by McCoy et al [20], but further analysis is necessary to improve our understanding of the underlying processes. 1. Introduction In expert domains such as the medical field, high-stakes cognitive reasoning and decision-making take place. This study aims to further our understanding of dermatological diagnostic decision making as expressed in speech. We use a spoken corpus with manual annotations by an expert for diagnostic units of thought (henceforth, thought units), i.e. semantic units which capture specific conceptual steps of the diagnostic reasoning process, such as describing the form or distribution of a lesion. We look at the patterns of speech disfluencies, prosody, and type-token ratios with respect to diagnostic thought units in the medical narratives. We also describe findings of differences in patterns of temporal trajectory of thought unit tags based on whether the given final diagnosis is correct or incorrect. Analysis of linguistic phenomena, such as speech disfluencies, in this specialized medical domain is useful to better understand how they vary in different situations and if they can be a meaningful linguistic window toward better understanding the medical reasoning process. For example, studies have shown that disfluencies indicate cognitive processing (e.g. [1, 2, 3]), but such research thus far has largely focused on casual or conversational discourse. We investigate if there is linguistic evidence suggesting that the information coded by thought units can be distinguished from other speech in the narratives, since they reflect critical steps in the medical decision-making process that may require greater cognitive demand. Identifying these patterns is also a step towards automatic labelling of diagnostic thought units or other similar semantic phenomena [4] Copyright © 2013 ISCA 3722 25 - 29 August 2013, Lyon, France Thought unit Demographics Configuration Distribution Location Primary Secondary Differential Diagnosis Final Diagnosis Recommendation The current model of medical decision making describes two types of reasoning: the “intuitive” process which relies on pattern matching based on the physicians’ experience and the step-by-step “analytic” process which carefully considers all information [21]. These systems can break down, or the system used may be inadequate for a number of reasons. This study’s findings can lead toward predicting which reasoning process a physicians are using and where in the reasoning process they might make a wrong turn that leads to an incorrect diagnosis. Abbr. DEM CON DIS LOC PRI SEC DIF DX REC Example elderly annular bilaterally elbow plaque crusted this could be a or b this is a needs a biopsy 3. Data and annotation Table 1: List of considered thought unit labels. Primary and Secondary refer to medical lesion morphology. [25] The corpus used in this paper consists of transcripts acquired from a data elicitation scenario involving 16 dermatologists. Each dermatologist was shown 50 images of dermatological conditions and asked to describe the image while working toward a diagnosis. This study uses a subset of the narratives which have been specially annotated and are discussed below. The data elicitation used a modification of the MasterApprentice scenario described by Beyer and Holtzblatt [22]. A student “apprentice” sat silently with an expert dermatologist “master” while the “master” narrated his or her thoughts. The narratives are monologues; however, this scenario creates the feeling of dialogue in the experts and encourages narration of their thoughts in rich detail. This study considers the narratives’ speech, transcripts, and annotations. The narratives were manually single-annotated and time-aligned at the word level using Praat [23]; Figure 1 shows an example. Four transcribers worked on time-alignment with supervision. Every speech token was transcribed, including disfluencies and silent pauses over 0.3 seconds. For methodological reasons, analysis considers silent pauses that occur after the physician started speaking and before they finished speaking. The thought unit annotation scheme encodes narratives using nine thought units, listed in Table 1, which attempt to identify common, important steps in the diagnostic process; this annotation scheme is discussed in detail by McCoy et al [24]. Initially, co-author Calvelli and a second dermatologist annotated 60 narratives. This allowed evaluation of inter- and intraannotator agreement, which was moderate to good [24]. For this study, the annotated subcorpus was expanded to include an additional 157 narratives annotated by Dr. Calvelli, leading to a total of 217 annotated narratives. Current analysis accordingly considers Dr. Calvelli’s initial annotation and the extended annotated subcorpus (N=217). There were two differences between the initial and the more recent annotation scenarios. The first is that the transcripts given to the annotators for the original 60 narratives had been mostly cleaned of disfluencies and given punctuation for read- ability, while the second set of transcripts included all tokens and had no punctuation; cleaning was not needed since no project-external annotator was included. The second change was a slight redefining of the primary and secondary medical lesion morphologies. Thus, for correspondance, this work considers these two tags as one merged unit: Morphology (MOR). The considered subset of narratives with thought units totals almost 3 hours of audio or about 18,200 tokens; the narratives are, on average, 45 seconds. Slightly more than one-third of tokens are tagged in thought units. Individual tags, on average, span 2.6 tokens and 1.3 seconds. Labels vary in frequency, with MOR tags being the most frequent (∼1,000 tags over ∼2,000 tokens), then DIF (∼400 tags over ∼1,300 tokens), and REC being the least frequent (11 tags over 62 tokens). Additional annotation for correctness was done on the full corpus by three annotators [24]. In this study, we consider a narrative to have a correct diagnosis if 2 or 3 of the annotators agree that the final diagnosis is correct. 4. Results and discussion 4.1. Disfluencies and thought units Several common types of speech disfluencies are examined here (with examples in Table 2): silent pauses (“SIL”) (1), non-nasal filled pauses (ah, er, and uh) (2), nasal filled pauses (hm and um) (3), repetitions (4), and edits (5). For this study, a repetition is a one- or two-word phrase that is repeated with a silent pause, filled pause, or nothing between the words; e.g. they uh they show. An edit is a word that was cut off; e.g. on the r- left wrist. ... [um]3 so given the distribution [my [SIL]1 my]4 differential would be [uh]2 like [a a]4 contact either [allergi-]5 allergic contact dermatitis maybe ... Table 2: Example of disfluency types in the corpus. (Taken from a similar elicitation situation and used with permission.) A list of the disfluency types and proportions are shown in Figure 2. Silent pauses comprised the majority of disfluencies, followed by filled pauses, with repetitions and edits having the smallest proportions. A total of 25% of tokens in the subcorpus were disfluent; 17% of all tokens were silent pauses and about 6.5% filled pauses. This percentage of disfluencies is higher than that reported by several corpora of casual conversations (e.g. in a range of 3.5 to 6 disfluencies/100 words in casual conversations Figure 1: Screenshot of Praat, used to time-align each narrative’s speech with its transcription and annotations, and to extract speech information for analysis. 3723 meaning that the description of morphology is the first cognitively demanding clinical reasoning step they will take. Importantly, in the full corpus of 800 narratives, 62% of narratives in which the physician provided a correct medical morphology had a correct final diagnosis, while only 37% of narratives with an incorrect or missing morphology had a correct final diagnosis. Correctly identifying the morphology is a vital step that guides the clinicians to hone in on specific potential diagnoses [25]. Thus, analysis indicates that tagged and non-tagged text behave differently. This could be in part because thought unittagged text is often short noun phrases. Additionally, disfluencies are likely uttered earlier in the narrative as cognitive clues related to a concept mentioned later on. Disfluencies tend to precede cognitively demanding steps (cf. section 2), so one would expect that more disfluencies would be found preceding tagged speech. Gaze analysis, of eye-tracking data co-collected with speech in this corpus, has found that people tend to look at lesion features before they discuss them [30]. Similarly, speech phenomena that occur earlier in the narrative may be indicative of cognitive reasoning that is related to utterances later on. Figure 2: The percentages (of all disfluencies) of each disfluency type in the subcorpus. The number of instances of each: silent pause (3067); non-nasal filled pause (779); nasal filled pause (436); edits (156); repetitions (102). 4.2. Type versus token ratio and thought units The type/token ratio (TTR) can indicate a text’s stylistic properties. In this data, the average TTR of the narratives is 0.6, or 0.8 if disfluencies are removed; the average TTR of nontagged text only is very similar. In comparison, the average TTR over each type of thought unit tag2 is 0.9, and did not change when disfluencies were removed; as discussed above, disfluencies are less prevalent in thought unit spans. The high TTR of thought unit tags indicates they are rich and dense in vocabulary, likely because the vocabulary is specialized and the typical short span of thought unit tags. There was little variation in the average TTR for each specific thought unit, ranging from 0.9 to ∼1. These findings indicate that thought units and non-tagged text are qualitatively different, supporting the hypothesis that thought unit tags capture specific information and behave differently than non-tagged text. Figure 3: The proportion of non-disfluent words vs. each type of disfluency in each thought unit. For example, 12% of all tokens tagged as REC were silent pauses. 4.3. Prosody and thought units or interviews [26, 27, 28, 29]). This difference could be due to how each study defines disfluencies and could also be an indication that the narratives are from a technical expert domain. For analysis of disfluencies, the span of thought unit tags were expanded by one token to the left and right because the first annotation had been done on cleaned transcripts (cf. section 3).1 Thought unit tags varied in frequency and span, so analysis looks at disfluencies as a proportion of tagged tokens. As shown in Figure 3, the only two thought unit categories with higher-than-usual disfluency percentages (compared with the disfluency percentage over all tokens) were tokens tagged as Morphology and non-tagged tokens. The Morphology tag had the highest percentage of silent pauses (22%) while other disfluencies were more prominent outside of thought unit tags. The high percentage of silent pauses in Morphology tags could indicate physicians’ thoughtfulness. Participants tended to identify patient demographics, location, and distribution of the lesion before or while discussing the medical morphology, Duration (mean) Duration (st. dev.) Pitch (mean) Pitch (st. dev.) Intensity (mean) Intensity (st. dev.) Tagged 0.48 0.31 66 Hz 28 148 dB 55 Non-tagged 0.50 0.74 63 Hz 36 135 dB 75 Table 3: Prosody features in tokens tagged with thought units and non-tagged tokens. Each test was significant, p<0.01. Additional analysis tested the duration, mean pitch, and mean intensity of individual tokens tagged with thought units against non-tagged tokens.3 As shown in Table 3, Tokens’ 2 TTR measures the number of unique types divided by the number of total tokens. For each narrative, all words within each thought unit tag over the whole narrative were considered for TTR measurement, even if the tags were split; e.g. “[elderly male]DEM ... [elderly patient]DEM ” would have a TTR of 3/4 = 0.75 (not 1 and 1). 3 Features were extracted with Praat and tested with two-sample ttests assuming unequal variances. 1 The span of non-tagged text was not expanded, to avoid overlap. If two tags were adjacent, tokens were assigned only to their actual tag. If one token was between two tags (e.g. ”SIL” in Figure 1), the token was assigned to the left tag. 3724 tags during the last quarter of the narrative and few DX tags. If participants were more confident when they were correct, they may have chosen to either omit or give a smaller differential and provided a final diagnosis more readily.4 When incorrect, participants might have been unsure and given a longer differential instead of settling on one diagnosis. No real difference was found in the duration of each thought unit in correct versus incorrect narratives, with the exception of MOR tags. Morphology tags in correct narratives on average lasted for 9% of the narrative’s duration, while they lasted for only 5% of incorrect narratives’ duration (p<0.015 ). This again indicates the importance of identifying the medical morphology as a critical medical reasoning step, although it is difficult to determine the relationship between morphology and diagnostic correctness. It could be that physicians who spend more time and provide more detail on morphology notice more, giving them additional clues for the final diagnosis; or it could be that physicians who give the correct diagnosis are more familiar with that diagnosis’ morphology and can explain more or draw on this experience. mean pitch and mean intensity were both higher for thought unit-tagged tokens, which could be an indication of prominence or importance in tagged speech [31], although further analysis would be needed. The typical short span of thought unit tags and specific parts of speech captured by them also plays a role in these results. Variance in the data was higher for every test in non-tagged tokens, showing additional evidence that thought unit-tagged tokens have less variability than non-tagged tokens. 5. Diagnostic correctness and thought units We visually examined the number of each thought unit tag along the narratives’ temporal progress, comparing narratives in which the observer gave the correct final diagnosis against narratives with an incorrect final diagnosis, shown in Figure 4. There are several differences in the temporal trajectories of thought units that could provide clues as to the thought processes of the physicians in these contexts. In correct narratives, the number of MOR tags peak between 25% and 60% of the way through the narrative. In contrast, MOR tags in incorrect narratives peak around 20% and then trail off. In incorrect narratives, it could be that the physicians paid less attention to the Morphology description, giving the descriptions earlier and being less likely to revisit them, than in the correct narratives, where they might have been more likely to provide evidence for their favored diagnosis. The temporal distribution of DX vs. DIF tags also differs. In correct narratives, few DIF tags are present and many DX tags are given at the end. In incorrect narratives, there are many DIF 6. Conclusions This study examines information-rich thought unit annotation in spoken medical narratives, with the finding that speech tokens tagged in thought units linguistically behave distinctly from non-tagged tokens in various ways, and that the temporal patterns of thought units are different in narratives with a correct diagnosis vs. an incorrect diagnosis. These findings are a step toward characterizing information-rich speech in clinical decision-making, information extraction of semantic thought units, and identification of misdiagnoses. In particular, several intriguing results have surfaced relating to the medical lesion morphology step, which describes the form of the lesion, in the diagnostic process. Morphology tags have the highest percentage of silent pauses and an aboveaverage percentage of total disfluencies.The duration of Morphology tags in correct narratives is significantly longer than in incorrect narratives and they appear to show a difference in temporal trajectory between correct and incorrect narratives. Moreover, based on the full corpus, we know that physicians who give the correct medical morphology are more likely to provide the correct diagnosis. These findings point toward the need for further analysis of this critical medical reasoning step. Further work will continue to explore the relationship between medical reasoning and linguistic patterns in medical narratives based on diagnostic correctness, physician expertise, and image characteristics, such as perceived vs. actual difficulty of a diagnostic case or type of lesion. We have also, more recently, collected a similar, larger corpus in which physicians were additionally asked to provide their level of confidence in their final diagnosis, which provides the opportunity to explore another facet of diagnostic reasoning. 7. Acknowledgements This work was partially supported by NIH grant 1 R21 LM010039-01A1 and NSF grant IIS-0941452. We appreciate the contributions of the participants, annotators, reviewers, and the research group; and Logical Images, Inc. for images. Figure 4: The location of thought units as percent temporal progression through the narratives; comparing correct narratives (top, N=123) against incorrect narratives (bottom, N=94). The peak of each thought unit is labelled; REC and CON are not labelled because there are few occurances. 4 Instructions to the participants during data elicitation were flexible to allow more natural descriptions, and physicians were not specifically asked to provide a differential diagnosis. 5 A two-sample t-test assuming unequal variances was used. 3725 8. References [20] W. McCoy, C. O. Alm, C. Calvelli, J. B. Pelz, P. Shi, and A. Haake, “Linking uncertainty in physicians’ narratives to diagnostic correctness,” in Proceedings of the Workshop on ExtraPropositional Aspects of Meaning in Computational Linguistics. Jeju, Republic of Korea: Association for Computational Linguistics, July 2012, pp. 19–27. [1] J. E. Arnold, C. L. Hudson Kam, and M. K. Tanenhaus, “If you say thee uh you are describing something hard: The on-line attribution of disfluency during reference comprehension,” Journal of Experimental Psychology: Learning, Memory, and Cognition, vol. 33, no. 5, pp. 914–930, 2007. [21] P. Croskerry, “A universal model of diagnostic reasoning,” Academic Medicine, pp. 1022–1028, 2009. [2] D. J. Barr and M. Seyfiddinipur, “The role of fillers in listener attributes for speaker disfluency,” Language and Cognitive Processes, vol. 25, no. 4, pp. 441–455, 2010. [22] H. Beyer and K. Holtzblatt, Contextual Design: Defining Customer-Centered Systems. Morgan Kaufmann, 1997. [3] V. L. Smith and H. H. Clark, “On the course of answering questions,” Journal of Memory and Language, vol. 32, pp. 25–38, 1993. [23] P. Boersma, “Praat, a system for doing phonetics by computer,” Glot International, pp. 341–345, 2001. [24] W. McCoy, C. O. Alm, C. Calvelli, R. Li, J. Pelz, P. Shi, and A. Haake, “Annotation schemes to encode domain knowledge in medical narratives,” in Proceedings of the Sixth Linguistic Annotation Workshop. Jeju, Republic of Korea: Association for Computational Linguistics, July 2012, pp. 95–103. [4] E. Shriberg, A. Stolcke, D. Hakkani-Tür, and G. Tür, “Prosodybased automatic segmentation of spech into sentences and topics,” Speech Communication, vol. 32, no. 1-2, pp. 127–154, 2000. [5] R. De Mori, F. Béchet, D. Hakkani-Tür, M. McTear, G. Riccardi, and G. Tür, “Spoken language understanding: Interpreting the signs given by a speech signal,” IEEE Signal Processing Magazine, vol. 25, pp. 50–58, 2008. [25] K. Wolff, T. B. Fitzpatrick, R. A. Johnson, and D. Suurmond, Fitzpatrick’s color atlas and synopsis of clinical dermatology, 6th ed. McGraw-Hill Medical Pub. Division, 2005. [6] Y.-Y. Wang, L. Deng, and A. Acero, “Spoken language understanding,” IEEE Signal Processing Magazine, vol. 22, no. 5, pp. 16–31, 2005. [26] H. Bortfeld, S. D. Leon, J. E. Bloom, M. F. Schober, and S. E. Brennan, “Disfluency rates in conversation: Effects of age, relationship, topic, role, and gender,” Language and Speech, vol. 44, no. 2, pp. 123–147, 2001. [7] J. E. Arnold, M. Fagnano, and M. K. Tanenhaus, “Disfluencies signal theee, um, new information,” Journal of Psycholinguistics Research, vol. 32, no. 1, pp. 25–36, 2003. [27] E. Shriberg, “To ‘errrr’ is human: Ecology and acoustics of speech disfluencies,” Journal of the International Phonetic Association, vol. 31, no. 1, 2001. [8] S. E. Brennan and M. Williams, “The feeling of another’s knowing: Prosody and filled pauses as cues to listeners about the metacognitive states of speakers,” Journal of Memory and Language, vol. 34, pp. 383–398, 1995. [28] S. Oviatt, “Predicting and managing spoken disfluencies during human-computer interaction,” Computer Speech and Language, vol. 9, pp. 19–35, 1995. [9] M. Corley and O. W. Stewart, “Hesitation disfluencies in spontaneous speech: The meaning of um,” Language and Linguistics Compass, vol. 2, no. 4, pp. 589–602, 2008. [29] S. Schachter, N. Christenfeld, B. Ravina, and F. Bilous, “Speech disfluency and the structure of knowledge,” Journal of Personality and Social Psychology, vol. 60, no. 3, pp. 362–367, 1991. [10] P. A. Heeman, “Modeling spontaneous speech events during recognition,” First International Conference on Universal Access in Human-Computer Interaction, 2001. [30] P. Vaidyanathan, J. Pelz, W. McCoy, C. Calvelli, C. O. Alm, P. Shi, and A. Haake, “Visualinguistic approach to medical image understanding,” Proceedings of AMIA 2012 Annual Symposium, p. 1661, 2012. [11] E. Campione and J. Véronis, “A large-scale multilingual study of silent pause duration,” Speech Prosody, 2002. [31] D. G. Watson, J. E. Arnold, and M. K. Tanenhaus, “Tic tac toe: Effects of predictability and importance on acoustic prominence in language production,” Cognition, vol. 106, pp. 1548–1557, 2008. [12] K. Kirsner, J. Dunn, and K. Hird, “Fluency: Time for a paradigm shift,” Proceedings of DiSS’03, Disfluency in Spontaneous Speech Workshop, pp. 13–16, 2003. [13] B. Megyesi and S. Gustafson-Čapkovà, “Pausing in dialogues and read speech in Swedish: Speakers’ production and listeners’ interpretation,” Proceedings of the Workshop on Prosody in Speech Recognition and Understanding, pp. 107–113, 2001. [14] T. Lövgren and J. van Doorn, “Influence of manipulation of short silent pause duration on speech fluency,” Proceedings of DiSS05, Disfluency in Spontaneous Speech Workshop, pp. 123–126, 2005. [15] K. G. Bailey and F. Ferreira, “Disfluencies affect the parsing of garden-path sentences,” Journal of Memory and Language, vol. 49, pp. 183–200, 2003. [16] S. E. Brennan and M. F. Schober, “How listeners compensate for disfluencies in spontaneous speech,” Journal of Memory and Language, vol. 44, pp. 274–296, 2001. [17] L. J. MacGregor, M. Corley, and D. I. Donaldson, “Listening to the sound of silence: Disfluent silent pauses in speech have consequences for listeners,” Neuropsychologia, vol. 48, pp. 3982–3992, 2010. [18] M. Corley, L. J. MacGregory, and D. I. Donaldson, “It’s the way that you, er, say it: Hesitations in speech affect language comprehension,” Cognition, vol. 105, pp. 658–668, 2007. [19] L. J. MacGregory, M. Corley, and D. I. Donaldson, “Not all disfluencies are are equal: The effects of disfluent repetitions on language comprehension,” Brain and Language, vol. 111, pp. 36–45, 2009. 3726
© Copyright 2026 Paperzz