ISCA Archive http://www.iscaĆspeech.org/archive 7th International Conference on Spoken Language Processing [ICSLP2002] Denver, Colorado, USA September 16Ć20, 2002 SPEECH PAUSES AND GESTURAL HOLDS IN PARKINSON’S DISEASE Francis Quek, ÝMary Harper, Yonca Haciahmetoglu, ÝLei Chen, and ÞLorraine Ramig Vision Interfaces & Sys. Lab. (VISLab) CSE Dept., Wright State U., OH Ý ECE Dept., Purdue University, IN , ÞUniversity of Colorado, Boulder Correspondence: [email protected] Abstract Parkinson’s disease (PD) belongs to a class of neurodegenerative diseases that affect the patient’s speech, motor, and cognitive capabilities. All three deficits affect the multimodal communication channels of speech and gesture. We present a study on the changes in speech pause patterns and gesture holds before and after treatment. We present the results of a pilot study of two Idiopathic PD patients who have undergone Lee Silverman Voice Treatment (LSVT). We show that there was a consistent change in the location of pauses with respect to semantic sentential utterance units. After treatment, the number and duration of pauses within sentential units decreased while the inter sentential pauses increased. This indicates reduction in hesitation and increase in speech phrasing. We also found a decrease in the number of sentence repairs and the time spent in repairs. For gesture, we found that non-rest holds intersecting with within-sentence pauses appears to decline after treatment, as does the ratio of rest holds during speech against rest holds between sentences. While the work is preliminary, these patterns suggest that multimodal discourse characteristics might provide access to the underlying cognitive state under the load of narrative discourse. 1. INTRODUCTION Parkinson’s disease (PD) belongs to a class of neurodegenerative diseases that affect both the patient’s speech and motor capabilities. Currently there are one and a half million sufferers of the disease and this number is expected to rise fourfold by 2040 [1]. Parkinson’s Disease (PD) is a progressive neurodegenerative disorder in which death of dopaminergic cells in the substantia nigra results in a variety of changes in motoric function such as delay in initiation of movement, increase in resting muscle tone, slowness of movement, and resting tremor [2, 3]. Cognitive changes, such as slowed information processing (bradyphrenia) [4], and loss of postural reflexes may also occur [5]. Speech changes are common; approximately 70% of Parkinson’s patients have speech problems. Parkinson’s patients often exhibit monotonous pitch and loudness, reduced stress, variable speech rate, short rushes of speech, and imprecise consonant articulation. They also manifest hypophonia, start-hesitation or stuttering, or delay in the production of speech, which is often difficult to distinguish from bradyphrenia [6, 7]. This research has been supported by the U.S. National Science Foundation STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” and the National Science Foundation KDI program, Grant No. BCS-9980054, “Cross-Modal Analysis of Signal and Sense: Multimedia Corpora and Tools for Gesture, Speech, and Gaze Research” This research was also supported in part by NIH-NIDCD R01 DC-01150. Appreciation is expressed to subjects who participated in this study. Earlier studies of voice onset time [8] and the effects of PD on naming deficits and language comprehension [9] suggest that language deficits may provide a handle on understanding the neuropathology of the disease. Such studies seek to understand these component effects in isolation. We believe that there are effects of PD that reveal themselves in individuals only under the load complexity of natural discourse. This paper tests the proposition that these effects are accessible via the analysis in multimodal narrative discourse involving gesture and speech. Specifically we study the effects of the disease on speech pause and gestural hold distribution. We analyzed the before and after treatment videos of two PD patients who underwent a treatment program known as Lee Silverman Voice Treatment (LSVT) [10]. This treatment applies voice therapy methods to improve the speech intelligibility of PD patients. Subjective observations indicate that not only did speech qualitatively improve, the accompanying gesticulation appeared to improve as well. Our earlier studies have shown that there is an improvement in overall ‘liveliness’ motion quality [11] and in the dynamics of both motion and speech intensity variation [12, 13]. 2. BACKGROUND The reduced ability to communicate is considered to be one of the most difficult aspects of Idiopathic Parkinson’s Disease (IPD) by many IPD sufferers and their families. Soft voice, monotone, breathiness, hoarse voice quality, and imprecise articulation, together with lessened (masked) facial expression, contribute to limitations in communication in the vast majority of individuals with IPD [14, 15]. The initial treatment aim of LSVT is to improve the phonatory source in individuals with IPD. The treatment typically results in significant, long-term improvement in laryngeal valving and post-treatment changes in thyroarytenoid muscle activity, subglottal air pressure, maximum flow declination rate, voice sound pressure level (SPL), loudness and voice quality [16, 17, 18]. LSVT uses phonation as a trigger to increase effort and coordination across the speech production system through stimulating the global variable “loud.” Speech production is a learned, highly practiced motor behavior, with many of its movements regulated in a quasiautomatic fashion [19, 20]; loudness scaling is a task that humans engage in all their lives [21, 22]. For example, it is common to increase loudness to improve speech intelligibility when speaking against noise or when the listener is far away. By targeting loudness in treatment, well-established, centrally stored motor patterns for speech may be triggered; that is, intensive loudness training may provide the stimulation needed for the individual with IPD to activate and modulate appropriate speech motor programs that are still intact. Such multi-level upscaling across the speech system is likely to involve common central neural pathways. Further, post-treatment observations of increased facial expression accompanying improved loudness and intonation [23], and the results of a PET brain imaging study of individuals with IPD pre- and postLSVT [24] suggest that training loud phonation may also promote the recruitment of the right insular cortex and the anterior cingulate cortex. Taken together, this suggests that there is an element of spreading of the gains accrued in improved phonation to other emotional and motoric activity at a neuronal level. This suggests that the observed gesticulatory ‘improvement’ may be a result of this neuronal level spreading. This is congruent with the basic assertion in our gesture research that gestural behavior and speech share a common semantic source and are tightly integrated [25, 26]. Percent of time Speaking N(EP S) N(S) T(EP S) T(S) N(EP SP) N(S) T(EP SP) T (S) Table 1: PD 1 and 2 Speech Characteristics P1 P2 3. THE EXPERIMENT Our pilot dataset was obtained from two patients, both male, who underwent an 8-week LSVT protocol involving therapy for 1 hour 2 days a week over the period. We shall designate the our two pilot study subjects P1 and P2. They were 61 and 60 years of age respectively. P1 and P2 has a Hoehn & Yahr stage II and stage III diagnoses respectively. Both men have been diagnosed with IPD for 5 years. Both have been stabilized on their antiParkinsonian medication prior LSVT so that behavioral changes are not attributable to medication adjustments. Each patient performed a videotaped narration before and after the treatment. The experimental protocol employed was the standard ‘Tweety and Sylvester’ narrations that are a staple of gesture-speech research [27]. The patient viewed a series of Tweety-and-Sylvester cartoon clips and were instructed to convey the contents of the clips to an interlocutor so that she can then go and tell it to someone else. The narration was videotaped with a single superVHS camera for analysis. The audio was recorded using a AKG C451 EB boom mounted microphone approximately 2 feet from the subject. The audio was recorded through an amplifier connected to a Panasonic SVHS AG1960 VCR. The video was digitized along on an Silicon Graphics O2 in SGI-MJPEG format. Audio was digitized at the same time at 44.1KHz and then downsampled to 14.7 KHz for analysis. P1 and P2 produced datasets of 275.3 sec (8259 frames) and 336.033 sec (1081 frames) in before narration video for P1 and P2 respectively; and 219.1 sec (6573 frames) and 291.867 sec (8756 frames) in after-treatment narration video for P1 and P2 respectively. 4. SPEECH PAUSE ANALYSIS We first transcribe the speech data manually and run the Entropics word aligner on it. The are imported to the Praat phonetics analysis tool [28] and adjusted to obtain an accurate word, and syllable level time alignment. We also perform an analysis for sentential boundaries, disfluencies, and speech repairs. Hesitation and silence in speech have been applied to the study of the cognitive aspects of speech production [29]. We posit that the distribution of empty pauses intra- and inter-sentence units constitute metrics of the degree of hesitation and deliberation. While intra-sentential pauses exist in normal speech, such pauses per unit time or utterance should remain relatively constant for an individual discussing a particular topic. Increased intra-sentential pauses would indicate greater hesitation and disfluency in the speech. On the other hand, since PD sufferers tend to compensate for speech deficits by rushing their utterances, we expect that better speech phrasing and deliberation will manifest themselves in increased intersentential pause times. to be the set of all pauses between We define Ì È Pre-TX Post-TX Change % P1 55.70% 63.61% 14.20% P2 79.05% 80.08% 1.30% P1 1.2037 0.7556 -37.23% P2 0.4701 0.4048 -13.89% P1 0.2752 0.1999 -27.36% P2 0.1866 0.1354 -27.44% P1 0.5926 0.6889 16.25% P2 0.4530 0.4881 7.75% P1 0.6782 0.9521 40.39% P2 0.3435 0.4406 28.27% 77 83 Pre-TX 0.279 per sec 0.247 per sec 43 43 Post-TX 0.196 per sec 0.147 per sec a. Total Number of Repairs P1 P2 Pre-TX 38.39 sec 14.05% of Speech 48.69 sec 14.5% of Speech Post-TX 24.13 sec 11.3% of Speech 48.69 sec 7.8% of Speech b. Total Duration of Repairs Table 2: PD 1 and 2 Speech Repairs words greater than some pause determination threshold (we use a is the number of such pauses. value of 150ms). is the total duration of such EP. We define to be the set of empty pauses within sentences, and to be the intersentential pauses. From Table 1 we can see that the percent of time speaking (i.e. word utterance time / total narration time) increased by 14.20% and 1.30% for PD1 and PD2 respectively. per sentence decreased by 37.23% and 13.89% for PD1 and PD2 respectively as does the duration, (by 27.36% for PD1 and 27.44% for PD2). Both of these measures indicate a reduction in hesitation and an increase in fluency postTX. The indices for inter-sentence pauses showed a reverse trend. increased by 16.25% for PD1 and by 7.75% for PD2. Concomitantly, the duration ratios for increased by 40.38% for PD1 and 28.25% for PD2. Hence phrasing and deliberateness in both subjects appear to improve. These consistent trends were surprising since our two subjects had rather different speech rates. PD1 had a pre-TX speech rate of 2.725 words per second that increased post-TX to 2.97 words per second. PD2 was a natural faster speaker, and his speech rate slowed slightly from 4.19 to 4.16 words per second. Æ Ì Æ Æ Ì Ì Æ Ì Ì Ì Ì Ì Ì Ì Ì 5. SPEECH REPAIRS We marked disfluencies in the aligned speech, labeling speech repairs, and aborted sentences (utterances that do not terminate and are not repaired), and filled pauses (such as um). Repairs are marked as tuples of reparandum (region of hte mistake, editing words (words that indicate a repair will happen), and alterations (the repaired speech). An example of a disfluent sentence taken from our PD data is: “[third time] [uh] [second time] [excuse me] [second time] [he goes up the drainpipe ]” : This is the reparandum; : This is an editing word; : This is the alteration for (i.e. the fixup), but this also becomes the reparandum for ; : These are the editing words for the - repair; and : This is the final reparandum. Note that the ‘um’ in would have been marked as a filled pause if it were not judged to perform the role of an editing word in the utterance. We measured the number of repairs pre-TX and post-TX and their durations. As can be seen in Table 2, both the total number of repairs and time taken in repairs improved post-TX. 6. GESTURE ANALYSIS We apply a fuzzy image processing approach known as Vector Coherence Mapping (VCM) [30] to track the hand motion. VCM applies spatial coherence, momentum (temporal coherence), speed limit, and skin color constraints in the vector field computation by using a fuzzy-combination strategies, and produce good results for hand gesture tracking. We apply an iterative clustering algorithm that minimizes spatial and temporal vector variance to extract the moving hands. The positions of the hands in the stereo images are used to produce 3D motion traces describing the gestures. To detect holds, we employed a motion energy-based detector to locate places where there was low motion energy [31]. We have observed that the there is a high degree of cohesion between holds and pauses [32]. We divided holds into three kinds: rest holds where both hands were at the rest position, pre-stroke and post-stroke holds and stationary points in a motion sequence which are holds with durations lower than 200 msec (pre- and post- stroke holds are typically holds that take place before or after a motion stroke, and stationary points are typical in oscillatory gestures where the hand stops momentarily at the stroke extrema), and long hold strokes (non-rest holds greater than 200 msec). We posit that long non-rest holds have different meanings depending on whether they intersect with pauses or words. If they intersect with words, they are more likely to be ‘hold-strokes’ (information-laden parts of a gesture that includes a effortful hold). If they intersect with empty pauses within sentences they are likely to reinforce the concept of hesitation. Some silences occur within sentences when the hand is moving, this may serve a coordinating function between gesture and speech (as when one says “When you pause while hand moves up to prepare a throwing action [pass the ball hand thrusts forward in throwing action] down the field”). We posit that within sentence pauses that intersect with nonrest holds are likely to signal word search or hesitation behavior. For rest holds, we posit that a predominance of these holds during speech indicates a decrease in the ideation imagery. We computed the intersection of these hold phenomena with the betweensentence pauses, within-sentence pauses, and sentence durations. to be the set of all non-rest holds We define Ì to be the set greater than our 0.2 sec threshold, and Ì of all rest holds. and are the total duration of to be such NRH and RH respectively. We define the set of intersections of NRH with empty pauses within sentences, to be the set of intersections between rest holds and sentences, and to be the set of intersections between rest holds and inter-sentence pauses. as a fraction of total discourse For PD1, time was 34.025 msec per sec pre-TX. This declines to 9.57 msec/sec (or 71.87%) post-TX. For PD2, this value was 21.43 msec/sec preTX, but it increased slightly to 24.43 msec/sec (3%) post-TX. To obtain the relationship of rest holds within and between sentences, we took the quotient: È Ì Ì Ì Ì Ì Ì È Ì Ì ÌÌ Ì For PD1, this quotient increased from 0.181 pre-TX to 0.337 postTX (an increase of 86.19%). For PD2, this quotient increased similarly from 0.364 pre-TX to 0.770 post-TX (an increase of 111.54%). Ì Ì The strong decrease in as a fraction of total discourse time in PD1 matches our finding in improved fluency in the speech. The problem with PD2 is that there was a shift in the experimental conditions pre-TX and post-TX. The subjects were seated in a therapy chair with adjustable armrests (they could be engaged or not). In the pre-TX case, the arm rests were not engaged, and the subjects gestured with the rest position on their laps. Hence each excursion from rest required greater effort. In the postTX case, the arm rests were erroneously engaged. PD1 still used his laps as the rest position while PD2 rested his elbows on the arm rests and clasped his hands in front of his mid-section. This resulted in a markedly different dynamic in his gesticulation. Since the gesture and rest spaces were very close, it was difficult to tell when he was at rest and when the mid-length holds were rest or non-rest holds. Furthermore, because of the reduced effort required, PD2 post-TX engaged in many more beat gestures directly from rest position. Hence, we have to conclude that his results were inconclusive for the non-rest hold comparison. The consistency in pre-TX to post-TX change in our second rest hold measure indicates that there was more gesticular activity postTX as opposed to before. 7. CONCLUSIONS We have presented the results of our pilot work on speech and vision-based gesture metrics to access the cognitive change in PD patients before and after LSVT. We employed a speech pause and gesture hold analysis for this purpose. We stress that these results are preliminary, but they do suggest that the study of change in multimodal language characteristics in narrative discourse may be of utility in understanding the broad cognitive changes in individuals with PD. The temporal relation between pauses and sentences was consistent across the two subjects in our study even though there was a large disparity between the speech rates and disease conditions of the two patients. The gestural analysis of rest and non-rest holds with respect to speech pause and sentences in PD1 agreed with our basic hypotheses. In PD2, the confounding situation of the arm rests contributed to mixed results. The purpose of this pilot work was to show the promise of multimodal linguistic analysis to study cognition and language in PD patients. In our ongoing work, we are extending our study to more subjects and devising more robust metrics for such cognitive evaluation. 8. REFERENCES [1] http://www.pdindex.com/, “PD INDEX: A directory of parkinson’s disease information on the internet”, Internet Publication. [2] J. Jankovic, “Pathophysiology and clinical assessment of motor symptoms in parkinsons disease”, in W. Koller, editor, Handbook of Parkinsons disease, pp. 99–126. Marcel Dekker, New York, 1987. [3] M. Hallett and S. Khoshbin, “A physiological mechanism of bradykinesia”, Brain, vol. 103, pp. 301–314, 1980. [4] B. Dubois. and B. Pillon., “Cognitive deficits in Parkinson’s disease”, J Neurol., vol. 244, pp. 2–8, Jan. 1997. [5] M. Morris, R. Iansek, F. Smithson, and F. Huxham, “Postural instability in parkinson’s disease: a comparison withand with- out a concurrent task”, Gait Posture, vol. 12, pp. 205–216, Dec. 2000. [6] K. Forrest, G. Weismer, and G. Turner, “Kinematic, acoustic, and perceptual analyses of connected speech produced by parkinsonian and normal geriatric adults”, Journal of the Acoustical Society of America, vol. 85, pp. 2608–2622, 1989. [7] J. Logemann, H. Fisher, B. Boshes, and E. Blonsky, “Frequency and cooccurrence of vocal tract dysfunctions in the speech of a large sample of parkinson individuals”, Journal of Speech and Hearing Disorders, vol. 42, pp. 47–57, 1978. [8] P. Lieberman, E. Kako, J. Friedman, G. Tajchman, L.S. Feldman, and E.B. Jiminez, “Speech production, syntax comprehension, and cognitive deficits in parkinson’s disease”, Brain and Language, vol. 43, pp. 169–189, 1992. [9] F.M. Lewis, L.L. Lapointe, B.E. Murdoch, and H.J. Chenery, “Language impairment in parkinson’s disease”, Aphasiology, vol. 12, pp. 193–206, 1998. [10] L. Ramig, S. Countryman, L. Thompson, and Y. Horii, “A comparison of two forms of intensive speech treatment for parkinson disease”, Journal of Speech and Hearing Research, vol. 38, pp. 1232–1251, 1995. [11] F. Quek, R. Bryll, and L. Ramig, “Toward vision and gesture based evaluation of parkinson disease from discourse video”, in IEEE Wksp Cues in Comm., Kauai, Hawaii, Dec.9 2001. [12] F. Quek, R. Bryll, M. Harper, L. Chen, and Ramig L., “Audio and vision-based evaluation of parkinsons disease from discourse video”, in IEEE International Symposium of BioInformatics and Bio-Engineering, pp. 246–253, Bethesda, MA, Nov.4-6 2001. [13] F. Quek, R. Bryll, M. Harper, L. Chen, and Ramig L., “Speech and gesture analysis for evaluation of progress in lsvt treatment in parkinson disease”, in Eleventh Biennial Conference on Motor Speech: Motor Speech Disorders, Williamsburg, VA, Mar.14-17 2002. [14] T. Pitcairn, S. Clemie, J. Gray, and B. Pentland, “Non-verbal cues in the self-presentation of parkinsonian patients”, British Journal of Clinical Psychology, vol. 29, pp. 177–184, 1990. [15] T. Pitcairn, S. Clemie, J. Gray, and B. Pentland, “Impressions of parkinsonian patients from their recorded voices”, British Journal of Disorders of Communication, vol. 25, pp. 85–92, 1990. [16] C. Baumgartner, S. Sapir, and L. Ramig, “Perceptual voice quality changes following phonatory-respiratory effort treatment (LSVT) vs. respiratory effort treatment for individuals with parkinson disease”, Journal of Voice, vol. in press, 2001. [17] L. Ramig and C. Dromey, “Aerodynamic mechanisms underlying treatment-related changes in vocal intensity in patients with parkinson disease”, Journal of Speech and Hearing Research, vol. 39, pp. 798–807, 1996. [18] L. Ramig, S. Sapir, K. Baker, and M. Smith, “Electromyographic changes in laryngeal muscle activity following LSVT in idiopathic parkinson disease”, Journal of Speech, Language, and Hearing Research, vol. in review, 2001. [19] H. Ackermann, D. Wildgruber, I. Daum, and W. Grodd, “Does the cerebellum contribute to cognitive aspects of speech production? a functional magnetic resonance imaging (fMRI) study in humans”, Neuroscience letters, vol. 247, pp. 187– 90, 1998. [20] S. Hirano, H. Kojima, Y. Naito, I. Honjo, Y. Kamoto, H. Okazawa, K. Ishizu, Y. Yonekura, Y. Nagahama, H. Fukuyama, and J. Konishi, “Cortical processing mechanism for vocalization with auditory verbal feedback”, Neuroreport, vol. 8, pp. 2379–82, 1997. [21] A. Ho, J. Bradshaw, R. Iansek, and R. Alfredson, “Speech volume regulation in parkinson’s disease: effects of implicit cues and explicit instructions”, Neuropsychologia, vol. 37, pp. 1453–60, 1999. [22] AK Ho, JL Bradshaw, and T Iansek, “Volume perception in parkinsonian speech”, Movement Disorders, vol. 15, pp. 1125–31, 2000. [23] J. Spielman, L. Ramig, and J. Borod, “Preliminary effects of voice therapy on facial expression in parkinsons disease”, Journal of the International Neuropsychological Association, vol. 7, pp. 244, 2001. [24] M. Liotti, D. Vogel, L. Ramig, P. New, C. Cook, and P. Fox, “Functional reorganization of speech-motor function in parkinson disease following LSVT: A PET study”, Neurology, vol. in review, 2001. [25] F. Quek, D. McNeill, R. Ansari, X. Ma, R. Bryll, S. Duncan, and K-E. McCullough, “Multimodal human discourse: Gesture and speech”, ToCHI, vol. In Press, 2001, VISLab, Wright State U., Tech. Report VISLab-01-01, http://vislab.cs.wright.edu/Publications/Queetal01.html. [26] D. McNeill, F. Quek, K.-E. McCullough, S. Duncan, N. Furuyama, R. Bryll, X.-F. Ma, and R. Ansari, “Catchments, prosody and discourse”, in in press: Gesture, 2001. [27] D. McNeill, Hand and Mind: What Gestures Reveal about thought, U. Chicago Press, Chicago, 1992. [28] P. Boersma and D. Weenik, “Praat, a system for doing phonetics by computer”, Technical Report Report 132, Institute of Phonetic Sciences of the University of Amsterdam, 1996. [29] B. Butterworth and G. Beattie, “Gesture and silence as indicators of planning in speech”, in R.N. Campbell and P.N. Smith, editors, Recent Advances in the Psychology of Language: Formal and Experimental Approaches. Plenum Press, New York, 1978. [30] F. Quek, X. Ma, and R. Bryll, “A parallel algorithm for dynamic gesture tracking”, in ICCV’99 Wksp on RATFG-RTS., pp. 119–126, Corfu, Greece, Sep.26–27 1999. [31] R. Bryll, F. Quek, and A. Esposito, “Automatic hold detection in natural conversation”, in IEEE Wksp Cues in Comm., Kauai, Hawaii, Dec.9 2001. [32] A. Esposito, K-E. McCullough, and F. Quek, “Disfluencies in gesture: Gestural correlates to speech silent and filled pauses”, in IEEE Wksp Cues in Comm., Kauai, Hawaii, Dec.9 2001.
© Copyright 2026 Paperzz