Speech Pauses and Gestural Holds in

ISCA Archive
http://www.iscaĆspeech.org/archive
7th International Conference on Spoken
Language Processing [ICSLP2002]
Denver, Colorado, USA
September 16Ć20, 2002
SPEECH PAUSES AND GESTURAL HOLDS IN PARKINSON’S DISEASE
Francis Quek, ÝMary Harper, Yonca Haciahmetoglu, ÝLei Chen, and ÞLorraine Ramig
Vision Interfaces & Sys. Lab. (VISLab) CSE Dept., Wright State U., OH
Ý
ECE Dept., Purdue University, IN , ÞUniversity of Colorado, Boulder
Correspondence: [email protected]
Abstract
Parkinson’s disease (PD) belongs to a class of neurodegenerative
diseases that affect the patient’s speech, motor, and cognitive capabilities. All three deficits affect the multimodal communication
channels of speech and gesture. We present a study on the changes
in speech pause patterns and gesture holds before and after treatment. We present the results of a pilot study of two Idiopathic
PD patients who have undergone Lee Silverman Voice Treatment
(LSVT). We show that there was a consistent change in the location
of pauses with respect to semantic sentential utterance units. After treatment, the number and duration of pauses within sentential
units decreased while the inter sentential pauses increased. This
indicates reduction in hesitation and increase in speech phrasing.
We also found a decrease in the number of sentence repairs and
the time spent in repairs. For gesture, we found that non-rest holds
intersecting with within-sentence pauses appears to decline after
treatment, as does the ratio of rest holds during speech against rest
holds between sentences. While the work is preliminary, these patterns suggest that multimodal discourse characteristics might provide access to the underlying cognitive state under the load of narrative discourse.
1. INTRODUCTION
Parkinson’s disease (PD) belongs to a class of neurodegenerative
diseases that affect both the patient’s speech and motor capabilities.
Currently there are one and a half million sufferers of the disease
and this number is expected to rise fourfold by 2040 [1]. Parkinson’s Disease (PD) is a progressive neurodegenerative disorder in
which death of dopaminergic cells in the substantia nigra results in
a variety of changes in motoric function such as delay in initiation of
movement, increase in resting muscle tone, slowness of movement,
and resting tremor [2, 3]. Cognitive changes, such as slowed information processing (bradyphrenia) [4], and loss of postural reflexes
may also occur [5]. Speech changes are common; approximately
70% of Parkinson’s patients have speech problems. Parkinson’s patients often exhibit monotonous pitch and loudness, reduced stress,
variable speech rate, short rushes of speech, and imprecise consonant articulation. They also manifest hypophonia, start-hesitation
or stuttering, or delay in the production of speech, which is often
difficult to distinguish from bradyphrenia [6, 7].
This research has been supported by the U.S. National Science Foundation STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech,
and Gaze in Discourse Segmentation” and the National Science Foundation
KDI program, Grant No. BCS-9980054, “Cross-Modal Analysis of Signal
and Sense: Multimedia Corpora and Tools for Gesture, Speech, and Gaze
Research” This research was also supported in part by NIH-NIDCD R01
DC-01150. Appreciation is expressed to subjects who participated in this
study.
Earlier studies of voice onset time [8] and the effects of PD
on naming deficits and language comprehension [9] suggest that
language deficits may provide a handle on understanding the neuropathology of the disease. Such studies seek to understand these
component effects in isolation. We believe that there are effects of
PD that reveal themselves in individuals only under the load complexity of natural discourse. This paper tests the proposition that
these effects are accessible via the analysis in multimodal narrative
discourse involving gesture and speech. Specifically we study the
effects of the disease on speech pause and gestural hold distribution.
We analyzed the before and after treatment videos of two PD patients who underwent a treatment program known as Lee Silverman
Voice Treatment (LSVT) [10]. This treatment applies voice therapy
methods to improve the speech intelligibility of PD patients. Subjective observations indicate that not only did speech qualitatively
improve, the accompanying gesticulation appeared to improve as
well. Our earlier studies have shown that there is an improvement
in overall ‘liveliness’ motion quality [11] and in the dynamics of
both motion and speech intensity variation [12, 13].
2. BACKGROUND
The reduced ability to communicate is considered to be one of the
most difficult aspects of Idiopathic Parkinson’s Disease (IPD) by
many IPD sufferers and their families. Soft voice, monotone, breathiness, hoarse voice quality, and imprecise articulation, together with
lessened (masked) facial expression, contribute to limitations in communication in the vast majority of individuals with IPD [14, 15].
The initial treatment aim of LSVT is to improve the phonatory source in individuals with IPD. The treatment typically results in significant, long-term improvement in laryngeal valving and
post-treatment changes in thyroarytenoid muscle activity, subglottal
air pressure, maximum flow declination rate, voice sound pressure
level (SPL), loudness and voice quality [16, 17, 18].
LSVT uses phonation as a trigger to increase effort and coordination across the speech production system through stimulating the
global variable “loud.” Speech production is a learned, highly practiced motor behavior, with many of its movements regulated in a
quasiautomatic fashion [19, 20]; loudness scaling is a task that humans engage in all their lives [21, 22]. For example, it is common
to increase loudness to improve speech intelligibility when speaking against noise or when the listener is far away. By targeting
loudness in treatment, well-established, centrally stored motor patterns for speech may be triggered; that is, intensive loudness training may provide the stimulation needed for the individual with IPD
to activate and modulate appropriate speech motor programs that
are still intact. Such multi-level upscaling across the speech system is likely to involve common central neural pathways. Further,
post-treatment observations of increased facial expression accompanying improved loudness and intonation [23], and the results of
a PET brain imaging study of individuals with IPD pre- and postLSVT [24] suggest that training loud phonation may also promote
the recruitment of the right insular cortex and the anterior cingulate
cortex.
Taken together, this suggests that there is an element of spreading of the gains accrued in improved phonation to other emotional
and motoric activity at a neuronal level. This suggests that the observed gesticulatory ‘improvement’ may be a result of this neuronal
level spreading. This is congruent with the basic assertion in our
gesture research that gestural behavior and speech share a common
semantic source and are tightly integrated [25, 26].
Percent of
time Speaking
N(EP S)
N(S)
T(EP S)
T(S)
N(EP SP)
N(S)
T(EP SP)
T (S)
Table 1: PD 1 and 2 Speech Characteristics
P1
P2
3. THE EXPERIMENT
Our pilot dataset was obtained from two patients, both male, who
underwent an 8-week LSVT protocol involving therapy for 1 hour
2 days a week over the period. We shall designate the our two
pilot study subjects P1 and P2. They were 61 and 60 years of
age respectively. P1 and P2 has a Hoehn & Yahr stage II and
stage III diagnoses respectively. Both men have been diagnosed
with IPD for 5 years. Both have been stabilized on their antiParkinsonian medication prior LSVT so that behavioral changes are
not attributable to medication adjustments. Each patient performed
a videotaped narration before and after the treatment. The experimental protocol employed was the standard ‘Tweety and Sylvester’
narrations that are a staple of gesture-speech research [27]. The
patient viewed a series of Tweety-and-Sylvester cartoon clips and
were instructed to convey the contents of the clips to an interlocutor so that she can then go and tell it to someone else. The narration was videotaped with a single superVHS camera for analysis.
The audio was recorded using a AKG C451 EB boom mounted microphone approximately 2 feet from the subject. The audio was
recorded through an amplifier connected to a Panasonic SVHS AG1960 VCR. The video was digitized along on an Silicon Graphics
O2 in SGI-MJPEG format. Audio was digitized at the same time
at 44.1KHz and then downsampled to 14.7 KHz for analysis. P1
and P2 produced datasets of 275.3 sec (8259 frames) and 336.033
sec (1081 frames) in before narration video for P1 and P2 respectively; and 219.1 sec (6573 frames) and 291.867 sec (8756 frames)
in after-treatment narration video for P1 and P2 respectively.
4. SPEECH PAUSE ANALYSIS
We first transcribe the speech data manually and run the Entropics
word aligner on it. The are imported to the Praat phonetics analysis tool [28] and adjusted to obtain an accurate word, and syllable
level time alignment. We also perform an analysis for sentential
boundaries, disfluencies, and speech repairs.
Hesitation and silence in speech have been applied to the study
of the cognitive aspects of speech production [29]. We posit that
the distribution of empty pauses intra- and inter-sentence units constitute metrics of the degree of hesitation and deliberation. While
intra-sentential pauses exist in normal speech, such pauses per unit
time or utterance should remain relatively constant for an individual discussing a particular topic. Increased intra-sentential pauses
would indicate greater hesitation and disfluency in the speech. On
the other hand, since PD sufferers tend to compensate for speech
deficits by rushing their utterances, we expect that better speech
phrasing and deliberation will manifest themselves in increased intersentential pause times.
to be the set of all pauses between
We define Ì
È
Pre-TX Post-TX Change %
P1 55.70% 63.61%
14.20%
P2 79.05% 80.08%
1.30%
P1 1.2037 0.7556 -37.23%
P2 0.4701 0.4048 -13.89%
P1 0.2752 0.1999 -27.36%
P2 0.1866 0.1354 -27.44%
P1 0.5926 0.6889
16.25%
P2 0.4530 0.4881
7.75%
P1 0.6782 0.9521
40.39%
P2 0.3435 0.4406
28.27%
77
83
Pre-TX
0.279 per sec
0.247 per sec
43
43
Post-TX
0.196 per sec
0.147 per sec
a. Total Number of Repairs
P1
P2
Pre-TX
38.39 sec 14.05% of Speech
48.69 sec 14.5% of Speech
Post-TX
24.13 sec 11.3% of Speech
48.69 sec
7.8% of Speech
b. Total Duration of Repairs
Table 2: PD 1 and 2 Speech Repairs
words greater than some pause determination threshold (we use a
is the number of such pauses.
value of 150ms).
is the total duration of such EP. We define to be the set
of empty pauses within sentences, and to be the intersentential pauses.
From Table 1 we can see that the percent of time speaking (i.e.
word utterance time / total narration time) increased by 14.20% and
1.30% for PD1 and PD2 respectively.
per sentence decreased by 37.23% and 13.89% for
PD1 and PD2 respectively as does the duration,
(by 27.36% for PD1 and 27.44% for PD2). Both of these measures
indicate a reduction in hesitation and an increase in fluency postTX.
The indices for inter-sentence pauses showed a reverse trend.
increased by 16.25% for PD1 and by 7.75%
for PD2. Concomitantly, the duration ratios for
increased by 40.38% for PD1 and 28.25% for PD2. Hence phrasing
and deliberateness in both subjects appear to improve.
These consistent trends were surprising since our two subjects
had rather different speech rates. PD1 had a pre-TX speech rate
of 2.725 words per second that increased post-TX to 2.97 words
per second. PD2 was a natural faster speaker, and his speech rate
slowed slightly from 4.19 to 4.16 words per second.
Æ Ì
Æ
Æ
Ì
Ì
Æ Ì
Ì Ì
Ì
Ì
Ì
Ì Ì 5. SPEECH REPAIRS
We marked disfluencies in the aligned speech, labeling speech repairs, and aborted sentences (utterances that do not terminate and
are not repaired), and filled pauses (such as um). Repairs are
marked as tuples of reparandum (region of hte mistake, editing
words (words that indicate a repair will happen), and alterations
(the repaired speech). An example of a disfluent sentence taken
from our PD data is:
“[third time] [uh] [second time] [excuse me] [second time]
[he goes up the drainpipe ]”
: This is the reparandum; : This is an editing word; : This is the
alteration for (i.e. the fixup), but this also becomes the reparandum
for ; : These are the editing words for the - repair; and : This is
the final reparandum. Note that the ‘um’ in would have been
marked as a filled pause if it were not judged to perform the role of
an editing word in the utterance.
We measured the number of repairs pre-TX and post-TX and
their durations. As can be seen in Table 2, both the total number of
repairs and time taken in repairs improved post-TX.
6. GESTURE ANALYSIS
We apply a fuzzy image processing approach known as Vector Coherence Mapping (VCM) [30] to track the hand motion. VCM
applies spatial coherence, momentum (temporal coherence), speed
limit, and skin color constraints in the vector field computation by
using a fuzzy-combination strategies, and produce good results for
hand gesture tracking. We apply an iterative clustering algorithm
that minimizes spatial and temporal vector variance to extract the
moving hands. The positions of the hands in the stereo images are
used to produce 3D motion traces describing the gestures. To detect
holds, we employed a motion energy-based detector to locate places
where there was low motion energy [31]. We have observed that the
there is a high degree of cohesion between holds and pauses [32].
We divided holds into three kinds: rest holds where both hands
were at the rest position, pre-stroke and post-stroke holds and stationary points in a motion sequence which are holds with durations
lower than 200 msec (pre- and post- stroke holds are typically holds
that take place before or after a motion stroke, and stationary points
are typical in oscillatory gestures where the hand stops momentarily
at the stroke extrema), and long hold strokes (non-rest holds greater
than 200 msec).
We posit that long non-rest holds have different meanings depending on whether they intersect with pauses or words. If they intersect with words, they are more likely to be ‘hold-strokes’ (information-laden parts of a gesture that includes a effortful hold). If
they intersect with empty pauses within sentences they are likely
to reinforce the concept of hesitation. Some silences occur within
sentences when the hand is moving, this may serve a coordinating
function between gesture and speech (as when one says “When you
pause while hand moves up to prepare a throwing action
[pass the ball hand thrusts forward in throwing action] down the
field”). We posit that within sentence pauses that intersect with nonrest holds are likely to signal word search or hesitation behavior.
For rest holds, we posit that a predominance of these holds during speech indicates a decrease in the ideation imagery. We computed the intersection of these hold phenomena with the betweensentence pauses, within-sentence pauses, and sentence durations.
to be the set of all non-rest holds
We define Ì
to be the set
greater than our 0.2 sec threshold, and Ì
of all rest holds.
and are the total duration of
to be
such NRH and RH respectively. We define the set of intersections of NRH with empty pauses within sentences,
to be the set of intersections between rest holds and sentences, and to be the set of intersections between rest
holds and inter-sentence pauses.
as a fraction of total discourse
For PD1,
time was 34.025 msec per sec pre-TX. This declines to 9.57 msec/sec
(or 71.87%) post-TX. For PD2, this value was 21.43 msec/sec preTX, but it increased slightly to 24.43 msec/sec (3%) post-TX.
To obtain the relationship of rest holds within and between sentences, we took the quotient:
È
Ì
Ì
Ì
Ì
Ì
Ì
È
Ì
Ì ÌÌ
Ì For PD1, this quotient increased from 0.181 pre-TX to 0.337 postTX (an increase of 86.19%). For PD2, this quotient increased similarly from 0.364 pre-TX to 0.770 post-TX (an increase of 111.54%).
Ì
Ì
The strong decrease in as a fraction of total
discourse time in PD1 matches our finding in improved fluency in
the speech. The problem with PD2 is that there was a shift in the
experimental conditions pre-TX and post-TX. The subjects were
seated in a therapy chair with adjustable armrests (they could be
engaged or not). In the pre-TX case, the arm rests were not engaged, and the subjects gestured with the rest position on their laps.
Hence each excursion from rest required greater effort. In the postTX case, the arm rests were erroneously engaged. PD1 still used
his laps as the rest position while PD2 rested his elbows on the arm
rests and clasped his hands in front of his mid-section. This resulted
in a markedly different dynamic in his gesticulation. Since the gesture and rest spaces were very close, it was difficult to tell when
he was at rest and when the mid-length holds were rest or non-rest
holds. Furthermore, because of the reduced effort required, PD2
post-TX engaged in many more beat gestures directly from rest position. Hence, we have to conclude that his results were inconclusive for the non-rest hold comparison.
The consistency in pre-TX to post-TX change in our second rest
hold measure indicates that there was more gesticular activity postTX as opposed to before.
7. CONCLUSIONS
We have presented the results of our pilot work on speech and
vision-based gesture metrics to access the cognitive change in PD
patients before and after LSVT. We employed a speech pause and
gesture hold analysis for this purpose. We stress that these results
are preliminary, but they do suggest that the study of change in multimodal language characteristics in narrative discourse may be of
utility in understanding the broad cognitive changes in individuals
with PD.
The temporal relation between pauses and sentences was consistent across the two subjects in our study even though there was
a large disparity between the speech rates and disease conditions of
the two patients. The gestural analysis of rest and non-rest holds
with respect to speech pause and sentences in PD1 agreed with our
basic hypotheses. In PD2, the confounding situation of the arm rests
contributed to mixed results.
The purpose of this pilot work was to show the promise of multimodal linguistic analysis to study cognition and language in PD
patients. In our ongoing work, we are extending our study to more
subjects and devising more robust metrics for such cognitive evaluation.
8. REFERENCES
[1] http://www.pdindex.com/,
“PD INDEX: A directory of
parkinson’s disease information on the internet”, Internet Publication.
[2] J. Jankovic, “Pathophysiology and clinical assessment of motor symptoms in parkinsons disease”, in W. Koller, editor,
Handbook of Parkinsons disease, pp. 99–126. Marcel Dekker,
New York, 1987.
[3] M. Hallett and S. Khoshbin, “A physiological mechanism of
bradykinesia”, Brain, vol. 103, pp. 301–314, 1980.
[4] B. Dubois. and B. Pillon., “Cognitive deficits in Parkinson’s
disease”, J Neurol., vol. 244, pp. 2–8, Jan. 1997.
[5] M. Morris, R. Iansek, F. Smithson, and F. Huxham, “Postural
instability in parkinson’s disease: a comparison withand with-
out a concurrent task”, Gait Posture, vol. 12, pp. 205–216,
Dec. 2000.
[6] K. Forrest, G. Weismer, and G. Turner, “Kinematic, acoustic, and perceptual analyses of connected speech produced
by parkinsonian and normal geriatric adults”, Journal of the
Acoustical Society of America, vol. 85, pp. 2608–2622, 1989.
[7] J. Logemann, H. Fisher, B. Boshes, and E. Blonsky, “Frequency and cooccurrence of vocal tract dysfunctions in the
speech of a large sample of parkinson individuals”, Journal
of Speech and Hearing Disorders, vol. 42, pp. 47–57, 1978.
[8] P. Lieberman, E. Kako, J. Friedman, G. Tajchman, L.S. Feldman, and E.B. Jiminez, “Speech production, syntax comprehension, and cognitive deficits in parkinson’s disease”, Brain
and Language, vol. 43, pp. 169–189, 1992.
[9] F.M. Lewis, L.L. Lapointe, B.E. Murdoch, and H.J. Chenery,
“Language impairment in parkinson’s disease”, Aphasiology,
vol. 12, pp. 193–206, 1998.
[10] L. Ramig, S. Countryman, L. Thompson, and Y. Horii, “A
comparison of two forms of intensive speech treatment for
parkinson disease”, Journal of Speech and Hearing Research,
vol. 38, pp. 1232–1251, 1995.
[11] F. Quek, R. Bryll, and L. Ramig, “Toward vision and gesture
based evaluation of parkinson disease from discourse video”,
in IEEE Wksp Cues in Comm., Kauai, Hawaii, Dec.9 2001.
[12] F. Quek, R. Bryll, M. Harper, L. Chen, and Ramig L., “Audio and vision-based evaluation of parkinsons disease from
discourse video”, in IEEE International Symposium of BioInformatics and Bio-Engineering, pp. 246–253, Bethesda,
MA, Nov.4-6 2001.
[13] F. Quek, R. Bryll, M. Harper, L. Chen, and Ramig L., “Speech
and gesture analysis for evaluation of progress in lsvt treatment in parkinson disease”, in Eleventh Biennial Conference
on Motor Speech: Motor Speech Disorders, Williamsburg,
VA, Mar.14-17 2002.
[14] T. Pitcairn, S. Clemie, J. Gray, and B. Pentland, “Non-verbal
cues in the self-presentation of parkinsonian patients”, British
Journal of Clinical Psychology, vol. 29, pp. 177–184, 1990.
[15] T. Pitcairn, S. Clemie, J. Gray, and B. Pentland, “Impressions
of parkinsonian patients from their recorded voices”, British
Journal of Disorders of Communication, vol. 25, pp. 85–92,
1990.
[16] C. Baumgartner, S. Sapir, and L. Ramig, “Perceptual voice
quality changes following phonatory-respiratory effort treatment (LSVT) vs. respiratory effort treatment for individuals
with parkinson disease”, Journal of Voice, vol. in press, 2001.
[17] L. Ramig and C. Dromey, “Aerodynamic mechanisms underlying treatment-related changes in vocal intensity in patients
with parkinson disease”, Journal of Speech and Hearing Research, vol. 39, pp. 798–807, 1996.
[18] L. Ramig, S. Sapir, K. Baker, and M. Smith, “Electromyographic changes in laryngeal muscle activity following LSVT
in idiopathic parkinson disease”, Journal of Speech, Language, and Hearing Research, vol. in review, 2001.
[19] H. Ackermann, D. Wildgruber, I. Daum, and W. Grodd, “Does
the cerebellum contribute to cognitive aspects of speech production? a functional magnetic resonance imaging (fMRI)
study in humans”, Neuroscience letters, vol. 247, pp. 187–
90, 1998.
[20] S. Hirano, H. Kojima, Y. Naito, I. Honjo, Y. Kamoto,
H. Okazawa, K. Ishizu, Y. Yonekura, Y. Nagahama,
H. Fukuyama, and J. Konishi, “Cortical processing mechanism for vocalization with auditory verbal feedback”, Neuroreport, vol. 8, pp. 2379–82, 1997.
[21] A. Ho, J. Bradshaw, R. Iansek, and R. Alfredson, “Speech
volume regulation in parkinson’s disease: effects of implicit
cues and explicit instructions”, Neuropsychologia, vol. 37,
pp. 1453–60, 1999.
[22] AK Ho, JL Bradshaw, and T Iansek, “Volume perception
in parkinsonian speech”, Movement Disorders, vol. 15, pp.
1125–31, 2000.
[23] J. Spielman, L. Ramig, and J. Borod, “Preliminary effects
of voice therapy on facial expression in parkinsons disease”,
Journal of the International Neuropsychological Association,
vol. 7, pp. 244, 2001.
[24] M. Liotti, D. Vogel, L. Ramig, P. New, C. Cook, and
P. Fox, “Functional reorganization of speech-motor function
in parkinson disease following LSVT: A PET study”, Neurology, vol. in review, 2001.
[25] F. Quek, D. McNeill, R. Ansari, X. Ma, R. Bryll, S. Duncan, and K-E. McCullough,
“Multimodal human discourse: Gesture and speech”,
ToCHI, vol. In Press,
2001, VISLab, Wright State U., Tech. Report VISLab-01-01,
http://vislab.cs.wright.edu/Publications/Queetal01.html.
[26] D. McNeill, F. Quek, K.-E. McCullough, S. Duncan, N. Furuyama, R. Bryll, X.-F. Ma, and R. Ansari, “Catchments,
prosody and discourse”, in in press: Gesture, 2001.
[27] D. McNeill, Hand and Mind: What Gestures Reveal about
thought, U. Chicago Press, Chicago, 1992.
[28] P. Boersma and D. Weenik, “Praat, a system for doing phonetics by computer”, Technical Report Report 132, Institute
of Phonetic Sciences of the University of Amsterdam, 1996.
[29] B. Butterworth and G. Beattie, “Gesture and silence as indicators of planning in speech”, in R.N. Campbell and P.N. Smith,
editors, Recent Advances in the Psychology of Language: Formal and Experimental Approaches. Plenum Press, New York,
1978.
[30] F. Quek, X. Ma, and R. Bryll, “A parallel algorithm for dynamic gesture tracking”, in ICCV’99 Wksp on RATFG-RTS.,
pp. 119–126, Corfu, Greece, Sep.26–27 1999.
[31] R. Bryll, F. Quek, and A. Esposito, “Automatic hold detection in natural conversation”, in IEEE Wksp Cues in Comm.,
Kauai, Hawaii, Dec.9 2001.
[32] A. Esposito, K-E. McCullough, and F. Quek, “Disfluencies in
gesture: Gestural correlates to speech silent and filled pauses”,
in IEEE Wksp Cues in Comm., Kauai, Hawaii, Dec.9 2001.