acoustic features of four types of laughter

ICPhS XVII
Regular Session
Hong Kong, 17-21 August 2011
ACOUSTIC FEATURES OF FOUR TYPES OF LAUGHTER
IN NATURAL CONVERSATIONAL SPEECH
Hiroki Tanakaa & Nick Campbella,b
a
Nara Institute of Science and Technology, Japan;
b
Trinity College Dublin, Ireland
[email protected]; [email protected]
liking for one another. It is predicted to as easily
have the opposite role among those who do not”
[5]. The most extensive study of the spectral
properties of humorous laughter was performed
by Bachorowski and colleagues [1] and was based
on laugh bouts recorded from 97 young adults as
they watched funny video clips. The research
revealed a consistent lack of articulation effects in
supralaryngeal filtering and reported formant
related filtering effects which were found to be
disproportionately important as acoustic correlates
of laughter, sex, and individual identity.
At a satellite workshop of the 16th ICPhS in
Saarbrucken, Szameitat, et al [6] reported a
similar acoustic analysis of several kinds of
laughter but employed professional actors to
produce the laughs. They confirmed the
Bachorowski et al finding that laughter syllables
are predominantly formed with central vowels,
and showed that compared to speech production,
the first formant of laughter vowels is
occasionally characterized by exceptionally high
frequencies which may be the result of a wide jaw
opening and/or pharyngeal changes in “pressed”
voice, while the vowel elements during laughter
showed a relatively stable individual pattern.
In a seminal study of the segmentation of
laughs, Trouvain [8] suggests that we consider
laughter as articulated speech, where at the low
level there are sound segments that are either
vowels or consonants. At the next higher level,
there are syllables consisting of sound segments.
The next higher level deals with larger units such
as phrases which are made up of several syllables.
Owren [5] recommends the term ‘bout’ for the
longer sequence, and ‘call’ for the individual
syllables; we will adopt that terminology in this
paper.
Our principal goal in this study is to help
autistic children distinguish types of laughter in
terms of phonetic classes of laugh in
conversational speech.
ABSTRACT
This paper presents the results of an analysis of
the representative sounds of human laughter from
a
large
corpus
of
naturally-occurring
conversational speech. Two contrasting manners
of laughter were categorized for the study: polite
formal laughs and sincere mirthful laughs, and a
formant analysis was performed on four phonetic
classes of laugh therein. Laughing speech was
also common in the corpus but is not addressed in
this work. Statistical analysis of the acoustic
features of each laugh was performed, and
formant parameters ware compared for each call
type within a laughter bout. The paper details the
formant characteristics and shows how polite
laughter can be distinguished from sincere
laughter on this basis using a trained statistical
recogniser.
Keywords: spoken discourse, natural interaction,
laughter, formant analysis, polite vs sincere
laughs
1. INTRODUCTION
Laughter forms an essential component of human
spoken interaction but has not been the subject of
mainstream phonetic research until quite recently.
An analysis of a large corpus of spontaneous
conversational speech recorded in highly natural
situations revealed that laugher played an
important role in all dialogues but the laughs only
occasionally resulted from deliberate humour;
polite laughter and nervous social laughter
accounted for more than half the number of laugh
bouts [2].
There is evidence that laughter is deeply
rooted in human biology. “Based on acoustic from
and likely phylogenetic history, laughter is argued
to have evolved primarily as a vehicle of
emotional conditioning. In this view, human
laughter emerged because it helps foster and
maintain
positive,
mutually
beneficial
relationships among individuals with genuine
1958
ICPhS XVII
Regular Session
American English. The speech data were
transferred to a computer, segmented into separate
files, and transcribed manually. Laughs were
marked with a special diacritic, and laughing
speech was also bracketed to show by use of the
diacritic which sections were spoken with a
laughing voice. Laughs were transcribed using the
Japanese phonetic Katakana orthography,
wherever possible, alongside the use of the
symbol [3].
2. TYPES OF LAUGHTER
Table 1 shows the counts of laughs extracted from
a 30-minute conversation from the ESP speech
corpus [4]. We determined in all five categories of
laugh (1: mirthful 2: politeness 3: embarrassment
4: derision, and 5: other) from independent
labeling by eighteen people, and the kappa
statistic using Cohen’s method is 0.37 (p =
5.179e-10 (signif)).
Table 1: Counts of laughs and non-laughs in a
representative thirty-minute conversation between
two Japanese males (JMA and JMB).
Type
not laughs
2
1
3
5
4
Count
6999
472
244
107
49
4
Prop
none
54%
28%
12%
5%
1%
Hong Kong, 17-21 August 2011
3.2.
Formant feature extraction
For this study, we selected one typical
conversation and analysed the voice of one male
speaker (JMA). Figure 1 shows a sample of his
labeled speech, identifying different kinds of
laugh. Four phonetic manners of laugh were
identified for further annotation in this
conversation, to distinguish nasal, ingressive,
vocal and chuckle subtypes at the call level.
Formant analysis was performed using the Tcl/Tk
‘Snack’ speech processing toolkit [7] which
includes legacy code from David Talkin’s
Entropic Signal Processing System (ESPC)
including get_f0 and formant.
Cumulative Prop
none
54%
82%
94%
99%
100%
The mirthful and polite laughs accounted for
80 percent of all laughs. Embarrassed laughs are
difficult to distinguish acoustically from polite
laughs, so we conflate them together here with
‘polite’. The table shows counts of labels for both
laughs and laughing speech, but we omit laughing
speech because of its linguistic complexity.
Therefore we categorize the data into two main
types: hearty or mirthful laughs, labeled ‘m’ and
polite laughs, labeled ‘p’ in the figures below.
Figure 1: Formant estimation in Wavesufer, showing
labels for each call (nasal. voiced. chuckle, or
ingressive). These raw estimates were post-processed
as explained in section 3.2 of the text.
3. ANALYSIS OF LAUGHTER
3.1.
Data
The ESP-C corpus was recorded over several
months, with paid volunteers coming once a week
to talk informally and with no specific instruction
as to content with specific partners in a different
floor of the same building over an office
telephone. While talking, they wore a headmounted Sennheiser HMD-410 close-talking
dynamic microphone and recorded their speech
directly to DAT (digital audio tape) at a sampling
rate of 48kHz. They did not see their partners or
socialize with them outside of the recording
sessions. Partner combinations were controlled for
sex, age, and familiarity, and all recordings were
transcribed and time-aligned for subsequent
analysis. Recordings continued for a maximum of
ten sessions between each pair. Each conversation
lasted for a period of thirty minutes. In all, ten
people took part as speakers in the corpus
recordings, five male and five female. Six were
Japanese, two Chinese, and two native speakers of
Figure 1 shows a sample of the formant
estimation. It should be clear from the figure that
the estimation is good for strongly voiced portions
of the speech but can be almost random at other
parts. We therefore devised a filter based on rms
signal energy and thresholded the estimates to
disregard all sections where speech power was
lower than the overall mean determined for the
signal.
3.3.
Analysis results
The average fundamental frequency for each
laugh was: 159 Hz (sd 30) for voiced laughs, 234
Hz (sd 64) for ingressives, and 197 Hz (sd 76) for
1959
ICPhS XVII
Regular Session
chuckles. Fundamental frequency estimation was
not reliable for the nasal grunts.
The difference between the second and first
formants (a useful measure of fronting) was
voiced (v): 801 (sd 189), ingressive (i): 825 (sd
207), chuckle (c): 714 (sd 195), and nasal (n): 863
(sd 244). The three vowels, for comparison, were
/a/ 714 (sd 204), /i/ 1440 (sd 352), /u/ 954 (sd
196).
The summary statistics obtained as a result of
the formant analysis are plotted in Figure 3 and 4.
Figure 3 shows estimates of the formant averages
for all instances of three vowels (/a/, /i/, /u/
extracted from the conversational speech) in the
various formant spaces, alongside equivalent plots
for the four types of laughter. The figures confirm
with findings from spontaneous conversational
speech the findings of previous researchers using
acted or humorous laughter that these laughs tend
to cluster in the ‘schwa’ space, in spite of being
perceptually more vocalic (i.e., often transcribed
as ‘haha’, ‘hihi’ or ‘hoho’).
Table 2: Counts and formant values (mean with sd in
brackets) for each phonetic type described in the text.
id
/a/
/i/
/u/
v
i
c
n
count
47
38
53
88
58
69
59
F1
600(71)
443(98)
423(60)
643(138)
487(72)
500(204)
617(299)
F2
1315(181)
1883(280)
1378(184)
1445(165)
1312(186)
1215(239)
1481(280)
F3
2511(84)
2760(278)
2356(206)
2575(215)
2414(221)
2136(378)
2346(450)
F4
3889(327)
3821(260)
3692(336)
3559(208)
3308(261)
3022(435)
3384(529)
Figure 3: Plots of three vowels (/a/, /i/, /u/) extracted
from the same conversational speech and four types
of laugh (i: ingressive, n: nasal, v: voiced, c: chuckle)
in the various formant spaces (left: f1,f2, middle:
f1,f3, right: f3,f4).
Table 3: Confusion matrix (predictions) from HMM
prediction trained on spectral features, testing unseen
data.
voiced
ingressive
chuckle
nasal
Hong Kong, 17-21 August 2011
voiced ingr. chuckle nasal %c
39
0
2
0 95.1%
0
10
7
0 58.8%
1
1
24
3 82.8%
0
0
0
19 100%
Figure 2 shows how the polite laughs are
distinguished from mirthful laughter in duration
and number of calls. Polite laughter includes
voiced laughs that sound similar to mirthful
laughs, as well as nasal grunts (‘hmmh’) but
rarely includes chuckles and never the ingressive
laugh [3]. Mirthful laughter tends to be longer and
to include more calls per bout.
Figure 2: Plot of log duration calls per bout.
1960
ICPhS XVII
Regular Session
Figure 4 shows by means of boxplots the
distribution within each formant space for the
three vowels and four laugh call types. Figure 5
shows the distribution of each call type in the two
manners of laugh. According to a Student’s t-test
between JMA and JMB to confirm generalizability of speakers, p-value is significant for
voiced laughs’ formant spaces (F1: p = 2.2e-16,
F2: p = 3.0e-15, F3: p = 1.1e-15, F4: p = 2.9e-08).
Hong Kong, 17-21 August 2011
laughter (i.e., ingressive laughs within a bout is
likely to be sincere mirthful laughs). We applied a
HMM statistical modeling based on a
conventional phone recognition model to
automated classification of each call type. To test
the degree to which this spectral information can
be used to distinguish between the different types
of laugh, we trained an HMM using the following
features: mfcc, rms power, and delta, power, with
a subset of the laughs and tested on a different
subset and achieved a prediction accuracy of
86.79%. (train: v 48, i 42, c 44, n 47), (test: v 41, i
17, c 29, n 19). Table 3 shows the confusion
matrix and results from the prediction.
Figure 4: Showing distribution within each formant
space for the 3 vowels and 4 laughs plotted in Fig 3.
5. CONCLUSION
Previous work reported formant frequency data
for laughs using either acted samples or elicited
humorous laughs. The present study reports
findings from an analysis of natural spontaneous
laughter in conversational speech which reinforce
the findings of earlier studies. The present work
extended the previous results to distinguish four
types of phonetic laugh and two manners of social
laughing. We were able to distinguish between
these at levels much better than chance with a
statistical detector trained on spectral features.
This laughter detection is currently being
integrated into a device to help autistic children
distinguish between the two types of laughter.
Figure 5: Showing proportion of calls of each laugh
type in polite and mirthful variants from a total of 26
bouts of laughter.
6. REFERENCES
[1] Bachorowski, J.A., Smoski, M.J., Owren, M.J. 2001.The
acoustic features of human laughter. J.Acoust.Soc.Am.
110, 1581-1597.
[2] Campbell, N. 2007. Changes in voice quality due to
social conditions. Proc. 16th ICPhS Saarbrücken, 20932096.
[3] Campbell, N., Kashioka, H., Ohara, R. 2005 No
laughing matter. Proc Interspeech, 465-478.
[4] ESP (Expressive Speech Processing) corpus:
http://www.speech-data.jp
[5] Owren, M., 2007. Understanding acoustics and function
in spontaneous human laughter. In Trouvain, J.,
Campbell, N. (eds.), Interdisciplinary Workshop on The
Phonetics of Laughter Saarbrücken.
[6] Szameitat, D.P., Darwin, C.J., Szameitat, A.J.,
Wildgruber, D., Alter, K. 2011. Formant characteristics
of human laughter. Journal of Voice 25(1), 32-37.
Tcl/Tk Snack Toolkit, www.speech.kth.se/snack/.
[7] Trovain, J. 2003. Segmenting phonetic units in laughter.
Proc. 15th ICPhS Barcelona, 2793-2796.
4. STATISTICAL MODELLING
The discrimination of four types of phonetic
classes of laugh enables us to classify manner of
1961