Acoustic Analysis of Emotional Speech in Mandarin
Chinese
Sheng Zhang1, P.C. Ching2, Fanrang Kong1
1
Dept. of Precision Machinery and Precision Instrumentation,
University of Science & Technology of China, 230027
2Dept. of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong
Abstract. In this paper, the vocal expressions of human emotions of anger, fear,
joy and sadness are acoustically analyzed in relation to neutral speech. The
features under investigation include duration, short-time amplitude, as well as
pitch at both word and sentence level. The study is specially focused on word
stress in Mandarin speech signals with emotion arousal. Based on the analytical
results, an overall quantitative measure of the distinctive characteristics of
emotional speech will be presented.
Keywords: Mandarin Emotional Speech, Acoustic Analysis, Word Stress
1. Introduction
Spoken language is one of the most natural communication forms among human
beings. Natural speech carries not only linguistic, but also para-linguistic information
such as the speaker’s social, physiological, emotional status, etc. Many researchers
consider this topic challenging, thus it has received a lot of attention lately [1, 2].
From application perspective, there is strong motivation for us to conduct emotion
recognition to determine the emotional states of a particular speaker from the uttered
speech samples.
It is possible to achieve our goal of emotion recognition only if there are some
reliable correlates of emotion in the acoustic characteristics of the speech signal.
Although many researchers have investigated the possibility of performing emotion
recognition by speech [1, 3, 4] (basically, all agree that the most crucial aspects are
those related to prosody: the pitch (or F0) contour, the intensity contour, and the
timing of utterances), literature on emotion recognition in Mandarin Chinese language
are very few. Among them, Chen conducted spectral, tonal and durational analyses at
different phonetic levels based on the perceptually classified friendly and neutral
speech data [5]. Yuan suggested that there are two dimensions in the acoustic
realization of anger, fear, joy and sadness in Chinese, namely, phonation and prosody.
Meanwhile, anger and fear are mainly realized on phonation; joy is mainly realized on
prosody of F0; and sadness is realized on both dimensions [6]. Until now, studies
regarding the acoustic characteristics of Mandarin Chinese speech for emotion
recognition still need to be explored.
The remaining parts of this paper are organized as follows. In Section 3, we first
analyze the acoustic features including duration, short-time amplitude, and pitch not
only at the sentence level but also at the word level. Second, the results of comparison
between the two levels will be examined. Third, we analyze the stress in emotional
Mandarin speech. In Section 4, we present the abovementioned features of the words
with stress in emotional speech. Then in Section 5, the collected information about the
distinctive characteristics of every emotional speech will be analyzed. Through this,
we will see that frequency domain acoustic analysis should be further studied in
future work for in-depth understanding of emotional expressions such as joy and fear
in Mandarin Chinese speech.
This study provides a better understanding of the difference between variant
emotional speeches by the acoustic features, and therefore contributes to the emotion
recognition research on Mandarin Chinese speech signals. As there is no convention
yet, the emotional states investigated are limited to the four emotional behaviors:
anger, fear, joy, and sadness. This set is often supplemented by a neutral state for
dissociation from a non-emotional state, and this selection is noted to offer a certain
degree of international comparability [7].
2. Emotional speech sample
As it is difficult to find a common set of emotional speech data, we collected the
emotional speech signals for experiments in an acoustically isolated room, and we
also designed the speech text materials to be used. The speech data were recorded at a
sampling rate of 16 kHz with a 16-bit resolution. Twenty sentences (uttered three
times each) were analyzed for every emotion. Out of the 20 sentences, five were
common, while 15 were noted as utterances of individual emotion. Four males and
four females participated in the data collection process.
Table 1: Target sentence
Serial
1
2
3
4
5
Sentence
Zen3 Me0 Hui4 Zhe4 Yang0 (怎么会这样)
What’s up?
Ni3 Hui2 Lai2 Le0 (你回来了)
You’ve come back.
Zhen1 Mei2 Xiang3 Dao4 Ni3 Hui4 Zhe4 Yang0 (真没想到你会这样)
Never thought that you’d do this.
Zhe4 Li3 Bian4 De0 Yue4 Lai2 Yue4 Re4 Le0 (这里变得越来越热了)
It’s becoming hotter and hotter here.
Wo3 Men2 Yao4 Tian1 Tian1 Dou1 Chi1 Pi1 Sa0 (我们要天天都吃批萨)
We are going to have pizza every day.
In order to test the validity of the data, all of the sound clips were played randomly.
The listeners (not the speakers) decided how strong the emotion was in each utterance
based on their subjective judgment using three levels: very obvious, obvious, and not
obvious. The utterances that were considered to possess different emotional levels
were grouped accordingly.
In addition, we used a spontaneous speech sample produced by a single female
speaker for discriminant analysis. The target sentences were the five common
sentences mentioned above which were designed to be consistent with all of the four
emotions as shown in Table 1.
3. Acoustic Analysis
3.1
Duration
Sentence and word lengths were selected as duration features. In order to eliminate
the effect of different word number in the sentence, we define the LRS (vector of the
ratio of emotional sentence length (sec) to the length of the same sentence in neutral
state) as:
LRS = [
LS
LS
LS
LS E 11 LS E 12 LS E 13
,
,
, " , Em1 , Em 2 , Em 3 ]
LS N 1 LS N 1 LS N 1
LS Nm LS Nm LS Nm
(1)
where LS refers to the length of the sentence, E means the sentence with emotion of
anger, fear, joy or sadness; N means the sentence with corresponding neutral emotion;
m is the sentence number, and each sentence was uttered three times (here m = 5 , so
there was a total of 15 units in one LRS vector for one emotion).
Similarly, we define the LRW (vector of the ratio of emotional word length (sec)
to the length of the same word in neutral state):
LRW = [
LWEi
LWE1 LWE 2
,
," ,
]
LWN 1 LWN 2
LWNi
(2)
where i is the total number of words in the 15 emotional sentences. The tested
statistics of LRS and LRW for each emotion are shown in Table 2:
Table 2. Statistics of LRS and LRW for each emotion
Anger
Fear
Joy
Sadness
Mean
0.74
0.78
0.80
1.17
LRS
STD
Max
0.11
0.95
0.12
0.98
0.11
0.99
0.16
1.51
Min
0.57
0.58
0.62
0.92
Mean
0.75
0.81
0.85
1.21
LRW
STD
Max
0.19
1.44
0.19
1.38
0.18
1.31
0.27
1.87
Min
0.39
0.43
0.43
0.54
We can see that for the same context, the order of the sentence length appear to be:
{anger < fear < joy < sadness} and so is the mean value of LRW. While the order of
the maximum of LRW is: {joy < fear < anger < sadness}, and the values of the
maximum of LRW are all larger than 1, which means that some word length(s) in
anger, fear or joy may be longer than the corresponding word(s) with neutral emotion,
even if the other words in the same sentence are all much shorter. This will be
discussed further when considering the stress feature changes from emotion arousal.
3.2
Short-time Amplitude
Short time amplitude is used to represent the intensity contour, which is one of the
most crucial aspects of emotion recognition. Here, we define the ARS (vector of the
ratio of emotional sentence amplitude to the corresponding neutral sentence amplitude)
and the ARW (vector of the ratio of emotional word amplitude to the amplitude of the
same word in neutral state).
The statistics of ARS and ARW of the test data for each emotion are shown in
Table 3:
Table 3. Statistics of ARS and ARW for each emotion
Anger
Fear
Joy
Sadness
Mean
2.30
1.31
1.95
0.92
ARS
STD
Max
0.38
2.88
0.21
1.84
0.21
2.45
0.25
1.50
Min
1.47
0.92
1.73
0.60
Mean
2.45
1.35
1.94
0.94
ARW
STD
Max
1.12
6.85
0.58
3.08
0.72
3.65
0.45
2.54
Min
0.53
0.25
0.54
0.26
From Table 3, we can see that the amplitudes for emotions of anger and joy in the
sentence level are all higher than that (those) of the corresponding word(s) with
neutral emotion, while in the word level, some word amplitude(s) turn(s) out to be
lower than the corresponding word(s) with neutral emotion.
Table 4 is the word amplitude statistics (averaged) in each sentence for each
emotion. The maximum, mean, minimum, range and STD of the word amplitudes in
each sentence were calculated. We can see that the “Mean” values for emotions of
anger and joy are close, while the other three emotions are close to each other.
However, the “Range” and “Std” values for anger emotion is much higher than the
others, which helps to distinguish anger from joy (their difference is not that obvious
in the sentence level as shown in Table 3).
Table 4. Statistics of word amplitude in each sentence for each emotion
Neutral
Anger
Fear
Joy
Sadness
Max
9.51
23.93
12.48
17.24
8.61
Mean
5.43
12.78
6.87
10.16
4.83
Min
2.51
4.72
3.23
4.62
2.15
Range
7.00
19.22
9.24
12.62
6.46
Std
2.59
7.33
3.62
4.93
2.45
3.3
Pitch (F0)
The pitch (F0) contour is another important factor in emotion recognition. Thus, we
define the FRS (vector of the F0 ratio of emotional sentence to the corresponding
neutral sentence) and the FRW (vector of F0 ratio of emotional word to the same
word in neutral state).
The statistics of FRS and FRW for each emotion are presented in Table 5:
Table 5. Statistics of FRS and FRW for each emotion
Anger
Fear
Joy
Sadness
Mean
1.45
1.39
1.39
0.91
FRS
STD
Max
0.11
1.57
0.09
1.50
0.06
1.52
0.08
1.04
Min
1.22
1.21
1.30
0.82
Mean
1.43
1.39
1.35
0.90
FRW
STD
Max
0.32
2.15
0.15
1.76
0.13
1.67
0.11
1.10
Min
0.69
1.00
0.99
0.56
We can see that for the same context, the order of the sentence F0 is: {sadness <
fear, joy and anger}, and so is the mean value of FRW. The maximum value of FRW
for anger emotion is 2.15. However, the minimum value of FRW for anger emotion is
0.69, which is lower than those for fear and joy. This means that some words’ F0 in
anger may be lower than that (those) of the corresponding word(s) with neutral
emotion, even if the other words’ F0 in the same sentence are all much higher. As for
fear and joy, the words’ F0 are all higher than that (those) of the corresponding
word(s) with neutral emotion, if not approximately equal to.
Table 6. Statistics of word F0 (Hz) in each sentence for each emotion
Max
Mean
Min
Range
STD
Neutral
293.11
244.44
199.87
93.24
33.36
Anger
474.18
350.13
196.61
277.57
99.16
Fear
396.43
336.90
274.71
121.72
45.35
Joy
415.94
333.12
247.24
168.70
61.24
Sadness
251.58
221.84
175.63
75.95
24.39
Table 6 is the word F0 statistics (averaged) in each sentence for each emotion. The
maximum, mean, minimum, range and STD of the word F0 in each sentence were
likewise calculated.
We can see that the “Mean” values for anger, fear and joy are close, while neutral
and sadness are close to each other. However, the “Range” value is a discrete feature,
which roughly divides the emotions into three groups such as neutral and sadness,
fear and joy, and anger.
4. Feature Changes of “Word Stress” with Emotion
The results of the analysis in Part 3 indicate that the word level analysis gives extra
information for emotion recognition besides the sentence level analysis. By
experimental results, we found that “word stress” has special function for emotion
recognition by its acoustic features as mentioned above. Details of the word level
analysis on “word stress” are presented in the following:
Like in many other languages, F0, duration, and amplitude are acoustic correlates
highly related to stress in Mandarin [9]. Stress affects both F0 and duration. Higher
F0s become even higher, while lower ones become even lower. Durations are longer
for stressed tones; tones are short enough in duration to make those identified as
difficult to be perceived as completely unstressed and toneless [10]. The Mandarin
Chinese has the characteristic of “stress on left”. Multi- phrase can be divided into
several bi-time phrases from left to right, with the word stress set on the odd time of
every phrase while the stress beyond the word level is decided by the syntax structure
[11]. In general, the acoustic characteristics of word stress are:
1. Wider F0 range;
2. Longer duration
However, the studies above are all based on the speech with neutral emotion. The
word stress may be changed with emotions of anger, fear, joy or sadness. In order to
get the stressed words in each sentence for different emotions, all of the sound clips
were played one by one. The word was marked as stressed if chosen by all listeners
(not the speaker). For the samples under neutral emotion, we group them based on the
acoustic characteristics of stressed word above. The test results are as follows:
Table 7. Ratio of stressed word number to total word number (SN/TN), and ratio of correctly
stressed word number (CN)
Neutral
Anger
Fear
Joy
Sadness
SN/TN (%)
48.57
51.43
25.71
34.29
14.29
CN (%)
100
72.22
77.77
83.33
100
From the “SN/TN” value in Table 7, we can see that many more word stresses
(51.43%) are marked under anger emotion, and very few word stresses (14.29%) were
noted under sadness. The “CN” value is the rate of correctly stressed word number
which is the word marked under both emotional state and neutral state. This shows
that most stressed words remained even when emotion existed in speech, especially
for sadness.
Table 8~11 show the word length, amplitude, F0 and F0 range of stressed word
compared with unstressed word. The “Ratio” is the word length, amplitude and F0
rate of emotional word verse word in corresponding neutral sentence. We can see
from Table 8~11 that:
Table 8. Word length (sec) of stressed word compared with unstressed word
Neutral
Anger
Fear
Joy
Sadness
Stressed Word
Average
Ratio
0.21
0.15
0.76
0.17
0.86
0.17
0.86
0.26
1.28
Unstressed Word
Average
Ratio
0.18
0.14
0.75
0.15
0.80
0.16
0.85
0.22
1.19
Table 9. Amplitude of stressed word compared with unstressed word
Neutral
Anger
Fear
Joy
Sadness
Stressed Word
Average
Ratio
5.73
16.03
2.79
10.18
1.60
12.95
2.12
6.61
1.06
Unstressed Word
Average
Ratio
5.02
9.66
2.18
5.70
1.26
8.46
1.85
4.40
0.92
Table 10. F0 (Hz) of stressed word compared with unstressed word
Neutral
Anger
Fear
Joy
Sadness
Stressed Word
Average
Ratio
253.90
407.98
1.55
383.63
1.40
375.29
1.39
244.27
0.87
Unstressed Word
Average
Ratio
243.44
327.47
1.35
327.63
1.38
318.19
1.33
216.90
0.90
Table 11. F0 range (Hz) of stressed word compared with unstressed word
Neutral
Anger
Fear
Joy
Sadness
z
z
Stressed Word
Average
Ratio
49.22
100.53
2.94
54.20
1.45
62.05
1.75
46.89
1.05
Unstressed Word
Average
Ratio
32.48
77.87
2.44
48.17
1.44
57.57
1.72
30.29
0.92
Word length: (1) the length of stressed word is not significantly longer than the
length of unstressed word, except for neutral and sadness. (2) To differentiate
“neutral and sadness” from “anger, fear and joy”, a stressed word is more helpful
than an unstressed word because the difference of word length average for
stressed word with different emotions is more obvious than that for unstressed
word.
Amplitude: (1) For anger, fear, and joy, the amplitude of stressed word is much
higher than that of unstressed word. (2) Anger and joy are much easier to be
z
z
separated from sadness and neutral by the amplitude of stressed word. (3)
Specifically, the amplitude of unstressed word for fear emotion is quite close to
that of sadness and neutral emotion, while the amplitude of stressed word can be
more easily differentiated from sadness and neutral ones.
F0: (1) For anger, fear, and joy, the F0 values are much higher in stressed words.
(2) Based on the result, for anger, the highest F0 value of stressed word is
approximately 600Hz. Inflexion can be heard from the utterances in such case. (3)
Using stressed word, it can be easier to differentiate anger, fear and joy from
sadness and neutral emotions.
F0-range: Wider F0 range is one aspect that could identify stressed word for
neutral emotion. (1) However for anger, though the F0-range is wider in stressed
word, the STD of F0 range is much higher (=62.19). This means that the F0 range
of stressed word for anger is not always wide. (2) As for fear and joy, the F0
range is not very wide in stressed word. (3) For sadness, the F0 range is quite
similar to that of the neutral emotion.
From the analysis above, we summarize the characteristics of word stress for each
emotion:
Anger:
With higher amplitude and F0 value. Word stress is obvious for anger
emotion. It is also possible for stressed word with the highest energy to have
the largest mean value of F0 and its F0 is maintained in high frequency space,
while duration is not found to be successful at determining stress.
Fear:
For fear, the word stress is not obvious. Higher F0 value is the primary
acoustic correlate to be considered for deciding word stress by perception,
then higher amplitude follows.
Joy:
Same with fear emotion, word stress for joy is not obvious. Higher F0 value
is the primary acoustic correlate to be considered for deciding word stress by
perception, then higher amplitude follows.
Sadness:
Word stress for sadness is not obvious. Wider F0-range and longer word
length are the acoustic correlates to be considered for determining word
stress by perception.
Furthermore, determination of word stress is affected by syntax.
The Mandarin language has a contour tone system. The distinguishing features of
the tone are its shifts in pitch (their pitch shapes or contours, such as rising, falling,
dipping, or peaking). Word stress affects the accurate and complete representation of
tonal contour. Especially for anger emotion, if the tone of the “stressed word” is tone3
(low-dipping F0), the word length is often long, and the F0 range will be fully
expanded. However, if the tone of the “unstressed word” is tone3, then the tone
contour will not be fully expanded, thus this will sound just like the forepart of the
whole tone3 contour. However, if it is tone1 (high-level F0), tone2 (high-rising F0) or
tone4 (high-falling F0), the word length is often short, and F0 remains in high
frequency space.
As shown above, we conducted a special study on word stress in Mandarin speech
signals with emotion arousal and summarized the characteristics of word stress. The
experiment results represent that stressed word is more helpful than unstressed word
to differentiate among different emotions. The analysis on word stress for each
emotion will be discussed in the following section.
5. Summary and discussion
In the present study, the vocal expression of emotions such as anger, fear, joy and
sadness show various specific characteristics concerning the pitch (or F0) contour, the
intensity contour, and the timing of utterances.
The utterances expressing anger showed the highest F0, highest F0 variance,
shortest sentence length, and highest short-time amplitude at sentence level. These
outcomes match well with other studies [1, 2, 6]. However, the analysis of the
acoustic features at word level shows that even if almost all the word lengths for
anger emotion are all shorter than those of the corresponding words with neutral
emotion, some word lengths especially that for the “stressed word” with Tone3 are
still longer. Short-time amplitudes of the words are much higher than those of the
corresponding words with neutral emotion, but still have some words with lower
amplitude. This long-short length and high-low amplitude at word level compose the
rhythm of anger emotion. As for F0 for anger emotion, the mean values are not
always higher than those of the corresponding words with neutral emotion either.
From the word amplitude (F0) statistics in each sentence for each emotion, we can see
that the variance of the words amplitude (F0) in each sentence for anger emotion is
much higher than those for other emotions.
Word stress is most obvious in anger emotion and it appears easy to be heard
because of the high energy associated with it. Moreover, the larger mean value of F0
and the word F0 maintained in high frequency space gives the stressed word some
kind of inflexion.
Duration is not found to be successful at determining whether stress is present. A
possible explanation for this may be that the very high energy and F0 attract more
people’s attention than long duration. Much more word stress can also be a
component of anger emotion.
For joy emotion, the short-time amplitude and F0 are a little lower than those for
anger emotion in sentence level. However, the variances of the word amplitude and
F0 in each sentence for joy emotion are much lower than that for anger emotion,
which helps to decide the emotion from joy to anger. Word stress is not obvious for
joy emotion. A possible explanation for this may be that “joy effect” reduces the
effect of short-time amplitude and F0.
For fear, the short-time amplitude is lower than that for anger and joy emotion in
sentence level. The variances of the word amplitude and F0 in each sentence for fear
emotion are lower than that for joy, although this is not obvious. The amplitude of
unstressed word for fear emotion is much lower than that for joy emotion. Therefore,
through this, fear can be more easily differentiated from joy emotion. Similarly, word
stress is also not obvious for fear emotion. Similar with joy emotion, some kind of
“warble tone” reduces the effect of short-time amplitude and F0.
For sadness, the duration is longest, the short-time amplitude is lowest, and so is
the F0 in sentence level. However in word level, the word length is not always longer
than those of the corresponding words with neutral emotion. The length of stressed
word turns out to be longer while the unstressed word may be shorter. Amplitude and
F0 are obvious acoustic features for sadness emotion, in contrast with anger, fear and
joy emotion. The word amplitude and F0 for sadness are all quite lower, and their
variances are also quite lower.
This research attempts to establish a framework for constructing models to
describe each of the emotions in the study for emotion recognition by speech signals.
Since the results of this study are based on a small sample size, caution should be
taken when making any generalized conclusions. Therefore, the described analysis
will be expanded by using more speech data. Besides more detailed acoustical and
segmental analysis on time domain, future work on this issue may take into
consideration the frequency domain to find out the “joy effect” for joy emotion and
“warble tone” for fear emotion.
References
1.
2.
3.
4.
5.
6.
Oudeyer Pierre-Yves, “The production and recognition of emotions in speech:
features and algorithms”, Int. J. Human-Computer Studies 59 (2003) 157-183
Tsang-Long Pao, etc, “Detecting Emotions in Mandarin Speech”, Computational
Linguistics and Chinese Language Processing, Vol. 10, No. 3, Sep. 2005, pp347362
Ilkka Linnankoski, etc; “Conveyance of emotional connotations by a single word
in English”, Speech Communication 45 (2005) 27–39
Paeschke, A., Sendlmeier, W.F.; “Prosodic characteristics of emotional speech:
measurements of fundamental frequency movements”, In: Proceedings of the ISCA
ITRW on Speech and Emotion, Newcastle, 5-7 September 2000, Belfast, Textflow, pp.
75-80
Fangxin Chen, Aijun Li, Haibo Wang, Tianqing Wang and Qiang Fang, “Acoustic
Analysis of Friendliness”, in proceedings of ICASSP2004.
Yuan J., Shen L., Chen F., “The acoustic realization of anger, fear, joy and sadness in
Chinese”, Speech Prosody 2002, France
Valery A. Petrushin, “Emotion Recognition in Speech Signal: Experimental Study,
Development, and Application”, In Proceedings of ICSLP-2000, vol.2, pp222-225
8. N. Campbell, “Databases of Emotional Speech,” In: Proc. of ISCA Workshop on
Speech and Emotion, 2000.
9. Kratochvil, P. “Syllabic volume as acoustic correlate of perceptual prominence in
Peking dialect”, Unicorn 5,1-28
10. Jongman, A., Wang, Y., Moore, C. and Sereno, J. (in press). “Perception and
production of Mandarin tone”, In Bates, E., Tan, L. and Tzeng, O. (Eds.),
Handbook of Chinese Psycholinguistics, Cambridge University Press
7.
11. Duanmu S., “Theorem of word stress and the selection of word length in Mandarin
Chinese”, Chinese in China, Vol. 4, p246-254
© Copyright 2026 Paperzz