Improvement of Emotion Recognition by Bayesian

2005 IEEE International Workshop on Robots and Human Interactive Communication
Improvement of Emotion Recognition by Bayesian
Classifier Using Non-zero-pitch Concept *
Kyung Hak Hyun
Department of Mechanical Engineering
Korea Advanced Institute of Science and
Technology
Guseong Dong, Yuseong Gu, Daejeon,
Repulic of Korea
[email protected]
Eun Ho Kim
Yoon Keun Kwak
Department of Mechanical
Engineering
Korea Advanced Institute of Science
and Technology
Guseong Dong, Yuseong Gu, Daejeon,
Repulic of Korea
[email protected]
Abstract – Emotion recognition is an important factor in
the development of human-robot interactions (HRI) and
especially, the pitch information is considered for features
related to emotion in speech emotion recognition. Thus, in this
paper, the goal is to propose the improved method to recognize
the emotion by pitch, “non-zero-pitch”, that is, the pitch
contour does not have the zero value. We have applied this
concept to a Bayesian classifier, and obtained better results for
emotion recognition than those attained using the previous
pitch contour. In this study, we explain precisely the concept of
“non-zero-pitch” and show its superiority over the previous
pitch concept.
Moreover, it is also important to determine emotions to be
classified. In general, many researchers of emotion recognition
use the classification of primary emotions such as anger, joy,
and so on. However, they differ on the number and kind of
primary emotions to use and generally fail to explain the
rationale for their classification. Psychologists have also
debated the topic of primary emotions. In the present study we
propose a classification method of primary emotions for the
field of HRI.
Index Terms – Primary emotion, Voice, HRI, Pitch,
Bayesian Classifier
I. INTRODUCTION
In the past, robots were primarily used in dangerous
work domains or in the field of factory automation.
However, more recently, robots have been developed for
more diverse purposes; some examples include, ASIMO[1],
a biped robot that can simulate human walking, Paro, which
is used for psycho-therapy, and Kismet[2], a head-robot that
represents emotions by facial expressions. In particular,
many researchers are interested in developing humanfriendly robots that can interact with a human, recognize the
human emotions and represent emotions. In this regard,
there have been many studies on HRI, emotions.
The main feature of these human-friendly robots is their
capacity to interact with humans. In the past, robots only
communicated unilaterally with humans. For example, if a
human issued an order, the robot followed the order. In the
future, however, robots will be able to consider not only
commands but also other information such as the emotional
and health status of its human commander. In following an
*
Department of Mechanical Engineering
Korea Advanced Institute of Science and
Technology
Guseong Dong, Yuseong Gu, Daejeon,
Repulic of Korea
[email protected]
order, the robot will be able to modify its actions using
feedback from the human’s reaction. That is, human-robot
interaction will be possible.
In this paper, we addressed the issue of the recognition
of emotion, which is a fundamental technology in the design
of human-friendly robots. Human emotion is subjective.
Furthermore, recognizing the emotion of an unknown or
unfamiliar person in a real situation is not easy. For this
reason, we have restricted our discussion to a speakerdependant system. Moreover, we consider a textindependent system that reveals emotion independently of
content.
Finally, the target of our study is a system of
recognizing emotion. We rely on speech to establish an
interface between the human and robot, because speech is
the fundamental mode of human communication.
II. PRIMARY EMOTION FOR HRI
A. Previous Works
Primary emotions refer to the basic elements that
organize the set of emotions that describe complex human
emotions. In other words, arbitrary emotions are comprised
of a combination of primary emotions. However, few
scholars or psychologists agree with this definition. In
addition, there is controversy over whether primary
emotions even exist.
For engineers, a clear and discrete classification of
emotions would be convenient. In accordance, many
engineers have based their research on the assumption that
TABLE I
PREVIOUS WORK ON PRIMARY EMOTIONS
Study Group
Mozziaconacci S.
Klasmeyer G.
Primary Emotions
Joy, Anger, Sadness, Neutrality, Fear, Boredom,
Indignation
Happiness, Anger, Sadness, Neutrality, Fear,
Boredom, Disgust
Sheren KR.
Joy, Anger, Sadness, Fear, Disgust
McGilloway S.
Happiness, Anger, Sadness, Fear
Kang, Bong-Seok
Happiness, Anger, Sadness, Neutrality
This research was performed for the Intelligent Robotics Development Program, one of the 21st Century Frontier R&D Programs funded by the Ministry of
Science and Technology of Korea.
0-7803-9275-2/05/$20.00/©2005 IEEE
312
Anger
Sadness
Joy
Neutrality
Fig. 1 A circumplex model of affect
the primary emotions exist, though the nature and quantity
of primary emotions remains in doubt.
However, as with psychologists, engineers have also
posed several arguments about the nature of primary
emotions and how many primary emotions should be
considered.
In 1962, Tomkins proposed the existence of eight
primary emotions: fear, anger, anguish, joy, disgust,
surprise, interest, and shame [3]. In 1980, Plutchik also said
there were eight emotions: fear, anger, sorrow, joy, disgust,
acceptance, anticipation, and surprise [4]. In 1988, Ortony
et al. proposed six emotions: fear, anger, sadness, joy,
disgust, and surprise [5]. Table I shows other classification
systems of primary emotions.
Nicholson noted that those who research the recognition
of emotion use both a varying number of categories and
various kinds of categories [6]. There is, however, a
limitation in these works in terms of the intents of the
present study, because researchers have based the rationale
of their classification on human psychology rather than on
consideration of human-robot interaction.
Because we deal with the design of human-friendly
robots, however, we have to consider not just human
emotions per se, but the interaction between humans and
robots. To design a robot’s actions for human-robot
interaction, we therefore need to examine the type of
emotions that are needed and establish a new approach to
the classification of primary emotions.
B. Classification of primary emotions
As mentioned in section A, there are some human
emotions that can give feedback to robots and enable robots
to modify their own actions for human-robot interaction.
Conversely, there are emotions that do not play an important
role in the design of a robot’s interaction with humans.
These unimportant emotions are not useful to designers and,
as a result, do not need to be classified. In fact, classifying
such emotions may hinder the correct recognition of
emotions, causing the robot to act improperly.
However, it is difficult to define and verify the
necessary emotions. Thus, in this study, we start by defining
a minimum number of emotions. In 1982, Kemper, a
psychologist, contended that the number of primary
emotions, which were proposed by other psychologists,
ranges from three to eleven [7]. In 1954, Schlosberg
asserted an emotion structure that consists of a pleasant axis
and an attention axis. And in the 1970s, Russell referred to a
two dimensional-structure for affect words [8] (Fig 1).
Therefore, we consider the minimal dimension of the mental
space to be two, and the axes are pleasant and activation.
Thus, we define four primary emotions, and each emotion is
a representative state of each quarter.
We define the four emotions as follows:
1) State of Joy: The state of joy is a representative state
of the first quadrant of Fig. 1, and includes emotional states
such as happy, delighted, excited, astonished, aroused, and
so on. In the HRI field, if robots recognize the human’s
emotion to be in this state, the robot could interpret this to
mean that the master or human commander is satisfied or
content with the robot’s actions. Thus, if the master requests
the performance of a similar job in the future, the robot will
tend to follow a similar pattern of action.
2) State of Anger: The state of anger is a representative
state of the second quadrant of Fig. 1 and includes
emotional states such as tense, alarmed, angry, afraid,
annoyed, distressed, frustrated, and so on. In the HRI field,
anger is a state that generally occurs when a human is
dissatisfied with a robot’s actions. In addition, if external
elements make the human angry, the robot is obliged to
ameliorate the human’s emotional state and to modify its
own current actions.
3) State of Sadness: The state of sadness is a
representative state of the third quadrant of Fig. 1 and
includes emotional states such as sad, miserable, gloomy,
depressed, bored, droopy, tired, and so on. Sadness is a
state that indicates that the robot needs to modify its actions,
similar to anger. However, if the human is in this state, the
robot must cease its action and perform another kind of
action instead of simply modifying its reaction.
4) State of Neutrality: Finally, we define the state of
neutrality as a primary emotion. Neutrality is a
representative state of the fourth quadrant of Fig. 1 and
includes emotional statues such as sleepy, calm, relaxed,
satisfied, content, and so on. When confronted with a
neutral state, the robot understands that it need not modify
its own actions or design a new pattern of reaction because
the human’s emotional state is stable. That is, the robot’s
actions are correct. Moreover, because the human’s emotion
intensity is generally not important, we could regard this
state as neutral.
Although the classification of joy, anger, sadness, and
neutrality requires verification in terms of its
appropriateness to represent emotions of each quadrant, we
could use at least four emotions for primary emotion.
III. EXTRACTION OF FEATURES
A. Prosodic Features
In the field of emotion recognition, prosodic features
are considered as important factors for emotion. Some
features are as follows:
1) Pitch: Pitch, which is sometimes called fundamental
frequency, refers to the periodic time of a wave pulse
generated by air compressed through the glottis from the
lungs. It is a very sensitive factor that responds to the
auditory sense.
313
350
350
300
300
Pitch (Hz)
Pitch (Hz)
250
200
150
200
150
100
100
50
0
250
0
50
100
150
200
250
300
50
0
20
40
60
80
100
120
140
160
180
200
Frame
Fig. 2 Pitch contour.
Frame
Fig. 4 Non-zero-pitch contour.
2) Energy: The energy of a voice can be physically
detected through the pressure of sounds or a subjective level
of noisiness. Generally, this factor is effective for
classifying joy and anger but not for neutrality or sadness.
3) Tempo: A voice recognition algorithm can be used to
measure tempo of an utterance. With this type of algorithm,
tempo is expressed as a phonemic number that changes over
a unit of time. However, this method requires a huge
database of each person’s utterances.
On the other hand, in the frequency domain method, the
cepstrum method is often used. However, this algorithm is
generally not used because it is a expensive computation
due to its dependence on FFT (Fast Fourier Transform).
C. Proposed Method
1) Assumption for Bayesian Classification: In order to
apply the Bayesian Classifier, we need to know the
probability density function (pdf) of pitch, however, this
cannot be determined exactly. Hence, we have to assume the
pdf of pitch for each classification. In the general approach,
we simply assume the Gaussian distribution for an unknown
distribution. However, if we use the previous concept for
pitch contour, which has a zero value when the
autocorrelation is low, the Gaussian distribution assumption
is not appropriate, as shown in Fig. 2. This result is a result
of the zero value in the pitch contour, and is known as the
“Zero effect”. The zero effect has a deleterious effect on the
pdf; more precisely, the mean value shifts to the left.
Therefore, we have to eliminate the zero effect to obtain a
more appropriate curve for pdf.
2) Non-zero-pitch: Because the zero value of pitch
causes some errors in the Gaussian distribution, it must be
eliminated. As such, we have to obtain the non-zero-pitch,
that is, we make the pitch contour without the zero value.
Many researchers have studied pitch extraction algorithms
B. Previous Works
Considerable research has already shown that human
emotion is related to the prosodic features of speech [9]. Lin
et al., in particular, show the relationship between pitch and
emotion; they used the pitch of human speech to attempt to
recognize the emotion of a speaker [10]. In addition, Kostov
and Fukuda used prosodic features such as pitch, energy,
and tempo to develop a text-independent system and
reported simulation results [11].
In particular, pitch is frequently used for emotion
recognition. Accordingly, many pitch extraction algorithms
have been developed. These algorithms can be grouped into
two classes, pitch extraction from the time domain and pitch
extraction from the frequency domain.
In the time domain method, an autocorrelation method
is generally used. We obtain the correlation of pitch to find
pitch period using this method.
J
0.4
0.3
0.2
0.3
0.2
J
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
Probability
0.4
N
Probability
0.5
Probability
Probabiltiy
N
0.5
0.4
0.3
0
0.1
0
100
200
300
400
Pitch [Hz]
500
0
600
0
100
200
500
600
0.5
0.4
0.4
0.3
0.2
0
0.3
0.2
0.1
0
100
200
300
400
Pitch [Hz]
500
0
100
200
300
400
Pitch [Hz]
500
600
500
600
A
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.3
0.4
0.3
0.2
0.1
0.1
0
0
600
0.2
0.1
0.3
S
Probability
A
0.5
Probablity
Probabiltiy
S
300
400
Pitch [Hz]
0.4
0.2
0.1
Probability
0.1
Probability
Probability
0.2
0
100
200
300
400
Pitch [Hz]
500
600
0
0
100
200
300
400
Pitch [Hz]
500
0
600
Pitch (Hz)
Fig. 3 PDF from previous pitch contour
0.1
0
100
200
300
400
Pitch [Hz]
500
600
0
0
100
200
300
400
Pitch [Hz]
Pitch (Hz)
Fig. 5 PDF from non-zero-pitch contour
314
TABLE II
CORRECT CLASSIFICATION RATES BY SUBJECTS.
Recog.
Neutrality
Joy
Sadness
Anger
(%)
Neutrality
83.9
3.1
8.9
4.1
Joy
26.6
57.8
3.5
12.0
Sadness
6.4
0.6
92.2
0.8
Anger
15.1
5.4
1.0
78.5
Overall
emotion and the conditional probability of features for a
given emotion from the Bayesian Theorem as follows:
P(E|x)=P(x|E)P(E)/P(x)
Where E : emotion, x : feature
IV. EXPERIMENTS
A. Database
Given that in many languages the fundamental
tendencies of sounds are expressed in similar ways, our
results in recognizing the emotions of Korean language
speakers can generally be applied to speakers of other
languages. For this reason, we used a database produced by
Professor C.Y. Lee, Media and the Communication Signal
Processing Laboratory of Yonsei University in Korea with
the support of the Korea Research Institute of Standards and
Science. This data covers the four emotions of neutrality,
joy, sadness, and anger; and its principles are as follows
[13]:
78.2
using autocorrelation, cepstrum, etc whereas previous
methods such as the SIFT algorithm [12] extract the pitch
contour as shown in Fig. 3. Thus, the zero values can cause
a zero effect, resulting in a distorted pdf from the pitch
contour, as shown in Fig. 4.
We thus need to obtain the non-zero-pitch contour to
prevent the zero effect in pdf. Hence, we omit the zero
values from the original pitch contour, which is obtained by
the SIFT algorithm. As a result, we obtain a more
appropriate pdf from the non-zero-pitch contour, as shown
in Fig. 5.
Although the non-zero-pitch contour loses some
content information such as unvoiced sound, the emotion
recognition result is improved (shown in chapter IV). Thus,
we conclude that, in contrast with the realm of speech
recognition, a zero value in the pitch contour is not
important in the filed of emotion recognition field.
3) Bayesian Classifier: From the assumption that the
pdf of pitch contour has a Gaussian distribution, we can use
the Bayesian classifier for emotion recognition. We want to
determine the probability of emotion for given features.
Thus, we only need to know the a priori probability for each
- easy pronunciation in a neutral, joyful, sad and angry
state;
- 45 dialogic sentences that express natural emotion.
The original data is stored in the form of 16kH, 32bits
over 30dB S/N and margined with no sound for about 50 ms
in the beginning and end of the utterance. To use the data in
MATLAB, we transformed the data for the training and
simulation into a 16 bits format through quantization with a
pulse code modulation filter.
There are five male and five female emotional voice
samples in the DB. In reference [13], an experiment was
conducted to measure human performance with respect to
detecting the underlying emotional state of a speaker. In this
study, 30 subjects listened to the utterances of one speaker
TABLE V
CORRECT CLASSIFICATION RATES FOR MALE SUBJECTS
USING PREVIOUS PITCH CONTOUR
Male
%
TABLE III
CORRECT CLASSIFICATION RATES FOR FEMALE SUBJECTS
USING PREVIOUS PITCH CONTOUR
Female
%
Neutrality
Joy
Sadness
Anger
Neutrality
43.3
5.0
21.7
30.0
Recog.
Neutrality
Joy
Sadness
Anger
Neutrality
61.7
23.3
13.3
1.7
Recog.
Joy
18.3
51.7
1.7
28.3
Joy
31.7
15.0
11.7
41.7
Sadness
13.3
0
86.7
0
Sadness
13.3
5.0
81.7
0
Anger
10.0
23.3
0
66.7
Anger
6.7
10.0
0
83.3
Overall
Overall
62.08
TABLE VI
CORRECT CLASSIFICATION RATES FOR MALE SUBJECTS
USING NON-ZERO-PITCH CONTOUR
Male
%
TABLE IV
CORRECT CLASSIFICATION RATES FOR FEMALE SUBJECTS
USING NON-ZERO-PITCH CONTOUR
female
%
Neutrality
Joy
Sadness
Anger
Neutrality
63.3
1.7
25.0
10.0
Joy
5.0
66.7
8.3
20.0
Recog.
Sadness
26.7
0
73.3
0
Anger
8.3
15.0
8.3
68.3
Overall
60.42
Neutrality
Joy
Sadness
Anger
Neutrality
75.0
10.0
13.3
1.7
Recog.
Joy
18.3
30.0
13.3
38.3
Sadness
28.3
5.0
66.7
0
Anger
5.0
10.0
0
85.0
Overall
67.92
315
64.17
Female 1
TABLE VII
THE AVERAGE RECOGNITION RATE
Female 2
Female 3
Female 4
Female 5
57.08%
67.92%
58.33%
60.42%
54.58%
Male 1
Male 2
Male 3
Male 4
Male 5
64.17%
59.58%
58.33%
55.83%
66.67%
Overall
a non-zero-pitch contour and demonstrated that is yields
better performance compared to the previous algorithm.
In the future, we need to verify the assumptions made
in this study, specifically, that the probability density
function of the pitch contour can be modeled by the
Gaussian distribution. If this assumption does not hold, then
a more appropriate distribution model for the pitch contour
must be identified.
Moreover, we need to find a robust algorithm for
individual variance. Although the prosodic features
generally contain the common characteristics of each
emotion, these features vary according to gender and
individual differences. Therefore, we cannot obtain the
same recognition rate for each subject. It can be explained
by variance of pitch for each subject because the Bayesian
classifier shows the better performance about low variance
in pdf than high variance. However, it needs more studies
for difference between subjects.
71.10%
played back in random order. The subject was then asked to
choose one emotion out of four (joy, anger, sadness, or
neutrality). Human performance was measured to have
approximately 78% accuracy, as indicated in Table II. Note
that the baseline is 25% (random guessing).
B. Result
We obtain the conditional probability function, i.e.
P(x|E), from the training data and recognition results are
obtained from test data.
For each person, the training data consists of 25
sentences for each emotion while the test data consists of 20
sentences for each emotion excluding training data.
Table III shows the recognition results for females
using the previous pitch and Table IV shows the recognition
results for females using the non-zero-pitch contour.
Similarly, Table V and Table VI show the recognition
results for male subjects.
As shown from Tables III through VI , use of the nonzero-pitch contour results in better performance relative to
that obtained via use of the previous pitch contour.
Finally, we have appended the average recognition rate
for each subject in Table VII.
From these results, we can conclude non-zero-pitch
concept improves the average recognition rate about 5%,
however, it is still difficult to classify joy, and sadness
similar to results from previous pitch concept.
The low rates for joy and sadness may be explained by
pitch characteristic and mental model (Fig. 1). From the
Table III ~ VI, pitch information can cause misclassification
for joy from anger, and for sadness from neutrality. And
from the Fig. 1, joy and anger are distinguished by x-axis
similar to case for sadness and neutrality. So there is high
performance to classify by y-axis but low by x-axis, it
means the pitch information has more effect for y-axis but
x-axis.
REFERENCES
[1] Sakagami Y., Watanabe R., Aoyama C.,Matsunaga S., Higaki N., and
Fujumura K. “The Intelligent ASIMO : System overview and
intergration,” Intelligent Robots and System, 2002. vol.3 pp2478-2483,
2002.
[2] C. Breazeal, “Emotion and sociable humanoid robots,” International
journal of human-computer studies, vol. 59, no. 1/2, pp.119-155, 2003
[3] Tomkins S., “Affect, Imagery, Consciousness Springer Publishing
Company,” New York, 1962.
[4] R. Plutchik, “A general psychoevolutionary theory of emotion,”
Emotion Theory, Research and Experience, vol. 1, Academic Press,
1980.
[5] A. Ortony, G. L. Clore, and A. Collins, “The Cognitive Structure of
Emotions,” Cambridge University Press, Cambridge, MA, 1988
[6] J. Nicholson, K. Takahashi and R. Nakatsu, “Emotion Recognition in
Speech Using Neural Networks,” Neural computing & applications,
vol. 9, no. 4, pp. 290-296, 2000.
[7] Kemper T. D., “How many emotions are there? Wedding the social and
the autonomic components,” American Journal of Sociology, vol. 93, p.
269, 1987
[8] Russell, J. A., “A circumplex model of affect,” Journal of Personality
and Social Psychology, vol. 39, pp1161-1178, 1980
[9] C.E. Williams and K. N. Stevens, “Emotions and speech: Some
acoustical correlates,” Journal Acoustical Society of America, vol. 52,
no. 4, pp 1238-1250, 1972.
[10]Xiao Lin, Yanqiu Chen, Soonleng Lim and Choonban Lim,
“Recognition of Emotional State from Spoken Sentences,” Multimedia
Signal Processing, IEEE 3rd workshop, pp. 469-473, 1999.
[11]V. Kostov and S. Fukuda, “Emotion in User Interface, Voice
Interaction System,” System Man and Cybernetics, IEEE Conf. vol. 2
pp. 798-803, 2000.
[12]John D. M., ”The SIFT Algorithm for Fundamental Frequency
Estimation”, IEEE Transactions on Audio and Electroacoustics, Vol
AU-20, No.5, p367`377, 1972.
[13]Kang, Bong-Seok, “Text Independent Emotion Recognition Using
Speech Signals,” Yonsei Univ. 2000.
V. CONCLUSION
On the basis of interactions between humans and
robots, as discussed in section II, we proposed four types of
primary emotions: neutrality, joy, anger, and sadness. Each
emotion represents a quadrant of Fig. 1. However, the labels
of the above primary emotions require further assessment.
In section III, prosodic features such as pitch, energy,
and tempo are shown to have a major effect on the ability to
recognize emotions based on a speaker’s utterance, whereas
phonetic features do not have a significant effect.
The pitch of prosodic features can be extracted by
numerous algorithms such as SIFT. However, in the field of
emotion recognition, we can neglect the zero value in the
pitch contour because of the zero effect. Thus, we suggested
316