2005 IEEE International Workshop on Robots and Human Interactive Communication Improvement of Emotion Recognition by Bayesian Classifier Using Non-zero-pitch Concept * Kyung Hak Hyun Department of Mechanical Engineering Korea Advanced Institute of Science and Technology Guseong Dong, Yuseong Gu, Daejeon, Repulic of Korea [email protected] Eun Ho Kim Yoon Keun Kwak Department of Mechanical Engineering Korea Advanced Institute of Science and Technology Guseong Dong, Yuseong Gu, Daejeon, Repulic of Korea [email protected] Abstract – Emotion recognition is an important factor in the development of human-robot interactions (HRI) and especially, the pitch information is considered for features related to emotion in speech emotion recognition. Thus, in this paper, the goal is to propose the improved method to recognize the emotion by pitch, “non-zero-pitch”, that is, the pitch contour does not have the zero value. We have applied this concept to a Bayesian classifier, and obtained better results for emotion recognition than those attained using the previous pitch contour. In this study, we explain precisely the concept of “non-zero-pitch” and show its superiority over the previous pitch concept. Moreover, it is also important to determine emotions to be classified. In general, many researchers of emotion recognition use the classification of primary emotions such as anger, joy, and so on. However, they differ on the number and kind of primary emotions to use and generally fail to explain the rationale for their classification. Psychologists have also debated the topic of primary emotions. In the present study we propose a classification method of primary emotions for the field of HRI. Index Terms – Primary emotion, Voice, HRI, Pitch, Bayesian Classifier I. INTRODUCTION In the past, robots were primarily used in dangerous work domains or in the field of factory automation. However, more recently, robots have been developed for more diverse purposes; some examples include, ASIMO[1], a biped robot that can simulate human walking, Paro, which is used for psycho-therapy, and Kismet[2], a head-robot that represents emotions by facial expressions. In particular, many researchers are interested in developing humanfriendly robots that can interact with a human, recognize the human emotions and represent emotions. In this regard, there have been many studies on HRI, emotions. The main feature of these human-friendly robots is their capacity to interact with humans. In the past, robots only communicated unilaterally with humans. For example, if a human issued an order, the robot followed the order. In the future, however, robots will be able to consider not only commands but also other information such as the emotional and health status of its human commander. In following an * Department of Mechanical Engineering Korea Advanced Institute of Science and Technology Guseong Dong, Yuseong Gu, Daejeon, Repulic of Korea [email protected] order, the robot will be able to modify its actions using feedback from the human’s reaction. That is, human-robot interaction will be possible. In this paper, we addressed the issue of the recognition of emotion, which is a fundamental technology in the design of human-friendly robots. Human emotion is subjective. Furthermore, recognizing the emotion of an unknown or unfamiliar person in a real situation is not easy. For this reason, we have restricted our discussion to a speakerdependant system. Moreover, we consider a textindependent system that reveals emotion independently of content. Finally, the target of our study is a system of recognizing emotion. We rely on speech to establish an interface between the human and robot, because speech is the fundamental mode of human communication. II. PRIMARY EMOTION FOR HRI A. Previous Works Primary emotions refer to the basic elements that organize the set of emotions that describe complex human emotions. In other words, arbitrary emotions are comprised of a combination of primary emotions. However, few scholars or psychologists agree with this definition. In addition, there is controversy over whether primary emotions even exist. For engineers, a clear and discrete classification of emotions would be convenient. In accordance, many engineers have based their research on the assumption that TABLE I PREVIOUS WORK ON PRIMARY EMOTIONS Study Group Mozziaconacci S. Klasmeyer G. Primary Emotions Joy, Anger, Sadness, Neutrality, Fear, Boredom, Indignation Happiness, Anger, Sadness, Neutrality, Fear, Boredom, Disgust Sheren KR. Joy, Anger, Sadness, Fear, Disgust McGilloway S. Happiness, Anger, Sadness, Fear Kang, Bong-Seok Happiness, Anger, Sadness, Neutrality This research was performed for the Intelligent Robotics Development Program, one of the 21st Century Frontier R&D Programs funded by the Ministry of Science and Technology of Korea. 0-7803-9275-2/05/$20.00/©2005 IEEE 312 Anger Sadness Joy Neutrality Fig. 1 A circumplex model of affect the primary emotions exist, though the nature and quantity of primary emotions remains in doubt. However, as with psychologists, engineers have also posed several arguments about the nature of primary emotions and how many primary emotions should be considered. In 1962, Tomkins proposed the existence of eight primary emotions: fear, anger, anguish, joy, disgust, surprise, interest, and shame [3]. In 1980, Plutchik also said there were eight emotions: fear, anger, sorrow, joy, disgust, acceptance, anticipation, and surprise [4]. In 1988, Ortony et al. proposed six emotions: fear, anger, sadness, joy, disgust, and surprise [5]. Table I shows other classification systems of primary emotions. Nicholson noted that those who research the recognition of emotion use both a varying number of categories and various kinds of categories [6]. There is, however, a limitation in these works in terms of the intents of the present study, because researchers have based the rationale of their classification on human psychology rather than on consideration of human-robot interaction. Because we deal with the design of human-friendly robots, however, we have to consider not just human emotions per se, but the interaction between humans and robots. To design a robot’s actions for human-robot interaction, we therefore need to examine the type of emotions that are needed and establish a new approach to the classification of primary emotions. B. Classification of primary emotions As mentioned in section A, there are some human emotions that can give feedback to robots and enable robots to modify their own actions for human-robot interaction. Conversely, there are emotions that do not play an important role in the design of a robot’s interaction with humans. These unimportant emotions are not useful to designers and, as a result, do not need to be classified. In fact, classifying such emotions may hinder the correct recognition of emotions, causing the robot to act improperly. However, it is difficult to define and verify the necessary emotions. Thus, in this study, we start by defining a minimum number of emotions. In 1982, Kemper, a psychologist, contended that the number of primary emotions, which were proposed by other psychologists, ranges from three to eleven [7]. In 1954, Schlosberg asserted an emotion structure that consists of a pleasant axis and an attention axis. And in the 1970s, Russell referred to a two dimensional-structure for affect words [8] (Fig 1). Therefore, we consider the minimal dimension of the mental space to be two, and the axes are pleasant and activation. Thus, we define four primary emotions, and each emotion is a representative state of each quarter. We define the four emotions as follows: 1) State of Joy: The state of joy is a representative state of the first quadrant of Fig. 1, and includes emotional states such as happy, delighted, excited, astonished, aroused, and so on. In the HRI field, if robots recognize the human’s emotion to be in this state, the robot could interpret this to mean that the master or human commander is satisfied or content with the robot’s actions. Thus, if the master requests the performance of a similar job in the future, the robot will tend to follow a similar pattern of action. 2) State of Anger: The state of anger is a representative state of the second quadrant of Fig. 1 and includes emotional states such as tense, alarmed, angry, afraid, annoyed, distressed, frustrated, and so on. In the HRI field, anger is a state that generally occurs when a human is dissatisfied with a robot’s actions. In addition, if external elements make the human angry, the robot is obliged to ameliorate the human’s emotional state and to modify its own current actions. 3) State of Sadness: The state of sadness is a representative state of the third quadrant of Fig. 1 and includes emotional states such as sad, miserable, gloomy, depressed, bored, droopy, tired, and so on. Sadness is a state that indicates that the robot needs to modify its actions, similar to anger. However, if the human is in this state, the robot must cease its action and perform another kind of action instead of simply modifying its reaction. 4) State of Neutrality: Finally, we define the state of neutrality as a primary emotion. Neutrality is a representative state of the fourth quadrant of Fig. 1 and includes emotional statues such as sleepy, calm, relaxed, satisfied, content, and so on. When confronted with a neutral state, the robot understands that it need not modify its own actions or design a new pattern of reaction because the human’s emotional state is stable. That is, the robot’s actions are correct. Moreover, because the human’s emotion intensity is generally not important, we could regard this state as neutral. Although the classification of joy, anger, sadness, and neutrality requires verification in terms of its appropriateness to represent emotions of each quadrant, we could use at least four emotions for primary emotion. III. EXTRACTION OF FEATURES A. Prosodic Features In the field of emotion recognition, prosodic features are considered as important factors for emotion. Some features are as follows: 1) Pitch: Pitch, which is sometimes called fundamental frequency, refers to the periodic time of a wave pulse generated by air compressed through the glottis from the lungs. It is a very sensitive factor that responds to the auditory sense. 313 350 350 300 300 Pitch (Hz) Pitch (Hz) 250 200 150 200 150 100 100 50 0 250 0 50 100 150 200 250 300 50 0 20 40 60 80 100 120 140 160 180 200 Frame Fig. 2 Pitch contour. Frame Fig. 4 Non-zero-pitch contour. 2) Energy: The energy of a voice can be physically detected through the pressure of sounds or a subjective level of noisiness. Generally, this factor is effective for classifying joy and anger but not for neutrality or sadness. 3) Tempo: A voice recognition algorithm can be used to measure tempo of an utterance. With this type of algorithm, tempo is expressed as a phonemic number that changes over a unit of time. However, this method requires a huge database of each person’s utterances. On the other hand, in the frequency domain method, the cepstrum method is often used. However, this algorithm is generally not used because it is a expensive computation due to its dependence on FFT (Fast Fourier Transform). C. Proposed Method 1) Assumption for Bayesian Classification: In order to apply the Bayesian Classifier, we need to know the probability density function (pdf) of pitch, however, this cannot be determined exactly. Hence, we have to assume the pdf of pitch for each classification. In the general approach, we simply assume the Gaussian distribution for an unknown distribution. However, if we use the previous concept for pitch contour, which has a zero value when the autocorrelation is low, the Gaussian distribution assumption is not appropriate, as shown in Fig. 2. This result is a result of the zero value in the pitch contour, and is known as the “Zero effect”. The zero effect has a deleterious effect on the pdf; more precisely, the mean value shifts to the left. Therefore, we have to eliminate the zero effect to obtain a more appropriate curve for pdf. 2) Non-zero-pitch: Because the zero value of pitch causes some errors in the Gaussian distribution, it must be eliminated. As such, we have to obtain the non-zero-pitch, that is, we make the pitch contour without the zero value. Many researchers have studied pitch extraction algorithms B. Previous Works Considerable research has already shown that human emotion is related to the prosodic features of speech [9]. Lin et al., in particular, show the relationship between pitch and emotion; they used the pitch of human speech to attempt to recognize the emotion of a speaker [10]. In addition, Kostov and Fukuda used prosodic features such as pitch, energy, and tempo to develop a text-independent system and reported simulation results [11]. In particular, pitch is frequently used for emotion recognition. Accordingly, many pitch extraction algorithms have been developed. These algorithms can be grouped into two classes, pitch extraction from the time domain and pitch extraction from the frequency domain. In the time domain method, an autocorrelation method is generally used. We obtain the correlation of pitch to find pitch period using this method. J 0.4 0.3 0.2 0.3 0.2 J 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 Probability 0.4 N Probability 0.5 Probability Probabiltiy N 0.5 0.4 0.3 0 0.1 0 100 200 300 400 Pitch [Hz] 500 0 600 0 100 200 500 600 0.5 0.4 0.4 0.3 0.2 0 0.3 0.2 0.1 0 100 200 300 400 Pitch [Hz] 500 0 100 200 300 400 Pitch [Hz] 500 600 500 600 A 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.3 0.4 0.3 0.2 0.1 0.1 0 0 600 0.2 0.1 0.3 S Probability A 0.5 Probablity Probabiltiy S 300 400 Pitch [Hz] 0.4 0.2 0.1 Probability 0.1 Probability Probability 0.2 0 100 200 300 400 Pitch [Hz] 500 600 0 0 100 200 300 400 Pitch [Hz] 500 0 600 Pitch (Hz) Fig. 3 PDF from previous pitch contour 0.1 0 100 200 300 400 Pitch [Hz] 500 600 0 0 100 200 300 400 Pitch [Hz] Pitch (Hz) Fig. 5 PDF from non-zero-pitch contour 314 TABLE II CORRECT CLASSIFICATION RATES BY SUBJECTS. Recog. Neutrality Joy Sadness Anger (%) Neutrality 83.9 3.1 8.9 4.1 Joy 26.6 57.8 3.5 12.0 Sadness 6.4 0.6 92.2 0.8 Anger 15.1 5.4 1.0 78.5 Overall emotion and the conditional probability of features for a given emotion from the Bayesian Theorem as follows: P(E|x)=P(x|E)P(E)/P(x) Where E : emotion, x : feature IV. EXPERIMENTS A. Database Given that in many languages the fundamental tendencies of sounds are expressed in similar ways, our results in recognizing the emotions of Korean language speakers can generally be applied to speakers of other languages. For this reason, we used a database produced by Professor C.Y. Lee, Media and the Communication Signal Processing Laboratory of Yonsei University in Korea with the support of the Korea Research Institute of Standards and Science. This data covers the four emotions of neutrality, joy, sadness, and anger; and its principles are as follows [13]: 78.2 using autocorrelation, cepstrum, etc whereas previous methods such as the SIFT algorithm [12] extract the pitch contour as shown in Fig. 3. Thus, the zero values can cause a zero effect, resulting in a distorted pdf from the pitch contour, as shown in Fig. 4. We thus need to obtain the non-zero-pitch contour to prevent the zero effect in pdf. Hence, we omit the zero values from the original pitch contour, which is obtained by the SIFT algorithm. As a result, we obtain a more appropriate pdf from the non-zero-pitch contour, as shown in Fig. 5. Although the non-zero-pitch contour loses some content information such as unvoiced sound, the emotion recognition result is improved (shown in chapter IV). Thus, we conclude that, in contrast with the realm of speech recognition, a zero value in the pitch contour is not important in the filed of emotion recognition field. 3) Bayesian Classifier: From the assumption that the pdf of pitch contour has a Gaussian distribution, we can use the Bayesian classifier for emotion recognition. We want to determine the probability of emotion for given features. Thus, we only need to know the a priori probability for each - easy pronunciation in a neutral, joyful, sad and angry state; - 45 dialogic sentences that express natural emotion. The original data is stored in the form of 16kH, 32bits over 30dB S/N and margined with no sound for about 50 ms in the beginning and end of the utterance. To use the data in MATLAB, we transformed the data for the training and simulation into a 16 bits format through quantization with a pulse code modulation filter. There are five male and five female emotional voice samples in the DB. In reference [13], an experiment was conducted to measure human performance with respect to detecting the underlying emotional state of a speaker. In this study, 30 subjects listened to the utterances of one speaker TABLE V CORRECT CLASSIFICATION RATES FOR MALE SUBJECTS USING PREVIOUS PITCH CONTOUR Male % TABLE III CORRECT CLASSIFICATION RATES FOR FEMALE SUBJECTS USING PREVIOUS PITCH CONTOUR Female % Neutrality Joy Sadness Anger Neutrality 43.3 5.0 21.7 30.0 Recog. Neutrality Joy Sadness Anger Neutrality 61.7 23.3 13.3 1.7 Recog. Joy 18.3 51.7 1.7 28.3 Joy 31.7 15.0 11.7 41.7 Sadness 13.3 0 86.7 0 Sadness 13.3 5.0 81.7 0 Anger 10.0 23.3 0 66.7 Anger 6.7 10.0 0 83.3 Overall Overall 62.08 TABLE VI CORRECT CLASSIFICATION RATES FOR MALE SUBJECTS USING NON-ZERO-PITCH CONTOUR Male % TABLE IV CORRECT CLASSIFICATION RATES FOR FEMALE SUBJECTS USING NON-ZERO-PITCH CONTOUR female % Neutrality Joy Sadness Anger Neutrality 63.3 1.7 25.0 10.0 Joy 5.0 66.7 8.3 20.0 Recog. Sadness 26.7 0 73.3 0 Anger 8.3 15.0 8.3 68.3 Overall 60.42 Neutrality Joy Sadness Anger Neutrality 75.0 10.0 13.3 1.7 Recog. Joy 18.3 30.0 13.3 38.3 Sadness 28.3 5.0 66.7 0 Anger 5.0 10.0 0 85.0 Overall 67.92 315 64.17 Female 1 TABLE VII THE AVERAGE RECOGNITION RATE Female 2 Female 3 Female 4 Female 5 57.08% 67.92% 58.33% 60.42% 54.58% Male 1 Male 2 Male 3 Male 4 Male 5 64.17% 59.58% 58.33% 55.83% 66.67% Overall a non-zero-pitch contour and demonstrated that is yields better performance compared to the previous algorithm. In the future, we need to verify the assumptions made in this study, specifically, that the probability density function of the pitch contour can be modeled by the Gaussian distribution. If this assumption does not hold, then a more appropriate distribution model for the pitch contour must be identified. Moreover, we need to find a robust algorithm for individual variance. Although the prosodic features generally contain the common characteristics of each emotion, these features vary according to gender and individual differences. Therefore, we cannot obtain the same recognition rate for each subject. It can be explained by variance of pitch for each subject because the Bayesian classifier shows the better performance about low variance in pdf than high variance. However, it needs more studies for difference between subjects. 71.10% played back in random order. The subject was then asked to choose one emotion out of four (joy, anger, sadness, or neutrality). Human performance was measured to have approximately 78% accuracy, as indicated in Table II. Note that the baseline is 25% (random guessing). B. Result We obtain the conditional probability function, i.e. P(x|E), from the training data and recognition results are obtained from test data. For each person, the training data consists of 25 sentences for each emotion while the test data consists of 20 sentences for each emotion excluding training data. Table III shows the recognition results for females using the previous pitch and Table IV shows the recognition results for females using the non-zero-pitch contour. Similarly, Table V and Table VI show the recognition results for male subjects. As shown from Tables III through VI , use of the nonzero-pitch contour results in better performance relative to that obtained via use of the previous pitch contour. Finally, we have appended the average recognition rate for each subject in Table VII. From these results, we can conclude non-zero-pitch concept improves the average recognition rate about 5%, however, it is still difficult to classify joy, and sadness similar to results from previous pitch concept. The low rates for joy and sadness may be explained by pitch characteristic and mental model (Fig. 1). From the Table III ~ VI, pitch information can cause misclassification for joy from anger, and for sadness from neutrality. And from the Fig. 1, joy and anger are distinguished by x-axis similar to case for sadness and neutrality. So there is high performance to classify by y-axis but low by x-axis, it means the pitch information has more effect for y-axis but x-axis. REFERENCES [1] Sakagami Y., Watanabe R., Aoyama C.,Matsunaga S., Higaki N., and Fujumura K. “The Intelligent ASIMO : System overview and intergration,” Intelligent Robots and System, 2002. vol.3 pp2478-2483, 2002. [2] C. Breazeal, “Emotion and sociable humanoid robots,” International journal of human-computer studies, vol. 59, no. 1/2, pp.119-155, 2003 [3] Tomkins S., “Affect, Imagery, Consciousness Springer Publishing Company,” New York, 1962. [4] R. Plutchik, “A general psychoevolutionary theory of emotion,” Emotion Theory, Research and Experience, vol. 1, Academic Press, 1980. [5] A. Ortony, G. L. Clore, and A. Collins, “The Cognitive Structure of Emotions,” Cambridge University Press, Cambridge, MA, 1988 [6] J. Nicholson, K. Takahashi and R. Nakatsu, “Emotion Recognition in Speech Using Neural Networks,” Neural computing & applications, vol. 9, no. 4, pp. 290-296, 2000. [7] Kemper T. D., “How many emotions are there? Wedding the social and the autonomic components,” American Journal of Sociology, vol. 93, p. 269, 1987 [8] Russell, J. A., “A circumplex model of affect,” Journal of Personality and Social Psychology, vol. 39, pp1161-1178, 1980 [9] C.E. Williams and K. N. Stevens, “Emotions and speech: Some acoustical correlates,” Journal Acoustical Society of America, vol. 52, no. 4, pp 1238-1250, 1972. [10]Xiao Lin, Yanqiu Chen, Soonleng Lim and Choonban Lim, “Recognition of Emotional State from Spoken Sentences,” Multimedia Signal Processing, IEEE 3rd workshop, pp. 469-473, 1999. [11]V. Kostov and S. Fukuda, “Emotion in User Interface, Voice Interaction System,” System Man and Cybernetics, IEEE Conf. vol. 2 pp. 798-803, 2000. [12]John D. M., ”The SIFT Algorithm for Fundamental Frequency Estimation”, IEEE Transactions on Audio and Electroacoustics, Vol AU-20, No.5, p367`377, 1972. [13]Kang, Bong-Seok, “Text Independent Emotion Recognition Using Speech Signals,” Yonsei Univ. 2000. V. CONCLUSION On the basis of interactions between humans and robots, as discussed in section II, we proposed four types of primary emotions: neutrality, joy, anger, and sadness. Each emotion represents a quadrant of Fig. 1. However, the labels of the above primary emotions require further assessment. In section III, prosodic features such as pitch, energy, and tempo are shown to have a major effect on the ability to recognize emotions based on a speaker’s utterance, whereas phonetic features do not have a significant effect. The pitch of prosodic features can be extracted by numerous algorithms such as SIFT. However, in the field of emotion recognition, we can neglect the zero value in the pitch contour because of the zero effect. Thus, we suggested 316
© Copyright 2026 Paperzz