Facial Emoticons Sandra Gama Instituto Superior Tecnico Visualisation and Multimodal Interfaces Group [email protected] nication, being non-verbal, soon brought the necessity for artifacts which could emphasise the dialogue. Emoticons, which reinforce the meaning of textual messages by simulating human facial expressions, have enthusiastically been adopted in this context. ABSTRACT Facial expressions are a highly effective non-verbal means for sharing emotions. On the Internet, emoticons are used to compensate the lack of non-verbal forms of communication. The objective of this study is to generate emoticons through facial recognition, and to reproduce this information on user interfaces. As such, for using an emoticon, users have to look up a list and select it, or use a specific combination of characters to insert it in the textual message. However, on an ideal situation, communication should be as similar as possible to human-human interaction. Consequently, computer systems should be able to reproduce facial information on a more natural way. Face tracking is done through the OpenCV library funcionalities, which rely on a Haar cascade classifier. The result is a facial image that is processed for feature detection. This stage, despite also using OpenCV, consists of a series of face-adapted algorithms for edge detection. Bayesian classifiers are used to infer facial expressions. Automatic expression recognition has been an interesting challenge, which implies image processing and pattern recognition. The further development of these fields may bring significant improvements on human-machine interaction, as well as on many other fields such as medicine or psychology. In spite of test results showing a room for improvement, they also point out the viability of an interface which turns facial expressions into emoticons as an improvement for communication. Methodologies which rely on low computational cost algorithms, such as the ones proposed, grant quite satisfactory results and low-latency rates. On the area of user interfaces, it is necessary to find balance between temporal response and success rate of the classifier. The main objective of this study is to gather facial informafacial recognition, emoticons, facial expressions, pattern recog- tion and process it, in order to reproduce, on an user interface, the corresponding facial emoticon. Three prototypes nition, user interfaces have been created for evaluating success rates and showing the functionality and applicability of the algorithms which ACM Classification Keywords were developed. Also, automated tests have been run. As a H.5.2 Information Interfaces and Presentation: Miscellaneous— metric for success, we defined that a classifier which pointed Facial Expression Recognition out a clear distinction between the expression corresponding to the user’s shown emotion and the remaining expressions, INTRODUCTION would be satisfactory. For human beings, the process of recognising facial expressions is quite straightforward. Actually, facial expressions This paper is organised as follows: firstly we analyse recent consist of a series of features which often make them easy to studies in the area, taking into account the different stages of understand, even among people of different social and ethfacial expression recognition; then we present the developed nical backgrounds. work and describe, in detail, the processes of facial detection, feature extraction and expression classification; after With the advent and massification of the Internet, people ofthat, we refer to the prototypes which have been created; we ten have online textual conversations. This form of commuthen present and discuss the results of automated tests and, lastly, we state the conclusions and point out a direction for future investigation. Author Keywords Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CHI 2009, April 3 - 9, 2009, Boston, MA, USA. Copyright 2009 ACM 978-1-60558-246-7/07/0004...$5.00. RELATED WORK Classifying facial expressions implies following a coherent methodology. The studies which have been done in this field sub-divide facial expression recognition into three main steps: Facial detection on an arbitrary image, facial feature 1 useful statistical information. Also, Cao and Tong [2] developed a recent study on HMM (Hidden Markov Models), which they name EMM (Embedded Hidden Markov Model) because it expands each HMM’s state into a new HMM, obtaining a super-state for the exterior model, and an embedded state which corresponds to the interior model. detection and expression classification. Since this study’s main focus is classification, the remaining steps are here enunciated as an introduction. Facial Detection In what concerns facial detection in arbitrary images, there are two main approaches. The first one is based on features, which focuses on the colour of the human face. Saber et al. [16] and Vezhnevets et al. [19] both use these principles. Acoordingly to the second approach, which is based on appearance, a face is defined as a pattern of pixel intensities, as suggested by Sung et al. [18]. Facial patterns are distinguished from non-facial patterns by sub-dividing an image in smaller windows. Viola and Jones’ [20] method has been a strong reference basis for facial detection systems, since it subdivides the main image into windows. These are processed by cascade classifiers, which are able to find the candidate window containing a facial image. This method, being of light computational weight, has served as a basis for studies such as the ones of Bartlett et al. [10] and Zhan et al. [1], which present real-time performances. The first one uses Haar functions, computationally light, and the second adopts the boosting method. More recently, both Cao and Tong [2], and Lu et al. [11] have adopted Viola and Jones’ method. Also, Intel’s OpenCV libray, which emphasises real-time computational vision and, among other aspects, provides tools for facial detection, is based on this method. Some methods based on neutral networks are the studies of Kobayashi et al. [8], which applies a Back-Propagation Neural Network and presents a real-time response, Yoneyama et al. [12] which adopts discreet Hopfield networks, Feitosa et. al [15], which study the use of two different neutral networks for classification and Stathopoulou et al. [17], which applies the network to partial areas of the face, obtaining faster classifications. Some rule-based methods are the works of Pantic et al. [14], which uses the principal points of the face, extracts its features and calculates the difference between model features (which correspond to a neutral state) and the present ones; and Khanam et al. [7], which present a Mandini-type fuzzy system, which uses two knowledge bases – one for data and another for rules (fuzzy rules). DEVELOPED WORK A top-down approach to facial expression recognition was followed. As such, the initial problem was divided into three sub-problems: Facial Detection, which implies tracking the face and extracting its coordinates, Feature Extraction, that determines, using the previous step’s output, the coordinates of the main facial features and Classification which, using the facial features which were inferred in the previous step, determines the corresponding facial expression. Facial Feature Detection There are two main approaches to facial feature detection: holistic, which is associated with model-based methods, analytic, which is commonly used with feature-based methods, or a combination of both. For holistic detection, not only Cohn et al. [6] and Zhan et al. [1] have suggested the use of models some years ago, but these are also present on recent studies, such as Cao and Tong’s [2], who adopt an operator, which is defined as the measure of the invariant grayscale texture, for detection. Along with being invariant to grayscale variation, it presents a good temporal performance. As for analytic approaches, Zhan et al. [1] apply Gabor filters to a set of reference points (normalised from tests that have been run on facial images) for facial feature extraction. Lu et al. [11] present a new feature representation for an analytic approach, which sees a facial image as a combination of micro-patterns, being each pixel of the image associated with a class of model-patterns. It is a fast and illumination-robust method. As such, the internal processing is sub-divided into five main modules, illustrated in Figure 1: 1. Facial Detection Module: at this stage, an arbitrary image is used as input and processed in order to track the face; 2. Normalisation Module: facial images, which may have different dimensions according to the original image resolution and size of the face in the image, are normalised to enhance further processing; 3. Feature Extraction Module: the normalised facial image goes through a series of algorithms. The outputs are the coordinates of facial features; 4. Feature Transformation Module: features which have been extracted on the previous stage are now processed into a series of more expressive features for classification; Classification There are three main approaches to classification: based on models, neutral networks and rules. Model-based studies are the ones of Edwards et al. [4], which presents an AAM framework for illumination and position-independent systems, Hong et al. [5], which combines Gabor filters with Elastic Graph Matching and uses a gallery to improve the classification of a given expression with about 89% of success rate and, recently, Lu et al. [11] and Kotstia et al. [9] adopt SVM (Support Vector Machines) for classification. The latter developed an evolution of SVM which includes 5. Classification Module: using as input the output of the previous step, the classifier produces the expression corresponding to the original image. All these stages will be now exposed in detail. 2 Facial Detection In the context of the present work, the user’s environment consists of an arbitrary room and his or her only instrument for interaction with the system is a computer with a webcam. As such, the system must be able to process contextindependent images; not only facial images, but also images where other elements may be present. It was thus necessary to detect a face on an arbitrary image. Viola and Jones’ method [20], since it uses Haar classifier cascades (which are computationally light), along with providing a means for real-time facial detection, seemed to be a proper choice. Thus, not only the problem of facial detection was solved, but also it is done in real time, which is an important factor. Normalisation So the system may interact with any user who owns a webcam, and since devices present different resolutions, and users may be at any distance from the camera, it is important to consider that facial areas may present a variety of dimensions. So that facial feature extraction is done with enough precision, and since algorithms are sensible to image dimensions, it was necessary and pertinent to normalise these. This stage follows face detection. As such, only this is normalised, for resource maximisation. Images are resized into standard dimensions before being processed for feature extraction. A facial feature model has been adopted, which consists of facial features of an average face, to infer unknown facial values. Model dimensions are taken into account when normalising a facial image, in order to grant a high coherence and flexibility. Consequently, if the model is altered, the image’s dimension will accompany this transformation. Feature Extraction Considered Features FACS, created by Ekman and Friesen in 1978 [3], is a technical norm that classifies facial bahaviours depending on the muscles which are responsible for each facial action. Many other studies have considered FACS as a basis for their work. This method defines some (originally, 46) UA (unitary actions), which consist of the actions that correspond to one or more facial muscles. The combination of different UAs define a wide set of facial expressions. Accordingly with this norm, there is a vast spectrum of parameters that must be taken into account. These are predominantly related to eyes, eyebrows and mouth. Effectively, other features such as chin and cheeks are responsible for only 4 of the UA. Model of Facial Features In order to maximising the performance of feature detection, through a series of standard values for extraction, a model of the human face has been developed. This model allows not only to restrict the region of interest but also to estimate the position of facial features when they are not detected (caused by extremely poor lightning conditions, exaggerated face rotation or impossibility to detect certain parts of the face). Figure 1. Stages of Facial Expression Classification 3 The model is dynamically adjusted to the detected features and, if it is impossible to obtain a certain feature, the model will compensate for it. This grants the system an increased robustness. However, since the cascade classifier didn’t have quite a satisfactory performance, it was used only as a basis for detection. In this case, special attention has been paid to defining the mouth’s region of interest based on the elastic model which has been adopted. Investigators from the university of Regensburg in Germany developed two average faces (one for each gender) for their studies in psychology. Here, we created a hybrid model through the merge of these two faces, which resulted in a realistic start point, well adapted to this study’s necessities. It consists of a vector in which every facial feature is represented, according to the standard dimensions defined by the face model. This element has an elastic behaviour, being progressively adapted to the facial features which are localised. Additionally, the use of this model as a means for detection is based on the substitution of coordinates in case these are not correctly detected, preventing from possible errors. In these cases, and since it is an elastic model, the coordinates corresponding to undetected features will present the value which is presented by the model. The model is illustrated in figure 2. Grayscale conversion: This implies a pre-processing for the application of the Canny operator, which manipulates singlechanneled images. It also simplifies Gaussian Blur processing, since applying its convolution matrix to one channel is more efficient than applying it to the three colour channels (RGB). Gaussian Blur: This algorithm reduces noise and other artifacts which are unnecessary for Canny edge detection. Although Canny, by default, applies a Gaussian Blur filter before detection, there are many artifacts which present high dimensions. As such, these are eliminated by this additional Gaussian Blur operator. In preliminary tests, it proved its consistency, especially in the eye region, where dark circles or lines represent additional elements, and in the mouth area, where teeth or the junction between lips may difficult the detection process. Canny: A Canny algorithm has been invoked for tracking facial features with more precision. This operator was chosen for many reasons. Firstly, because it made sense to use an algorithm which was fast and implemented by OpenCV. This constraint reduced our choices into Canny, Sobel and Laplace. Secondly, because there was a need for a noiserobust algorithm. Sobel and Laplace algorithms presented a great sensibility to noise. Lastly, because the position of the detected edges should be as precise as possible. Also in this point, Canny presented better results when compared to Sobel or Laplace. Figure 2. Face model Facial Feature Detection The methodology which has been used for facial feature detection implies five stages: Coordinates extraction: By analysis of last step’s results, this stage consists of obtaining values corresponding to the detected facial feature. For eyes, extreme points are calculated based on topmost, down-most, leftmost and rightmost tracked points, and normalised so that the final result is symmetric, which is done through the average of the corresponding coordinates. The process is very similar for the mouth. Extreme points define the maximum and minimum values on both axes and the average between them makes it possible to generate a geometric form. For eyebrows, their average point is calculated, since their height is the most important factor. After that, the rest of this feature is approximated through the use of the model we have adopted throughout this study. Detection of a region of interest: Here we take advantage of the OpenCV library functions; this step helps reduce computational weight corresponding to the following steps. In the case of the eyes, four fundamental points are considered: the leftmost, rightmost, topmost and down-most. These make it possible to find out the width and height of the eye, which are fundamental for expression classification. In order to improve performance, only the upper half of the face has been considered. When analising the eyebrows, we didn’t use any cascade classifier. In spite of that, a tighter region of interest has been defined that processes only the area which is located right above the uppermost point of the left eye, taking advantage of the symmetric properties of the human face to maximise the efficiency of detection. This decision has been considered carefully. On one hand, it implies a slight decrease in the system’s robustness in cases of people who have suffered from an injury or disease which prevents them from moving their frontalis muscle (the one which is responsible for eyebrow raising). On the other hand, given the objective of developing a light, simple and fast method for feature detection, this seemed to be a proper decision. Mouth detection followed the same steps as eyes detection. Figure 3 summarises these steps. Figure 3. Stages of feature detection Feature Transformation 4 The features used for classification were selected in order to maximise the information available to this process. These features are based in the concept of Action Units, defined in the FACS system. Although may provide a wide set of information, if looked at separately, little can be inferred on their associated expression. Therefore, a combination of several UAs, as defined by Ekman and Friesen [3], is used. In order to improve the expressiveness of the features which have been collected in the former stages, these were transformed into another set of features. These features are: (1) Vertical distance between the eyes and the eyebrows; (2) Eye aperture; (3) Mouth aperture; (4) Mouth width; (5) Average vertical distance from mouth corners to eyes; and (6) Average vertical distance from mouth corners to mouth centre. Figure 6. Mouth aperture The vertical distance between eyes and eyebrows, as pictured in figure 4, determines eyebrows elevation and depression, thus providing valuable information on the detection of facial expressions such as surprised, angry or sad. Figure 7. Mouth width value is considerably higher than in a happy face. Figure 4. Vertical distance between eyes and eyebrows Eye aperture, as illustrated in figure 5, helps distinguish between facial expressions such as surprised, where eyes are wide open, from angry and happy, in which eyes are more closed. Figure 8. Average vertical distance from mouth corners to eyes As for the average vertical distance from mouth corners to mouth centre, which is determined as seen in figure 9, this feature is used in conjunction with the previous feature to differentiate between sad and happy expressions. Classification Bayesian classifiers were adopted, mainly because they make it possible to obtain good results despite their quite low computational weight. Two classifiers have been taken into account. Firstly, a more conventional approach to the Bayesian classifier, using discreet decision intervals, was chosen. These intervals are associated with the values of samples’ features, being the whole classification process based on a discreet set of intervals. However, and since the index of performance wasn’t sufficiently satisfactory, we afterwards adopted a Gaussian Bayesian classifier, in which gaussians take the place of intervals (in this case, only a Gaussian for each feature) for representing training values for each feature. Figure 5. Eye aperture Mouth aperture, measured in accordance to figure 6, gives information in determining facial expressions such as surprised or angry. In the first, the mouth is usually open and, in the latter, the mouth is usually completely closed. Mouth width, as seen in figure 7, allows for additional information of several facial expressions, like angry, happy, or sad. While in the case of angry the mouth is usually compressed, when considering happy or sad it is usually more extended. Bayesian Classifiers using Discreet Decision Intervals The process of classification begins by testing many facial images in order to obtaining feature reference values and understand their distribution along the domain. Using these values, we esteemed that 5 intervals would be sufficient for discretising the domain, taking into account reference values, in order to afterwards proceed for the Bayesian The average vertical distance from mouth corners to eyes is measured as shown in figure 8. Its main goal is to distinguish between sad and happy facial expressions. In a sad face, this 5 generated from training samples. Gaussian distributions are estimated through the mean and standard deviation of test samples. Thus, c.d.f. is given by: 1 cdf (x) = 2 Figure 9. Average vertical distance from mouth corners to its centre 2 erf (z) = √ π The training process was automated through the development of an application for this intent. During it, 10 positive samples and 10 negative ones are selected. Each one of these samples is introduced in the classifier’s training module together with additional data (denomination of the class for training and information on whether the samples are positive or negative). The training module, using this data, and for each class of emotion, fills in a set of positive and negative structures which belong to each interval of each feature. The distribution of samples along the features intervals present the probabilities that are presented to the Bayes classifier. Z z 2 e−t dt, (3) 0 The training of the classifier is, once more, done through both positive and negative samples, so that it is possible to determine the likelihood of a sample belonging or not to a given class. In this stage, two main values are stored for each feature: the sum of all samples’ values, which is used to calculate the samples’ mean, and the sum of the squares of all samples’ values, which is used to calculate the samples’ standard deviation. The number of samples isn’t known until the training process has been concluded. As such, both mean and standard deviation are calculated at the classification stage. Consequently, the calculation of the standard deviation is done using the equation Subsequently, classification consists of calculating the likelihhod of a sample feature corresponding to a given emotion. This value is calculated based on the formula N Y HPf , TPf (2) where µ is the sample values’ mean and σ is their standard deviation. erf (z) is the error function associated with the integration of the normalised form of the Gaussian function, given by classification. These intervals were created so that each one would contain about 20% of the reference samples. When trying to adjust the number of intervals, it was verified that, with a decrease, the classifier’s quality decreased, and with an increase it did not bring consistent improvements. Lc = x−µ √ 1 + erf σ 2 v u u σ=t (1) f =1 in which Lc is class c’s likelihood, HPf is the number of positive hits for the feature f and T Pf is the total number of training samples for class f . v u u =t N 1 X 2 x N i=1 i ! − x̄2 !2 ! N N 1 X 2 1 X x − xi N i=1 i N i=1 v ! !2 u N N X X 1u t 2 = N xi − xi , N i=1 i=1 This classification method has a disadvantage, which is related to the continuous nature of features. In the process of assigning sample values to intervals it sometimes happens that, while an interval k has a large number of samples and an interval k+2 also has a large number of samples, an interval k+1 has zero samples. This greatly influences the classification process. As such, we decided to model an infinite set of training samples through the use of a gaussian, i.e., an approximation to the normal distribution. (4) where N is the number of samples, xi is sample i’s value and x̄ is the samples’ mean. The process of classification of a sample begins by calculating the likelihood of each feature belonging to a given class. This computation is done through the value of the c.d.f. of a normal distribution, which is generated given the mean and standard deviation of training samples. The likelihood of a sample not belonging to a given class is computed as well. The classification process ends with the attribution of the sample to the class which has a higher likelihood, and summarised as Gaussian Bayesian Classifiers A Gaussian Bayesian classifier is very similar to the previous one. However, it doesn’t use the values of training samples for classification. Instead, it uses an estimation of values for infinite samples, assuming that these follow a normal distribution. As such, the likelihood of a certain value belonging to a given class is calculated through the cumulative distribution function (c.d.f.) of the Gaussian distribution which is 6 N 0.5 − f da Z x , µp , σ p Y f f f , Lc = n n f =1 0.5 − f da Z xf , µf , σf (5) where Lc is the likelihood of class c, xf the value of feature f , relative to sample x, µpf the mean for feature f ’s value for positive samples, σfp the standard deviation for feature f ’s values, µnf the mean of feature f ’s value for negative samples, σfn the standard deviation for feature f ’s value for negatives samples and Z(x, µ, σ) the fit to the standard normal, which is given by Z= X −µ σ (6) Figure 10. Facial Expression Classification Prototype The classifier associates the sample with the class c which presents the highest Lc value for that sample. needs to press the F12 key. After gathering image information and processing it, the application inserts a combination of keyboard symbols, corresponding to the user’s facial expression, in the active window. It may be used on instant messaging applications or on any other software programs. PROTOTYPES The implemented functionalities have been encapsulated on the reusable and versatile Facial Emoticons DLL, in order to provide a source for application development. This library exports an interface which allows the usage of the classifier, as well as the manipulation of its knowledge base, by any external application. This prototype, extremely minimalist, proves the simplicity of inserting an emoticon through a new modality, without the user having to look through a list or knowing keyboard shortcuts by heart. Three prototypes have been developed in order to show the functionality of this library. Python language was adopted, was well as wxPython, pyHook and pywin32 modules, for the graphical interface, Windows global events interception and interaction with other windows and applications, respectively. Facial Expression Classification This prototype shows the result of the user’s face processing, considering all possible expressions. It consists of two main elements: real-time video capture and a button which triggers facial expression classification. Whenever this event is active, an image is gathered and processed, being the corresponding emoticon presented as a result. Additionally, a series of charts is shown, which present the probability of the face belonging to each set of emotions. This prototype also allows feedback from the user, corresponding to the classifier’s success or failure. Classification results, as well as users’ feedback and gathered facial images, are stored for further analysis. The main motivation for the creation of this application was to run user tests and evaluate the behaviour of our classifier. Its implementation, using Python, was based on Facial Emoticons libary. Insertion of Emoticons in the Active Window The main objective of this application is to illustrate the usage of Facial Emoticons library as a means for inserting emoticons in any active window. The prototype runs in background and visually consists of a video capture window. Whenever the user wishes to insert an emoticon, he or she only 7 Figure 11. Facial Expression Classification Prototype Two testing scenarios have been set. One of these consisted in the detection of all five emotions. The other, which has been created due to its wide practical application, dealt with the diferentiation between the classes ”happy” and ”sad”. E-motional Jukebox In the context of a Multimodal Interfaces course, a project was developed which consists of the adoption of non conventional modalities for interaction with an audio player. A multimodal interface was created which consists of gestures to control basic audio functions (such as play, pause, etc.), and facial expression recognition for track classification, so that the application had an intelligent behaviour when selecting tracks to play. This application uses two cameras to capture the user’s hand and face simultaneously. Hand gesture recognition is done through invocation of the HandVU library functions. So that the results of classifications be optimised, these tests were conducted several times. They were crucial in what concerns the decision process of the features to select for classification, since they provided the information necessary for balance between number of features and performance. On this preliminary testing stage, which intended to show the classifier’s performance with static image, the success rate was of 80.76% for 2 classes of emotions (”happy” versus ”sad”), as seen in table 1. There is, however, a divergence in classification. Effectively, sad faces’ classification presented much higher a hit-rate – 87.27% against happy faces’ 74.26%. The quality of samples seems to be the major cause since, as in during the development of this whole study it has been verified, different test samples correspond to different biases of the classifier towards a class or another. Facial expression recognition is done through Facial Emoticons library functions. Since only happy and sad expressions are considered in the context of the application, the knowledge database only contemplates these two emotions. However, since the library is exactly the same, as it is independent of the number of considered expressions, it only recognises the expressions which are present on the knowledge base according to its parameters, which turns out to be extremely versatile. Table 1. Hit-rate for ”happy” vs ”sad” using static images Facial expressions are captured at a time interval of 5 seconds and cumulatively classified, so that the track’s classification consists of the user’s global appreciation throughout its whole duration. Happy Sad Average Hit-rate 74.26 87.27 80.76 The results of classification for all classes of emotions, presented on table 2, show a tendency of the classifier towards the expression ”happy”. Despite this, for each group of samples, the predominant class corresponds to the correct emotion, being the hit-rate of about 55%. Table 2. Hit-rate with five emotion classes using static images Angry Happy Neutral Sad Surprised Average Hit-rate 56.67 87.19 61.18 44.44 29.85 55.87 Analysing the results in what concerns training and classification using testsets, a hit rate of about 81% for a two-class case is quite satisfactory, despite pointing out some room for improvement. The all-class case results are also positive, since there is a high increase in the number of classes. Figure 12. Facial Expression Classification Prototype RESULTS AND DISCUSSION Results of tests using static images (testsets) In order to evaluate the behaviour of the developed system, and with the objective of optimising its performance, some tests have been conducted. These tests used a facial database from the University of Dallas [13], to which access has been granted for this study. 10 samples associated with each facial expression class considered in the scope of this study (neutral, happy, sad, angry and surprised) were randomly selected to train the classifier. All remaining samples were used for testing the quality of the classifier. Thus, 203 samples were used to represent the class ”happy” and 41 for ”angry”, 570 for ”neutral”, 55 for ”sad” and 67 for ”surprised”. Results of tests using dynamic images Results presented here try to measure and illustrate the behaviour of the classifier in a real-world situation, using a traditional webcam. As in the previous section, several tests have been run in order to classify expressions in a ”happy” vs. ”sad” scenario and in an all-expressions’ scheme. Samples have been collected using one of the applications developed in the course of this research. 67 participants used 8 the application to automatically classify their representation of the five emotional classes refered in the section above. The application logged not only the images captured, but also the result of the classification and the user’s feedback on the success or insuccess of the process. Data have later been analysed in order to provide the results shown below. by the usual low signal to noise ratio in computer webcams, but also by poor illumination. Also, in some cases the faces appearing in the images are at such an angle with the camera as to difficult facial features extraction, and in some other cases participants were unable to properly represent the correct emotion. These situations are pictured on figure 13. The most significant tests considered: 1. 30 randomly-chosen training samples from live images; 2. 10 randomly-chosen training samples from Dallas database. As for the ”happy” vs ”sad” scenario, the results obtained are considerably higher in the second case, as can be seen in table 3. In this case, the classifier does not show a strong bias towards any class. The results obtained are a bit below the ones presented in the former section. From the analysis of the tests conducted, these lower hit-rates are mainly due to the poorer quality of live-captured images. (a) Poor illumination (b) Face angled with (c) Unrealistic ”anconditions the camera gry” facial expression Figure 13. Factors which influence classification CONCLUSIONS AND FUTURE WORK Summary of the Developed Work Table 3. Hit-rate for ”happy” vs ”sad” using live images Happy Sad Average Hit-rate for training scheme 1 67.57 64.86 66.22 The main motivation for this study was essentially concerned with the development of an interface which allowed facial recognition and expression classification, with the objective of creating a new interaction channel between user and computer system. Hit-rate for training scheme 2 75.00 70.91 72.95 As such, it was necessary to find a solution of recognition and facial classification associated with a significative hitrate, which was not compromised by a long-time response. Regarding the results obtained from the tests with the five expression classes, shown in table 4, it seems that, with the first training scheme, the classifier was over-specialised. Tests were also performed with only 10 random samples used to train the classifier. Although results improved, they were still sub-par. Using training samples taken from the University of Dallas database, the results improved to a hit-rate of about 38%. This improvement, as well as the difference in results between still image and live image tests, strongly suggests that results are influenced by image quality. In this context, a library has been created, which allows the inclusion of the whole functionality of the facial expression recognition module in any application with a minimum effort. This library is responsible for training the classifier and consequently populate the database, and also for sample classifying. It allows querying on the knowledge base and information gathering on the classifier’s internal processing. Three prototypes have been developed in order to exemplify the functionalities of the library and also to run user tests. The first of them performs automatic classification of facial expressions, indicating the likelihood levels for each expression, and allowing feedback from the user of whether the classification is correct or incorrect. The second application runs in background and inserts emoticons in the active window, which are automatically generated by facial expression recognition. Lastly, the third prototype has been created in the context of a project for Multimodal Intelligent Interfaces course at the University and consists of an audio player which allows cumulative track classification through periodic analysis of the user’s expression. Table 4. Hit-rate with five emotion classes using live images Angry Happy Neutral Sad Surprised Average Hit-rate for training scheme 1 2.70 0.00 72.97 5.40 10.81 18.38 Hit-rate for training scheme 2 36.62 50.00 37.04 29.09 33.33 37.82 Some tests where also performed with a different classification process. In this case several video frames were used in order to classify an expression. The final classification was obtained by classifying 10 video frames and picking the most common classification. In this case the hit-rate for the ”happy” vs. ”sad” scenario was 85% and the hit-rate for the scenario with all classes of emotions was 59%. Final Conclusions and Discussion This study explores a possibility of human-computer interaction, based on facial expression recognition as a non convencional interaction modality. The development of a library which is easy to use allows the adoption, by any kind of application, of the functionalities which have been exposed in this paper. The quality of collected images has been influenced not only 9 subtle differences in facial expression. In Proc. International Conf. Automatic Face and Gesture Recognition, pages 396–401, 1998. In the case of distinction between happy and sad facial expressions, a hit-rate of over 80% has been achieved. Among five different expressions (angry, happy, neutral, sad and surprised), it has been of about 55%. Despite the source code not being optimised and the library having been generated without any optimisations at the compiler level, the algorithms which have been implemented are light and allow a good temporal performance. The current implementation took slightly less than 1 second to classify a facial image. With an optimised code and compilation process, it will be possible to achieve real-time performances. 7. A. Khanam, M. Shafiq, and M. Akram. Fuzzy based facial expression recognition, 2008. 8. H. Kobayashi and F. Hara. Facial interaction between animated 3d face robot and human beings. In Proc. International Conf. Systems, Man, Cybernetics, pages 3732–3737, 1997. 9. I. Kotsia, N. Nikolaidis, and I. Pitas. Facial expression recognition in videos using a novel multi-class support vector machines variant, 2007. Tests performed with the classification of multiple frames have also shown promising results. The hit-rate for ”happy” vs. ”sad” was of 85% and for the scenario with the five emotion classes was of almost 60%. Further studies are needed in order to better understand what is the correct number of frames to use in this case. 10. G. Littlewort, M. Bartlett, C. Fasel, T. Kanda, H. Ishiguro, and J. Movellan. Towards social robots: Automatic evaluation of human-robot interaction by face detection and expression classification. There are certain aspects which may be improved in the future. It would be desirable to achieve a 90% hit-rate for distinction between two facial expressions. It will be necessary to adjust some parameters in order to achieving a system which is more robust. In the process of feature extraction, eye and mouth detection are the least reliable. In the first case, the solution will be to train a new Haar cascade classifier, which was outside the scope of this study. In the second case, it is imperative to improve the mouth-detecting algorithm. It would be interesting to use an approach which responded to variations of illumination or colour. Such an approach may bring a new robustness to the system. Also the use of the differentials of the face associated with the capture of video was outside the scope of this study. Indeed, the whole classification process is based in the information of a single image. The use of information related to the alteration of facial features throughout a video capture may not only simplify the feature extraction process, but also provide additional information and, thus, improve its performance. 11. H.-C. Lu, Y.-J. Huang, C. Y.-W., and Y. D.-I. Real-time facial expression recognition based on pixel-pattern-based texture feature, 2007. 12. A. O. M. Yoneyama, Y. Iwano and K. Shirai. Facial expressions recognition using discreet hopfield neural networks. In Proc. International Conf. Information Processing, volume 3, pages 117–120, 1997. 13. M. Minear and P. DC. A lifespan database of adult facial stimuli, 2004. 14. M. Pantic and L. Rothkrantz. Expert system for automatic analysis of facial expression. In Image and vision computing J., volume 18, pages 881–905, 2000. 15. D. O. D. A. S. M. R. Feitosa, M. Vellasco. Facial expression classification using rbf and back-propagation neural networks, 2000. 16. E. Saber and A. Tekalp. Frontal-view face detection and facial feature extraction using color, shape and symmetry based cost functions, 1998. REFERENCES 1. P. O. C. Zhan, W. Li and F. Safaei. Facial expression recognition for multiplayer online games. In Proc. of the 3rd Australasian Conf. on Interactive Entertainment, volume 207, pages 452–458, 2006. 17. I.-O. Stathopoulou and G. A. Tsihrintzis. An improved neural-network-based face detection and facial expression classification system. In SMC (1), pages 666–671, 2004. 2. J. Cao and C. Tong. Facial expression recognition based on lbp-ehmm, 2008. 18. K. K. Sung and T. Poggio. Example-based learning for view-based human face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):39–51, 1998. 3. P. Ekman and W. Friesen. Facial action coding system (facs): Manual. Book, 1978. 4. T. C. G.J. Edwards and C. Taylor. Face recognition using active appearance models. In Proc. European Conf. Computer Vision, volume 2, pages 581–695, 1998. 19. V. Vezhnevets. Method for localization of human faces in color-based face detectors and trackers, 2002. 20. P. Viola and M. Jones. Robust real-time object detection. Technical report, University of Cambridge, 2001. 5. H. N. H. Hong and C. V. der Malsburg. Online facial expression recognition based on personalized galleries. In Proc. International Conf. Automation Face and Gesture Recognition, pages 354–359, 1998. 6. J. L. J.F. Cohn, A.J. Zlochower and T. Kanade. Feature-point tracking by optical flow discriminates 10
© Copyright 2026 Paperzz