Facial Emoticons

Facial Emoticons
Sandra Gama
Instituto Superior Tecnico
Visualisation and Multimodal Interfaces Group
[email protected]
nication, being non-verbal, soon brought the necessity for
artifacts which could emphasise the dialogue. Emoticons,
which reinforce the meaning of textual messages by simulating human facial expressions, have enthusiastically been
adopted in this context.
ABSTRACT
Facial expressions are a highly effective non-verbal means
for sharing emotions. On the Internet, emoticons are used to
compensate the lack of non-verbal forms of communication.
The objective of this study is to generate emoticons through
facial recognition, and to reproduce this information on user
interfaces.
As such, for using an emoticon, users have to look up a list
and select it, or use a specific combination of characters to
insert it in the textual message. However, on an ideal situation, communication should be as similar as possible to
human-human interaction. Consequently, computer systems
should be able to reproduce facial information on a more
natural way.
Face tracking is done through the OpenCV library funcionalities, which rely on a Haar cascade classifier. The result is
a facial image that is processed for feature detection. This
stage, despite also using OpenCV, consists of a series of
face-adapted algorithms for edge detection. Bayesian classifiers are used to infer facial expressions.
Automatic expression recognition has been an interesting
challenge, which implies image processing and pattern recognition. The further development of these fields may bring
significant improvements on human-machine interaction, as
well as on many other fields such as medicine or psychology.
In spite of test results showing a room for improvement, they
also point out the viability of an interface which turns facial
expressions into emoticons as an improvement for communication. Methodologies which rely on low computational
cost algorithms, such as the ones proposed, grant quite satisfactory results and low-latency rates.
On the area of user interfaces, it is necessary to find balance
between temporal response and success rate of the classifier.
The main objective of this study is to gather facial informafacial recognition, emoticons, facial expressions, pattern recog- tion and process it, in order to reproduce, on an user interface, the corresponding facial emoticon. Three prototypes
nition, user interfaces
have been created for evaluating success rates and showing
the functionality and applicability of the algorithms which
ACM Classification Keywords
were developed. Also, automated tests have been run. As a
H.5.2 Information Interfaces and Presentation: Miscellaneous— metric for success, we defined that a classifier which pointed
Facial Expression Recognition
out a clear distinction between the expression corresponding
to the user’s shown emotion and the remaining expressions,
INTRODUCTION
would be satisfactory.
For human beings, the process of recognising facial expressions is quite straightforward. Actually, facial expressions
This paper is organised as follows: firstly we analyse recent
consist of a series of features which often make them easy to
studies in the area, taking into account the different stages of
understand, even among people of different social and ethfacial expression recognition; then we present the developed
nical backgrounds.
work and describe, in detail, the processes of facial detection, feature extraction and expression classification; after
With the advent and massification of the Internet, people ofthat, we refer to the prototypes which have been created; we
ten have online textual conversations. This form of commuthen present and discuss the results of automated tests and,
lastly, we state the conclusions and point out a direction for
future investigation.
Author Keywords
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CHI 2009, April 3 - 9, 2009, Boston, MA, USA.
Copyright 2009 ACM 978-1-60558-246-7/07/0004...$5.00.
RELATED WORK
Classifying facial expressions implies following a coherent
methodology. The studies which have been done in this
field sub-divide facial expression recognition into three main
steps: Facial detection on an arbitrary image, facial feature
1
useful statistical information. Also, Cao and Tong [2] developed a recent study on HMM (Hidden Markov Models),
which they name EMM (Embedded Hidden Markov Model)
because it expands each HMM’s state into a new HMM, obtaining a super-state for the exterior model, and an embedded
state which corresponds to the interior model.
detection and expression classification. Since this study’s
main focus is classification, the remaining steps are here
enunciated as an introduction.
Facial Detection
In what concerns facial detection in arbitrary images, there
are two main approaches. The first one is based on features,
which focuses on the colour of the human face. Saber et al.
[16] and Vezhnevets et al. [19] both use these principles.
Acoordingly to the second approach, which is based on appearance, a face is defined as a pattern of pixel intensities,
as suggested by Sung et al. [18]. Facial patterns are distinguished from non-facial patterns by sub-dividing an image in smaller windows. Viola and Jones’ [20] method has
been a strong reference basis for facial detection systems,
since it subdivides the main image into windows. These are
processed by cascade classifiers, which are able to find the
candidate window containing a facial image. This method,
being of light computational weight, has served as a basis
for studies such as the ones of Bartlett et al. [10] and Zhan
et al. [1], which present real-time performances. The first
one uses Haar functions, computationally light, and the second adopts the boosting method. More recently, both Cao
and Tong [2], and Lu et al. [11] have adopted Viola and
Jones’ method. Also, Intel’s OpenCV libray, which emphasises real-time computational vision and, among other aspects, provides tools for facial detection, is based on this
method.
Some methods based on neutral networks are the studies of
Kobayashi et al. [8], which applies a Back-Propagation Neural Network and presents a real-time response, Yoneyama et
al. [12] which adopts discreet Hopfield networks, Feitosa et.
al [15], which study the use of two different neutral networks
for classification and Stathopoulou et al. [17], which applies
the network to partial areas of the face, obtaining faster classifications.
Some rule-based methods are the works of Pantic et al. [14],
which uses the principal points of the face, extracts its features and calculates the difference between model features
(which correspond to a neutral state) and the present ones;
and Khanam et al. [7], which present a Mandini-type fuzzy
system, which uses two knowledge bases – one for data and
another for rules (fuzzy rules).
DEVELOPED WORK
A top-down approach to facial expression recognition was
followed. As such, the initial problem was divided into three
sub-problems: Facial Detection, which implies tracking the
face and extracting its coordinates, Feature Extraction, that
determines, using the previous step’s output, the coordinates
of the main facial features and Classification which, using
the facial features which were inferred in the previous step,
determines the corresponding facial expression.
Facial Feature Detection
There are two main approaches to facial feature detection:
holistic, which is associated with model-based methods, analytic, which is commonly used with feature-based methods, or a combination of both. For holistic detection, not
only Cohn et al. [6] and Zhan et al. [1] have suggested the
use of models some years ago, but these are also present on
recent studies, such as Cao and Tong’s [2], who adopt an
operator, which is defined as the measure of the invariant
grayscale texture, for detection. Along with being invariant
to grayscale variation, it presents a good temporal performance. As for analytic approaches, Zhan et al. [1] apply
Gabor filters to a set of reference points (normalised from
tests that have been run on facial images) for facial feature
extraction. Lu et al. [11] present a new feature representation for an analytic approach, which sees a facial image as
a combination of micro-patterns, being each pixel of the image associated with a class of model-patterns. It is a fast and
illumination-robust method.
As such, the internal processing is sub-divided into five main
modules, illustrated in Figure 1:
1. Facial Detection Module: at this stage, an arbitrary image
is used as input and processed in order to track the face;
2. Normalisation Module: facial images, which may have
different dimensions according to the original image resolution and size of the face in the image, are normalised
to enhance further processing;
3. Feature Extraction Module: the normalised facial image
goes through a series of algorithms. The outputs are the
coordinates of facial features;
4. Feature Transformation Module: features which have been
extracted on the previous stage are now processed into a
series of more expressive features for classification;
Classification
There are three main approaches to classification: based on
models, neutral networks and rules. Model-based studies
are the ones of Edwards et al. [4], which presents an AAM
framework for illumination and position-independent systems, Hong et al. [5], which combines Gabor filters with
Elastic Graph Matching and uses a gallery to improve the
classification of a given expression with about 89% of success rate and, recently, Lu et al. [11] and Kotstia et al.
[9] adopt SVM (Support Vector Machines) for classification.
The latter developed an evolution of SVM which includes
5. Classification Module: using as input the output of the
previous step, the classifier produces the expression corresponding to the original image.
All these stages will be now exposed in detail.
2
Facial Detection
In the context of the present work, the user’s environment
consists of an arbitrary room and his or her only instrument
for interaction with the system is a computer with a webcam. As such, the system must be able to process contextindependent images; not only facial images, but also images
where other elements may be present.
It was thus necessary to detect a face on an arbitrary image.
Viola and Jones’ method [20], since it uses Haar classifier
cascades (which are computationally light), along with providing a means for real-time facial detection, seemed to be a
proper choice.
Thus, not only the problem of facial detection was solved,
but also it is done in real time, which is an important factor.
Normalisation
So the system may interact with any user who owns a webcam, and since devices present different resolutions, and
users may be at any distance from the camera, it is important to consider that facial areas may present a variety of
dimensions. So that facial feature extraction is done with
enough precision, and since algorithms are sensible to image dimensions, it was necessary and pertinent to normalise
these. This stage follows face detection. As such, only this is
normalised, for resource maximisation. Images are resized
into standard dimensions before being processed for feature
extraction.
A facial feature model has been adopted, which consists of
facial features of an average face, to infer unknown facial
values. Model dimensions are taken into account when normalising a facial image, in order to grant a high coherence
and flexibility. Consequently, if the model is altered, the image’s dimension will accompany this transformation.
Feature Extraction
Considered Features
FACS, created by Ekman and Friesen in 1978 [3], is a technical norm that classifies facial bahaviours depending on the
muscles which are responsible for each facial action. Many
other studies have considered FACS as a basis for their work.
This method defines some (originally, 46) UA (unitary actions), which consist of the actions that correspond to one or
more facial muscles. The combination of different UAs define a wide set of facial expressions. Accordingly with this
norm, there is a vast spectrum of parameters that must be
taken into account. These are predominantly related to eyes,
eyebrows and mouth. Effectively, other features such as chin
and cheeks are responsible for only 4 of the UA.
Model of Facial Features
In order to maximising the performance of feature detection,
through a series of standard values for extraction, a model of
the human face has been developed. This model allows not
only to restrict the region of interest but also to estimate the
position of facial features when they are not detected (caused
by extremely poor lightning conditions, exaggerated face rotation or impossibility to detect certain parts of the face).
Figure 1. Stages of Facial Expression Classification
3
The model is dynamically adjusted to the detected features
and, if it is impossible to obtain a certain feature, the model
will compensate for it. This grants the system an increased
robustness.
However, since the cascade classifier didn’t have quite a satisfactory performance, it was used only as a basis for detection. In this case, special attention has been paid to defining the mouth’s region of interest based on the elastic model
which has been adopted.
Investigators from the university of Regensburg in Germany
developed two average faces (one for each gender) for their
studies in psychology. Here, we created a hybrid model
through the merge of these two faces, which resulted in a
realistic start point, well adapted to this study’s necessities.
It consists of a vector in which every facial feature is represented, according to the standard dimensions defined by
the face model. This element has an elastic behaviour, being progressively adapted to the facial features which are localised. Additionally, the use of this model as a means for
detection is based on the substitution of coordinates in case
these are not correctly detected, preventing from possible errors. In these cases, and since it is an elastic model, the coordinates corresponding to undetected features will present
the value which is presented by the model. The model is
illustrated in figure 2.
Grayscale conversion: This implies a pre-processing for the
application of the Canny operator, which manipulates singlechanneled images. It also simplifies Gaussian Blur processing, since applying its convolution matrix to one channel is
more efficient than applying it to the three colour channels
(RGB).
Gaussian Blur: This algorithm reduces noise and other artifacts which are unnecessary for Canny edge detection. Although Canny, by default, applies a Gaussian Blur filter before detection, there are many artifacts which present high
dimensions. As such, these are eliminated by this additional
Gaussian Blur operator. In preliminary tests, it proved its
consistency, especially in the eye region, where dark circles
or lines represent additional elements, and in the mouth area,
where teeth or the junction between lips may difficult the detection process.
Canny: A Canny algorithm has been invoked for tracking
facial features with more precision. This operator was chosen for many reasons. Firstly, because it made sense to use
an algorithm which was fast and implemented by OpenCV.
This constraint reduced our choices into Canny, Sobel and
Laplace. Secondly, because there was a need for a noiserobust algorithm. Sobel and Laplace algorithms presented
a great sensibility to noise. Lastly, because the position of
the detected edges should be as precise as possible. Also in
this point, Canny presented better results when compared to
Sobel or Laplace.
Figure 2. Face model
Facial Feature Detection
The methodology which has been used for facial feature detection implies five stages:
Coordinates extraction: By analysis of last step’s results,
this stage consists of obtaining values corresponding to the
detected facial feature. For eyes, extreme points are calculated based on topmost, down-most, leftmost and rightmost
tracked points, and normalised so that the final result is symmetric, which is done through the average of the corresponding coordinates. The process is very similar for the mouth.
Extreme points define the maximum and minimum values on
both axes and the average between them makes it possible
to generate a geometric form. For eyebrows, their average
point is calculated, since their height is the most important
factor. After that, the rest of this feature is approximated
through the use of the model we have adopted throughout
this study.
Detection of a region of interest: Here we take advantage of the OpenCV library functions; this step helps reduce
computational weight corresponding to the following steps.
In the case of the eyes, four fundamental points are considered: the leftmost, rightmost, topmost and down-most.
These make it possible to find out the width and height of
the eye, which are fundamental for expression classification.
In order to improve performance, only the upper half of the
face has been considered. When analising the eyebrows, we
didn’t use any cascade classifier. In spite of that, a tighter region of interest has been defined that processes only the area
which is located right above the uppermost point of the left
eye, taking advantage of the symmetric properties of the human face to maximise the efficiency of detection. This decision has been considered carefully. On one hand, it implies a
slight decrease in the system’s robustness in cases of people
who have suffered from an injury or disease which prevents
them from moving their frontalis muscle (the one which is
responsible for eyebrow raising). On the other hand, given
the objective of developing a light, simple and fast method
for feature detection, this seemed to be a proper decision.
Mouth detection followed the same steps as eyes detection.
Figure 3 summarises these steps.
Figure 3. Stages of feature detection
Feature Transformation
4
The features used for classification were selected in order to
maximise the information available to this process. These
features are based in the concept of Action Units, defined in
the FACS system. Although may provide a wide set of information, if looked at separately, little can be inferred on their
associated expression. Therefore, a combination of several
UAs, as defined by Ekman and Friesen [3], is used.
In order to improve the expressiveness of the features which
have been collected in the former stages, these were transformed into another set of features. These features are: (1)
Vertical distance between the eyes and the eyebrows; (2) Eye
aperture; (3) Mouth aperture; (4) Mouth width; (5) Average
vertical distance from mouth corners to eyes; and (6) Average vertical distance from mouth corners to mouth centre.
Figure 6. Mouth aperture
The vertical distance between eyes and eyebrows, as pictured in figure 4, determines eyebrows elevation and depression, thus providing valuable information on the detection of
facial expressions such as surprised, angry or sad.
Figure 7. Mouth width
value is considerably higher than in a happy face.
Figure 4. Vertical distance between eyes and eyebrows
Eye aperture, as illustrated in figure 5, helps distinguish between facial expressions such as surprised, where eyes are
wide open, from angry and happy, in which eyes are more
closed.
Figure 8. Average vertical distance from mouth corners to eyes
As for the average vertical distance from mouth corners to
mouth centre, which is determined as seen in figure 9, this
feature is used in conjunction with the previous feature to
differentiate between sad and happy expressions.
Classification
Bayesian classifiers were adopted, mainly because they make
it possible to obtain good results despite their quite low computational weight. Two classifiers have been taken into account. Firstly, a more conventional approach to the Bayesian
classifier, using discreet decision intervals, was chosen. These
intervals are associated with the values of samples’ features,
being the whole classification process based on a discreet set
of intervals. However, and since the index of performance
wasn’t sufficiently satisfactory, we afterwards adopted a Gaussian Bayesian classifier, in which gaussians take the place of
intervals (in this case, only a Gaussian for each feature) for
representing training values for each feature.
Figure 5. Eye aperture
Mouth aperture, measured in accordance to figure 6, gives
information in determining facial expressions such as surprised or angry. In the first, the mouth is usually open and,
in the latter, the mouth is usually completely closed.
Mouth width, as seen in figure 7, allows for additional information of several facial expressions, like angry, happy, or
sad. While in the case of angry the mouth is usually compressed, when considering happy or sad it is usually more
extended.
Bayesian Classifiers using Discreet Decision Intervals
The process of classification begins by testing many facial
images in order to obtaining feature reference values and understand their distribution along the domain.
Using these values, we esteemed that 5 intervals would be
sufficient for discretising the domain, taking into account
reference values, in order to afterwards proceed for the Bayesian
The average vertical distance from mouth corners to eyes is
measured as shown in figure 8. Its main goal is to distinguish
between sad and happy facial expressions. In a sad face, this
5
generated from training samples. Gaussian distributions are
estimated through the mean and standard deviation of test
samples. Thus, c.d.f. is given by:
1
cdf (x) =
2
Figure 9. Average vertical distance from mouth corners to its centre
2
erf (z) = √
π
The training process was automated through the development of an application for this intent. During it, 10 positive
samples and 10 negative ones are selected. Each one of these
samples is introduced in the classifier’s training module together with additional data (denomination of the class for
training and information on whether the samples are positive
or negative). The training module, using this data, and for
each class of emotion, fills in a set of positive and negative
structures which belong to each interval of each feature. The
distribution of samples along the features intervals present
the probabilities that are presented to the Bayes classifier.
Z
z
2
e−t dt,
(3)
0
The training of the classifier is, once more, done through
both positive and negative samples, so that it is possible to
determine the likelihood of a sample belonging or not to a
given class. In this stage, two main values are stored for
each feature: the sum of all samples’ values, which is used
to calculate the samples’ mean, and the sum of the squares of
all samples’ values, which is used to calculate the samples’
standard deviation. The number of samples isn’t known until the training process has been concluded. As such, both
mean and standard deviation are calculated at the classification stage. Consequently, the calculation of the standard
deviation is done using the equation
Subsequently, classification consists of calculating the likelihhod of a sample feature corresponding to a given emotion.
This value is calculated based on the formula
N
Y
HPf
,
TPf
(2)
where µ is the sample values’ mean and σ is their standard
deviation. erf (z) is the error function associated with the
integration of the normalised form of the Gaussian function,
given by
classification. These intervals were created so that each one
would contain about 20% of the reference samples. When
trying to adjust the number of intervals, it was verified that,
with a decrease, the classifier’s quality decreased, and with
an increase it did not bring consistent improvements.
Lc =
x−µ
√
1 + erf
σ 2
v
u
u
σ=t
(1)
f =1
in which Lc is class c’s likelihood, HPf is the number of
positive hits for the feature f and T Pf is the total number of
training samples for class f .
v
u
u
=t
N
1 X 2
x
N i=1 i
!
− x̄2
!2
!
N
N
1 X 2
1 X
x −
xi
N i=1 i
N i=1
v
!
!2
u
N
N
X
X
1u
t
2
=
N
xi −
xi ,
N
i=1
i=1
This classification method has a disadvantage, which is related to the continuous nature of features. In the process of
assigning sample values to intervals it sometimes happens
that, while an interval k has a large number of samples and
an interval k+2 also has a large number of samples, an interval k+1 has zero samples. This greatly influences the classification process. As such, we decided to model an infinite
set of training samples through the use of a gaussian, i.e., an
approximation to the normal distribution.
(4)
where N is the number of samples, xi is sample i’s value
and x̄ is the samples’ mean.
The process of classification of a sample begins by calculating the likelihood of each feature belonging to a given class.
This computation is done through the value of the c.d.f. of a
normal distribution, which is generated given the mean and
standard deviation of training samples. The likelihood of a
sample not belonging to a given class is computed as well.
The classification process ends with the attribution of the
sample to the class which has a higher likelihood, and summarised as
Gaussian Bayesian Classifiers
A Gaussian Bayesian classifier is very similar to the previous
one. However, it doesn’t use the values of training samples
for classification. Instead, it uses an estimation of values for
infinite samples, assuming that these follow a normal distribution. As such, the likelihood of a certain value belonging
to a given class is calculated through the cumulative distribution function (c.d.f.) of the Gaussian distribution which is
6
N 0.5 − f da Z x , µp , σ p
Y
f
f
f
,
Lc =
n
n
f =1 0.5 − f da Z xf , µf , σf
(5)
where Lc is the likelihood of class c, xf the value of feature
f , relative to sample x, µpf the mean for feature f ’s value for
positive samples, σfp the standard deviation for feature f ’s
values, µnf the mean of feature f ’s value for negative samples, σfn the standard deviation for feature f ’s value for negatives samples and Z(x, µ, σ) the fit to the standard normal,
which is given by
Z=
X −µ
σ
(6)
Figure 10. Facial Expression Classification Prototype
The classifier associates the sample with the class c which
presents the highest Lc value for that sample.
needs to press the F12 key. After gathering image information and processing it, the application inserts a combination
of keyboard symbols, corresponding to the user’s facial expression, in the active window. It may be used on instant
messaging applications or on any other software programs.
PROTOTYPES
The implemented functionalities have been encapsulated on
the reusable and versatile Facial Emoticons DLL, in order to
provide a source for application development. This library
exports an interface which allows the usage of the classifier,
as well as the manipulation of its knowledge base, by any
external application.
This prototype, extremely minimalist, proves the simplicity
of inserting an emoticon through a new modality, without
the user having to look through a list or knowing keyboard
shortcuts by heart.
Three prototypes have been developed in order to show the
functionality of this library. Python language was adopted,
was well as wxPython, pyHook and pywin32 modules, for
the graphical interface, Windows global events interception
and interaction with other windows and applications, respectively.
Facial Expression Classification
This prototype shows the result of the user’s face processing,
considering all possible expressions. It consists of two main
elements: real-time video capture and a button which triggers facial expression classification. Whenever this event is
active, an image is gathered and processed, being the corresponding emoticon presented as a result. Additionally, a series of charts is shown, which present the probability of the
face belonging to each set of emotions. This prototype also
allows feedback from the user, corresponding to the classifier’s success or failure. Classification results, as well as
users’ feedback and gathered facial images, are stored for
further analysis. The main motivation for the creation of
this application was to run user tests and evaluate the behaviour of our classifier. Its implementation, using Python,
was based on Facial Emoticons libary.
Insertion of Emoticons in the Active Window
The main objective of this application is to illustrate the usage of Facial Emoticons library as a means for inserting
emoticons in any active window. The prototype runs in background and visually consists of a video capture window. Whenever the user wishes to insert an emoticon, he or she only
7
Figure 11. Facial Expression Classification Prototype
Two testing scenarios have been set. One of these consisted
in the detection of all five emotions. The other, which has
been created due to its wide practical application, dealt with
the diferentiation between the classes ”happy” and ”sad”.
E-motional Jukebox
In the context of a Multimodal Interfaces course, a project
was developed which consists of the adoption of non conventional modalities for interaction with an audio player. A
multimodal interface was created which consists of gestures
to control basic audio functions (such as play, pause, etc.),
and facial expression recognition for track classification, so
that the application had an intelligent behaviour when selecting tracks to play. This application uses two cameras to
capture the user’s hand and face simultaneously. Hand gesture recognition is done through invocation of the HandVU
library functions.
So that the results of classifications be optimised, these tests
were conducted several times. They were crucial in what
concerns the decision process of the features to select for
classification, since they provided the information necessary
for balance between number of features and performance.
On this preliminary testing stage, which intended to show
the classifier’s performance with static image, the success
rate was of 80.76% for 2 classes of emotions (”happy” versus ”sad”), as seen in table 1. There is, however, a divergence in classification. Effectively, sad faces’ classification
presented much higher a hit-rate – 87.27% against happy
faces’ 74.26%. The quality of samples seems to be the major cause since, as in during the development of this whole
study it has been verified, different test samples correspond
to different biases of the classifier towards a class or another.
Facial expression recognition is done through Facial Emoticons library functions. Since only happy and sad expressions are considered in the context of the application, the
knowledge database only contemplates these two emotions.
However, since the library is exactly the same, as it is independent of the number of considered expressions, it only
recognises the expressions which are present on the knowledge base according to its parameters, which turns out to be
extremely versatile.
Table 1. Hit-rate for ”happy” vs ”sad” using static images
Facial expressions are captured at a time interval of 5 seconds and cumulatively classified, so that the track’s classification consists of the user’s global appreciation throughout
its whole duration.
Happy
Sad
Average
Hit-rate
74.26
87.27
80.76
The results of classification for all classes of emotions, presented on table 2, show a tendency of the classifier towards
the expression ”happy”. Despite this, for each group of samples, the predominant class corresponds to the correct emotion, being the hit-rate of about 55%.
Table 2. Hit-rate with five emotion classes using static images
Angry
Happy
Neutral
Sad
Surprised
Average
Hit-rate
56.67
87.19
61.18
44.44
29.85
55.87
Analysing the results in what concerns training and classification using testsets, a hit rate of about 81% for a two-class
case is quite satisfactory, despite pointing out some room
for improvement. The all-class case results are also positive,
since there is a high increase in the number of classes.
Figure 12. Facial Expression Classification Prototype
RESULTS AND DISCUSSION
Results of tests using static images (testsets)
In order to evaluate the behaviour of the developed system,
and with the objective of optimising its performance, some
tests have been conducted. These tests used a facial database
from the University of Dallas [13], to which access has been
granted for this study. 10 samples associated with each facial expression class considered in the scope of this study
(neutral, happy, sad, angry and surprised) were randomly
selected to train the classifier. All remaining samples were
used for testing the quality of the classifier. Thus, 203 samples were used to represent the class ”happy” and 41 for ”angry”, 570 for ”neutral”, 55 for ”sad” and 67 for ”surprised”.
Results of tests using dynamic images
Results presented here try to measure and illustrate the behaviour of the classifier in a real-world situation, using a traditional webcam. As in the previous section, several tests
have been run in order to classify expressions in a ”happy”
vs. ”sad” scenario and in an all-expressions’ scheme.
Samples have been collected using one of the applications
developed in the course of this research. 67 participants used
8
the application to automatically classify their representation
of the five emotional classes refered in the section above.
The application logged not only the images captured, but
also the result of the classification and the user’s feedback
on the success or insuccess of the process. Data have later
been analysed in order to provide the results shown below.
by the usual low signal to noise ratio in computer webcams,
but also by poor illumination. Also, in some cases the faces
appearing in the images are at such an angle with the camera
as to difficult facial features extraction, and in some other
cases participants were unable to properly represent the correct emotion. These situations are pictured on figure 13.
The most significant tests considered:
1. 30 randomly-chosen training samples from live images;
2. 10 randomly-chosen training samples from Dallas database.
As for the ”happy” vs ”sad” scenario, the results obtained
are considerably higher in the second case, as can be seen
in table 3. In this case, the classifier does not show a strong
bias towards any class. The results obtained are a bit below
the ones presented in the former section. From the analysis
of the tests conducted, these lower hit-rates are mainly due
to the poorer quality of live-captured images.
(a) Poor illumination (b) Face angled with (c) Unrealistic ”anconditions
the camera
gry” facial expression
Figure 13. Factors which influence classification
CONCLUSIONS AND FUTURE WORK
Summary of the Developed Work
Table 3. Hit-rate for ”happy” vs ”sad” using live images
Happy
Sad
Average
Hit-rate
for training scheme 1
67.57
64.86
66.22
The main motivation for this study was essentially concerned
with the development of an interface which allowed facial
recognition and expression classification, with the objective
of creating a new interaction channel between user and computer system.
Hit-rate
for training scheme 2
75.00
70.91
72.95
As such, it was necessary to find a solution of recognition
and facial classification associated with a significative hitrate, which was not compromised by a long-time response.
Regarding the results obtained from the tests with the five expression classes, shown in table 4, it seems that, with the first
training scheme, the classifier was over-specialised. Tests
were also performed with only 10 random samples used to
train the classifier. Although results improved, they were
still sub-par. Using training samples taken from the University of Dallas database, the results improved to a hit-rate
of about 38%. This improvement, as well as the difference
in results between still image and live image tests, strongly
suggests that results are influenced by image quality.
In this context, a library has been created, which allows the
inclusion of the whole functionality of the facial expression
recognition module in any application with a minimum effort. This library is responsible for training the classifier
and consequently populate the database, and also for sample
classifying. It allows querying on the knowledge base and
information gathering on the classifier’s internal processing.
Three prototypes have been developed in order to exemplify
the functionalities of the library and also to run user tests.
The first of them performs automatic classification of facial
expressions, indicating the likelihood levels for each expression, and allowing feedback from the user of whether the
classification is correct or incorrect. The second application runs in background and inserts emoticons in the active
window, which are automatically generated by facial expression recognition. Lastly, the third prototype has been created in the context of a project for Multimodal Intelligent
Interfaces course at the University and consists of an audio
player which allows cumulative track classification through
periodic analysis of the user’s expression.
Table 4. Hit-rate with five emotion classes using live images
Angry
Happy
Neutral
Sad
Surprised
Average
Hit-rate
for training scheme 1
2.70
0.00
72.97
5.40
10.81
18.38
Hit-rate
for training scheme 2
36.62
50.00
37.04
29.09
33.33
37.82
Some tests where also performed with a different classification process. In this case several video frames were used
in order to classify an expression. The final classification
was obtained by classifying 10 video frames and picking the
most common classification. In this case the hit-rate for the
”happy” vs. ”sad” scenario was 85% and the hit-rate for the
scenario with all classes of emotions was 59%.
Final Conclusions and Discussion
This study explores a possibility of human-computer interaction, based on facial expression recognition as a non convencional interaction modality. The development of a library
which is easy to use allows the adoption, by any kind of application, of the functionalities which have been exposed in
this paper.
The quality of collected images has been influenced not only
9
subtle differences in facial expression. In Proc.
International Conf. Automatic Face and Gesture
Recognition, pages 396–401, 1998.
In the case of distinction between happy and sad facial expressions, a hit-rate of over 80% has been achieved. Among
five different expressions (angry, happy, neutral, sad and surprised), it has been of about 55%. Despite the source code
not being optimised and the library having been generated
without any optimisations at the compiler level, the algorithms which have been implemented are light and allow a
good temporal performance. The current implementation
took slightly less than 1 second to classify a facial image.
With an optimised code and compilation process, it will be
possible to achieve real-time performances.
7. A. Khanam, M. Shafiq, and M. Akram. Fuzzy based
facial expression recognition, 2008.
8. H. Kobayashi and F. Hara. Facial interaction between
animated 3d face robot and human beings. In Proc.
International Conf. Systems, Man, Cybernetics, pages
3732–3737, 1997.
9. I. Kotsia, N. Nikolaidis, and I. Pitas. Facial expression
recognition in videos using a novel multi-class support
vector machines variant, 2007.
Tests performed with the classification of multiple frames
have also shown promising results. The hit-rate for ”happy”
vs. ”sad” was of 85% and for the scenario with the five emotion classes was of almost 60%. Further studies are needed
in order to better understand what is the correct number of
frames to use in this case.
10. G. Littlewort, M. Bartlett, C. Fasel, T. Kanda,
H. Ishiguro, and J. Movellan. Towards social robots:
Automatic evaluation of human-robot interaction by
face detection and expression classification.
There are certain aspects which may be improved in the future. It would be desirable to achieve a 90% hit-rate for distinction between two facial expressions. It will be necessary
to adjust some parameters in order to achieving a system
which is more robust. In the process of feature extraction,
eye and mouth detection are the least reliable. In the first
case, the solution will be to train a new Haar cascade classifier, which was outside the scope of this study. In the second case, it is imperative to improve the mouth-detecting
algorithm. It would be interesting to use an approach which
responded to variations of illumination or colour. Such an
approach may bring a new robustness to the system. Also
the use of the differentials of the face associated with the
capture of video was outside the scope of this study. Indeed,
the whole classification process is based in the information
of a single image. The use of information related to the alteration of facial features throughout a video capture may not
only simplify the feature extraction process, but also provide
additional information and, thus, improve its performance.
11. H.-C. Lu, Y.-J. Huang, C. Y.-W., and Y. D.-I. Real-time
facial expression recognition based on
pixel-pattern-based texture feature, 2007.
12. A. O. M. Yoneyama, Y. Iwano and K. Shirai. Facial
expressions recognition using discreet hopfield neural
networks. In Proc. International Conf. Information
Processing, volume 3, pages 117–120, 1997.
13. M. Minear and P. DC. A lifespan database of adult
facial stimuli, 2004.
14. M. Pantic and L. Rothkrantz. Expert system for
automatic analysis of facial expression. In Image and
vision computing J., volume 18, pages 881–905, 2000.
15. D. O. D. A. S. M. R. Feitosa, M. Vellasco. Facial
expression classification using rbf and
back-propagation neural networks, 2000.
16. E. Saber and A. Tekalp. Frontal-view face detection
and facial feature extraction using color, shape and
symmetry based cost functions, 1998.
REFERENCES
1. P. O. C. Zhan, W. Li and F. Safaei. Facial expression
recognition for multiplayer online games. In Proc. of
the 3rd Australasian Conf. on Interactive
Entertainment, volume 207, pages 452–458, 2006.
17. I.-O. Stathopoulou and G. A. Tsihrintzis. An improved
neural-network-based face detection and facial
expression classification system. In SMC (1), pages
666–671, 2004.
2. J. Cao and C. Tong. Facial expression recognition
based on lbp-ehmm, 2008.
18. K. K. Sung and T. Poggio. Example-based learning for
view-based human face detection. IEEE Transactions
on Pattern Analysis and Machine Intelligence,
20(1):39–51, 1998.
3. P. Ekman and W. Friesen. Facial action coding system
(facs): Manual. Book, 1978.
4. T. C. G.J. Edwards and C. Taylor. Face recognition
using active appearance models. In Proc. European
Conf. Computer Vision, volume 2, pages 581–695,
1998.
19. V. Vezhnevets. Method for localization of human faces
in color-based face detectors and trackers, 2002.
20. P. Viola and M. Jones. Robust real-time object
detection. Technical report, University of Cambridge,
2001.
5. H. N. H. Hong and C. V. der Malsburg. Online facial
expression recognition based on personalized galleries.
In Proc. International Conf. Automation Face and
Gesture Recognition, pages 354–359, 1998.
6. J. L. J.F. Cohn, A.J. Zlochower and T. Kanade.
Feature-point tracking by optical flow discriminates
10