Facial Expression Recognition System

Facial Expression Recognition System
Final Year Project
Author:
Supervisor:
Andreea Pascu
Prof. Ross King
BSc Artificial Intelligence
University of Manchester
School of Computer Science
April 2015
Abstract
The problem of automatic recognition of facial expressions is still an ongoing research,
and it relies on advancements in Image Processing and Computer Vision techniques.
Such systems have a variety of interesting applications, from human-computer interaction, to robotics and computer animations. Their aim is to provide robustness and high
accuracy, but also to cope with variability in the environment and adapt to real time
scenarios.
This paper proposes an automatic facial expression recognition system, capable of distinguishing the six universal emotions: disgust, anger, fear, happiness, sadness and surprise.
It is designed to be person independent and tailored only for static images. The system
integrates a face detection mechanism using Viola-Jones algorithm, uses uniform Local
Binary Patterns for feature extraction and performs classification using a multi-class
Support Vector Machine model.
Acknowledgements
Firstly, I would like to thank my supervisor, Prof. Ross King, for his constant support,
feedback and guidance throughout the entire development of this project.
I would also like to thank my family for all the support they have provided me during
my years of university.
ii
Contents
Abstract
i
Acknowledgements
ii
Contents
iii
List of Figures
v
List of Tables
vi
Abbreviations
vii
1 Introduction
1.1 Problem outline . . . .
1.2 Proposed solution . . .
1.3 Aims and deliverables
1.4 Report structure . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
2
2
2 Background and literature survey
2.1 Chapter overview . . . . . . . . . . . . . . .
2.2 Universal emotions and benchmark systems
2.3 Facial Expression Recognition Systems . . .
2.3.1 Characteristics of an ideal system . .
2.3.2 Current state-of-the-art approaches .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
4
5
6
6
.
.
.
.
9
9
9
10
10
.
.
.
.
12
12
12
13
14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Requirements
3.1 Chapter overview . . . . . . . . . . . . . . . . . . .
3.2 Requirements Elicitation . . . . . . . . . . . . . . .
3.2.1 Functional and non-functional requirements
3.2.2 Use case diagrams . . . . . . . . . . . . . .
4 Design
4.1 Chapter overview . . .
4.2 Design methodologies
4.3 Architecture design . .
4.4 Interface design . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
iv
5 Implementation
5.1 Chapter overview . . . . . . . . . . . . . . . . . . . .
5.2 Implementation tools . . . . . . . . . . . . . . . . . .
5.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Cohn-Kanade dataset . . . . . . . . . . . . .
5.3.2 Selected data for the project . . . . . . . . .
5.4 Description of methodology . . . . . . . . . . . . . .
5.4.1 Face detection . . . . . . . . . . . . . . . . .
5.4.1.1 Viola-Jones algorithm . . . . . . . .
5.4.1.2 Detection of faces using Viola-Jones
5.4.2 Feature extraction . . . . . . . . . . . . . . .
5.4.2.1 Pre-processing . . . . . . . . . . . .
5.4.2.2 Uniform Local Binary Patterns . . .
5.4.2.3 Constructing the feature vector . . .
5.4.3 Classification . . . . . . . . . . . . . . . . . .
5.4.3.1 Support Vector Machines . . . . . .
5.4.3.2 Using SVM to classify emotions . .
5.5 System walkthrough . . . . . . . . . . . . . . . . . .
5.6 Overview of the prototypes . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
15
15
15
15
16
17
17
17
18
20
20
20
23
25
25
26
28
30
6 Testing and evaluation of results
6.1 Chapter overview . . . . . . . . . . . . . . .
6.2 Evaluation methods . . . . . . . . . . . . .
6.2.1 Cross validation . . . . . . . . . . .
6.2.1.1 Method description . . . .
6.2.1.2 Analysis of results . . . . .
6.2.2 Confusion matrix . . . . . . . . . . .
6.2.2.1 Method description . . . .
6.2.2.2 Analysis of results . . . . .
6.3 Comparison of results with Random Forests
6.3.1 Random Forests . . . . . . . . . . .
6.3.2 Comparative testing and results . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
32
32
32
32
33
34
34
34
35
35
36
7 Conclusions
7.1 Chapter overview . . . .
7.2 Project achievements . .
7.3 Challenges . . . . . . . .
7.4 Future improvements . .
7.5 Concluding remarks and
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
38
38
38
39
39
40
. . . . . .
. . . . . .
. . . . . .
. . . . . .
reflections
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
A Contrast-limited adaptive histogram equalization
41
B Additional testing
43
Bibliography
45
List of Figures
3.1
Use case diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1
4.2
System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
GUI components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14
Dataset examples . . . . . . . . . . . . . . .
System modules . . . . . . . . . . . . . . . .
Haar-like features . . . . . . . . . . . . . . .
Extracted area for the face . . . . . . . . .
Computing LBP codes . . . . . . . . . . . .
Examples of the extended LBP Operator .
Face division design . . . . . . . . . . . . .
Feature extraction process . . . . . . . . . .
Separation boundaries for 2D data . . . . .
Emotion posterior probabilities . . . . . . .
Panel for uploading or taking a new picture
GUI Face detection . . . . . . . . . . . . .
System’s response for emotion classification
Methods for partitioning the face . . . . . .
6.1
6.2
6.3
Data partitioning with cross validation . . . . . . . . . . . . . . . . . . . . 33
10-fold cross validation results . . . . . . . . . . . . . . . . . . . . . . . . . 34
SVM vs RF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
18
19
21
22
24
24
26
28
29
29
30
31
A.1 Pre-processing results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
B.1 Results of prototype 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
B.2 Results of prototype 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
B.3 Results of final prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
v
List of Tables
3.1
System requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1
Dataset and emotion distribution . . . . . . . . . . . . . . . . . . . . . . . 16
6.1
6.2
6.3
6.4
Confusion matrix format
Confusion matrix result
McNemar test . . . . . .
Results of McNemar test
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34
35
37
37
Abbreviations
FER
Facial Expression Recognition
LBP
Local Binary Patterns
SVM
Support Vector Machines
RF
Random Forests
ASM
Active Shape Models
AAM
Active Appearance Models
GUI
Graphical User Interface
vii
Chapter 1
Introduction
1.1
Problem outline
Expression of feelings through facial emotions was an object of interest since the time
of Aristotle. This topic grew only after 1960, when a list of universal emotions was
established and several parametrised systems were proposed, o asses them. Facilitated
by advances in Machine Learning and Computer Vision, the idea of building automated
recognition systems has received a lot of attention within the Computer Science area.
One of the pioneers of understanding communication, Prof. Albert Mehrabin, has concluded in his study [15], that in face-to-face communication, emotions are transmitted
in proportion of 55% through facial expressions. That means that if the computer could
capture and understand the emotions of its ”interlocutor”, communication would be
more natural and appropriate, especially if we think of scenarios where a computer
would play the role of a tutor.
Developing such Facial Expression Recognition system (also refereed to as a FER system)
is not trivial task, due to the high variability of data. Images are represented under
various conditions such as resolution, quality, illumination or size.
All these constraints have to be taken into consideration for selecting appropriate methods, in order to deliver a system that is robust, person independent and that ideally
works in real time scenarios.
1
Chapter 1. Introduction
1.2
2
Proposed solution
This paper proposes a system capable of performing automatic recognition of six emotions, considered to be universal across cultures: disgust, anger, fear, happiness, sadness
and surprise. Such system would analyse the image of a face and produce a calculated
prediction of the expression.
The approach integrates a module for automatic face detection, by using one of the
most popular algorithms for such task, known as Viola-Jones. Given the extracted face,
the system extracts discriminant features with a method called Local Binary Patterns,
chosen for its robustness to illumination changes and speed of computation. Lastly, the
solution performs expression classification by incorporating the widely used Machine
Learning model called Support Vector Machines, which is trained on a standard dataset
of examples.
1.3
Aims and deliverables
The main goal of this project is researching existing methods for performing automatic
facial expression recognition. This is enforced by developing a system which is capable
of classifying an image into one of the six basic emotions.
The project should deliver a face detection mechanism, a scheme for feature extraction, a
trained classifier and a Graphical User Interface that provides user access to the system’s
functionality.
1.4
Report structure
The report is structured into 7 chapters, which start by introducing the current problem
and existing solutions, followed with describing the design and implementation of the
proposed system and concluding with an analysis of results and some final remarks.
Chapter 1. Introduction
3
These are organised as follows:
1. Chapter 1 Introduces the problem and the proposed solution.
2. Chapter 2 Provides a literature survey and background information on the existing approaches.
3. Chapter 3 Presents the requirements elicitation and the system’s use case diagram.
4. Chapter 4 Outlines the design technique and how it has been applied for the
system’s architecture and user interface.
5. Chapter 5 Describes the implementation tools, the dataset and the development
process. It also presents a brief system walk-though and the prototypic stages.
6. Chapter 6 Outlines the methods used for testing and presents an analysis of the
results.
7. Chapter 7 Summaries the project’s achievements, challenges and illustrates future
improvements.
Chapter 2
Background and literature survey
2.1
Chapter overview
This chapter starts by presenting a brief overview of the studies made on facial expressions. It then continues from a technological perspective, presenting the characteristics
of the ideal automatic FER system and the techniques that are used in literature.
2.2
Universal emotions and benchmark systems
Fundamental studies on this topic can be traced back to the 17th century, but the
most significant contribution for today’s research was the influential work of Charles
Darwin [9]. He sought to understand the evolution of facial expressions and outlined
that universality of a certain set of expressions does exists across all humans, regardless
of race.
After 1960, Darwin’s writings were rediscovered and the focus shifted on the idea of
universality. Silvan Tomkins provided an initial list of the so-called ”basic” emotions,
which inspired many researchers of the psychology community, especially Paul Ekman,
a pioneer in this field. His work represents nowadays a significant milestone in the study
of human emotions and forms the basis of modern facial expression recognisers.
Paul Ekman and his colleagues conducted numerous cross-cultural studies on non-verbal
communication and especially on facial expressions. His findings lead to defining six
4
Chapter 2. Background and literature survey
5
categories of emotions, considered to be universal: happiness, sadness, anger, disgust,
fear and surprise.
Towards the end of the 20th century, researchers from the fields of computer science,
robotics and computer graphics started to gain interest in constructing systems that
could automated recognise human emotion.
Benchmark parametrised systems
For classifying facial emotions, researchers used to rely on observers, but the reliability
of the results was seriously questioned, as each person has their own interpretation. In
trying to overcome this problem, in 1977, Ekman et al has proposed a parametrised system, known as FACS (Facial Action Coding System, [18]), which became one of the best
known studies on facial activity. This scheme labels movements of face muscles, under
the term of Action Units, which can be either independent or depended on one another.
Using this framework, facial expressions are defined as being a combination of various
such Action Units, providing a more reliable methodology for emotion classification.
A similar system, developed by the MPEF-4 group, is the Facial Action Parameters, or
FAP, which takes into account facial movements. The idea behind it relies on defining
feature points on a face that is in a neural state. With this as a starting point, any
deformation from the original state and its magnitude, is captured as a FAP parameter.
2.3
Facial Expression Recognition Systems
Research on automatic facial expression recognition systems was initiated by Suwa et
al. (1978), but the lack of robust face detection and tracking algorithms lead to no
progress. After 1990, when advances in Computer Vision and Image Processing were
made, numerous studies on such systems have emerged and a large variety of solutions
have been proposed.
All approaches in recognition systems, based on images, are mainly composed of three
steps: detection of the region of interest, extraction of features and classification. In the
context of FER systems, we are interested in detecting or tracking the face, deriving
relevant characteristics and applying fitting classification algorithms, in order to get a
reliable emotion prediction.
Chapter 2. Background and literature survey
6
Once the face is successfully captured, it has to be effectively processed to extracted
meaningful properties. However, this process does not have a perfect solution for any
given image, because face detection is sensitive to scale, orientation and changes in
illumination. Also, occluding elements, such as hair or sunglasses, makes the problem
even more challenging.
2.3.1
Characteristics of an ideal system
Us humans, we find it trivial to look at an image and immediately identify faces, recognise
them or even distinguish between some of the emotions. Researches have been trying
to mimic the way our visual system works and to formate characteristics of an ideal
automatic FER system. The following properties summaries the list provided by Tian
et al in Chapter 11 of [23]
• work in real time scenarios, with any type of images (static or video)
• detect faces regardless of orientation, resolution, occluding elements
• be invariant to changes in illumination conditions
• be person independent, as well as invariant to genre or age
• ability to recognise both posed or spontaneous expressions
2.3.2
Current state-of-the-art approaches
As briefly mentioned before, an automatic FER system is composed of three major
components. According to Pantic [14], depending on the type of images that are being
used, we can talk about extracting the facial information as ”localizing the face and its
features”, in the context of static images, and ”tracking the face and its features” for
videos.
1. Face detection algorithms
For solving the first step, which is face identification, various methods have been
proposed. In the case of static images, the most commonly used technique is
called Viola-Jones, which achieves fast and reliable detection for frontal faces [24].
Chapter 2. Background and literature survey
7
Among other localization techniques, there is a neural network-based face detection
solution by Rowley et al [11] and a statistical method for 3D object detection
applied to faces and cars, by Schneiderman. [12]
Face tracking in image sequences use other types of approaches, which rely on
constructing 3D face models. Some popular examples are the 3D Candide face
model [1] and Piecewise Bezier Volume Deformation tracker (PBVD) [22]
2. Feature extraction algorithms
The next and most important step is feature extraction, which can determine the
performance, efficiency and scalability of the system. The main goal is mapping
the face pixels into a higher level representation, in order to capture the most
relevant properties of the image and reduce the dimension of data.
There are three types of approaches that appear in the literature, which depend
on data and goal of the system.
(a) Firstly, geometric or feature based techniques are concerned with identifying specific areas or landmarks of the face. They are more computationally
expensive, but they can also be more robust and accurate, especially if there is
variation in size or orientation. An example would Active Shape Models, also
known as ASM, which is popular for face and medical imaging applications.
They are statistical models that learn the shape of objects and iteratively get
adjusted on a new example, in this case, on a face. However, they can be
highly sensitive to image brightness or noise. Improved results are achieved
with Appearance Active Models (AAM), a more elaborated version of ASM
which also incorporates texture information for building an object model.
(b) The second approach does not treat the face as individual parts, but analyses
the face as a whole. These are known as appearance or holistic methods.
One of the most popular algorithms in the literature are the Gabor Wavelets,
which can achieve excellent results in recognising facial expressions. An interesting system developed by Bartlett et al in 2003 [3] uses this method and
it has been deployed on several platforms. A major downside for real time
applications is the high computational complexity and memory storage, even
though it is usually combined with a dimension reduction technique.
Chapter 2. Background and literature survey
8
An alternative approach, originally used for texture analysis but which recently gained popularity in the current context, is Local Binary Patterns
(LBP). This technique has a great advantage in terms of time complexity,
while exhibiting high discriminating capabilities and tolerance against illumination changes.
(c) There is also a third approach, perhaps the best in tackling feature extraction,
which consists of a combination of the previous methods, usually known in
literature as hybrid techniques. Geometric procedures such as AAM are
being used for automatic identification of important facial areas on which
holistic methods, such as LBP, are applied.
3. Classification algorithms
Once a higher representation of the face is obtained, a classification process is
applied. A set of faces and their corresponding labels are fed into a classifier, which
upon training, it learns and predicts the emotion class for a new face. There is a
large variety of classifiers that are used in literature, and choosing which one to
use depends on criteria, such as: type and size of data, computational complexity,
importance of robustness and overall outcome.
One of the most popular methods are Support Vector Machines, greatly used for
their results and high generalisation capabilities, but suited for binary classification problems. Alternatively, powerful, flexible and capable of training complex
functions are Artificial Neural Networks, which are also naturally multi-class algorithms, or Random Forests.
Cohen et al. [7] suggests using dynamic classifiers, such as Hidden Markov Models.
This method is proposed for person-dependent systems, as it is more sensitive to
temporal pattern changes, in the case of videos. Studies by the same authors
also recommend using static classifiers, such as Tree Augmented Naı̈ve Bayes, for
person-independent scenarios.
Chapter 3
Requirements
3.1
Chapter overview
This chapter presents the first step taken to create the proposed system, which is focused
on understanding what it is expected from the system, who are the stakeholders and
how the user is meant to interact with the delivered services. To support this process,
a list of of requirements and a use case diagram are provided.
3.2
Requirements Elicitation
It is very important to grasp the scope of the system, what is the core functionality and
what represents a ’nice to have’ feature.
The elicitation step involves gathering clear and precise requirements, in order to model
the system and its characteristics, a process that can be very complex in software development. Due to the fact that this project is mainly focused on research and less
on providing a user oriented tool, it only uses the main techniques for analysing the
system’s requirements.
9
Chapter 3. Requirements
3.2.1
10
Functional and non-functional requirements
When collecting and analysing the requirements of a software, there are two aspects that
need to be considered. The functional one refers to the features that the system needs
to deliver, while the non-functional aspect takes into account constraints and how the
system should behave.
Table 3.1 lists the requirements of the proposed solution.
Functional requirements
Non-functional requirements
• The system should classify an image
into one of 6 emotions.
• The system should be implemented
in Matlab.
• The system should include an automatic face detection algorithm.
• The system should produce graphs
showing performance.
• The system should allow for manual
face extraction.
• The system’s GUI should be simple
and clear.
• The system should include techniques for extraction of meaningful
facial features.
• The system should deliver a trained
classifier.
• The system should deliver a simple
GUI.
• The system should allow to user to
upload new images.
Table 3.1: Functional and non-functional requirements.
3.2.2
Use case diagrams
Use case diagrams depict a high level representation of the system’s behaviour. They
graphically capture the existing actors, the system’s functionality and the interaction
between the two, without explicitly showing the sequence of steps within processes.
Chapter 3. Requirements
11
The use case diagram that captures the proposed system’s behaviour is illustrated in
Figure 3.1
Figure 3.1: Use case diagram illustrating the system’s behaviour.
Chapter 4
Design
4.1
Chapter overview
This chapter presents the FER system from a large-grain perspective, depicting the design processes and its constituent parts, which have all led to the current implementation
of the system.
4.2
Design methodologies
Software design is one of the most important steps of software life-cycle, as it provides
the processes for transitioning the user requirements into the system’s implementation.
There are a series of methodologies that can be adopted, which highly depend on the
type of project, team and available resources.
The proposed FER system has been developed by following an approach known as “prototyping”. According to Beaudouin-Lafon [4], a prototype is “a tangible artefact, not
an abstract description that requires interpretation”. The idea behind this, is creating a series of uncompleted versions of the software, until the expected final solution is
achieved.
From the various approaches of this design principle, I have chosen to use evolutionary
prototyping. It implies creating prototypes that will emerge as building blocks for the
12
Chapter 4. Design
13
final software; therefore, the system is re-evaluated and enhanced after each version to
provide a more functionality or a more accurate performance.
4.3
Architecture design
The proposed FER system has been developed as an standalone application, with no
communication with other services or applications. The most important component
incorporates all the functionality, which is itself divided into 3 modules, while the second
one is a simple Graphical User Interface that allows user access to the system’s features.
The elements of the system are illustrated in Figure 4.1
Figure 4.1: External and internal components of the system.
Dependencies
The external component consisting of a dataset of images, is used for training a model
to recognise facial emotions and testing its performance. This acts as a dependency for
the system because if the data is scarce, has a bad quality or varies significantly, the
system will be poorly trained, hence will achieve inaccurate results.
Some of the solution’s functionality is achieved through available libraries. This leads
to another important dependency, created by the use of a face detection library, whose
accuracy has an impact on the overall performance.
Chapter 4. Design
4.4
14
Interface design
The system’s GUI has the role of allowing the user to use the available features, rather
then providing an extensive and modern software front-end. Being a secondary component, it has been build as a simple, yet user friendly interface, which integrates all the
major functionalities.
Both the design and development of the GUI were performed in Matlab’s GUI design
environment, known as GUIDE. The very first scheme was primitive in terms of features,
consisting only of two buttons which allowed the user to upload a new image, respectively
to request the system to perform the classification, which would result in displaying the
name of the predicted label.
Figure 4.2 presents the final version GUI, along with brief descriptions of its elements.
Figure 4.2: The Graphical User Interface of the system and its components.
Chapter 5
Implementation
5.1
Chapter overview
This chapter presents the process of developing the proposed Facial Expression Recognition System. It firstly introduces the tools that were used and the database, followed
by a detailed description of the implementation of the functionality and the interface. It
concludes with a system walkthough and a brief presentation of the prototypic stages.
5.2
Implementation tools
As the goal of the project was highly targeted towards research, the entire system was
developed in Matlab, a high level language and scientific environment. Its capabilities
are enhanced by using the integration with OpenCV, a library of functions mainly aimed
towards Computer Vision usage.
5.3
5.3.1
Dataset
Cohn-Kanade dataset
The system classifies images of people expressing one of the basic six emotions: disgust,
anger, fear, happiness, sadness or surprise. The dataset used for training and testing the
system was chosen out of the free and publicly available datasets on the web, namely
15
Chapter 5. Implementation
16
Cohn-Kanade AU-Coded Facial Expression Database (more accurately, Version 2, also
known as CK+).
It contains images of 210 people that are exhibiting both posed and spontaneous expressions, along with their corresponding meta-data which specifies the validated labels.
In terms of data diversity, the subjects include both females and males, aged from 18
to 50 years old, who come from different ethic backgrounds such as Euro-Americans,
Afro-Americans and other groups.
There are 327 images of posed emotions, which come from 123 participants, and they
have been labelled into 7 validated categories: anger, contempt, disgust, fear, happy,
sadness and surprise. Data varies in quality, the majority being 8-bit Gray scale images
having the size of 640x490 pixel arrays, while others are 24-bit colour images with sizes
of 640x480 or 720x480.
A detailed description of the dataset can be found in Lucey et al [13]
5.3.2
Selected data for the project
As the project’s scope is limited to the six basic emotions (anger, disgust, fear, happy,
sadness and surprise), the dataset was reduced to contain only the images corresponding
to the classes mentioned above, resulting in a total of 309 examples, used for training
and testing the system. Moreover, for ensuring consistency of format, all images have
been transformed into 8-bit Gray scales and resized to 640x490 pixels dimension. A
detailed representation of the data distribution is given in Table 5.1 and examples of
images from the dataset are depicted in Figure 5.1 .
No.
Emotion
Number of examples
Proportion in dataset
1
2
3
4
5
6
Surprise
Happiness
Disgust
Anger
Sadness
Fear
83
69
59
45
28
25
26.86%
22.33%
19.09%
14.53%
9.06%
8.09%
Table 5.1: Database distribution of the six emotions
Chapter 5. Implementation
17
Figure 5.1: Examples from the Cohn-Kanade dataset. The emotions starting from
the left hand side are: happiness, anger, fear and disgust.
5.4
Description of methodology
The proposed solution for the Automatic Face Expression Recognition System is composed of a series of modules, with well defined properties and actions, that follow sequential processes. If we look at the system from a high grain perspective, its main
attributes are identifying the face from a given image, mapping the face pixels into a
higher representation and ultimately decide the emotion class. The sequence of steps
undertaken by the system is depicted in Figure 5.2.
Figure 5.2: The constituent modules of the system
5.4.1
Face detection
Detecting the region of interest represents an essential part of any recognition system.
Ideally, this process has to be performed automatically and with a very low false positive
rate. One of the most famous frameworks for object detection that is currently being
used, is called Viola-Jones.
5.4.1.1
Viola-Jones algorithm
In [24], Viola and Jones proposed a new algorithm for object detection, widely used
for face detection. Their novel approach attained better results compared to previous
methodologies, achieving fast detection and a low false positive rate.
Chapter 5. Implementation
18
The first stage of the system consists of computing and extracting the so called Haarlike features, which correspond to the rectangle patches illustrated in Figure 5.3a. These
templates are applied on top of a 24x24 image of a face (as depicted in Figure 5.3b),
under all scales and locations.
(a) The 5 types of Haar-like templates; the
value of each rectangle feature is computed
by subtracting the sum of the black area,
from the white area (image adapted from
Figure 1 of [24])
(b) Method of applying the rectangle features on the 24x24 pixels image of the face.
(image is taken from Figure 5 of [24])
Figure 5.3: Haar-like features
Because computing the feature values would be an expensive operation, a new concept
called ’Integral Image’, was introduced, which allowed for constant time computations.
This median representation enabled a fast and easy way for obtaining the feature values.
However, deriving all the possible features would be very expensive. Therefore, a feature
selection process was proposed, which applies a modified version of the AdaBoost technique. This machine learning boosting algorithm was used to create a strong classifier,
out of a series of weak classifiers (models which perform slightly better than a random
guess) and a scheme of associated weights.
Lastly, a cascade of classifiers was used, in which the first classifiers are simple and used
to discard non-faces, and the stronger classifiers were used for sub-windows that might
be faces.
5.4.1.2
Detection of faces using Viola-Jones
Due to its efficiency and universality, I have chosen the Viola-Jones algorithm for this
project, in order to detect and extract the faces. For this, I have used the OpenCV library
(namely CascadeClassifier) integrated with Matlab, which offers an implementation of
the algorithm.
Chapter 5. Implementation
19
For detecting frontal faces, I have used one of the provided trained Haar classifiers, called
’haarcascade frontalface alt.xml’, which successfully extracted 306 faces out of the total
309 images of the dataset.
To ensure that the extracted faces are positioned in the same location, I have used an additional classifier, from the same OpenCV library, called ’haarcascade mcs eyepair big.xml’.
This detects the region of the eyes, which is then used to adjust the left and right margins of the face window, to ensure equal distance between eyes and the sides of the face.
In this way, unnecessary information (such as hair, ears, background) is discarded and
the extracted faces will have normalised positions.
The first two pictures of Figure 5.4 show the face and eyes regions returned by the
Viola-Jones detectors, outlining the side areas with non-essential elements, while the
third image displays the area which is ultimately extracted.
(a) Two examples of the regions detected for face and eyes.
(b) The red rectangles depict the original
face and eyes areas; the green rectangle outlines the final extracted region.
Figure 5.4: Face and eye detection areas, using Viola-Jones
The extracted faces are then resized to a standard dimension of 120x100 pixels in 8-bit
Gray scale and stored in a new face dataset, which is used in the next modules, for
feature extraction and classification.
Chapter 5. Implementation
5.4.2
20
Feature extraction
Feature extraction is one of the most important stages for any classification system.
The choice of algorithms depends not only on the computational properties, but also
on the type of data. As a result, the algorithm that I have chosen to perform feature
extraction is Local Binary Patterns, or LBP, which is widely known not only for its
computational efficiency, but also for its robustness against illumination changes. To
increase its performance, the images are firstly taken though a pre-processing step.
5.4.2.1
Pre-processing
Raw image data can be corrupted by noise or other unwanted effects, even if the camera
or environment remain unchanged. Therefore, before doing any processing to extract
meaningful information, the quality of the images has to be improved though a series of
operations, known under the term of pre-processing.
This solution applies a pre-processing technique called Contrast-limited adaptive histogram equalization, by using the Matlab’s built-in function, ’adapthisteq’, which is
used for its property of improving the local contrast in the face images.
A description of how this method works, is provided in Appendix A.
5.4.2.2
Uniform Local Binary Patterns
Local Binary Patterns
Local Binary Patterns is a powerful method, firstly introduced by Ojala et al [17]. Widely
used for texture analysis, it has been applied more recently, for applications such as face
or facial emotion recognition. Its advantages consists in the tolerance against changes
in illumination, and rather fast computation of features.
In the generic version, LBP computes a label associated for every pixel in the image,
which is computed by comparing the pixel with its neighbours, from a 3x3 neighbourhood, based on the following function s(x): (taken from Eq. 3 of [25])
Chapter 5. Implementation
21
s(x) =


1
if x >= 0,

0
if x < 0.
(5.1)
Therefore , comparing a pixel at position (xc , yc ) with all its 8 neighbours, will result in
a 8-bit binary number, whose decimal value is directly computed as follows :
LBPN (xc , yc ) =
N
−1
X
s(ni − nc )2i
(5.2)
i=0
where N is the total number of neighbour pixels (8 in this case), ni is the pixel value
of the ith neighbour and nc is the pixel value of the centre pixel. (Equation was taken
from Eq. 1 of [25])
Figure 5.5 outlines the methodology described above.
Figure 5.5: Process of computing the LBP code for a pixel: each of the 8 neighbours
is compared to the value of the central pixel ( nc = 4 ) based on the function s(x),
resulting in the 8-bit binary label 11111110; this is then transformed into its decimal
equivalent: 254
This method allows capturing the image features, even if there are changes in illumination. After applying LBP on each pixel, a histogram is build using the pixel-associated
codes, which will act as a texture description for the image. This histogram is constructed by simply counting the occurrence of each possible pattern and storing its
frequency in an associated bin.
The biggest limitation of the generic LBP is that it cannot capture large features due
to its very small neighbourhood of 3x3 pixels. To overcome this problem, there is a
variant of the algorithm, called the Extended LBP, that uses a circular neighbourhood
of a variable radius R, and P equally spaced neighbours. This is referred to as LBPP,R ,
following the very same principle of comparing each neighbour with the central pixel’s
value, nc , deriving its decimal label. To achieve this, the first thing that needs to be
Chapter 5. Implementation
22
done is computing the coordinates of the neighbours , as follows: (equations are taken
from Eq.2.2 and Eq.2.3 of [20])
xi = xc + Rcos(2πi/P )
(5.3)
yi = yc − Rsin(2πi/P )
Not all coordinates will correspond to actual positions within the image grid, therefore,
the value of those pixels will have to be approximated, using a method called Bilinear
interpolation. This looks at the closest 4 pixels, and uses a weighted scheme for linearly
interpolate each of the 2 dimensions: vertical and horizontal.
Figure 5.6 shows a number of examples for different LBP operators, with variable number of neighbours, P and radius, R.
Figure 5.6: Examples of different extended LBP operators: LBP4,1 , LBP8,2 , LBP8,2
(from left to right). Image is taken from Figure 2 of [25]
Using this algorithm, a LBP operator with P neighbourhood pixels, regardless of the
radius, will output a number of 2P different labels, which will ultimately lead to a 2P
- bin histogram. However, it has been shown that most of these patterns do not store
discriminative information, hence most of them can be discarded. Ojala et al [16] has
presented that there exists a significantly smaller subset of these patterns, which is still
able to capture the texture information, with a very small error margin. This variant is
called Uniform Local Binary Patterns, and it is the version that has been used in this
project.
Uniform Local Binary Patterns
u2 selects only a small subset of
Uniform Local Binary Patterns, referred to as LBPP,R
labels, which are good texture descriptors because they represent the responses of edges
or corners. A label is called uniform if it contains at most two bitwise transitions: that
Chapter 5. Implementation
23
means it has at most two changes from 1 to 0, or from 0 to 1. For example, 11111101
is a binary pattern as it contains two transitions (1 to and 0 to 1), while 11011011 has
four transitions (1 to 0, 0 to 1, 1 to 0 and 0 to 1).
The computational complexity is also reduced, as the subset will only contain P (P −1)+3
patterns, instead of 2P , as the original version had. In this case, when building the
corresponding histogram, each unique uniform pattern is assigned to a separate bin,
while all non-uniform codes are assigned to a single bin (which is usually the last one).
Algorithm 1 describes the implemented algorithm for uniform LBP, through high level
pseudocode.
Algorithm 1 Uniform Local Binary Pattern Algorithm
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
function uLbp(img)
P = 8, R = 2
. Create a lookup table for all possible uLBP codes
for each possible uLbp code do
if code is uniform then
lookupT able(code) = next available bin number
else
lookupT able(code) = last bin
end if
end for
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
for each pixel in img do
compute coordinates of the P neighbours
for each neighbour do
if its coordinates do not correspond to grid positions then
compute the value with bilinear interpolation
else
get value from corresponding position
end if
end for
compare pixel with neighbours and compute decimal value
get correspondent value from the lookup table
end for
return ulbp values
end function
5.4.2.3
Constructing the feature vector
In order to extract meaningful information from the faces, I have used the Uniform Local
Binary Patterns algorithm, described above. More specifically, the implemented version
uses 8 neighbour pixels and radius of 2 (P=8 and R=2), which results in a total of 59
Chapter 5. Implementation
24
different binary patterns. These parameter values are commonly used in other similar
systems and they have also provided the highest accuracy for the current system.
A face represents a very complex texture and treating it as a whole would more likely
capture general face features, rather then particular characteristics. Therefore, the first
step of feature extraction is dividing the image into relevant areas. For every region, we
u2 on each pixel and build its 59-bins histogram. The final stage conapplying the LBP8,2
sists of connecting all the histogram, creating a concatenated histogram that describes
the face.
I have designed and tried several schemas for image splitting, ultimately using the one
which achieved the highest results. Figure 5.7 illustrates this method, which divides the
face into 25 regions.
Figure 5.7: Method of dividing the face into 25 regions, corresponding to important
facial areas.
u2 histograms gives us a 1475-bins (25 regions x 59-bins
Concatenating all the 25 LBP8,2
per region) histogram descriptor for the face, which represents the feature vector. The
process in illustrated in Figure 5.8 and Algorithm 2
u2
Figure 5.8: Process of feature extraction: image is divided into 25 regions, LBP8,2
is applied on each area and the global histogram is constructed by concatenating the
resulting 25 histograms.
Chapter 5. Implementation
25
Algorithm 2 Feature extraction
function featureExtraction(img)
2:
4:
ppImg = preprocess(img)
regions = extractRegions(ppImg)
for each region in regions do
ulbpCodes = uLbp(img)
8:
create histogram from ulbpCodes
concatenate histogram to globalHistogram
10:
end for
return globalHistogram
12: end function
6:
5.4.3
Classification
The last stage of the system consists of a model that is trained to perform emotion
classification on new images. It uses a Machine Learning classifier called Support Vector
Machines (SVMs), which takes the output of the feature extraction module, the feature
vectors, and learns the patterns that differentiate one emotion from the other. This subsection firstly introduces the concept of SVM and continues with a detailed explanation
of how they are used in the system.
5.4.3.1
Support Vector Machines
The concept of Support Vector Machines was firstly introduced by Vapnik et al [8], and
presently, they are one of the most widely used methods for pattern classification. An
SVM is a supervised learning model, because it uses labelled examples in its training
process, examples which correspond to only two categories. This property makes the
algorithm able to only tackle binary classification tasks.
The model analyses the training examples and tries to derive a boundary that will
linearly separate the data points into their corresponding classes. One of the most important feature of this method, is that it does not only look for a separation boundary,
but for the ’best boundary’. This is done by maximizing the margin, which is the width
by which the separation boundary can be increased until it hits a data point. The difference between separating data with any border and separating it with SVM’s optimal
boundary is shown in Figure 5.9
Chapter 5. Implementation
26
(a) Example of possible boundaries to separate 2D data into two classes. (Image is
taken from Figure 9.2, of [10])
(b) Separating 2D data with the maximummargin boundary. (Image adapted from
Figure 2, of [8])
Figure 5.9: Separation boundaries for 2D data
As depicted in Figure 5.9b, the linear classifier that separates the data has the following
mathematical form: (equation adapted from Figure 5.7 of [26])
f (x) = w| x + b
(5.4)
where w is the normal to the separation hyperplane, known as the weigh vector, b is the
bias and x is the vector of training examples, corresponding to the classes y={1, -1}.
The goal is fining the best values for w and b, that correspond to the maximum-margin
boundary, such that each training example xi can be described as: (equations adapted
from Eq 5.43 of [26])
xi · w + b ≥ +1 if yi = +1
(5.5)
xi · w + b ≤ −1 if yi = −1
Finding such function is not trivial because most of the time, the data is simply not
linearly separable. For this, SVM uses something called ’kernels’, which can be regarded
as complicated functions, which map the data points into a higher dimensional space,
where eventually, a hyperplane would be able to separate the examples. The chosen
kernel can be: a polynomial kernel or a radial basis function.
5.4.3.2
Using SVM to classify emotions
Support Vector Machines are not only one of the most popular choice in facial emotion
classification, but it is also a complex robust model with great generalization properties,
Chapter 5. Implementation
27
less prone to over-fitting. Based on such characteristics, I have chosen this algorithm
to support the prediction module of my FER system. The feature extraction stage,
described earlier, produces a set of feature vectors for a subset of images that are used
as training examples. These are fed into the linear SVM model, for which I have used
the Matlab’s implementation of the algorithm.
The major problem with this, is that the proposed solution has to classify emotions into
six categories, whereas this Machine Learning algorithm can only deal with 2-class tasks.
The overcome this issue, the system implements a multi-class SVM model, by using a
’one-against-one’ strategy.
The ’one-against-one’ method constructs K(K − 1)/2 SVM classifiers, where K is the
number of classes that the data exhibits, and each model is trained on examples that
belong to the associated pair of classes. For example, assuming the class set is A, B, C,
there are 3 models which are built as follows:
• Model 1 classifies data from class A and B
• Model 2 classifies data from class B and C
• Model 3 classifies data from class A and C
Following this strategy, a total of 15 SVM classifiers are built, for the 6 basic emotions.
This kind of model is also referred to as a Max-win voting SVM, because it uses a voting
strategy to perform the classification. When a new example is fed into the system, each
of the 15 classifiers makes a vote, and the class that gets the bigger number of votes
is chosen as the predicted label. In fact, the current system makes use all the election’
information, such that it predicts a label, but it also computes the probabilities of the
image being in each of the six categories. These posterior probabilities are computed
using Bayes Probability Theory.
Computing emotion probabilities
Bayes’ theorem is a method widely used in statistics for predicting event probabilities,
based on knowledge of past and conditional events. For example, the conditional (or
posterior probability) for an event Y, given that event X has already happened, is defined
as follows: (equation taken from Eq. 1.12 of [5]):
Chapter 5. Implementation
28
P (Y | X) =
P (X | Y ) P (Y )
P (X)
(5.6)
For each new image, the system uses the knowledge of class distribution (many many
examples from each emotion are the dataset) and the votes previously accumulated, to
compute the posterior probability of each emotion, using the following formulas:
P (Ci | x) =
P (x | Ci ) P (Ci )
P (x)
(5.7)
where P (Ci | x) is the posterior probability for emotion Ci , P (x | Ci ) is the vote
P
percentage for class Ci and P (x) is the total probability defined as: P (x) = 6i=1 P (x |
Ci )P (Ci )
Figure 5.10 shows a graphical representation of applying the above formulas, for computing the probability that a new input image represents the emotion of happiness.
Figure 5.10: This shows how to apply Bayes’ theorem for computing the posterior
probability of the emotion being “happiness”, given the image x
5.5
System walkthrough
The main functionality of the system is the ability of uploading an image of a person,
expressing one of the six basic emotions, and requesting a prediction for its category.
Chapter 5. Implementation
29
Therefore, the first feature that is enabled, is the upload panel. The user has the choice
of loading an image from the file drive, or alternatively, he use the computer’s camera
to take a picture of himself. The choice is shown in Figure 5.11
(a) Uploading a new image from the file
drive)
(b) Using the computer’s camera to take a
new picture.)
Figure 5.11: Panel for uploading or taking a new picture
Once the image is uploaded into the system, it can perform face detection, displayed in
Figure 5.12. This is done automatically, as described in the ’Face Detection’ section, but
in some scenarios, due to poor image quality (usually in the case of using the computer’s
camera to take a picture), the algorithm might fail. In order to cope with such situations,
the user has the option to manually select the face region and still use the system to
classify the image.
(a) Automatic face detection.
(b) Manual face detection, performed by
user.
Figure 5.12: GUI Face detection
Chapter 5. Implementation
30
Finally, the ’Classify emotion’ button performs classification of the face, displaying the
predicted label and its description in the “Emotion Details” panel. The “Probability
Estimates” panel offers an graphical visualisation of the posterior probabilities of each
class, outlining the system’s belief in all the emotions, not only in the predicted one (
Figure 5.13)
Figure 5.13: Output of the system for classifying the given image.
5.6
Overview of the prototypes
The FER system described above is the result of a series of prototypical stages, designed
to improve the methodologies and ultimately increase the classification accuracy.
The first fully working prototype could only classify 2 class examples, for ’Happiness’
and ’Disgust’, and included the following functionality:
1. Face detection, using Viola-Jones algorithm
u2 on a face divided into 3 large areas
2. Feature extraction, using LBP8,2
3. Classification, using a binary SVM model
The next improvement consisted in extending the third module to be able to distinguish
all six emotions, by implementing a multi-class SVM classifier.
Chapter 5. Implementation
31
The following prototypes focused on creating alternative designs of dividing the face into
chunks, in order to capture the most important areas that have discriminant information
for the emotions. The different partition methods that I have tried, are illustrated in
Figure 5.14, starting with the very first design and ending with the current scheme that
the system uses.
Figure 5.14: Different schemes for dividing the face into regions, displayed in the
order in which they have been tried. The last one is the method implemented in the
final version of the FER system.
Chapter 6
Testing and evaluation of results
6.1
Chapter overview
This chapter presents a detailed analysis of the system from the perspective of its performance and robustness. It firstly introduces the methods that were used to test the
system and it then presents the obtained results. Finally, it presents an alternative
classification method and offers a comparison of the outcomes.
6.2
6.2.1
6.2.1.1
Evaluation methods
Cross validation
Method description
Adopting a validation technique is not only essential for estimating the performance
of the system, but also to be able to compare different models or versions of the same
model, by modifying its parameters. Moreover, one of the biggest problems that systems
which use Machine Learning algorithms have, is getting a big enough dataset. Having
such large dataset would ensure proper training of the agent and would allow to use the
other half of the examples for a robust performance evaluation.
To overcome this limitation, a widely used validation technique in Machine Learning, is
cross-validation. This method replies on repetitive divisions of the dataset, into training
32
Chapter 6. Testing
33
and testing subsets, in order to avoid over-fitting and capture the prediction error that
the model exhibits. These are various adaptations of this technique, with different
schemes of partition; this project uses the so called k-fold cross validation.
With this method, the dataset is randomly split into K equal subsets, out of which K-1
subsets are used for the training phase, while the remaining fold is used for testing. This
process in repeated K times, such that in every round of testing, a different subset is
used for validation. Figure 6.1 illustrates the process of partitioning the dataset into 10
folds using the described technique.
Figure 6.1: Process of diving the data into training and testing subset, using 10-fold
cross validation
6.2.1.2
Analysis of results
The proposed system uses a 10-fold cross validation method for splitting the 306 examples, therefore it performs testing on approximately 30 examples each round. Following
the validation phase, an average accuracy of 86% is achieved for the implemented model.
Figures 6.2a and 6.2b present the outcome, generated by the 10-fold cross validation
process, during one respectively ten rounds of testing. The graph depicts the average
accuracies and the associated error bars, which outline how much the results vary. The
difference in values for each test, observed in figure 6.2a, is due to the random division
of the examples in each fold. In some partitioning, the training set might contain too
few examples from one class, hence have a poorer performance on classifying examples
from that specific category.
A more detailed analysis of the results obtained in the prototypical stages is presented
in the Appendix B.
Chapter 6. Testing
34
(a) Graph showing the average accuracy and
standard error bars for a round of testing using 10-fold cross validation.
(b) Graph showing the average accuracy and
standard error bars for 10 rounds of 10-fold
cross validation.
Figure 6.2: Cross-validation results
6.2.2
6.2.2.1
Confusion matrix
Method description
It is very important to analyse the model not only from its performance perspective, but
also to investigate how it behaves for each individual classes. Does it predict perfectly
for class A but always miss-classifies class C, or does it have a fair performance for all
classes? A very common approach to detect such issues is using a confusion matrix. This
is a table-like representation of the predictions that the model outputs during testing,
illustrating how the examples have been classified. Table 6.1 outlines the general format
of this type of representation, for a binary problem.
True class
Positive
Negative
Predicted Class
Positive
Negative
true positives false negative
false positive true negative
Table 6.1: General format for a confusion matrix (adapted from Table 19.1 of [2])
6.2.2.2
Analysis of results
Figure 6.2 shows the confusion matrix that has been obtained during 10 rounds of testing the system, where the rows correspond to the true labels and the columns to the
Chapter 6. Testing
35
predicted ones. Therefore, looking at the rows, we can not only observe how many instances have been correctly classified, but also which were the classes assigned to the
misclassified examples. For example, ’surprise’ is well labelled (80 times, out of 82) as
well as ’disgust’ (54 out of 58). On the other hand, ’sadness’ is often confused with
’anger’, while ’fear’ is mostly mislabelled.
Actual
label
Anger
Disgust
Fear
Happiness
Sadness
Surprise
Anger
38
3
2
0
9
1
Disgust
3
54
2
0
0
0
Predicted label
Fear Happiness Sadness
0
1
3
0
1
0
9
3
3
1
68
0
0
0
17
2
0
0
Surprise
0
0
4
0
2
80
Table 6.2: Confusion matrix obtained after testing
6.3
Comparison of results with Random Forests
Choosing a suitable classifier in a classification problem is a very essential part. The
choice of the model for this project was based on high discriminative properties, robustness and achievement of high accuracy. As the main goal of this project consisted in
researching techniques for solving the facial expression recognition problem, an alternative solution called Random Forests, was investigated.
6.3.1
Random Forests
Random Forests is among the most popular algorithms adopted in classification tasks,
due to its efficiency on large datasets, high accuracy and native multi-class capabilities.
It consists of building a collection of decision trees, which are each trained on a bootstrap
sample (set of examples randomly chosen with replacement) from the available training
examples.
A decision tree is normally constructed by recursively partitioning the examples, based
on selecting the best variable from the feature set, at each node split. In a random
forest, growing each decision tree consists of firstly selecting a subset of attributes, at
Chapter 6. Testing
36
random, and within that subset, picking the best variable to perform the split. The idea
behind this random sampling is in trying to reduce the amount of variation.
A classification for a new example is performed by collecting the votes of all the decision
trees and then choosing the class that has the majority of votes.
6.3.2
Comparative testing and results
In order to decide which algorithm to select for the final implementation of the system,
I have performed several tests to compare the performance of each method.
Performance comparison using cross-validation
Firstly, I have run both algorithms on the same data, corresponding to 10 rounds of
10-fold cross validation. Figure 6.3 illustrates the obtained results, and graphically
highlights that SVM outperforms, for all rounds of testing, the alternative approach.
Figure 6.3: Performance of Support Vector Machines versus Random Forest
Upon these results, I have also tried to demonstrate that the better algorithm, in this
case SVM, is more statistically significant than the alternative, using a test called the
”McNemar test”.
Chapter 6. Testing
37
McNemar test
McNemar is a non-parametric statistical test which can be used to assess if the performance of two different classifiers is significantly different. The goal is to reject the null
hypothesis, which assumes the two classifiers have the same error rate, hence prove that
one method is better than the other. The outputs of the algorithms, on the test set, can
be organised in a 2x2 table, which is then used to compute a statistical measure called
z-score, illustrated in Equation 6.1 (taken from Eq. 1 of [6]). According to Bostanci et al
in [6], if z score is 0, than the performance of the two algorithms is the same; otherwise
they differ and we can reject the null hypothesis.
The general form of the output table and the results I have obtained are presented in
Tables 6.3 and 6.4.
Algorithm B
failed
Algorithm B
succeeded
Algorithm A
failed
Algorithm A
succeeded
Nf f
Nsf
Nf s
Nss
Table 6.3: General form of output arrangement for the two algorithms. Table is taken from Table 3 of [6]
z=
Random Forest
failed
Random Forest
succeeded
SVM
failed
SVM
succeeded
32
54
8
212
Table 6.4: Obtained table for SVM
and Random Forest output, for all test
folds in a 10-fold cross validation)
(|Nsf − Nf s | − 1))
p
Nsf +Nf s
(6.1)
The computed value of the z-score for the above output is 5.71. Based on this result,
we can conclude that the two classifiers do not have the same error rate, hence we can
reject the null hypothesis and state that SVM more statistically significant that Random
Forest in the current context.
Chapter 7
Conclusions
7.1
Chapter overview
This final chapter summaries the achievements of the project as well as the challenges
that were faced during its development. It also provides an outline of the possible
improvements and their applicability, concluding with final remarks.
7.2
Project achievements
The proposed solution delivers a recogniser system for facial expressions. The most
important achievement consists in the integrated functionalities and the obtained results. The system includes an automatic face detection mechanism and implements
feature extraction techniques, tailored for the current problem. A Support Vector Machine model is trained on examples of faces and extended to support multi-classification.
This successfully expands the system with the capability of classifying all six emotions,
ultimately achieving an accuracy of 86%.
The functionality can be easily accessed by the user through a simple, yet intuitive GUI,
which provides the ability of uploading an image and requesting a classification.
38
Chapter 7. Conclusions
7.3
39
Challenges
Prior to implementing the system, one of the first challenges of the project was choosing
the algorithms for each individual module, because the selection had to consider the
following: integration of techniques, restriction to the time allowed for the project development, speed of computation and ultimately, achieving a good system performance.
Other difficulties were met during the feature extraction phase, while implementing
the uniform LBP algorithm and finding a suitable way of partitioning the face into
meaningful regions. Extending the binary SVM model into a multi-class agent and
deriving a mechanism for computing class probabilities were also non-trivial tasks.
7.4
Future improvements
There are a series of approaches that would either increase the performance of the
system or extent its functionality. A major improvement would be replacing the current
method of face division with one of the geometric techniques, Active Shape Models or
Active Appearance Models. These techniques allow identification of landmark points,
surrounding important face regions, such as eyes, nose and mouth. This enables feature
extraction to be applied only on key areas, hence improve the results or even eliminate
the need of using a face detection mechanism.
One of the limitations of the proposed system is that it only allows recognition for
static images. Therefore, a significant advancement would be adding the ability of modelling the temporal component of expressions, hence analysing video input as well. This
would imply using alternative algorithms such as Piecewise Bezier Volume Deformation
(PBVD) tracker, for face tracking, and Hidden Markov Models for classification.
Moreover, the solution has a restricted number of classes that it is able to predict. To
overcome this, a more appropriate approach would be using the FACS parametrised
system, hence regard each emotion as a set of Action Units describing movements of
face muscles. This allows both the extension of the subset of classes to more than 6, and
a more in-depth emotion characterization.
Chapter 7. Conclusions
7.5
40
Concluding remarks and reflections
Finally, this report demonstrates the achievements of the project, but also presents
an assessment of the performance and reliability. Overall, the proposed solution has
delivered a system capable of classifying the six basic universal emotions, with an average
accuracy of 86%. It extensively makes use of Image Processing and Machine Learning
techniques, to evaluate still images and derive suitable features, such that, presented
with a new example, it would be able to recognise the expressed emotion.
Personally, this project had a significant contribution for improving my knowledge of
Computer Vision and Machine Learning methodologies and for understanding the challenges and limitations of image interpretation. Moreover, it helped me in developing a
systematic approach for building such a system, including planning, design and documentation, within a restricted amount of time.
To conclude with, I believe that the current solution succeeded to meet the project’s
requirements and its deliverables. Even though it has a series of limitations, it allows
for further extensions, which would enable a more in-depth analysis and understanding
of human behaviour, through facial emotions.
Appendix A
Contrast-limited adaptive
histogram equalization
Pre-processing is a very important step in image processing, because the input data is
often noisy or has a poor quality. Such techniques can greatly improve the result, if the
methods being used are sensitive to noise or other artefacts, but can also influence the
computational speed.
Contrast-limited adaptive histogram equalization, or CLAHE, is used in order to amplify the local contrast within the face images. It has the advantages of achieving very
good results without being computationally expensive. Its original version, Adaptive
Histogram Equalization, is based on subdividing an image into regions and applying
histogram equalization for each one of them. Histogram equalization uses a mapping
function, to spread out the pixels of an original image histogram. Therefore, the pixels
are translated into a new histogram, achieving a global contrast enhancement of the
image.
Artefacts that might appear on the edges of the divided blocks are usually managed
by using bilinear interpolation. However, Adaptive Histogram Equalization is prone to
amplification of noise in some small regions. CLAHE tries to overcome this problem by
using a mechanism called clipping, by which each histogram bin has a limited number
of pixels that can be assigned with. The pixels that remain unassigned are equally
scattered over the histogram. According to Karel Zuiderveld in [27], the clipping factor
41
Appendix A. Contrast-limited adaptive histogram equalization
42
is chosen as ”a multiple of the average histogram contents”, defining how much the local
contrast is limited.
Figure A.1 illustrates the effect of applying the contrast-limited adaptive histogram
equalization method on the extracted face.
Figure A.1: Result of applying Contrast-limited adaptive histogram equalization.
I have also performed tests of 10 rounds of 10-fold cross validation, to compare the
system’s performance using raw images against pre-processed data. The results outlined
an increase in accuracy of 2.5%, from an average of 84% to 86.5%.
Appendix B
Additional testing
Evaluation of prototypes
The proposed system has been developed through a series of prototypes which gradually expanded the functionality and improved the achieved results. During the feature
extraction phase, there were various methods that were tried, for partitioning the face
into relevant areas.
Figures B.1, B.2, B.2 illustrates the results obtained by using different experimental
designs, and outlines the reasoning of implementing the current design into the system,
which achieved an accuracy of approximately 86% - higher than any of the presented
prototypes.
Figure B.1: Performance of prototype using a 3x1 face division grid.
Average accuracy: 66.6%
43
Appendix B. Additional testing
Figure B.2: Performance of prototype using a 4x5 face division grid.
Average accuracy: 83.9%
Figure B.3: Performance of the final prototype (the implemented version), using a
variable face division grid.
Average accuracy: 86.5%
44
Bibliography
[1] Jörgen Ahlberg. Candide-3 – an updated parameterized face. Technical Report
LiTH-ISY-R-2326, Dept. of Electrical Engineering,Linköping University, Sweden,
2001.
[2] Ethem Alpaydin. Introduction to Machine Learning, second edition. 2010.
[3] Marian Stewart Bartlett, Gwen Littlewort, Ian Fasel, and Javier R. Movellan. Real
time face detection and facial expression recognition: Development and applications to human computer interaction. In Computer Vision and Pattern Recognition
Workshop, 2003. CVPRW ’03. Conference on, volume 5, pages 53–53, 2003.
[4] Michel Beaudouin-Lafon and Wendy Mackay. The human-computer interaction
handbook. chapter Prototyping Tools and Techniques, pages 1006–1031. 2003.
[5] Christopher Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., 2006.
[6] Betul Bostanci and Erkan Bostanci. An evaluation of classification algorithms using
mc nemar’s test. In Proceedings of Seventh International Conference on Bio-Inspired
Computing: Theories and Applications (BIC-TA 2012), volume 201, pages 15–26.
2013.
[7] Ira Cohen, Nicu Sebe, Larry Chen, Ashutosh Garg, and Thomas S. Huang. Facial
expression recognition from video sequences: Temporal and static modelling. In
Computer Vision and Image Understanding, pages 160–187, 2003.
[8] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning,
20(3):273–297, 1995.
[9] Charles Darwin. The expression of the emotions in man and animals. AMS Press,
1972.
45
Bibliography
46
[10] Trevor Hastie Robert Tibshirani Gareth James, Daniela Witten. An Introduction
to Statistical Learning with Applications in R. Springer-Verlag New York, 2013.
[11] Takeo Kanade Henry A. Rowley, Shumeet Baluja. Neural network-based face detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(1):
23–38, 1998.
[12] Takeo Kanade Henry Schneiderman. A statistical method for 3d object detection
applied to faces and cars. 1:746–751 vol.1, 2000.
[13] P. Lucey, J.F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The
extended cohn-kanade dataset (ck+): A complete dataset for action unit and
emotion-specified expression. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pages 94–101, 2010.
[14] Leon J.M. Rothkrantz Maja Pantic. Automatic analysis of facial expressions: the
state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence,
pages 1424–1445, 2000.
[15] Albert Mehrabin. Communication without word. Psychology Today, 2((9)):52–55,
1968.
[16] T. Ojala, M. Pietikainen, and T. Maenpaa. Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, 24(7):971–987, 2002.
[17] Timo Ojala, Matti Pietikäinen, and David Harwood. A comparative study of texture measures with classification based on feature distributions. Pattern Recognition, 29(1):51–59, 1996.
[18] Wallace V. Friesen Paul Ekman. Facial action coding sytem. A Human Face, 2002.
[19] Rosalind W. Picard. Affective computing for hci. Proceedings HCI, 1999.
[20] Matti Pietikäinen, Abdenour Hadid, Guoying Zhao, and Timo Ahonen. Local
binary patterns for still images. In Computer Vision Using Local Binary Patterns,
volume 40, pages 13–47. Springer London, 2011.
[21] James A. Russell and José Miguel Fernández-Dols. The psychology of facial expression. In Studies in Emotion and Social Interaction. 1997.
Bibliography
47
[22] Hai Tao and THOMAS S Huang. A piecewise bezier volume deformation model and
its applications in facial motion capture. Series in machine perception and artificial
intelligence, 52:39–56, 2002.
[23] Ying-Li Tian, Takeo Kanade, and JeffreyF. Cohn. Facial expression analysis. In
Handbook of Face Recognition, pages 247–275. Springer New York, 2005.
[24] Paul Viola and MichaelJ. Jones. Robust real-time face detection. International
Journal of Computer Vision, 57(2):137–154, 2004.
[25] Wencheng Wang, Faliang Chang, Jianguo Zhao, and Zhenxue Chen. Automatic
facial expression recognition using local binary pattern. In Intelligent Control and
Automation (WCICA), 2010 8th World Congress on, pages 6375–6378, 2010.
[26] Andrew R. Webb and Keith D. Copsey. Statistical Pattern Recognition. Wiley,
2011.
[27] Karel Zuiderveld. Graphics gems iv. chapter Contrast Limited Adaptive Histogram
Equalization, pages 474–485. 1994.