Available - Department of Electrical and Information Engineering

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE
Transactions on Image Processing
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MONTH XX
1
Dynamic Facial Expression Recognition with Atlas
Construction and Sparse Representation
Yimo Guo, Guoying Zhao, Senior Member, IEEE, and Matti Pietikäinen, Fellow, IEEE
Abstract—In this paper, a new dynamic facial expression recognition method is proposed. Dynamic facial expression recognition
is formulated as a longitudinal groupwise registration problem.
The main contributions of this method lie in the following aspects:
(1) subject-specific facial feature movements of different expressions are described by a diffeomorphic growth model; (2) salient
longitudinal facial expression atlas is built for each expression
by a sparse groupwise image registration method, which can
describe the overall facial feature changes among the whole
population and can suppress the bias due to large inter-subject
facial variations; (3) both the image appearance information in
spatial domain and topological evolution information in temporal
domain are used to guide recognition by a sparse representation
method. The proposed framework has been extensively evaluated on five databases for different applications: the extended
Cohn-Kanade, MMI, FERA, and AFEW databases for dynamic
facial expression recognition, and UNBC-McMaster database for
spontaneous pain expression monitoring. This framework is also
compared with several state-of-the-art dynamic facial expression
recognition methods. The experimental results demonstrate that
the recognition rates of the new method are consistently higher
than other methods under comparison.
Index Terms—Dynamic Facial Expression Recognition, Diffeomorphic Growth Model, Groupwise Registration, Sparse Representation.
I. I NTRODUCTION
Automatic facial expression recognition (AFER) has essential real world applications. Its applications include, but are
not limited to, human computer interaction (HCI), psychology
and telecommunications. It remains a challenging problem
and active research topic in computer vision, and many novel
methods have been proposed to tackle the automatic facial
expression recognition problem.
Intensive studies have been carried out on AFER problem
in static images during the last decade [1], [2]: Given a query
facial image, estimate the correct facial expression type, such
as anger, disgust, happiness, sadness, fear or surprise. It mainly
consists of two steps: feature extraction and classifier design.
For feature extraction, Gabor wavelet [3], local binary pattern
(LBP) [4], and geometric features such as active appearance
model (AAM) [5] are in common use. For classifier, support
vector machine is frequently used. Joint alignment of facial
images under unconstrained condition has also become an
active research topic in AFER [6].
In recent years, dynamic facial expression recognition has
become a new research topic and receives more and more
The authors are with the Center for Machine Vision Research, Department
of Computer Science and Engineering, University of Oulu, Finland. Email:[email protected],[email protected],[email protected]
Manuscript received XX, XXXX; revised XXX.
attention [7], [8], [9], [10], [11], [12]. Different from the
recognition problem in static images, the aim of dynamic
facial expression recognition is to estimate facial expression
type from an image sequence captured during physical facial
expression process of a subject. The facial expression image
sequence contains not only image appearance information in
the spatial domain, but also evolution details in the temporal domain. The image appearance information together
with the expression evolution information can further enhance
recognition performance. Although the dynamic information
provided is useful, there are challenges regarding how to
capture this information reliably and robustly. For instance, a
facial expression sequence normally constitutes of one or more
onset, apex and offset phases. In order to capture temporal
information and make temporal information of training and
query sequences comparable, correspondences between different temporal phases need to be established. As facial actions
over time are different across subjects, it remains an open issue
how a common temporal feature for each expression among
the population can be effectively encoded while suppressing
subject-specific facial shape variations.
In this paper, a new dynamic facial expression recognition
method is presented. It is motivated by the fact that facial
expression can be described by diffeomorphic motions of muscles beneath the face [13], [14]. Intuitively, ‘diffeomorphic’
means the motion is topologically preserved and reversible
[15]. The formal definition of ‘diffeomorphic’ transformation
is given in Section II. Different from previous works [10], [16]
by using pairwise registration to capture the temporal motion,
this method considers both the subject-specific and population information by a groupwise diffeomorphic registration
scheme. Moreover, both the spatial and temporal information
are captured with a unified sparse representation framework.
Our method consists of two stages: atlas construction stage
and recognition stage. Atlases, which are unbiased images, are
estimated from all the training images belonging to the same
expression type with groupwise registration. Atlases capture
general features of each expression across the population and
can suppress differences due to inter-subject facial shape variations. In the atlas construction stage, a diffeomorphic growth
model is estimated for each image sequence to capture subjectspecific facial expression characteristics. To reflect the overall
evolution process of each expression among the population,
longitudinal atlases are then constructed for each expression
with groupwise registration and sparse representation. In the
recognition stage, we first register the query image sequence
to atlas of each expression. Then, the comparison is conducted
from two aspects: image appearance information and temporal
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE
Transactions on Image Processing
2
evolution information. The preliminary work has been reported
in [17].
For the proposed method, there are three main contributions
and differences compared to the preliminary work in [17]:
(1) A more advanced atlas construction scheme is used.
In previous method [17], the atlases are constructed using
the conventional groupwise registration method, thus lots of
subtle and important anatomical details are lost due to the
naive mean operation. To overcome this shortage, a sparse
representation based atlas construction method is proposed
in this paper. It is capable of capturing subtle and salient
image appearance details to guide recognition, and preserving
common expression characteristics. (2) In the recognition
stage, the previous method in [17] compared image differences
between the warped query sequence and atlas sequence, which
is based on image appearance information only. In this paper,
the temporal evolution information is also taken into account
to drive the recognition process. It has shown to provide
complementary information to image appearance information
and can significantly improve the recognition performance.
(3) The proposed method has been evaluated in a systematic
manner on five databases whose applications vary from posed
dynamic facial expression recognition to spontaneous pain
expression monitoring. Moreover, possible alternatives have
been carefully analyzed and studied with different experimental settings.
The rest of the paper is organized as follows: Section
II gives an overview of related works on dynamic facial
expression and diffeomorphic image registration. Section III
describes the proposed method. Section IV analyzes experimental results. Section V concludes the paper.
II. R ELATED W ORK
A. Methods for Dynamic Facial Expression Recognition
Many novel approaches have been proposed for dynamic
facial expression recognition [18], [19], [20], [10], [21]. They
can be broadly classified into three categories: shape based
methods, appearance based methods and motion based methods.
Shape based methods describe facial component shapes
based on salient landmarks detected on facial images, such
as corners of eyes and mouths. The movement of those
landmarks provides discriminant information to guide the
recognition process. For instance, the active appearance model
(AAM) [22] and the constrained local model (CLM) [23] are
widely used. Also, Chang et al. [18] inferred facial expression
manifold by applying active wavelets network (AWN) to a
facial shape model which is defined by 58 landmarks that are
used by Pantic and Patras [19].
Appearance based methods extract image intensity or other
texture features from facial images to characterize facial expressions. Commonly used feature extraction methods include
LBPTOP [7], Gabor wavelets [3], HOG [24], SIFT [25]
and subspace learning [20]. For a more thorough review of
appearance features, readers may refer to the survey paper
[26].
Motion based methods aim to model spatial-temporal evolution process of facial expressions, which are usually developed
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MONTH XX
in virtue of image registration techniques. For instance, Koelstra et al. [10] used the free-form deformation (FFD) [27] to
capture motions between frames. The optical flow was adopted
by Yeasin et al. [21]. The recognition performance of motion
based methods is highly dependent on face alignment methods
used. Many advanced techniques have been proposed for face
alignment, such as the supervised descent method developed
by Xiong and De la Torre [28], the parameterized kernel principal component analysis based alignment method proposed by
De la Torre and Nguyen [29], the FFT-based scale invariant
image registration method proposed by Tzimiropoulos et al.
[30], and the explicit shape regression based face alignment
method proposed by Cao et al. [31].
B. Diffeomorphic Image Registration
Image registration is an active research topic in computer
vision and medical image analysis [32], [33]. The goal of
image registration is to transform a set of images which are
obtained from different space, time or imaging protocols into a
common coordinate system, namely the template space. Image
registration can be formulated as an optimization problem.
Figure 1 illustrates the flow chart of pairwise registration
between two images.
Moving Image
Similarity
Measure
Current Measure
Value
Optimizer
Transform the
Moving Image
Interpolator
Optimal transform
parameters
Transformation
Model
Fixed Image
Fig. 1. A typical flow chart for pairwise image registration: the moving
image is transformed to fixed image space. In each iteration, the optimizer
minimizes the similarity measure function between two images and calculates
the optimal corresponding transformation parameters.
Equation 1 summaries the general optimization process of
pairwise registration problem.
Topt = arg min E(If ix , T (Imov )),
T ∈Φ
(1)
where If ix is the fixed image, Imov is the moving image, Φ
denotes the whole possible transformation space and E(·) is
the similarity measure metric. This equation aims to find the
optimal transformation Topt which minimizes E(·) between
If ix and Imov .
The transformation model for registration is application specific. It can be rigid or affine transformations that contain only
three and six degrees of freedom; or deformable transformations, such as B-spline [27] and diffeomorphic transformations
[15], [34] that contain thousands degrees of freedom.
In this paper, the diffeomorphic transformation is used due
to its excellent properties, such as topology preservation and
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE
Transactions on Image Processing
SHELL et al.: BARE DEMO OF IEEETRAN.CLS FOR JOURNALS
3
Apex
Neutral
Neutral
Fig. 2. Illustration of the whole facial expression process. The neutral face gradually evolves to the apex state, and then facial muscles get it back to another
neutral state. Therefore, this can be considered as a diffeomorphic transformation process (i.e., topologically preserved and reversible).
reversibility [15]. These properties are essential and required
to model facial feature movements and suppress registration
errors. Otherwise, unrealistic deformations (e.g., twisted facial
expressions) may occur and introduce large registration errors.
The formal definition of diffeomorphic transformation is:
Given two manifolds Υ1 and Υ2 , and a mapping function
F : Υ1 → Υ2 , F is a diffeomorphic transformation if it is
differentiable and its inverse mapping F −1 : Υ2 → Υ1 is also
differentiable. F is a C ξ diffeomorphic transformation if F
and F −1 are ξ times differentiable. For a registration task, F
is often built in an infinite-dimensional manifold [15]. It should
be noted that the group of diffeomorphic transformations F is
also a manifold.
C. Groupwise Registration
As facial expression process is topologically preserved and
reversible, as illustrated in Figure 2, it can be considered as
a diffeomorphic transformation of facial muscles. Therefore,
the diffeomorphic transformation during the evolution process
of facial expression can be used to reconstruct facial feature
movements and further guide the recognition task.
Given P facial expression images I1 , ..., IP , a straightforward solution to transform them to a common space is to select
one image as the template, then register the remaining P − 1
images to the template by applying P −1 pairwise registration.
However, the registration quality is sensitive to the selection
of template. Therefore, the idea of groupwise registration was
provided [35], [36], where the template is estimated to be the
F réchet mean on the Riemannian manifold whose geodesic
distances are measured based on diffeomorphisms.
The diffeomorphic groupwise registration problem can be
formulated as the optimization problem by minimizing:
Iˆopt , ψ1opt , ..., ψPopt =
P X
ˆ ψi (Ii ))2 + λR(ψi ) ,
arg min
d(I,
(2)
ˆ 1 ,...,ψP i=1
I,ψ
where both the template Iˆopt and the optimal diffeomorphic
transformation ψiopt (i = 1, ..., P ) that transforms Ii to Iˆopt
are variables to be estimated. d(·) is the similarity function
that measures the matching degree between two images, R(·)
denotes the regularization term to control the smoothness of
transformation, and λ is a parameter to control the weight of
R(·). Iˆopt and ψiopt can be estimated by a greedy iterative
estimation strategy [35]: First, initialize Iˆ as the mean image
ψ1
I1
ψ2
ψ1-1
ψ2-1
ψ3-1
I2
ψ4-1
ψ3
Template
ψ4
I4
I3
Fig. 3.
Illustration of diffeomorphic groupwise registration, where the
template is estimated to be the F réchet mean on Riemannian manifold. ψi
(i=1,2,3,4) denotes the diffeomorphic transformation from Ii to the template
(solid black arrows), while ψi−1 denotes the reversed transformation (dashed
black arrows).
of Ii . Fix Iˆ and estimate ψi by registering Ii to Iˆ in current
iteration. Then, fix ψi and update Iˆ as the mean image of
ψi (Ii ). In this way, ψi and Iˆ are iteratively updated until they
converge.
Figure 3 illustrates an example of diffeomorphic groupwise
registration. The estimated template, which is also named
atlas, represents overall facial feature changes of a specific
expression among the population. The atlas is unbiased to any
individual subject and reflects the general expression information. Our dynamic facial expression recognition framework is
based on diffeomorphic groupwise registration. The details are
given in Section III.
III. M ETHODOLOGY
We propose a new dynamic facial expression recognition
method. There are mainly two stages: atlas construction stage
and recognition stage. In the atlas construction stage, atlas
sequence is built where salient and common features for each
expression among the population are extracted. Meanwhile, the
variations due to inter-subject facial shapes can be suppressed.
In the recognition stage, expression type is determined by
comparing the corresponding query sequence with each atlas
sequence.
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE
Transactions on Image Processing
4
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MONTH XX
A. Atlas Construction by Sparse Groupwise Registration
The flow chart of atlas construction stage is illustrated
in Figure 4. In this stage, the longitudinal facial expression
atlases are constructed to obtain salient facial feature changes
during an expression process.
Given K types of facial expressions of interest, and C
different subject image sequences for each expression, denote
the image at the jth time point of the ith subject (i = 1, ..., C)
as Itij . Assume each image sequence begins at time point 0
and ends at time point 1 (i.e., tij ∈ [0, 1]). For each expression,
to construct N atlases at given time points T = {t1 , ..., tN },
where tk ∈ [0, 1] (k = 1, ..., N ), we formulate it as an energy
minimization problem by minimizing:
Mt , φi =
n
o
i
2
i
i )) + λφi R(φ̃ )
arg min Σt∈T ΣC
(d(
M̃
,
φ̃
(I
,
i
t
t
i=1
(t →t)
0
0
M̃t ,φ̃i
(3)
where Mt is the longitudinal atlas at time point t and φi is
the diffeomorphic growth model that models facial expression
process for subject i. φi(ti →t) (Iti0 ) denotes the warping of
0
subject i’s image at first time point Iti0 to time point t, and
R(·) is the regularization constraint.
In the atlas construction stage, training sequences are
carefully constrained and pre-segmented to make sure that
they begin with the neural expression and end with apex
expressions. Thus, for each training sequence, their beginning
and ending stages are aligned. Given the growth model of each
sequence, the estimation of intermediate states between the
neural and apex expressions is made by uniformly dividing
the time interval between the beginning and ending stages.
Therefore, intermediate states are also aligned across training
sequences, and each state is corresponding to one specific time
point to construct the atlas sequence. The more states (i.e.,
the number of time points N ) are used, the more accurately
the atlas sequence can describe facial expression process,
while the computational burden also increases. Finally, images
belonging to the same time point are used to initialize and
iteratively refine the atlas.
In this paper, the Sobolev norm [15] is used as the regularization function. λφi is the parameter that controls the weight
of regularization term and d(·) is the distance metric defined
in non-Euclidean Riemannian manifold expressed by:

 1
Z
1
d(I1 , I2 )2 = min  ||vs ||2U ds + 2 ||I1 (ϕ−1 ) − I2 ||22  ,
σ
0
(4)
where ϕ(·) denotes the diffeomorphic transformation that
matches image I1 to I2 . In this paper, ϕ(·) is estimated
based on the large deformation diffeomorphic metric mapping
(LDDMM) framework [15]. || · ||2U is the Sobolev norm which
controls the smoothness of deformation field and ||·||2 denotes
the L2 norm. vs is the velocity field associated with ϕ(·). The
relationship between ϕ(·) and vs is defined by:
Z1
ϕ(~x) = ~x +
vs (ϕs (~x))ds,
(5)
0
where ϕs (~x) is the displacement of pixel ~x at time s ∈ [0, 1].
Equation 3 can be interpreted as following. First, the
subject-specific growth model φi is estimated for each subject
i. Then, propagate the subject-specific information to each
time point t ∈ T and construct atlas.
Given a subject i, there are ni images in his/her facial
expression image sequence. Itij denotes the image taken at the
jth time point of subject i. The growth model φi of subject i
can be estimated by minimizing the energy function:
i
Z1
J(φ ) =
||vsi ||2U ds +
1 ni −1 i
Σ
||φ(ti →ti ) (Iti0 ) − Itij ||22 . (6)
0
j
σ 2 j=0
0
The first term of Equation 6 controls the smoothness of
growth model. In the second term, the growth model is applied
to Iti0 and warped to other time points tij , then, the results
are compared with existing observations Itij at time points
tij . A smaller difference between the warped result and the
observation indicates that the growth model can describe the
expression more accurately. With the LDDMM [15] framework used in this paper, velocity field vsi is non-stationary
and varies over time. The variational gradient descent method
in [15] is adopted to estimate the optimal velocity field in this
paper with the regularization constraint represented by Sobolev
norm. The Sobolev norm ||vsi ||2U in Equation 6 is defined as
||Dvsi ||22 , where D is a differential operator. The selection of
the best operator D in diffeomorphic image registration is still
an open question [37]. In this paper, the diffusive model is used
as differential operator [15], which restricts the velocity field
to a space of Sobolev class two.
In Equation 6, the variables to be estimated are displacements of each pixel in image Iti0 , which represent the growth
model as a diffeomorphic deformation field. For each subject, there is one growth model to be estimated. Equation 6
estimates the growth model by considering differences at all
available time points, which is reflected by the summation in
the second term. The least number of images ni in the subjectspecific facial expression sequence used to estimate the growth
model is two. In this case, the problem can be reduced to a
pairwise image registration problem. The larger the number of
images available in the sequence, the more precise the growth
model describes the dynamic process of expression. We use
the Lagrange multiplier based optimization strategy similar to
[15] to perform the minimization of Equation 6. The growth
model φi is represented as a deformation field, based on which
the facial expression images at any time point t ∈ [0, 1] of
subject i, denoted as φi(ti →t) (Iti0 ), are interpolated, as shown
0
in Figure 4 (a).
Given the estimated φi , we are able to construct facial
expression atlas at any time point of interest. Assume there
are N time points of interest T = {t1 , ..., tN } to construct
facial expression atlas. Based on the estimated growth model
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE
Transactions on Image Processing
SHELL et al.: BARE DEMO OF IEEETRAN.CLS FOR JOURNALS
5
Sub 1
video
Growth Model Estimation
Atlas Construction with Sparse Representation
Atlas
Space
Growth Model
Interpolation
Atlas Construction with Sparse Representation
Sub 2
(a)
(b)
Fig. 4. Illustration of two main steps of atlas construction: (a) Growth model estimation for each facial expression image sequence; (b) facial expression
atlas construction from image sequences of the whole population based on longitudinal (i.e., temporal) atlas construction and sparse representation.
φi , subject i’s facial expression image can be interpolated
at time point t ∈ T with operation φi(ti →t) (Iti0 ). Moreover,
0
the optimization of Equation 3 with respect to variable Mt
becomes:
n
o
i
2
J(Mt ) = Σt∈T ΣC
.
i=1 (d(Mt , φ(ti →t) (Iti0 ))
0
of aligned images obtained in Step (1). Since Mt is initialized to the average image from all registered images, it is
oversmoothed and lacks of salient details. Furthermore, the
alignment of all images to this fuzzy image in Step (1) will
lead to the same problems in the next iteration.
(7)
The optimization of Equation 7 can be formulated as a
groupwise registration problem by estimating the F réchet
mean on the Riemannian manifold defined by diffeomorphisms
[35]. That is, the atlas Mt at each time point t ∈ T is estimated
by a greedy iterative algorithm, summarized by Algorithm 1
[17].
(a)
Algorithm 1 Estimate atlas Mt at time point t with conventional groupwise registration strategy.
Input: Images φi
(ti0 →t)
(Iti ) of each subject i (i = 1, ..., C) that are
0
(b)
interpolated at time point t with the growth model φi .
Output: Atlas Mt constructed at time point t.
1
1. Initialize Mt = C
2. Initialize Iˆi = φi i
PC
i=1
(t0 →t)
φi
(ti0 →t)
(Iti ).
0
(Iti ).
FOR i = 1 to C
Perform diffeomorphic image registration: register
Iˆi to Mt to minimize the image metric defined in
Equation 4 between Iˆi and Mt . Denote the
registered images as Ri .
END FOR
1 PC
4. Update Mt = C
i=1 Ri .
5. Repeat Steps 3 and 4 until Mt converges.
6. Return Mt .
3.
(c)
0
Taken the CK+ dynamic facial expression dataset for example, the fear longitudinal atlas constructed by Algorithm 1
are shown in Figure 5 (a). It can be observed that although
the constructed atlas can present most of the facial expression
characteristics, they fail to include details regarding to the
expression (e.g., muscle movements around cheek and eyes).
This is due to the updating rule of Mt in Operations 3 and
4 in Algorithm 1: (1) Align all the images to Mt obtained in
previous iteration, and (2) update Mt by taking the average
Fig. 5. Longitudinal atlas constructed at four time points for ’fear’ expression
on the extended Cohn-Kanade database using (a) conventional groupwise
registration strategy, and the proposed sparse representation method with
sparseness parameters (b) λs = 0.01 and (c) λs = 0.1, respectively. For
comparison purpose, significant differences in (a) and (b) are highlighted by
green circles.
Therefore, to preserve salient expression details during atlas
construction and provide high-quality atlas, we are motivated
to present a new atlas construction scheme performed by a
sparse representation method due to its saliency and robustness
[38]. Given C registered subject images Ri (i = 1, ..., C)
obtained by Step 3 in Algorithm 1, atlas Mt is estimated based
on the sparse representation of Ri by minimizing:
1
~ t ||22 + λs ||~δ||1 ,
E(~δ) = ||R~δ − m
2
(8)
where R = [~r1 , ..., ~rC ], ~ri (i = 1, ..., C) is a column vector
corresponding to the vectorization of Ri , and m
~ t is the
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE
Transactions on Image Processing
6
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MONTH XX
vectorization of Mt . || · ||1 is L1 norm and λs is the parameter
that controls sparseness degree of reconstruction coefficient
vector ~δ.
The optimization of Equation 8 is the LASSO sparse representation problem [39], which can be addressed by Nesterov’s
method [40]. With the optimal solution of Equation 8, denoted
as ~δopt , the atlas Mt can be updated to m
~ t = R~δopt . The
initialization of Mt is also improved by Equation 8, where
matrix R is the collection of φi(ti →t) (Iti0 ) (i = 1, ..., C). This
0
procedure is summarized by Algorithm 2.
Algorithm 2 Estimate atlas Mt at time point t with groupwise
registration and sparse representation.
Input: Images φi
(ti0 →t)
(Iti ) of each subject i (i = 1, ..., C) that are
0
interpolated at time point t with the growth model φi .
Output: Atlas Mt constructed at time point t.
1. Initialize Mt =
2.
3.
1
C
PC
i=1
φi
(ti0 →t)
(Iti ).
0
Refine the initialization of Mt based on the sparse
representation of φi i
(I i ) that expressed by Equation 8.
(t0 →t) t0
Initialize Iˆi = φi i
(I i ).
(t0 →t)
t0
FOR i = 1 to C
Perform diffeomorphic image registration: register
Iˆi to Mt to minimize the image metric defined in
Equation 4 between Iˆi and Mt . Denote the
registered images as Ri .
END FOR
5. Update Mt by optimizing Equation 8 with the sparse
representation of Ri .
6. Repeat Steps 4 and 5 until Mt converges.
7. Return Mt .
4.
To compare the performance of Algorithm 1 in [17] and
Algorithm 2, the longitudinal atlases of fear expression constructed by Algorithm 2 on CK+ database are shown in Figure
5 (b) with sparseness parameter λs = 0.01 in Equation 8. It
can be observed that the atlas constructed by the proposed
sparse representation method can preserve more anatomical
details, especially for those areas around cheek and eyes which
are critical parts for facial expression recognition. It should
be noted that there is a tradeoff between the data matching
term and sparseness term in Equation 8. As λs increases, the
sparseness term begins to dominate the data matching term,
which will affect the quality of constructed atlas. Figure 5 (c)
shows atlas constructed with sparseness parameter λs = 0.1 in
Equation 8. It can be observed that although the constructed
atlases become even sharper than those shown in Figure 5
(b), some facial features such as mouths are distorted in an
unrealistic manner. In this paper, we have empirically found
that λs = 0.01 gets a good balance between the data matching
and sparseness term, thus this setting is used through all
experiments in this paper.
B. Recognition of Query Sequences
In this paper, a new recognition scheme based on image
appearance and expression evolution information is proposed,
as shown in Figure 6.
Without loss of generality, assume that there are K different
expressions of interest. Let N denote the number of time
points to build longitudinal facial expression atlas sequence as
in Section III-A. The larger the number of N , the more precise
the atlas sequence describes the dynamic facial expression.
But this will also increase computational burden. We denote
N time points as T = {t1 , ..., tN } and Mtk as the atlas of the
kth (k = 1, ..., K) facial expression at time point t (t ∈ T ).
In the sparse atlas construction stage, training image sequences can be constrained or pre-segmented to ensure that
they begin with the neutral expression and gradually reach
to the apex expression. In this way, constructed longitudinal
atlases of different expressions should also follow the same
trend, as illustrated in Figure 5. However, in the recognition
stage, a new query image sequence does not necessarily
begin with the neutral expression and end with the apex
expression. And it is possible that abrupt transitions between
two expressions can be observed in one sequence.
Given a new facial expression sequence that consists of
nnew images Iinew (i = 0, ..., nnew − 1), correct temporal
correspondences should be established between constructed
atlas sequences and the query image sequence. This is because the facial expression sequence to be classified does
not necessarily follow the same temporal correspondence as
the constructed longitudinal atlas. First, we determine the
temporal correspondence of the first atlas image for each facial
expression k in the query image sequence, which is described
by:
b = arg min{d(Mtk1 , Ijnew )2 },
(9)
j
where d(·) is the distance metric defined in Equation 4 based
on diffeomorphisms. The physical meaning of Equation 9 is:
(1) perform diffeomorphic image registration between Mtk1 and
each image in the query sequence Ijnew (j = 1, ..., nnew );
and (2) determine the time point which gives the least
registration error between Mtk1 and the image in query as
temporal correspondence of Mtk1 . Similarly, we can determine
the temporal correspondence of each atlas for each expression
in query image sequence. Denote e as the index of temporal
correspondence time in the query image sequence to the last
image MtkN in the atlas sequence.
It should be noted that for query sequences with multiple
expression transitions, only one neutral → onset → apex clip
will be detected and used to establish the correspondence to
atlas sequence. Intuitively, this single neutral → onset →
apex clip should already contain sufficient information for
accurate expression recognition. This will be further justified
in the experiments conducted on the MMI, FERA, and AFEW
databases in which multiple facial expression transitions exist
and expression sequences are obtained under real-life conditions.
Then, we construct growth model φnew for the query image
sequence and interpolate facial expression images at time
new
points t ∈ {tb , tb+1 , ..., te } by operation φnew
), where
(tb →t) (Ib
b and e are the indices of temporal correspondence time in the
query image sequence to the first and last images in the atlas
sequence, respectively. With the established temporal correspondence between the query image sequence and longitudinal
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE
Transactions on Image Processing
SHELL et al.: BARE DEMO OF IEEETRAN.CLS FOR JOURNALS
7
Recognition
Deformation filed
reconstruction error
Happiness Surprise
Temporal Temporal
Evolution Evolution
…
Appearance similarity
Fear
Temporal
Evolution
Happiness
Atlas
Surprise
Atlas
…
Fear
Atlas
Query Image Sequence
Fig. 6. Both the image appearance information and dynamic evolution information are used to guide the recognition. Image appearance similarity is measured
with respect to the registration error between query image sequence and atlas sequence. The temporal process is compared by calculating deformation field
reconstruction error.
atlas, we can register interpolated facial expression images of
the query image sequence to their corresponding images in
the atlas sequence. The registration errors are compared to
determine its expression type [17]. This is described by:
tion is described by:
Pe−b
Lopt = arg min
L
( Pe−b
)
new
new 2
L
))
i=0 d(Mt1+i , φ(tb →tb+i ) (Ib
Lopt = arg min
(e − b + 1)
L
+β·
,
(10)
where Lopt ∈ {1, ..., K} is the estimated facial expression
label for the query image sequence.
The dynamic process provides complementary information
for image appearance to guide recognition. Given the growth
model φi (i = 1, ..., C), the deformation field φi(tj →tj+1 ) (j =
b, ..., e − 1) that represents temporal evolution from time point
tj to tj+1 can be calculated. φi(tj →tj+1 ) is represented as a
2×h×w dimensional vector F~tij →tj+1 , where h and w are the
height and width of each facial expression image (i.e., there
are h × w pixels). Each pixel’s displacement is determined
by movements in horizontal and vertical directions. Similarly,
for the new image sequence, we can obtain F~tnew
(j =
j →tj+1
b, ..., e − 1).
For each expression k (k = 1, ..., K), training image
sequences are used to construct a dictionary Dktj →tj+1 , which
represents temporal evolution of this expression from time
point
tj to tj+1 (j = ib, ..., e − 1), denoted as Dktj →tj+1 =
h
F~t1 →t , ..., F~tC→t
.
j
j+1
j
j+1
We reconstruct F~tnew
by the basis (i.e., each column) of
j →tj+1
k
Dtj →tj+1 using sparse representation [38] for each expression
type k, as shown in Figure 7. The accuracy of reconstruction
indicates the similarity between temporal processes, which
serves as an important clue to determine expression type of
the new image sequence.
Therefore, the overall energy function to drive the recogni-
e−1
X
i=0
new 2
d(MtL1+i , φnew
))
(tb →tb+i ) (Ib
(e − b + 1)
opt 2
L
||F~tnew
−
D
·
α
~
||
tj →tj+1
tj ,L 2 ,
j →tj+1
(11)
j=b
where β is the parameter to control the weight of temporal
information, and α
~ topt
is estimated by:
j ,L
α
~ topt
=
j ,L
arg min
α
~
1 ~ new
2
||Ftj →tj+1 − DL
·
α
~
||
+
λ
||~
α
||
.
1
α
~
tj →tj+1
2
2
(12)
The optimization of Equation 12 can be performed by
Nesterov’s method [40] as it is a LASSO sparse representation
problem [39].
IV. E XPERIMENTS
To evaluate the performance of the proposed method, it
has been extensively tested on three benchmark databases for
dynamic facial expression recognition: the extended CohnKanade [41], MMI [42] and FERA [43] databases. The
proposed method has also been evaluated on one spontaneous
expression database: the UNBC-McMaster database [44].
A. Experiments on the Extended Cohn-Kanade Database
The extended Cohn-Kanade (CK+) database [41] contains
593 facial expression sequences from 123 subjects. Similar
to [41], 325 sequences from 118 subjects are selected. Each
sequence is categorized to one of the seven basic expressions:
anger, contempt, disgust, fear, happy, sadness and surprise.
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE
Transactions on Image Processing
8
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MONTH XX
Training image sequences
Subject 1
Subject 2
Subject C
New Subject
…
…
…
…
…
Sparse
Representation
…
Fig. 7. Illustration of reconstructing deformation field between consecutive time points by sparse representation for a new subject from the deformation field
dictionary Dktj →tj+1 learnt from training images (Subject 1 to C).
Each image in the facial expression sequence is digitized to
resolution of 240×210. For each selected sequence, we follow
the same preprocessing step as in [7]. Specifically, the eye
positions in the first frame of each sequence were manually
labeled. These positions were used to determine the facial area
for the whole sequence and to normalize facial images. Figure
8 shows some examples from the CK+ database.
In all experiments, the following parameter settings are used
for our method: N = 12 as the number of time points of
interest to construct the longitudinal atlas, as we found that
setting N = 12 is a good tradeoff between recognition accuracy and computational burden; λφi = 0.02 as the parameter
that controls the smoothness of diffeomorphic growth model
for each subject i; λs = 0.01 as sparseness parameter for atlas
construction in Equation 8; β = 0.5 as weighting parameter
associated with the temporal evolution information in Equation
11; λα~ = 0.01 as the sparse representation parameter of
growth model for query image sequence in Equation 12.
Fig. 8.
Images from the CK+ database.
Our method is evaluated in a leave-one-subject-out manner
similar to [41]. Figure 9 shows the constructed longitudinal
atlases of seven different expressions on the CK+ database. It
can be visually observed that the constructed atlases are able to
capture the overall trend of facial expression evolution. Table
I shows the confusion matrix for the CK+ database. It can be
observed that high recognition accuracies are obtained by the
proposed method (i.e., the average recognition rate of each
expression is higher than 90%).
Figure 10 shows the recognition rates of different expressions obtained by: the sparse atlas construction with new
recognition scheme, sparse atlas construction with recognition
scheme in [17] (i.e., use image appearance information only),
conventional atlas construction with recognition scheme in
[17], atlas construction with the collection flow algorithm [45],
[46], and atlas construction with the standard RASL algorithm
[47]. For the collection flow algorithm [45], [46], we adopted
the flow estimation algorithm from Ce Liu’s implementation
[48] similar to [45], [46]. For the RASL algorithm, we
followed the same settings as in [47]. The affine transformation
model is used which is the most complex transformation model
supported by the standard RASL package 1 . It can be observed
that the sparse atlas construction with the new recognition
scheme consistently achieves the highest recognition rates,
which is consistent with the qualitative results obtained in
Section III-A. The main reason for the improvement is that the
enforced sparsity constraint can preserve salient information
that discriminates different expressions and can simultaneously
suppress subject-specific facial shape variations. In addition,
it is demonstrated that the recognition performance can be
1 http://perception.csl.illinois.edu/matrix-rank/rasl.html
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE
Transactions on Image Processing
SHELL et al.: BARE DEMO OF IEEETRAN.CLS FOR JOURNALS
9
Fig. 9. The longitudinal facial expression atlas constructed on the CK+ database at 12 time points with respect to seven expressions: ’anger’ (the first row),
’contempt’ (the second row), ’disgust’ (the third row), ’fear’ (the fourth row), ’happiness’ (the fifth row), ’sadness’ (the sixth row) and ’surprise’ (the seventh
row).
TABLE I
C ONFUSION MATRIX OF THE PROPOSED METHOD FOR CK+ DATABASE .
Contempt
(%)
0
91.8
0.8
0
0
0
0
Disgust
(%)
0
7.3
98.8
0
0
0
0
further improved by referring to both the image appearance
information in the spatial domain and temporal information.
It is also observed that recognition accuracies obtained by
using the RASL algorithm are slightly worse or comparable
to the conventional atlas construction scheme. There are two
reasons: First, as long as the mean operation is used to
construct atlas during groupwise registration process, subtle
and important anatomical details are inevitably lost, which
leads to inferior recognition accuracies. Second, the global
affine transformation can not model deformable facial muscle
movements sufficiently. Therefore, corresponding recognition
accuracies are worse than those obtained with diffeomorphic transformations. The collection flow algorithm achieves
slightly higher recognition accuracies than the conventional
groupwise registration scheme. However, its accuracies are
slightly inferior to those of the sparse representation based
atlas construction scheme. The reason is probably because
the sparse representation based atlas construction scheme
explicitly enforces the sparseness constraint to build sharp and
salient atlases in the energy function.
Fear
(%)
1.5
0
0
95.5
0.8
1.0
0
100
Recognition Rate (in %)
Anger
Contempt
Disgust
Fear
Happiness
Sadness
Surprise
Anger
(%)
96.1
0
0
0
0
2.2
0
Happiness
(%)
0
0
0.4
3.4
99.2
0
0
Sadness
(%)
2.4
0.9
0
1.1
0
96.8
0.7
Surprise
(%)
0
0
0
0
0
0
99.3
Sparse + A + T
Sparse + A
Conventional + A
Collection Flow
RASL
95
90
85
Anger
Contempt
Disgust
Fear
Happiness
Facial Expressions
Sadness
Surprise
Fig. 10. The average recognition rates of seven different facial expressions
on the CK+ database by using different schemes. “RASL” is the standard
RASL algorithm with affine transformation model, “Collection Flow” is
the collection flow algorithm, “Conventional” is the conventional groupwise
registration method used to construct atlas, and “Sparse” is the proposed sparse
representation scheme used to construct atlas. “A” denotes image appearance
information, and “T” denotes temporal evolution information.
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE
Transactions on Image Processing
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MONTH XX
To further understand the roles of image appearance and
temporal evolution information played in the recognition process, we plot average recognition accuracies of the proposed
method by using different values of β in Equation 11 in Figure
11. β controls the weighting of the image appearance and
temporal evolution information in the recognition step. The
smaller the value of β, the more image appearance information
the recognition relies on, and vice versa. It can be observed
from Figure 11 that when β is set to 0 (i.e., relies on the
image appearance information only), the recognition accuracy
drops to 92.4%. As β increases, temporal evolution information becomes more important and the recognition accuracy
increases to 97.2% when β = 0.5. When β further increases,
the temporal evolution information begins to dominate image
appearance information and recognition accuracy tends to
decline. Therefore, it is implied that both the image appearance
information and temporal evolution information play important
roles in recognition step as they are complementary to each
other. Therefore, it is beneficial to consider both of them for
the recognition performance.
100
99
Average Recognition Rate (in %)
10
98
97
96
95
94
93
92
91
90
4
6
8
10
12
14
Number of Time Points N
16
18
20
Fig. 12. Average recognition accuracies obtained by the proposed method
with different number of time points N to construct atlas sequences on the
CK+ database.
experimental protocols of compared methods are not exactly
the same due to a different number of sequences and crossvalidation setup, the effectiveness of our method can still be
implied by its highest recognition rate than those in other
methods.
100
100
Average Recognition Rate (in %)
Average Recognition Rate (in %)
98
96
94
92
90
95
90
85
80
75
88
86
0
70
0.1
0.2
0.3
0.4
0.5
Value of β
0.6
0.7
0.8
0.9
Our Method
Guo’s [17]
Zhao’s [7]
ITBN [50]
HMM [50] Gizatdinova’s [49]
Fig. 13. Average recognition rates of different approaches on CK+ database.
Fig. 11. Average recognition accuracies obtained by the proposed method
with different values of β in Equation 11 on the CK+ database.
Another important parameter in our method is the number
of time points N to construct atlas sequence. Intuitively, the
larger the value of N , the more precisely the atlas sequence
can represent the physical facial expression evolution process,
while computational burden will also increase. To study the
effects of different values of N , average recognition accuracies
with respect to different values of N are shown in Figure
12. It can be seen that when N is small (e.g., N = 4),
inferior recognition accuracies are obtained because the atlas
sequence cannot describe the expression evolution process
sufficiently. As N increases, the representation power of the
atlas sequence becomes stronger and higher recognition accuracies are obtained. For instance, when N = 12, satisfactory
recognition accuracies are obtained (i.e., 97.2%). However,
when N further increases, the recognition accuracy begins
to saturate because the atlas sequence has almost reached
its maximum description capacity and the gain in recognition
accuracies becomes marginal.
Figure 13 provides further comparisons on CK+ database
among our method and some state-of-the-art dynamic facial
expression methods proposed by Guo et al. [17], Zhao and
Pietikäinen [7], Gizatdinova and Surakka [49], and HMM
and ITBN model proposed by Wang et al. [50]. Although
It takes 21.7 minutes for our method in the atlas construction
stage and 1.6 seconds in the recognition stage for each query
sequence in average (Matlab, 4-core, 2.5GHz processor and 6
GB RAM).
B. Experiments on the MMI Database
Our method is evaluated on the MMI database [42], which
is known as one of the most challenging facial expression
recognition databases due to its large inter-subject variations.
For the MMI database, 175 facial expression sequences from
different subjects were selected. The selection criteria is that
each sequence can be labeled as one of the six basic emotions: anger, disgust, fear, happy, sadness and surprise. The
facial expression images in each sequence were digitized as
resolution of 720 × 576. Some sample images from the MMI
database are shown in Figure 14. For each facial image, it was
normalized based on eye coordinates similar to the processing
on CK+ database.
To evaluate our method on the MMI database, 10-fold
cross validation is adopted, which is similar as in [17]. The
confusion matrix of the proposed method is listed in Table II.
It can be observed that this method achieves high recognition
rates for different expressions (i.e., all above 90%).
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE
Transactions on Image Processing
SHELL et al.: BARE DEMO OF IEEETRAN.CLS FOR JOURNALS
11
TABLE II
C ONFUSION MATRIX OF THE PROPOSED METHOD
Anger
Disgust
Fear
Happiness
Sadness
Surprise
Fig. 14.
Anger
(%)
95.6
0.2
0
0
4.8
0
Disgust
(%)
1.2
97.8
0.5
0
0
0
Fear
(%)
0
0
96.4
1.8
0.8
2.5
Sample images from MMI database.
To investigate the performance of sparse representation
based atlas construction, the average recognition rates of
different expressions obtained with and without using sparse
representation in atlas construction are shown in Figure 15.
It also shows the average recognition rates obtained with and
without temporal information, which indicates the importance
of incorporating image appearance with temporal information
in recognition stage.
Recognition Rate (in %)
100
Sparse + A + T
Sparse + A
Conventional + A
98
ON
Happiness
(%)
0
2.0
3.1
98.2
0
0.6
MMI
DATABASE .
Sadness
(%)
3.2
0
0
0
94.4
0
Surprise
(%)
0
0
0
0
0
96.9
proposed method achieves more than 90% recognition rates
when using 4 folds as training set and the remaining 6 folds
as testing set. It is also shown that the proposed method
outperforms Guo’s method [17] consistently.
It is also interesting to study the robustness of our method
to the length of query sequence. The most challenging case
is that the query sequence contains only one image and the
temporal information is not available. The proposed method
is evaluated under this condition. The image selected to guide
the recognition is the one that has temporal correspondence
to the last image in atlas sequence (i.e., the image with the
expression in apex). Figure 17 shows the average recognition
rates. For comparison purposes, the recognition rates obtained
by using all images in query sequence are also shown. It
can be observed that recognition rates resulting from a single
input image drop consistently, which reflects the significance
of temporal information in the recognition task. On the other
hand, the proposed method still achieves acceptable recognition accuracy (i.e., on average 89.8%) even in this challenging
case.
96
94
92
90
Anger
Disgust
Fear
Happiness
Facial Expressions
Sadness
Surprise
Fig. 15. The average recognition rates of six different facial expressions on
MMI database with different schemes of the proposed method.
Fig. 17. The average recognition rates of six different expressions on MMI
database under conditions of single input image and full sequence.
It can be observed that recognition rates of sparse representation based atlas construction are consistently higher than
those obtained by the conventional scheme in [17]. Moreover,
the recognition rates can be further improved by incorporating
temporal information with image appearance information.
We also study the impact of training set size. Figure 16
shows the recognition rates obtained by the proposed method
with a different number of training samples. The horizontal
axes is the number of ’folds’ serving as the training set. For
comparison purposes, the results of Guo’s method in [17] are
also computed and shown.
It can be observed from Figure 16 that recognition rates of
our method converge into certain values quickly as the size
of training set increases. Specifically, for all expressions, the
C. Experiments on the FERA Database
To further investigate the robustness of the proposed
method, it is evaluated on the facial expression recognition and
analysis challenge (FERA2011) data: GEMEP-FERA dataset
[43]. The FERA dataset consists of ten different subjects
displaying five basic emotions: anger, fear, joy, relief and
sadness. FERA is one of the most challenging dynamic
facial expression recognition databases. First, the input facial
expression sequence does not necessarily start with neutral and
end with apex expressions. Second, there are various head
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE
Transactions on Image Processing
12
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MONTH XX
90
85
80
75
Our Method
Guo’s Method
2
3
4
5
6
7
Number of Folds Used as Training Sets
8
95
90
85
80
75
70
1
9
Our Method
Guo’s Method
2
3
(a) Anger
8
90
85
Our Method
Guo’s Method
2
3
4
5
6
7
Number of Folds Used as Training Sets
90
85
80
75
70
1
9
Our Method
Guo’s Method
2
3
8
(d) Happiness
9
8
9
(c) Fear
90
85
80
75
70
65
1
4
5
6
7
Number of Folds Used as Training Sets
100
Average Recognition Rates (in %)
95
80
95
(b) Disgust
Average Recognition Rates (in %)
Average Recognition Rates (in %)
4
5
6
7
Number of Folds Used as Training Sets
95
100
75
1
Average Recognition Rates (in %)
Average Recognition Rates (in %)
Average Recognition Rates (in %)
95
70
1
100
100
100
Our Method
Guo’s Method
2
3
4
5
6
7
Number of Folds Used as Training Sets
(e) Sadness
8
9
95
90
85
80
75
70
1
Our Method
Guo’s Method
2
3
4
5
6
7
Number of Folds Used as Training Sets
8
9
(f) Surprise
Fig. 16. The average recognition rates of different expressions for the proposed method with different training set sizes on MMI database. The recognition
rates of Guo’s method in [17] are calculated for comparison.
movements and unpredicted facial occlusions. Some sample
images are shown in Figure 18.
The FERA training set contains 155 image sequences for
seven subjects, and the testing set contains 134 image sequences for six subjects. Three of the subjects in the testing
set are not present in the training set. To evaluate the proposed
method, we follow the standard FERA protocol [43] and
construct atlas from the training set. Then, the estimated facial
expression labels of testing set are sent to FERA organizer to
calculate scores.
Fig. 18.
variations due to inter-person differences can be suppressed.
Therefore, the proposed method achieves robust recognition
performance under person-independent condition. Second, in
the challenging FERA database, intra-person expression variations are not necessarily smaller than inter-person expression
variations, which is illustrated in Figure 19. In Figure 19,
each row shows images of the same facial expression sequence
with expression type ’Joy’ from the training set of the FERA
database. The second and third rows are sequences of the same
person, while the first row is a sequence of another person.
It can be seen that facial features such as eyes, brows, and
mouths are quite similar between the sequence in the first
row and the sequence in the second row even though they are
from different persons. On the other hand, there are large facial
feature variations between the sequence in the second row and
the sequence in the third row even though they are from the
same person and with the same expression type ’Joy’.
Sample images from FERA database.
In this paper, we adopted similar preprocessing steps as
in [43] for the purposes of fair comparison. Specifically, the
Viola and Jones face detector [51] was first used to extract
facial region. To determine eye locations of facial image, the
cascaded classifier is applied which is trained for detecting
left and right eyes and implemented in OpenCV. Then, a
normalization is performed based on detected eye locations.
The person-specific and person-independent recognition
rates obtained by our method are listed in Table III. It achieves
promising recognition rates in both settings. It is also interesting to observe that for ’Anger’ and ’Joy’ expressions, the
proposed method achieves higher recognition accuracies under
person-independent setting than those under person-specific
setting. There are two reasons: First, one main strength of the
proposed method is its capability of building unbiased facial
expression atlas to guide recognition process. The facial shape
Fig. 19.
An example that shows one of the challenging properties of
the FERA database, where the intra-person expression variations are not
necessarily smaller than inter-person expression variations. Each row shows
an expression sequence from the FERA training set with expression type
’Joy’. The second and third rows are sequences of the same person, while the
first row is a sequence obtained from another person. It is visually observed
that intra-person expression variations are larger than inter-person expression
variations in this case.
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE
Transactions on Image Processing
SHELL et al.: BARE DEMO OF IEEETRAN.CLS FOR JOURNALS
13
TABLE III
R ECOGNITION RATES OBTAINED BY THE PROPOSED METHOD ON FERA
DATABASE .
Anger
Fear
Joy
Relief
Sadness
Average
Person-independent
1.00
0.867
1.00
0.563
0.933
0.873
Person-specific
0.923
1.00
0.727
0.900
1.00
0.910
Overall
0.963
0.920
0.903
0.692
0.960
0.888
algorithm [56] (i.e., LBP-TOP [7]) in the EmotiW 2014 protocol. Moreover, our method achieves comparable recognition
accuracy (with recognition accuracy 0.2% difference) to the
winner of the EmotiW 2014 challenge (i.e., Multiple Kernel +
Manifold [59]) and outperforms other state-of-the-art methods
under comparison. Therefore, the robustness of the proposed
method in the wild condition can be implied.
E. Experiments on the UNBC Database
The overall recognition rate obtained by our method is
0.888. It is higher than that of other methods reported in [43],
where the highest overall recognition rate is 0.838 achieved
by Yang and Bhanu [52], and it is also significantly higher
than that of the baseline method (i.e., 0.560) [43].
D. Experiments on the AFEW Database
The proposed method has also been evaluated on the Acted
Facial Expression in Wild (AFEW) database [53] to study
its performance when facial expression sequences are taken
under wild and real-life conditions. The AFEW database was
collected from movies showing real-life conditions, which
depicts or simulates spontaneous expressions in uncontrolled
environment. Some samples are shown in Figure 20. The
task is to classify each sequence to one of the seven basic
expression types: neutral (NE), happiness (HA), sadness (SA),
disgust (DI), fear (FE), anger (AN), and surprise (SUR). In
this paper, we follow the protocol of Emotion Recognition in
the Wild Challenge 2014 (i.e., EmotiW 2014) [53] to evaluate
the proposed method. The training set defined in the EmotiW
2014 protocol is used to build atlas sequences. The recognition
accuracies on the validation set are listed in Table IV similar
to [54], [55].
Fig. 20.
Sample images from the AFEW database.
TABLE IV
R ECOGNITION RATES OBTAINED BY THE PROPOSED METHOD ON THE
VALIDATION SET OF THE AFEW DATABASE AND COMPARISONS WITH
OTHER STATE - OF - THE - ART RECOGNITION METHODS .
Method
Baseline [56]
Multiple Kernel Learning [57]
Improved STLMBP [58]
Multiple Kernel + Manifold [59]
Our Method
Recognition Accuracies (in %)
34.4
40.2
45.8
48.5
48.3
From the Table IV, it can be seen that our method achieves
significantly higher recognition accuracies than the baseline
Our method is evaluated on the UNBC-McMaster shoulder
pain expression archive database [44] for spontaneous pain
expression monitoring. It consists of 200 dynamic facial
expression sequences from 25 subjects with 48, 398 frames,
where each subject was self-identified as having a problem
with shoulder pain. Each sequence was obtained when the
subject was instructed by physiotherapists to move his/her
limb as far as possible. For each sequence, observers who had
considerable training in identification of pain expression rated
it on a 6-point scale that ranged from 0 (no pain) to 5 (strong
pain). Each frame was manually FACS coded, and 66 point
active appearance model (AAM) landmarks were provided
[44]. Figure 21 shows some images from UNBC-McMaster
database.
Fig. 21.
Images from UNBC-McMaster database.
We follow the same experimental settings as in [44], where
leave-one-subject-out cross validation was adopted. Referring
to observers’ 6-point scale ratings, all sequences were grouped
into three classes [44]: 0-1 as class one, 2-3 as class two
and 4-5 as class three. Similar to [44], rough alignment and
initialization are performed for our method with 66 AAM
landmarks provided for each frame. Figure 22 shows the
constructed facial expression atlases for different classes. It
can be observed that constructed atlas successfully captures
subtle and important details, especially in areas that can reflect
the degree of pain, such as eyes and mouth.
The classification accuracies of our method are compared
with Lucey’s method [44], as shown in Figure 23. The classification accuracies for class 1, class 2 and class 3 obtained
by our method are 88%, 63% and 59%, respectively. It can
be seen that significant improvement is achieved compared to
those obtained by Lucey’s method: 75%, 38% and 47% [44].
This can demonstrate the effectiveness of the proposed method
on characterizing spontaneous expressions.
The proposed method is also compared with one state-ofthe-art pain classification method in [60]. For purposes of fair
comparison, we adopted the same protocol as in [60]: pain
class labels are binarized into ’pain’ and ’no pain’ by defining
instances with pain intensities larger than or equal to 3 as
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE
Transactions on Image Processing
14
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MONTH XX
Fig. 22. The longitudinal atlases of class 1 (observer score 0-1), class 2 (observer score 2-3) and class 3 (observer score 4-5) listed in the first row, second
row and third row, respectively.
the positive class (pain) and pain intensities equal to 0 as the
negative class (no-pain), and intermediate pain intensities of 1
and 2 are omitted. The average classification accuracy obtained
by the proposed method under this setting is 84.8%, which is
slightly higher or comparable to the one reported in [60] (i.e.,
83.7%). It should be noted that the experimental settings may
not be exactly the same as in [60]. However, the experimental
result can imply the effectiveness of the proposed method.
F. Experiments on Cross Dataset Evaluation
The cross-database generalization ability of the proposed
method has also been studied. We constructed six basic
dynamic expression atlas sequences (i.e., anger, disgust, fear,
happiness, sadness and surprise) from CK+ database following
the same setting as in Section IV-A. These atlas sequences are
then used to guide the recognition process on MMI database.
All the 175 facial expression sequences selected in Section
IV-B from MMI database are served as a testing set. Table V
lists the confusion matrix obtained by our method. The recognition accuracies are consistently lower than those obtained
under the within dataset validation condition listed in Table II.
This is mainly due to larger variations in terms of illumination
condition, pose and facial shapes cross different databases.
However, it can be observed from Table V that our method
still achieves high recognition performance (i.e., above 90%
average recognition rate) and outperforms some well-known
methods, such as Shan’s method (i.e., 86.9%) [61].
V. C ONCLUSION
In this paper, we propose a new way to tackle the dynamic
facial expression recognition problem. It is formulated as
a longitudinal atlas construction and diffeomorphic image
registration problem. Our method mainly consists of two
stages, namely atlas construction stage and recognition stage.
In the atlas construction stage, longitudinal atlas of different
facial expressions are constructed based on sparse representation groupwise registration. The constructed atlas can capture
overall facial appearance movements for a certain expression
among the population. In the recognition stage, both the image
appearance and temporal information are considered and integrated by diffeomorphic registration and sparse representation.
Our method has been extensively evaluated on five dynamic
facial expression recognition databases. The experimental
results show that this method consistently achieves higher
recognition rates than other compared methods. One limitation
of the proposed method is that it is still not robust enough to
overcome challenges of strong illumination changes. The main
reason is that the LDDMM registration algorithm used in this
paper may not compensate strong illumination changes. One
possible solution is to use complex image matching metrics
in the LDDMM framework, such as localized correlation
coefficient and localized mutual information which have some
degrees of robustness against illumination changes. This is one
possible future direction for this study.
R EFERENCES
Fig. 23. The average recognition rates of Lucey’s method [44] and the
proposed method on UNBC-McMaster database.
[1] B. Fasel and J. Luettin, “Automatic facial expression analysis: a survey,”
Pattern Recognition, vol. 36, pp. 259–275, 2003.
[2] Z. Zeng, M. Pantic, G. Roisman, and T. Huang, “A survey of affect
recognition methods: Audio, visual, and spontaneous expressions,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 31, pp.
39–58, 2009.
[3] M. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and J. Movellan, “Recognizing facial expression: machine learning and application
to spontaneous behavior,” in IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, 2005, pp. 568–573.
[4] T. Ojala, M. Pietikäinen, and T. Maenpaa, “Multiresolution gray-scale
and rotation invariant texture classification with local binary patterns,”
IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 24, pp. 971–987, 2002.
[5] S. Lucey, A. Ashraf, and J. Cohn, “Investigating spontaneous facial
action recognition through aam representations of the face,” in Face
Recognition Book, 2007, pp. 275–286.
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE
Transactions on Image Processing
SHELL et al.: BARE DEMO OF IEEETRAN.CLS FOR JOURNALS
C ONFUSION MATRIX OF THE PROPOSED METHOD
Anger
Disgust
Fear
Happiness
Sadness
Surprise
Anger
(%)
90.8
0.5
0.4
0
7.6
0.7
15
ON
Disgust
(%)
1.5
93.2
0.7
2.1
0
0
TABLE V
MMI DATABASE WITH ATLAS SEQUENCES TRAINED
Fear
(%)
0
0
93.5
5.3
2.3
4.9
[6] E. Learned-Miller, “Data driven image models through continuous
joint alignment,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 28, no. 2, pp. 236–250, 2006.
[7] G. Zhao and M. Pietikäinen, “Dynamic texture recognition using local
binary patterns with an application to facial expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, pp. 915–
928, 2007.
[8] L. Cohen, N. Sebe, A. Garg, L. Chen, and T. Huang, “Facial expression recognition from video sequences: temporal and static modeling,”
Computer Vision and Image Understanding, vol. 91, pp. 160–187, 2003.
[9] Y. Zhang and Q. Ji, “Active and dynamic information fusion for facial
expression understanding from image sequences,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 27, pp. 699–714, 2005.
[10] S. Koelstra, M. Pantic, and I. Patras, “A dynamic texture-based approach
to recognition of facial actions and their temporal models,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 32, pp.
1940–1954, 2010.
[11] A. Ramirez and O. Chae, “Spatiotemporal directional number transitional graph for dynamic texture recognition,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, p. In Press, 2015.
[12] H. Fang, N. Parthalain, A. Aubrey, G. Tam, R. Borgo, P. Rosin, P. Grant,
D. Marshall, and M. Chen, “Facial expression recognition in dynamic
sequences: An integrated approach,” Pattern Recognition, vol. 47, no. 3,
pp. 1271–1281, 2014.
[13] L. Zhong, Q. Liu, P. Yang, B. Liu, J. Huang, and D. Metaxas, “Learning
active facial patches for expression analysis,” in IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, 2012, pp.
2562–2569.
[14] Z. Wang, S. Wang, and Q. Ji, “Capturing complex spatio-temporal
relations among facial muscles for facial expression recognition,” in
IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, 2013, pp. 3422–3429.
[15] M. Beg, M. Miller, A. Trouve, and L. Younes, “Computing large
deformation metric mappings via geodesic flows of diffeomorphisms,”
International Journal of Computer Vision, vol. 61, pp. 139–157, 2005.
[16] S. Yousefi, P. Minh, N. Kehtarnavaz, and C. Yan, “Facial expression
recognition based on diffeomorphic matching,” in International Conference on Image Processing, 2010, pp. 4549–4552.
[17] Y. Guo, G. Zhao, and M. Pietikäinen, “Dynamic facial expression
recognition using longitudinal facial expression atlases,” in European
Conference on Computer Vision, 2012, pp. 631–644.
[18] Y. Chang, C. Hu, R. Feris, and M. Turk, “Manifold based analysis of
facial expression,” Image and Vision Computing, vol. 24, no. 6, pp. 605–
614, 2006.
[19] M. Pantic and I. Patras, “Dynamics of facial expression: Recognition
of facial actions and their temporal segments form face profile image
sequences,” IEEE Transactions on Systems, Man, and Cybernetics Part
B, vol. 36, no. 2, pp. 433–449, 2006.
[20] P. Yang, Q. Liu, X. Cui, and D. Metaxas, “Facial expression recognition
using encoded dynamic features,” in IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
[21] M. Yeasin, B. Bullot, and R. Sharma, “From facial expression to level
of interests: A spatio-temporal approach,” in IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, 2004, pp. 922–
927.
[22] G. Edwards, C. Taylor, and T. Cootes, “Interpreting face images using
active appearance models,” in IEEE FG, 1998, pp. 300–305.
[23] J. Saragih, S. Lucey, and J. Cohn, “Deformable model fitting by
regularized landmark mean-shift,” International Journal of Computer
Vision, vol. 91, no. 2, pp. 200–215, 2011.
[24] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
Happiness
(%)
0
5.3
5.4
92.6
0
2.8
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
Sadness
(%)
7.7
1.0
0
0
89.7
0
ON
CK+ DATABASE .
Surprise
(%)
0
0
0
0
0.4
91.6
detection,” in IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, 2005, pp. 886–893.
G. David, “Distinctive image features from scale-invariant keypoints,”
International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110,
2004.
K. Mikolajczyk and C. Schmid, “A performance evaluation of local
descriptors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 10, pp. 1615–1630, 2005.
D. Rueckert, L. Sonoda, C. Hayes, D. Hill, M. Leach, and D. Hawkes,
“Nonrigid registration using free-form deformations: Application to
breast MR images,” IEEE Transactions on Medical Imaging, vol. 18,
pp. 712–721, 1999.
X. Xiong and F. De la Torre, “Supervised descent method and its
applications to face alignment,” in IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, 2013, pp. 532–539.
F. De la Torre and M. Nguyen, “Parameterized kernel principal component analysis: Theory and applications to supervised and unsupervised
image alignment,” in IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, 2008, pp. 1–8.
G. Tzimiropoulos, V. Argyriou, S. Zafeiriou, and T. Stathaki, “Robust
fft-based scale-invariant image registration with image gradients,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 32,
no. 10, pp. 1899–1906, 2010.
X. Cao, Y. Wei, F. Wen, and J. Sun, “Face alignment by explicit shape
regression,” in IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, 2012, pp. 2887–2894.
S. Liao, D. Shen, and A. Chung, “A markov random field groupwise
registration framework for face recognition,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 36, pp. 657–669, 2014.
J. Maintz and M. Viergever, “A survey of medical image registration,”
Medical Image Analysis, vol. 2, pp. 1–36, 1998.
M. Miller and L. Younes, “Group actions, homeomorphisms, amd
matching: A general framework,” International Journal of Computer
Vision, vol. 41, pp. 61–84, 2001.
S. Joshi, B. Davis, M. Jomier, and G. Gerig, “Unbiased diffeomorphic
atlas construction for computational anatomy,” NeuroImage, vol. 23, pp.
151–160, 2004.
T. Cootes, C. Twining, V. Petrovic, K. Babalola, and C. Taylor,
“Computing accurate correspondences across groups of images,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 32,
no. 11, pp. 1994–2005, 2010.
M. Hernandez, S. Olmos, and X. Pennec, “Comparing algorithms
for diffeomorphic registration: Stationary lddmm and diffeomorphic
demons,” in 2nd MICCAI Workshop on Mathematical Foundations of
Computational Anatomy, 2008, pp. 24–35.
J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust face
recognition via sparse representation,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227, 2009.
R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B, vol. 58, pp. 267–288,
1996.
Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic
Course. Kluwer Academic Publishers, 2004.
P. Lucey, J. Cohn, T. Kanade, J. Saragih, and Z. Ambadar, “The extended
cohn-kanade dataset (ck+): A complete dataset for action unit and
emotion-specified expression,” in IEEE Computer Society Conference
on Computer Vision and Pattern Recognition Workshops, 2010, pp. 94–
101.
M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database
for facial expression analysis,” in IEEE International Conference on
Multimedia and Expo, 2005, pp. 317–321.
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE
Transactions on Image Processing
16
[43] M. Valstar, M. Mehu, B. Jiang, M. Pantic, and S. K., “Meta-analysis
of the first facial expression recognition challenge,” IEEE Transactions
on Systems, Man, and Cybernetics Part B, vol. 42, no. 4, pp. 966–979,
2012.
[44] P. Lucey, J. Cohn, K. Prkachin, P. Solomon, S. Chew, and I. Matthews,
“Painful monitoring: Automatic pain monitoring using the unbcmcmaster shoulder pain expression archive database,” Image and Vision
Computing, vol. 30, pp. 197–205, 2012.
[45] I. Kemelmacher-Shlizerman and S. Seitz, “Collection flow,” in IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, 2012, pp. 1792–1799.
[46] I. Kemelmacher-Shlizerman, S. Suwajanakorn, and S. Seitz,
“Illumination-aware age progression,” in IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, 2014, pp.
3334–3341.
[47] Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma, “Rasl: Robust
alignment by sparse and low-rank decomposition for linearly correlated
images,” in IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, 2010, pp. 763–770.
[48] C. Liu, Beyond Pixels: Exploring New Representations and Applications
for Motion Analysis. PhD thesis, MIT, 2009.
[49] Y. Gizatdinova and V. Surakka, “Feature-based detection of facial landmarks from neutral and expressive facial images,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 28, pp. 135–139, 2006.
[50] Z. Wang, S. Wang, and Q. Ji, “Capturing complex spatio-temporal
relations among facial muscles for facial expression recognition,” in
IEEE Conference on Computer Vision and Pattern Recognition, 2013,
pp. 3422–3429.
[51] P. Viola and M. Jones, “Robust real-time object detection,” International
Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2002.
[52] S. Yang and B. Bhanu, “Facial expression recognition using emotion
avatar image,” in IEEE Int. Conf. Autom. Face Gesture Anal., 2011, pp.
866–871.
[53] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Collecting large, richly
annotated facial-expression databases from movies,” IEEE MultiMedia,
vol. 19, pp. 34–41, 2012.
[54] J. Chen, T. Takiguchi, and Y. Ariki, “Facial expression recognition
with multithreaded cascade of rotation-invariant hog,” in International
Conference on Affective Computing and Intelligent Interaction, 2015,
pp. 636–642.
[55] M. Liu, S. Shan, R. Wang, and X. Chen, “Learning expressionlets on
spatio-temporal manifold for dynamic facial expression recognition,” in
IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, 2014, pp. 1749–1756.
[56] A. Dhall, R. Goecke, J. Joshi, K. Sikka, and T. Gedeon, “Emotion
recognition in the wild challenge 2014: Baseline, data and protocol,”
in ACM International Conference on Multimodal Interaction, 2014.
[57] J. Chen, Z. Chen, Z. Chi, and H. Fu, “Emotion recognition in the wild
with feature fusion and multiple kernel learning,” in ACM International
Conference on Multimodal Interaction, 2014, pp. 508–513.
[58] X. Huang, Q. He, X. Hong, G. Zhao, and M. Pietikäinen, “Emotion
recognition in the wild with feature fusion and multiple kernel learning,”
in ACM International Conference on Multimodal Interaction, 2014, pp.
514–520.
[59] M. Liu, R. Wang, S. Li, S. Shan, Z. Huang, and X. Chen, “Combining
multiple kernel methods on riemannian manifold for emotion recognition in the wild,” in ACM International Conference on Multimodal
Interaction, 2014, pp. 494–501.
[60] K. Sikka, A. Dhall, and M. Bartlett, “Classification and weakly supervised pain localization using multiple segment representation,” Image
and Vision Computing, vol. 32, no. 10, pp. 659–670, 2014.
[61] C. Shan, S. Gong, and P. McOwan, “Facial expression recognition based
on local binary patterns: a comprehensive study,” Image and Vision
Computing, vol. 27, pp. 803–816, 2009.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MONTH XX
Yimo Guo received her B.Sc. and M.Sc. degrees in
computer science in 2004 and 2007, and received the
Ph.D. degree in computer science and engineering
from the University of Oulu, Finland, in 2013. Her
research interests include texture analysis and video
synthesis.
Guoying Zhao received the Ph.D. degree in computer science from the Chinese Academy of Sciences, Beijing, China, in 2005. She is currently an
Associate Professor with the Center for Machine
Vision Research, University of Oulu, Finland, where
she has been a researcher since 2005. In 2011, she
was selected to the highly competitive Academy
Research Fellow position. She has authored or coauthored more than 140 papers in journals and
conferences, and has served as a reviewer for many
journals and conferences. She has lectured tutorials at ICPR 2006, ICCV
2009, and SCIA 2013, and authored/edited three books and four special issues
in journals. Dr. Zhao was a Co-Chair of seven International Workshops at
ECCV, ICCV, CVPR and ACCV, and two special sessions at FG13 and FG15.
She is editorial board member for Image and Vision Computing Journal,
International Journal of Applied Pattern Recognition and ISRN Machine
Vision. She is IEEE Senior Member. Her current research interests include
image and video descriptors, gait analysis, dynamic-texture recognition, facialexpression recognition, human motion analysis, and person identification.
Matti Pietikäinen received his Doctor of Science in
Technology degree from the University of Oulu, Finland. He is currently a professor, Scientific Director
of Infotech Oulu and Director of Center for Machine
Vision Research at the University of Oulu. From
1980 to 1981 and from 1984 to 1985, he visited
the Computer Vision Laboratory at the University
of Maryland. He has made pioneering contributions,
e.g. to local binary pattern (LBP) methodology,
texture-based image and video analysis, and facial
image analysis. He has authored over 335 refereed papers. His papers have
currently about 31,500 citations in Google Scholar (h-index 63), and six of
his papers have over 1,000 citations. Dr. Pietikinen was Associate Editor of
IEEE Transactions on Pattern Analysis and Machine Intelligence and Pattern
Recognition journals, and currently serves as Associate Editor of Image and
Vision Computing and IEEE Transactions on Forensics and Security journals.
He was President of the Pattern Recognition Society of Finland from 1989
to 1992, and was named its Honorary Member in 2014. From 1989 to 2007
he served as Member of the Governing Board of International Association
for Pattern Recognition (IAPR), and became one of the founding fellows of
the IAPR in 1994. He is IEEE Fellow for contributions to texture and facial
image analysis for machine vision. In 2014, his research on LBP-based face
description was awarded the Koenderink Prize for Fundamental Contributions
in Computer Vision.
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.