This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE Transactions on Image Processing IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MONTH XX 1 Dynamic Facial Expression Recognition with Atlas Construction and Sparse Representation Yimo Guo, Guoying Zhao, Senior Member, IEEE, and Matti Pietikäinen, Fellow, IEEE Abstract—In this paper, a new dynamic facial expression recognition method is proposed. Dynamic facial expression recognition is formulated as a longitudinal groupwise registration problem. The main contributions of this method lie in the following aspects: (1) subject-specific facial feature movements of different expressions are described by a diffeomorphic growth model; (2) salient longitudinal facial expression atlas is built for each expression by a sparse groupwise image registration method, which can describe the overall facial feature changes among the whole population and can suppress the bias due to large inter-subject facial variations; (3) both the image appearance information in spatial domain and topological evolution information in temporal domain are used to guide recognition by a sparse representation method. The proposed framework has been extensively evaluated on five databases for different applications: the extended Cohn-Kanade, MMI, FERA, and AFEW databases for dynamic facial expression recognition, and UNBC-McMaster database for spontaneous pain expression monitoring. This framework is also compared with several state-of-the-art dynamic facial expression recognition methods. The experimental results demonstrate that the recognition rates of the new method are consistently higher than other methods under comparison. Index Terms—Dynamic Facial Expression Recognition, Diffeomorphic Growth Model, Groupwise Registration, Sparse Representation. I. I NTRODUCTION Automatic facial expression recognition (AFER) has essential real world applications. Its applications include, but are not limited to, human computer interaction (HCI), psychology and telecommunications. It remains a challenging problem and active research topic in computer vision, and many novel methods have been proposed to tackle the automatic facial expression recognition problem. Intensive studies have been carried out on AFER problem in static images during the last decade [1], [2]: Given a query facial image, estimate the correct facial expression type, such as anger, disgust, happiness, sadness, fear or surprise. It mainly consists of two steps: feature extraction and classifier design. For feature extraction, Gabor wavelet [3], local binary pattern (LBP) [4], and geometric features such as active appearance model (AAM) [5] are in common use. For classifier, support vector machine is frequently used. Joint alignment of facial images under unconstrained condition has also become an active research topic in AFER [6]. In recent years, dynamic facial expression recognition has become a new research topic and receives more and more The authors are with the Center for Machine Vision Research, Department of Computer Science and Engineering, University of Oulu, Finland. Email:[email protected],[email protected],[email protected] Manuscript received XX, XXXX; revised XXX. attention [7], [8], [9], [10], [11], [12]. Different from the recognition problem in static images, the aim of dynamic facial expression recognition is to estimate facial expression type from an image sequence captured during physical facial expression process of a subject. The facial expression image sequence contains not only image appearance information in the spatial domain, but also evolution details in the temporal domain. The image appearance information together with the expression evolution information can further enhance recognition performance. Although the dynamic information provided is useful, there are challenges regarding how to capture this information reliably and robustly. For instance, a facial expression sequence normally constitutes of one or more onset, apex and offset phases. In order to capture temporal information and make temporal information of training and query sequences comparable, correspondences between different temporal phases need to be established. As facial actions over time are different across subjects, it remains an open issue how a common temporal feature for each expression among the population can be effectively encoded while suppressing subject-specific facial shape variations. In this paper, a new dynamic facial expression recognition method is presented. It is motivated by the fact that facial expression can be described by diffeomorphic motions of muscles beneath the face [13], [14]. Intuitively, ‘diffeomorphic’ means the motion is topologically preserved and reversible [15]. The formal definition of ‘diffeomorphic’ transformation is given in Section II. Different from previous works [10], [16] by using pairwise registration to capture the temporal motion, this method considers both the subject-specific and population information by a groupwise diffeomorphic registration scheme. Moreover, both the spatial and temporal information are captured with a unified sparse representation framework. Our method consists of two stages: atlas construction stage and recognition stage. Atlases, which are unbiased images, are estimated from all the training images belonging to the same expression type with groupwise registration. Atlases capture general features of each expression across the population and can suppress differences due to inter-subject facial shape variations. In the atlas construction stage, a diffeomorphic growth model is estimated for each image sequence to capture subjectspecific facial expression characteristics. To reflect the overall evolution process of each expression among the population, longitudinal atlases are then constructed for each expression with groupwise registration and sparse representation. In the recognition stage, we first register the query image sequence to atlas of each expression. Then, the comparison is conducted from two aspects: image appearance information and temporal 1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE Transactions on Image Processing 2 evolution information. The preliminary work has been reported in [17]. For the proposed method, there are three main contributions and differences compared to the preliminary work in [17]: (1) A more advanced atlas construction scheme is used. In previous method [17], the atlases are constructed using the conventional groupwise registration method, thus lots of subtle and important anatomical details are lost due to the naive mean operation. To overcome this shortage, a sparse representation based atlas construction method is proposed in this paper. It is capable of capturing subtle and salient image appearance details to guide recognition, and preserving common expression characteristics. (2) In the recognition stage, the previous method in [17] compared image differences between the warped query sequence and atlas sequence, which is based on image appearance information only. In this paper, the temporal evolution information is also taken into account to drive the recognition process. It has shown to provide complementary information to image appearance information and can significantly improve the recognition performance. (3) The proposed method has been evaluated in a systematic manner on five databases whose applications vary from posed dynamic facial expression recognition to spontaneous pain expression monitoring. Moreover, possible alternatives have been carefully analyzed and studied with different experimental settings. The rest of the paper is organized as follows: Section II gives an overview of related works on dynamic facial expression and diffeomorphic image registration. Section III describes the proposed method. Section IV analyzes experimental results. Section V concludes the paper. II. R ELATED W ORK A. Methods for Dynamic Facial Expression Recognition Many novel approaches have been proposed for dynamic facial expression recognition [18], [19], [20], [10], [21]. They can be broadly classified into three categories: shape based methods, appearance based methods and motion based methods. Shape based methods describe facial component shapes based on salient landmarks detected on facial images, such as corners of eyes and mouths. The movement of those landmarks provides discriminant information to guide the recognition process. For instance, the active appearance model (AAM) [22] and the constrained local model (CLM) [23] are widely used. Also, Chang et al. [18] inferred facial expression manifold by applying active wavelets network (AWN) to a facial shape model which is defined by 58 landmarks that are used by Pantic and Patras [19]. Appearance based methods extract image intensity or other texture features from facial images to characterize facial expressions. Commonly used feature extraction methods include LBPTOP [7], Gabor wavelets [3], HOG [24], SIFT [25] and subspace learning [20]. For a more thorough review of appearance features, readers may refer to the survey paper [26]. Motion based methods aim to model spatial-temporal evolution process of facial expressions, which are usually developed IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MONTH XX in virtue of image registration techniques. For instance, Koelstra et al. [10] used the free-form deformation (FFD) [27] to capture motions between frames. The optical flow was adopted by Yeasin et al. [21]. The recognition performance of motion based methods is highly dependent on face alignment methods used. Many advanced techniques have been proposed for face alignment, such as the supervised descent method developed by Xiong and De la Torre [28], the parameterized kernel principal component analysis based alignment method proposed by De la Torre and Nguyen [29], the FFT-based scale invariant image registration method proposed by Tzimiropoulos et al. [30], and the explicit shape regression based face alignment method proposed by Cao et al. [31]. B. Diffeomorphic Image Registration Image registration is an active research topic in computer vision and medical image analysis [32], [33]. The goal of image registration is to transform a set of images which are obtained from different space, time or imaging protocols into a common coordinate system, namely the template space. Image registration can be formulated as an optimization problem. Figure 1 illustrates the flow chart of pairwise registration between two images. Moving Image Similarity Measure Current Measure Value Optimizer Transform the Moving Image Interpolator Optimal transform parameters Transformation Model Fixed Image Fig. 1. A typical flow chart for pairwise image registration: the moving image is transformed to fixed image space. In each iteration, the optimizer minimizes the similarity measure function between two images and calculates the optimal corresponding transformation parameters. Equation 1 summaries the general optimization process of pairwise registration problem. Topt = arg min E(If ix , T (Imov )), T ∈Φ (1) where If ix is the fixed image, Imov is the moving image, Φ denotes the whole possible transformation space and E(·) is the similarity measure metric. This equation aims to find the optimal transformation Topt which minimizes E(·) between If ix and Imov . The transformation model for registration is application specific. It can be rigid or affine transformations that contain only three and six degrees of freedom; or deformable transformations, such as B-spline [27] and diffeomorphic transformations [15], [34] that contain thousands degrees of freedom. In this paper, the diffeomorphic transformation is used due to its excellent properties, such as topology preservation and 1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE Transactions on Image Processing SHELL et al.: BARE DEMO OF IEEETRAN.CLS FOR JOURNALS 3 Apex Neutral Neutral Fig. 2. Illustration of the whole facial expression process. The neutral face gradually evolves to the apex state, and then facial muscles get it back to another neutral state. Therefore, this can be considered as a diffeomorphic transformation process (i.e., topologically preserved and reversible). reversibility [15]. These properties are essential and required to model facial feature movements and suppress registration errors. Otherwise, unrealistic deformations (e.g., twisted facial expressions) may occur and introduce large registration errors. The formal definition of diffeomorphic transformation is: Given two manifolds Υ1 and Υ2 , and a mapping function F : Υ1 → Υ2 , F is a diffeomorphic transformation if it is differentiable and its inverse mapping F −1 : Υ2 → Υ1 is also differentiable. F is a C ξ diffeomorphic transformation if F and F −1 are ξ times differentiable. For a registration task, F is often built in an infinite-dimensional manifold [15]. It should be noted that the group of diffeomorphic transformations F is also a manifold. C. Groupwise Registration As facial expression process is topologically preserved and reversible, as illustrated in Figure 2, it can be considered as a diffeomorphic transformation of facial muscles. Therefore, the diffeomorphic transformation during the evolution process of facial expression can be used to reconstruct facial feature movements and further guide the recognition task. Given P facial expression images I1 , ..., IP , a straightforward solution to transform them to a common space is to select one image as the template, then register the remaining P − 1 images to the template by applying P −1 pairwise registration. However, the registration quality is sensitive to the selection of template. Therefore, the idea of groupwise registration was provided [35], [36], where the template is estimated to be the F réchet mean on the Riemannian manifold whose geodesic distances are measured based on diffeomorphisms. The diffeomorphic groupwise registration problem can be formulated as the optimization problem by minimizing: Iˆopt , ψ1opt , ..., ψPopt = P X ˆ ψi (Ii ))2 + λR(ψi ) , arg min d(I, (2) ˆ 1 ,...,ψP i=1 I,ψ where both the template Iˆopt and the optimal diffeomorphic transformation ψiopt (i = 1, ..., P ) that transforms Ii to Iˆopt are variables to be estimated. d(·) is the similarity function that measures the matching degree between two images, R(·) denotes the regularization term to control the smoothness of transformation, and λ is a parameter to control the weight of R(·). Iˆopt and ψiopt can be estimated by a greedy iterative estimation strategy [35]: First, initialize Iˆ as the mean image ψ1 I1 ψ2 ψ1-1 ψ2-1 ψ3-1 I2 ψ4-1 ψ3 Template ψ4 I4 I3 Fig. 3. Illustration of diffeomorphic groupwise registration, where the template is estimated to be the F réchet mean on Riemannian manifold. ψi (i=1,2,3,4) denotes the diffeomorphic transformation from Ii to the template (solid black arrows), while ψi−1 denotes the reversed transformation (dashed black arrows). of Ii . Fix Iˆ and estimate ψi by registering Ii to Iˆ in current iteration. Then, fix ψi and update Iˆ as the mean image of ψi (Ii ). In this way, ψi and Iˆ are iteratively updated until they converge. Figure 3 illustrates an example of diffeomorphic groupwise registration. The estimated template, which is also named atlas, represents overall facial feature changes of a specific expression among the population. The atlas is unbiased to any individual subject and reflects the general expression information. Our dynamic facial expression recognition framework is based on diffeomorphic groupwise registration. The details are given in Section III. III. M ETHODOLOGY We propose a new dynamic facial expression recognition method. There are mainly two stages: atlas construction stage and recognition stage. In the atlas construction stage, atlas sequence is built where salient and common features for each expression among the population are extracted. Meanwhile, the variations due to inter-subject facial shapes can be suppressed. In the recognition stage, expression type is determined by comparing the corresponding query sequence with each atlas sequence. 1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE Transactions on Image Processing 4 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MONTH XX A. Atlas Construction by Sparse Groupwise Registration The flow chart of atlas construction stage is illustrated in Figure 4. In this stage, the longitudinal facial expression atlases are constructed to obtain salient facial feature changes during an expression process. Given K types of facial expressions of interest, and C different subject image sequences for each expression, denote the image at the jth time point of the ith subject (i = 1, ..., C) as Itij . Assume each image sequence begins at time point 0 and ends at time point 1 (i.e., tij ∈ [0, 1]). For each expression, to construct N atlases at given time points T = {t1 , ..., tN }, where tk ∈ [0, 1] (k = 1, ..., N ), we formulate it as an energy minimization problem by minimizing: Mt , φi = n o i 2 i i )) + λφi R(φ̃ ) arg min Σt∈T ΣC (d( M̃ , φ̃ (I , i t t i=1 (t →t) 0 0 M̃t ,φ̃i (3) where Mt is the longitudinal atlas at time point t and φi is the diffeomorphic growth model that models facial expression process for subject i. φi(ti →t) (Iti0 ) denotes the warping of 0 subject i’s image at first time point Iti0 to time point t, and R(·) is the regularization constraint. In the atlas construction stage, training sequences are carefully constrained and pre-segmented to make sure that they begin with the neural expression and end with apex expressions. Thus, for each training sequence, their beginning and ending stages are aligned. Given the growth model of each sequence, the estimation of intermediate states between the neural and apex expressions is made by uniformly dividing the time interval between the beginning and ending stages. Therefore, intermediate states are also aligned across training sequences, and each state is corresponding to one specific time point to construct the atlas sequence. The more states (i.e., the number of time points N ) are used, the more accurately the atlas sequence can describe facial expression process, while the computational burden also increases. Finally, images belonging to the same time point are used to initialize and iteratively refine the atlas. In this paper, the Sobolev norm [15] is used as the regularization function. λφi is the parameter that controls the weight of regularization term and d(·) is the distance metric defined in non-Euclidean Riemannian manifold expressed by: 1 Z 1 d(I1 , I2 )2 = min ||vs ||2U ds + 2 ||I1 (ϕ−1 ) − I2 ||22 , σ 0 (4) where ϕ(·) denotes the diffeomorphic transformation that matches image I1 to I2 . In this paper, ϕ(·) is estimated based on the large deformation diffeomorphic metric mapping (LDDMM) framework [15]. || · ||2U is the Sobolev norm which controls the smoothness of deformation field and ||·||2 denotes the L2 norm. vs is the velocity field associated with ϕ(·). The relationship between ϕ(·) and vs is defined by: Z1 ϕ(~x) = ~x + vs (ϕs (~x))ds, (5) 0 where ϕs (~x) is the displacement of pixel ~x at time s ∈ [0, 1]. Equation 3 can be interpreted as following. First, the subject-specific growth model φi is estimated for each subject i. Then, propagate the subject-specific information to each time point t ∈ T and construct atlas. Given a subject i, there are ni images in his/her facial expression image sequence. Itij denotes the image taken at the jth time point of subject i. The growth model φi of subject i can be estimated by minimizing the energy function: i Z1 J(φ ) = ||vsi ||2U ds + 1 ni −1 i Σ ||φ(ti →ti ) (Iti0 ) − Itij ||22 . (6) 0 j σ 2 j=0 0 The first term of Equation 6 controls the smoothness of growth model. In the second term, the growth model is applied to Iti0 and warped to other time points tij , then, the results are compared with existing observations Itij at time points tij . A smaller difference between the warped result and the observation indicates that the growth model can describe the expression more accurately. With the LDDMM [15] framework used in this paper, velocity field vsi is non-stationary and varies over time. The variational gradient descent method in [15] is adopted to estimate the optimal velocity field in this paper with the regularization constraint represented by Sobolev norm. The Sobolev norm ||vsi ||2U in Equation 6 is defined as ||Dvsi ||22 , where D is a differential operator. The selection of the best operator D in diffeomorphic image registration is still an open question [37]. In this paper, the diffusive model is used as differential operator [15], which restricts the velocity field to a space of Sobolev class two. In Equation 6, the variables to be estimated are displacements of each pixel in image Iti0 , which represent the growth model as a diffeomorphic deformation field. For each subject, there is one growth model to be estimated. Equation 6 estimates the growth model by considering differences at all available time points, which is reflected by the summation in the second term. The least number of images ni in the subjectspecific facial expression sequence used to estimate the growth model is two. In this case, the problem can be reduced to a pairwise image registration problem. The larger the number of images available in the sequence, the more precise the growth model describes the dynamic process of expression. We use the Lagrange multiplier based optimization strategy similar to [15] to perform the minimization of Equation 6. The growth model φi is represented as a deformation field, based on which the facial expression images at any time point t ∈ [0, 1] of subject i, denoted as φi(ti →t) (Iti0 ), are interpolated, as shown 0 in Figure 4 (a). Given the estimated φi , we are able to construct facial expression atlas at any time point of interest. Assume there are N time points of interest T = {t1 , ..., tN } to construct facial expression atlas. Based on the estimated growth model 1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE Transactions on Image Processing SHELL et al.: BARE DEMO OF IEEETRAN.CLS FOR JOURNALS 5 Sub 1 video Growth Model Estimation Atlas Construction with Sparse Representation Atlas Space Growth Model Interpolation Atlas Construction with Sparse Representation Sub 2 (a) (b) Fig. 4. Illustration of two main steps of atlas construction: (a) Growth model estimation for each facial expression image sequence; (b) facial expression atlas construction from image sequences of the whole population based on longitudinal (i.e., temporal) atlas construction and sparse representation. φi , subject i’s facial expression image can be interpolated at time point t ∈ T with operation φi(ti →t) (Iti0 ). Moreover, 0 the optimization of Equation 3 with respect to variable Mt becomes: n o i 2 J(Mt ) = Σt∈T ΣC . i=1 (d(Mt , φ(ti →t) (Iti0 )) 0 of aligned images obtained in Step (1). Since Mt is initialized to the average image from all registered images, it is oversmoothed and lacks of salient details. Furthermore, the alignment of all images to this fuzzy image in Step (1) will lead to the same problems in the next iteration. (7) The optimization of Equation 7 can be formulated as a groupwise registration problem by estimating the F réchet mean on the Riemannian manifold defined by diffeomorphisms [35]. That is, the atlas Mt at each time point t ∈ T is estimated by a greedy iterative algorithm, summarized by Algorithm 1 [17]. (a) Algorithm 1 Estimate atlas Mt at time point t with conventional groupwise registration strategy. Input: Images φi (ti0 →t) (Iti ) of each subject i (i = 1, ..., C) that are 0 (b) interpolated at time point t with the growth model φi . Output: Atlas Mt constructed at time point t. 1 1. Initialize Mt = C 2. Initialize Iˆi = φi i PC i=1 (t0 →t) φi (ti0 →t) (Iti ). 0 (Iti ). FOR i = 1 to C Perform diffeomorphic image registration: register Iˆi to Mt to minimize the image metric defined in Equation 4 between Iˆi and Mt . Denote the registered images as Ri . END FOR 1 PC 4. Update Mt = C i=1 Ri . 5. Repeat Steps 3 and 4 until Mt converges. 6. Return Mt . 3. (c) 0 Taken the CK+ dynamic facial expression dataset for example, the fear longitudinal atlas constructed by Algorithm 1 are shown in Figure 5 (a). It can be observed that although the constructed atlas can present most of the facial expression characteristics, they fail to include details regarding to the expression (e.g., muscle movements around cheek and eyes). This is due to the updating rule of Mt in Operations 3 and 4 in Algorithm 1: (1) Align all the images to Mt obtained in previous iteration, and (2) update Mt by taking the average Fig. 5. Longitudinal atlas constructed at four time points for ’fear’ expression on the extended Cohn-Kanade database using (a) conventional groupwise registration strategy, and the proposed sparse representation method with sparseness parameters (b) λs = 0.01 and (c) λs = 0.1, respectively. For comparison purpose, significant differences in (a) and (b) are highlighted by green circles. Therefore, to preserve salient expression details during atlas construction and provide high-quality atlas, we are motivated to present a new atlas construction scheme performed by a sparse representation method due to its saliency and robustness [38]. Given C registered subject images Ri (i = 1, ..., C) obtained by Step 3 in Algorithm 1, atlas Mt is estimated based on the sparse representation of Ri by minimizing: 1 ~ t ||22 + λs ||~δ||1 , E(~δ) = ||R~δ − m 2 (8) where R = [~r1 , ..., ~rC ], ~ri (i = 1, ..., C) is a column vector corresponding to the vectorization of Ri , and m ~ t is the 1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE Transactions on Image Processing 6 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MONTH XX vectorization of Mt . || · ||1 is L1 norm and λs is the parameter that controls sparseness degree of reconstruction coefficient vector ~δ. The optimization of Equation 8 is the LASSO sparse representation problem [39], which can be addressed by Nesterov’s method [40]. With the optimal solution of Equation 8, denoted as ~δopt , the atlas Mt can be updated to m ~ t = R~δopt . The initialization of Mt is also improved by Equation 8, where matrix R is the collection of φi(ti →t) (Iti0 ) (i = 1, ..., C). This 0 procedure is summarized by Algorithm 2. Algorithm 2 Estimate atlas Mt at time point t with groupwise registration and sparse representation. Input: Images φi (ti0 →t) (Iti ) of each subject i (i = 1, ..., C) that are 0 interpolated at time point t with the growth model φi . Output: Atlas Mt constructed at time point t. 1. Initialize Mt = 2. 3. 1 C PC i=1 φi (ti0 →t) (Iti ). 0 Refine the initialization of Mt based on the sparse representation of φi i (I i ) that expressed by Equation 8. (t0 →t) t0 Initialize Iˆi = φi i (I i ). (t0 →t) t0 FOR i = 1 to C Perform diffeomorphic image registration: register Iˆi to Mt to minimize the image metric defined in Equation 4 between Iˆi and Mt . Denote the registered images as Ri . END FOR 5. Update Mt by optimizing Equation 8 with the sparse representation of Ri . 6. Repeat Steps 4 and 5 until Mt converges. 7. Return Mt . 4. To compare the performance of Algorithm 1 in [17] and Algorithm 2, the longitudinal atlases of fear expression constructed by Algorithm 2 on CK+ database are shown in Figure 5 (b) with sparseness parameter λs = 0.01 in Equation 8. It can be observed that the atlas constructed by the proposed sparse representation method can preserve more anatomical details, especially for those areas around cheek and eyes which are critical parts for facial expression recognition. It should be noted that there is a tradeoff between the data matching term and sparseness term in Equation 8. As λs increases, the sparseness term begins to dominate the data matching term, which will affect the quality of constructed atlas. Figure 5 (c) shows atlas constructed with sparseness parameter λs = 0.1 in Equation 8. It can be observed that although the constructed atlases become even sharper than those shown in Figure 5 (b), some facial features such as mouths are distorted in an unrealistic manner. In this paper, we have empirically found that λs = 0.01 gets a good balance between the data matching and sparseness term, thus this setting is used through all experiments in this paper. B. Recognition of Query Sequences In this paper, a new recognition scheme based on image appearance and expression evolution information is proposed, as shown in Figure 6. Without loss of generality, assume that there are K different expressions of interest. Let N denote the number of time points to build longitudinal facial expression atlas sequence as in Section III-A. The larger the number of N , the more precise the atlas sequence describes the dynamic facial expression. But this will also increase computational burden. We denote N time points as T = {t1 , ..., tN } and Mtk as the atlas of the kth (k = 1, ..., K) facial expression at time point t (t ∈ T ). In the sparse atlas construction stage, training image sequences can be constrained or pre-segmented to ensure that they begin with the neutral expression and gradually reach to the apex expression. In this way, constructed longitudinal atlases of different expressions should also follow the same trend, as illustrated in Figure 5. However, in the recognition stage, a new query image sequence does not necessarily begin with the neutral expression and end with the apex expression. And it is possible that abrupt transitions between two expressions can be observed in one sequence. Given a new facial expression sequence that consists of nnew images Iinew (i = 0, ..., nnew − 1), correct temporal correspondences should be established between constructed atlas sequences and the query image sequence. This is because the facial expression sequence to be classified does not necessarily follow the same temporal correspondence as the constructed longitudinal atlas. First, we determine the temporal correspondence of the first atlas image for each facial expression k in the query image sequence, which is described by: b = arg min{d(Mtk1 , Ijnew )2 }, (9) j where d(·) is the distance metric defined in Equation 4 based on diffeomorphisms. The physical meaning of Equation 9 is: (1) perform diffeomorphic image registration between Mtk1 and each image in the query sequence Ijnew (j = 1, ..., nnew ); and (2) determine the time point which gives the least registration error between Mtk1 and the image in query as temporal correspondence of Mtk1 . Similarly, we can determine the temporal correspondence of each atlas for each expression in query image sequence. Denote e as the index of temporal correspondence time in the query image sequence to the last image MtkN in the atlas sequence. It should be noted that for query sequences with multiple expression transitions, only one neutral → onset → apex clip will be detected and used to establish the correspondence to atlas sequence. Intuitively, this single neutral → onset → apex clip should already contain sufficient information for accurate expression recognition. This will be further justified in the experiments conducted on the MMI, FERA, and AFEW databases in which multiple facial expression transitions exist and expression sequences are obtained under real-life conditions. Then, we construct growth model φnew for the query image sequence and interpolate facial expression images at time new points t ∈ {tb , tb+1 , ..., te } by operation φnew ), where (tb →t) (Ib b and e are the indices of temporal correspondence time in the query image sequence to the first and last images in the atlas sequence, respectively. With the established temporal correspondence between the query image sequence and longitudinal 1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE Transactions on Image Processing SHELL et al.: BARE DEMO OF IEEETRAN.CLS FOR JOURNALS 7 Recognition Deformation filed reconstruction error Happiness Surprise Temporal Temporal Evolution Evolution … Appearance similarity Fear Temporal Evolution Happiness Atlas Surprise Atlas … Fear Atlas Query Image Sequence Fig. 6. Both the image appearance information and dynamic evolution information are used to guide the recognition. Image appearance similarity is measured with respect to the registration error between query image sequence and atlas sequence. The temporal process is compared by calculating deformation field reconstruction error. atlas, we can register interpolated facial expression images of the query image sequence to their corresponding images in the atlas sequence. The registration errors are compared to determine its expression type [17]. This is described by: tion is described by: Pe−b Lopt = arg min L ( Pe−b ) new new 2 L )) i=0 d(Mt1+i , φ(tb →tb+i ) (Ib Lopt = arg min (e − b + 1) L +β· , (10) where Lopt ∈ {1, ..., K} is the estimated facial expression label for the query image sequence. The dynamic process provides complementary information for image appearance to guide recognition. Given the growth model φi (i = 1, ..., C), the deformation field φi(tj →tj+1 ) (j = b, ..., e − 1) that represents temporal evolution from time point tj to tj+1 can be calculated. φi(tj →tj+1 ) is represented as a 2×h×w dimensional vector F~tij →tj+1 , where h and w are the height and width of each facial expression image (i.e., there are h × w pixels). Each pixel’s displacement is determined by movements in horizontal and vertical directions. Similarly, for the new image sequence, we can obtain F~tnew (j = j →tj+1 b, ..., e − 1). For each expression k (k = 1, ..., K), training image sequences are used to construct a dictionary Dktj →tj+1 , which represents temporal evolution of this expression from time point tj to tj+1 (j = ib, ..., e − 1), denoted as Dktj →tj+1 = h F~t1 →t , ..., F~tC→t . j j+1 j j+1 We reconstruct F~tnew by the basis (i.e., each column) of j →tj+1 k Dtj →tj+1 using sparse representation [38] for each expression type k, as shown in Figure 7. The accuracy of reconstruction indicates the similarity between temporal processes, which serves as an important clue to determine expression type of the new image sequence. Therefore, the overall energy function to drive the recogni- e−1 X i=0 new 2 d(MtL1+i , φnew )) (tb →tb+i ) (Ib (e − b + 1) opt 2 L ||F~tnew − D · α ~ || tj →tj+1 tj ,L 2 , j →tj+1 (11) j=b where β is the parameter to control the weight of temporal information, and α ~ topt is estimated by: j ,L α ~ topt = j ,L arg min α ~ 1 ~ new 2 ||Ftj →tj+1 − DL · α ~ || + λ ||~ α || . 1 α ~ tj →tj+1 2 2 (12) The optimization of Equation 12 can be performed by Nesterov’s method [40] as it is a LASSO sparse representation problem [39]. IV. E XPERIMENTS To evaluate the performance of the proposed method, it has been extensively tested on three benchmark databases for dynamic facial expression recognition: the extended CohnKanade [41], MMI [42] and FERA [43] databases. The proposed method has also been evaluated on one spontaneous expression database: the UNBC-McMaster database [44]. A. Experiments on the Extended Cohn-Kanade Database The extended Cohn-Kanade (CK+) database [41] contains 593 facial expression sequences from 123 subjects. Similar to [41], 325 sequences from 118 subjects are selected. Each sequence is categorized to one of the seven basic expressions: anger, contempt, disgust, fear, happy, sadness and surprise. 1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE Transactions on Image Processing 8 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MONTH XX Training image sequences Subject 1 Subject 2 Subject C New Subject … … … … … Sparse Representation … Fig. 7. Illustration of reconstructing deformation field between consecutive time points by sparse representation for a new subject from the deformation field dictionary Dktj →tj+1 learnt from training images (Subject 1 to C). Each image in the facial expression sequence is digitized to resolution of 240×210. For each selected sequence, we follow the same preprocessing step as in [7]. Specifically, the eye positions in the first frame of each sequence were manually labeled. These positions were used to determine the facial area for the whole sequence and to normalize facial images. Figure 8 shows some examples from the CK+ database. In all experiments, the following parameter settings are used for our method: N = 12 as the number of time points of interest to construct the longitudinal atlas, as we found that setting N = 12 is a good tradeoff between recognition accuracy and computational burden; λφi = 0.02 as the parameter that controls the smoothness of diffeomorphic growth model for each subject i; λs = 0.01 as sparseness parameter for atlas construction in Equation 8; β = 0.5 as weighting parameter associated with the temporal evolution information in Equation 11; λα~ = 0.01 as the sparse representation parameter of growth model for query image sequence in Equation 12. Fig. 8. Images from the CK+ database. Our method is evaluated in a leave-one-subject-out manner similar to [41]. Figure 9 shows the constructed longitudinal atlases of seven different expressions on the CK+ database. It can be visually observed that the constructed atlases are able to capture the overall trend of facial expression evolution. Table I shows the confusion matrix for the CK+ database. It can be observed that high recognition accuracies are obtained by the proposed method (i.e., the average recognition rate of each expression is higher than 90%). Figure 10 shows the recognition rates of different expressions obtained by: the sparse atlas construction with new recognition scheme, sparse atlas construction with recognition scheme in [17] (i.e., use image appearance information only), conventional atlas construction with recognition scheme in [17], atlas construction with the collection flow algorithm [45], [46], and atlas construction with the standard RASL algorithm [47]. For the collection flow algorithm [45], [46], we adopted the flow estimation algorithm from Ce Liu’s implementation [48] similar to [45], [46]. For the RASL algorithm, we followed the same settings as in [47]. The affine transformation model is used which is the most complex transformation model supported by the standard RASL package 1 . It can be observed that the sparse atlas construction with the new recognition scheme consistently achieves the highest recognition rates, which is consistent with the qualitative results obtained in Section III-A. The main reason for the improvement is that the enforced sparsity constraint can preserve salient information that discriminates different expressions and can simultaneously suppress subject-specific facial shape variations. In addition, it is demonstrated that the recognition performance can be 1 http://perception.csl.illinois.edu/matrix-rank/rasl.html 1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE Transactions on Image Processing SHELL et al.: BARE DEMO OF IEEETRAN.CLS FOR JOURNALS 9 Fig. 9. The longitudinal facial expression atlas constructed on the CK+ database at 12 time points with respect to seven expressions: ’anger’ (the first row), ’contempt’ (the second row), ’disgust’ (the third row), ’fear’ (the fourth row), ’happiness’ (the fifth row), ’sadness’ (the sixth row) and ’surprise’ (the seventh row). TABLE I C ONFUSION MATRIX OF THE PROPOSED METHOD FOR CK+ DATABASE . Contempt (%) 0 91.8 0.8 0 0 0 0 Disgust (%) 0 7.3 98.8 0 0 0 0 further improved by referring to both the image appearance information in the spatial domain and temporal information. It is also observed that recognition accuracies obtained by using the RASL algorithm are slightly worse or comparable to the conventional atlas construction scheme. There are two reasons: First, as long as the mean operation is used to construct atlas during groupwise registration process, subtle and important anatomical details are inevitably lost, which leads to inferior recognition accuracies. Second, the global affine transformation can not model deformable facial muscle movements sufficiently. Therefore, corresponding recognition accuracies are worse than those obtained with diffeomorphic transformations. The collection flow algorithm achieves slightly higher recognition accuracies than the conventional groupwise registration scheme. However, its accuracies are slightly inferior to those of the sparse representation based atlas construction scheme. The reason is probably because the sparse representation based atlas construction scheme explicitly enforces the sparseness constraint to build sharp and salient atlases in the energy function. Fear (%) 1.5 0 0 95.5 0.8 1.0 0 100 Recognition Rate (in %) Anger Contempt Disgust Fear Happiness Sadness Surprise Anger (%) 96.1 0 0 0 0 2.2 0 Happiness (%) 0 0 0.4 3.4 99.2 0 0 Sadness (%) 2.4 0.9 0 1.1 0 96.8 0.7 Surprise (%) 0 0 0 0 0 0 99.3 Sparse + A + T Sparse + A Conventional + A Collection Flow RASL 95 90 85 Anger Contempt Disgust Fear Happiness Facial Expressions Sadness Surprise Fig. 10. The average recognition rates of seven different facial expressions on the CK+ database by using different schemes. “RASL” is the standard RASL algorithm with affine transformation model, “Collection Flow” is the collection flow algorithm, “Conventional” is the conventional groupwise registration method used to construct atlas, and “Sparse” is the proposed sparse representation scheme used to construct atlas. “A” denotes image appearance information, and “T” denotes temporal evolution information. 1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE Transactions on Image Processing IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MONTH XX To further understand the roles of image appearance and temporal evolution information played in the recognition process, we plot average recognition accuracies of the proposed method by using different values of β in Equation 11 in Figure 11. β controls the weighting of the image appearance and temporal evolution information in the recognition step. The smaller the value of β, the more image appearance information the recognition relies on, and vice versa. It can be observed from Figure 11 that when β is set to 0 (i.e., relies on the image appearance information only), the recognition accuracy drops to 92.4%. As β increases, temporal evolution information becomes more important and the recognition accuracy increases to 97.2% when β = 0.5. When β further increases, the temporal evolution information begins to dominate image appearance information and recognition accuracy tends to decline. Therefore, it is implied that both the image appearance information and temporal evolution information play important roles in recognition step as they are complementary to each other. Therefore, it is beneficial to consider both of them for the recognition performance. 100 99 Average Recognition Rate (in %) 10 98 97 96 95 94 93 92 91 90 4 6 8 10 12 14 Number of Time Points N 16 18 20 Fig. 12. Average recognition accuracies obtained by the proposed method with different number of time points N to construct atlas sequences on the CK+ database. experimental protocols of compared methods are not exactly the same due to a different number of sequences and crossvalidation setup, the effectiveness of our method can still be implied by its highest recognition rate than those in other methods. 100 100 Average Recognition Rate (in %) Average Recognition Rate (in %) 98 96 94 92 90 95 90 85 80 75 88 86 0 70 0.1 0.2 0.3 0.4 0.5 Value of β 0.6 0.7 0.8 0.9 Our Method Guo’s [17] Zhao’s [7] ITBN [50] HMM [50] Gizatdinova’s [49] Fig. 13. Average recognition rates of different approaches on CK+ database. Fig. 11. Average recognition accuracies obtained by the proposed method with different values of β in Equation 11 on the CK+ database. Another important parameter in our method is the number of time points N to construct atlas sequence. Intuitively, the larger the value of N , the more precisely the atlas sequence can represent the physical facial expression evolution process, while computational burden will also increase. To study the effects of different values of N , average recognition accuracies with respect to different values of N are shown in Figure 12. It can be seen that when N is small (e.g., N = 4), inferior recognition accuracies are obtained because the atlas sequence cannot describe the expression evolution process sufficiently. As N increases, the representation power of the atlas sequence becomes stronger and higher recognition accuracies are obtained. For instance, when N = 12, satisfactory recognition accuracies are obtained (i.e., 97.2%). However, when N further increases, the recognition accuracy begins to saturate because the atlas sequence has almost reached its maximum description capacity and the gain in recognition accuracies becomes marginal. Figure 13 provides further comparisons on CK+ database among our method and some state-of-the-art dynamic facial expression methods proposed by Guo et al. [17], Zhao and Pietikäinen [7], Gizatdinova and Surakka [49], and HMM and ITBN model proposed by Wang et al. [50]. Although It takes 21.7 minutes for our method in the atlas construction stage and 1.6 seconds in the recognition stage for each query sequence in average (Matlab, 4-core, 2.5GHz processor and 6 GB RAM). B. Experiments on the MMI Database Our method is evaluated on the MMI database [42], which is known as one of the most challenging facial expression recognition databases due to its large inter-subject variations. For the MMI database, 175 facial expression sequences from different subjects were selected. The selection criteria is that each sequence can be labeled as one of the six basic emotions: anger, disgust, fear, happy, sadness and surprise. The facial expression images in each sequence were digitized as resolution of 720 × 576. Some sample images from the MMI database are shown in Figure 14. For each facial image, it was normalized based on eye coordinates similar to the processing on CK+ database. To evaluate our method on the MMI database, 10-fold cross validation is adopted, which is similar as in [17]. The confusion matrix of the proposed method is listed in Table II. It can be observed that this method achieves high recognition rates for different expressions (i.e., all above 90%). 1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE Transactions on Image Processing SHELL et al.: BARE DEMO OF IEEETRAN.CLS FOR JOURNALS 11 TABLE II C ONFUSION MATRIX OF THE PROPOSED METHOD Anger Disgust Fear Happiness Sadness Surprise Fig. 14. Anger (%) 95.6 0.2 0 0 4.8 0 Disgust (%) 1.2 97.8 0.5 0 0 0 Fear (%) 0 0 96.4 1.8 0.8 2.5 Sample images from MMI database. To investigate the performance of sparse representation based atlas construction, the average recognition rates of different expressions obtained with and without using sparse representation in atlas construction are shown in Figure 15. It also shows the average recognition rates obtained with and without temporal information, which indicates the importance of incorporating image appearance with temporal information in recognition stage. Recognition Rate (in %) 100 Sparse + A + T Sparse + A Conventional + A 98 ON Happiness (%) 0 2.0 3.1 98.2 0 0.6 MMI DATABASE . Sadness (%) 3.2 0 0 0 94.4 0 Surprise (%) 0 0 0 0 0 96.9 proposed method achieves more than 90% recognition rates when using 4 folds as training set and the remaining 6 folds as testing set. It is also shown that the proposed method outperforms Guo’s method [17] consistently. It is also interesting to study the robustness of our method to the length of query sequence. The most challenging case is that the query sequence contains only one image and the temporal information is not available. The proposed method is evaluated under this condition. The image selected to guide the recognition is the one that has temporal correspondence to the last image in atlas sequence (i.e., the image with the expression in apex). Figure 17 shows the average recognition rates. For comparison purposes, the recognition rates obtained by using all images in query sequence are also shown. It can be observed that recognition rates resulting from a single input image drop consistently, which reflects the significance of temporal information in the recognition task. On the other hand, the proposed method still achieves acceptable recognition accuracy (i.e., on average 89.8%) even in this challenging case. 96 94 92 90 Anger Disgust Fear Happiness Facial Expressions Sadness Surprise Fig. 15. The average recognition rates of six different facial expressions on MMI database with different schemes of the proposed method. Fig. 17. The average recognition rates of six different expressions on MMI database under conditions of single input image and full sequence. It can be observed that recognition rates of sparse representation based atlas construction are consistently higher than those obtained by the conventional scheme in [17]. Moreover, the recognition rates can be further improved by incorporating temporal information with image appearance information. We also study the impact of training set size. Figure 16 shows the recognition rates obtained by the proposed method with a different number of training samples. The horizontal axes is the number of ’folds’ serving as the training set. For comparison purposes, the results of Guo’s method in [17] are also computed and shown. It can be observed from Figure 16 that recognition rates of our method converge into certain values quickly as the size of training set increases. Specifically, for all expressions, the C. Experiments on the FERA Database To further investigate the robustness of the proposed method, it is evaluated on the facial expression recognition and analysis challenge (FERA2011) data: GEMEP-FERA dataset [43]. The FERA dataset consists of ten different subjects displaying five basic emotions: anger, fear, joy, relief and sadness. FERA is one of the most challenging dynamic facial expression recognition databases. First, the input facial expression sequence does not necessarily start with neutral and end with apex expressions. Second, there are various head 1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE Transactions on Image Processing 12 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MONTH XX 90 85 80 75 Our Method Guo’s Method 2 3 4 5 6 7 Number of Folds Used as Training Sets 8 95 90 85 80 75 70 1 9 Our Method Guo’s Method 2 3 (a) Anger 8 90 85 Our Method Guo’s Method 2 3 4 5 6 7 Number of Folds Used as Training Sets 90 85 80 75 70 1 9 Our Method Guo’s Method 2 3 8 (d) Happiness 9 8 9 (c) Fear 90 85 80 75 70 65 1 4 5 6 7 Number of Folds Used as Training Sets 100 Average Recognition Rates (in %) 95 80 95 (b) Disgust Average Recognition Rates (in %) Average Recognition Rates (in %) 4 5 6 7 Number of Folds Used as Training Sets 95 100 75 1 Average Recognition Rates (in %) Average Recognition Rates (in %) Average Recognition Rates (in %) 95 70 1 100 100 100 Our Method Guo’s Method 2 3 4 5 6 7 Number of Folds Used as Training Sets (e) Sadness 8 9 95 90 85 80 75 70 1 Our Method Guo’s Method 2 3 4 5 6 7 Number of Folds Used as Training Sets 8 9 (f) Surprise Fig. 16. The average recognition rates of different expressions for the proposed method with different training set sizes on MMI database. The recognition rates of Guo’s method in [17] are calculated for comparison. movements and unpredicted facial occlusions. Some sample images are shown in Figure 18. The FERA training set contains 155 image sequences for seven subjects, and the testing set contains 134 image sequences for six subjects. Three of the subjects in the testing set are not present in the training set. To evaluate the proposed method, we follow the standard FERA protocol [43] and construct atlas from the training set. Then, the estimated facial expression labels of testing set are sent to FERA organizer to calculate scores. Fig. 18. variations due to inter-person differences can be suppressed. Therefore, the proposed method achieves robust recognition performance under person-independent condition. Second, in the challenging FERA database, intra-person expression variations are not necessarily smaller than inter-person expression variations, which is illustrated in Figure 19. In Figure 19, each row shows images of the same facial expression sequence with expression type ’Joy’ from the training set of the FERA database. The second and third rows are sequences of the same person, while the first row is a sequence of another person. It can be seen that facial features such as eyes, brows, and mouths are quite similar between the sequence in the first row and the sequence in the second row even though they are from different persons. On the other hand, there are large facial feature variations between the sequence in the second row and the sequence in the third row even though they are from the same person and with the same expression type ’Joy’. Sample images from FERA database. In this paper, we adopted similar preprocessing steps as in [43] for the purposes of fair comparison. Specifically, the Viola and Jones face detector [51] was first used to extract facial region. To determine eye locations of facial image, the cascaded classifier is applied which is trained for detecting left and right eyes and implemented in OpenCV. Then, a normalization is performed based on detected eye locations. The person-specific and person-independent recognition rates obtained by our method are listed in Table III. It achieves promising recognition rates in both settings. It is also interesting to observe that for ’Anger’ and ’Joy’ expressions, the proposed method achieves higher recognition accuracies under person-independent setting than those under person-specific setting. There are two reasons: First, one main strength of the proposed method is its capability of building unbiased facial expression atlas to guide recognition process. The facial shape Fig. 19. An example that shows one of the challenging properties of the FERA database, where the intra-person expression variations are not necessarily smaller than inter-person expression variations. Each row shows an expression sequence from the FERA training set with expression type ’Joy’. The second and third rows are sequences of the same person, while the first row is a sequence obtained from another person. It is visually observed that intra-person expression variations are larger than inter-person expression variations in this case. 1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE Transactions on Image Processing SHELL et al.: BARE DEMO OF IEEETRAN.CLS FOR JOURNALS 13 TABLE III R ECOGNITION RATES OBTAINED BY THE PROPOSED METHOD ON FERA DATABASE . Anger Fear Joy Relief Sadness Average Person-independent 1.00 0.867 1.00 0.563 0.933 0.873 Person-specific 0.923 1.00 0.727 0.900 1.00 0.910 Overall 0.963 0.920 0.903 0.692 0.960 0.888 algorithm [56] (i.e., LBP-TOP [7]) in the EmotiW 2014 protocol. Moreover, our method achieves comparable recognition accuracy (with recognition accuracy 0.2% difference) to the winner of the EmotiW 2014 challenge (i.e., Multiple Kernel + Manifold [59]) and outperforms other state-of-the-art methods under comparison. Therefore, the robustness of the proposed method in the wild condition can be implied. E. Experiments on the UNBC Database The overall recognition rate obtained by our method is 0.888. It is higher than that of other methods reported in [43], where the highest overall recognition rate is 0.838 achieved by Yang and Bhanu [52], and it is also significantly higher than that of the baseline method (i.e., 0.560) [43]. D. Experiments on the AFEW Database The proposed method has also been evaluated on the Acted Facial Expression in Wild (AFEW) database [53] to study its performance when facial expression sequences are taken under wild and real-life conditions. The AFEW database was collected from movies showing real-life conditions, which depicts or simulates spontaneous expressions in uncontrolled environment. Some samples are shown in Figure 20. The task is to classify each sequence to one of the seven basic expression types: neutral (NE), happiness (HA), sadness (SA), disgust (DI), fear (FE), anger (AN), and surprise (SUR). In this paper, we follow the protocol of Emotion Recognition in the Wild Challenge 2014 (i.e., EmotiW 2014) [53] to evaluate the proposed method. The training set defined in the EmotiW 2014 protocol is used to build atlas sequences. The recognition accuracies on the validation set are listed in Table IV similar to [54], [55]. Fig. 20. Sample images from the AFEW database. TABLE IV R ECOGNITION RATES OBTAINED BY THE PROPOSED METHOD ON THE VALIDATION SET OF THE AFEW DATABASE AND COMPARISONS WITH OTHER STATE - OF - THE - ART RECOGNITION METHODS . Method Baseline [56] Multiple Kernel Learning [57] Improved STLMBP [58] Multiple Kernel + Manifold [59] Our Method Recognition Accuracies (in %) 34.4 40.2 45.8 48.5 48.3 From the Table IV, it can be seen that our method achieves significantly higher recognition accuracies than the baseline Our method is evaluated on the UNBC-McMaster shoulder pain expression archive database [44] for spontaneous pain expression monitoring. It consists of 200 dynamic facial expression sequences from 25 subjects with 48, 398 frames, where each subject was self-identified as having a problem with shoulder pain. Each sequence was obtained when the subject was instructed by physiotherapists to move his/her limb as far as possible. For each sequence, observers who had considerable training in identification of pain expression rated it on a 6-point scale that ranged from 0 (no pain) to 5 (strong pain). Each frame was manually FACS coded, and 66 point active appearance model (AAM) landmarks were provided [44]. Figure 21 shows some images from UNBC-McMaster database. Fig. 21. Images from UNBC-McMaster database. We follow the same experimental settings as in [44], where leave-one-subject-out cross validation was adopted. Referring to observers’ 6-point scale ratings, all sequences were grouped into three classes [44]: 0-1 as class one, 2-3 as class two and 4-5 as class three. Similar to [44], rough alignment and initialization are performed for our method with 66 AAM landmarks provided for each frame. Figure 22 shows the constructed facial expression atlases for different classes. It can be observed that constructed atlas successfully captures subtle and important details, especially in areas that can reflect the degree of pain, such as eyes and mouth. The classification accuracies of our method are compared with Lucey’s method [44], as shown in Figure 23. The classification accuracies for class 1, class 2 and class 3 obtained by our method are 88%, 63% and 59%, respectively. It can be seen that significant improvement is achieved compared to those obtained by Lucey’s method: 75%, 38% and 47% [44]. This can demonstrate the effectiveness of the proposed method on characterizing spontaneous expressions. The proposed method is also compared with one state-ofthe-art pain classification method in [60]. For purposes of fair comparison, we adopted the same protocol as in [60]: pain class labels are binarized into ’pain’ and ’no pain’ by defining instances with pain intensities larger than or equal to 3 as 1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE Transactions on Image Processing 14 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MONTH XX Fig. 22. The longitudinal atlases of class 1 (observer score 0-1), class 2 (observer score 2-3) and class 3 (observer score 4-5) listed in the first row, second row and third row, respectively. the positive class (pain) and pain intensities equal to 0 as the negative class (no-pain), and intermediate pain intensities of 1 and 2 are omitted. The average classification accuracy obtained by the proposed method under this setting is 84.8%, which is slightly higher or comparable to the one reported in [60] (i.e., 83.7%). It should be noted that the experimental settings may not be exactly the same as in [60]. However, the experimental result can imply the effectiveness of the proposed method. F. Experiments on Cross Dataset Evaluation The cross-database generalization ability of the proposed method has also been studied. We constructed six basic dynamic expression atlas sequences (i.e., anger, disgust, fear, happiness, sadness and surprise) from CK+ database following the same setting as in Section IV-A. These atlas sequences are then used to guide the recognition process on MMI database. All the 175 facial expression sequences selected in Section IV-B from MMI database are served as a testing set. Table V lists the confusion matrix obtained by our method. The recognition accuracies are consistently lower than those obtained under the within dataset validation condition listed in Table II. This is mainly due to larger variations in terms of illumination condition, pose and facial shapes cross different databases. However, it can be observed from Table V that our method still achieves high recognition performance (i.e., above 90% average recognition rate) and outperforms some well-known methods, such as Shan’s method (i.e., 86.9%) [61]. V. C ONCLUSION In this paper, we propose a new way to tackle the dynamic facial expression recognition problem. It is formulated as a longitudinal atlas construction and diffeomorphic image registration problem. Our method mainly consists of two stages, namely atlas construction stage and recognition stage. In the atlas construction stage, longitudinal atlas of different facial expressions are constructed based on sparse representation groupwise registration. The constructed atlas can capture overall facial appearance movements for a certain expression among the population. In the recognition stage, both the image appearance and temporal information are considered and integrated by diffeomorphic registration and sparse representation. Our method has been extensively evaluated on five dynamic facial expression recognition databases. The experimental results show that this method consistently achieves higher recognition rates than other compared methods. One limitation of the proposed method is that it is still not robust enough to overcome challenges of strong illumination changes. The main reason is that the LDDMM registration algorithm used in this paper may not compensate strong illumination changes. One possible solution is to use complex image matching metrics in the LDDMM framework, such as localized correlation coefficient and localized mutual information which have some degrees of robustness against illumination changes. This is one possible future direction for this study. R EFERENCES Fig. 23. The average recognition rates of Lucey’s method [44] and the proposed method on UNBC-McMaster database. [1] B. Fasel and J. Luettin, “Automatic facial expression analysis: a survey,” Pattern Recognition, vol. 36, pp. 259–275, 2003. [2] Z. Zeng, M. Pantic, G. Roisman, and T. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, pp. 39–58, 2009. [3] M. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and J. Movellan, “Recognizing facial expression: machine learning and application to spontaneous behavior,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, pp. 568–573. [4] T. Ojala, M. Pietikäinen, and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 971–987, 2002. [5] S. Lucey, A. Ashraf, and J. Cohn, “Investigating spontaneous facial action recognition through aam representations of the face,” in Face Recognition Book, 2007, pp. 275–286. 1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE Transactions on Image Processing SHELL et al.: BARE DEMO OF IEEETRAN.CLS FOR JOURNALS C ONFUSION MATRIX OF THE PROPOSED METHOD Anger Disgust Fear Happiness Sadness Surprise Anger (%) 90.8 0.5 0.4 0 7.6 0.7 15 ON Disgust (%) 1.5 93.2 0.7 2.1 0 0 TABLE V MMI DATABASE WITH ATLAS SEQUENCES TRAINED Fear (%) 0 0 93.5 5.3 2.3 4.9 [6] E. Learned-Miller, “Data driven image models through continuous joint alignment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 2, pp. 236–250, 2006. [7] G. Zhao and M. Pietikäinen, “Dynamic texture recognition using local binary patterns with an application to facial expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, pp. 915– 928, 2007. [8] L. Cohen, N. Sebe, A. Garg, L. Chen, and T. Huang, “Facial expression recognition from video sequences: temporal and static modeling,” Computer Vision and Image Understanding, vol. 91, pp. 160–187, 2003. [9] Y. Zhang and Q. Ji, “Active and dynamic information fusion for facial expression understanding from image sequences,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, pp. 699–714, 2005. [10] S. Koelstra, M. Pantic, and I. Patras, “A dynamic texture-based approach to recognition of facial actions and their temporal models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, pp. 1940–1954, 2010. [11] A. Ramirez and O. Chae, “Spatiotemporal directional number transitional graph for dynamic texture recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, p. In Press, 2015. [12] H. Fang, N. Parthalain, A. Aubrey, G. Tam, R. Borgo, P. Rosin, P. Grant, D. Marshall, and M. Chen, “Facial expression recognition in dynamic sequences: An integrated approach,” Pattern Recognition, vol. 47, no. 3, pp. 1271–1281, 2014. [13] L. Zhong, Q. Liu, P. Yang, B. Liu, J. Huang, and D. Metaxas, “Learning active facial patches for expression analysis,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2012, pp. 2562–2569. [14] Z. Wang, S. Wang, and Q. Ji, “Capturing complex spatio-temporal relations among facial muscles for facial expression recognition,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2013, pp. 3422–3429. [15] M. Beg, M. Miller, A. Trouve, and L. Younes, “Computing large deformation metric mappings via geodesic flows of diffeomorphisms,” International Journal of Computer Vision, vol. 61, pp. 139–157, 2005. [16] S. Yousefi, P. Minh, N. Kehtarnavaz, and C. Yan, “Facial expression recognition based on diffeomorphic matching,” in International Conference on Image Processing, 2010, pp. 4549–4552. [17] Y. Guo, G. Zhao, and M. Pietikäinen, “Dynamic facial expression recognition using longitudinal facial expression atlases,” in European Conference on Computer Vision, 2012, pp. 631–644. [18] Y. Chang, C. Hu, R. Feris, and M. Turk, “Manifold based analysis of facial expression,” Image and Vision Computing, vol. 24, no. 6, pp. 605– 614, 2006. [19] M. Pantic and I. Patras, “Dynamics of facial expression: Recognition of facial actions and their temporal segments form face profile image sequences,” IEEE Transactions on Systems, Man, and Cybernetics Part B, vol. 36, no. 2, pp. 433–449, 2006. [20] P. Yang, Q. Liu, X. Cui, and D. Metaxas, “Facial expression recognition using encoded dynamic features,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8. [21] M. Yeasin, B. Bullot, and R. Sharma, “From facial expression to level of interests: A spatio-temporal approach,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004, pp. 922– 927. [22] G. Edwards, C. Taylor, and T. Cootes, “Interpreting face images using active appearance models,” in IEEE FG, 1998, pp. 300–305. [23] J. Saragih, S. Lucey, and J. Cohn, “Deformable model fitting by regularized landmark mean-shift,” International Journal of Computer Vision, vol. 91, no. 2, pp. 200–215, 2011. [24] N. Dalal and B. Triggs, “Histograms of oriented gradients for human Happiness (%) 0 5.3 5.4 92.6 0 2.8 [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] Sadness (%) 7.7 1.0 0 0 89.7 0 ON CK+ DATABASE . Surprise (%) 0 0 0 0 0.4 91.6 detection,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, pp. 886–893. G. David, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 10, pp. 1615–1630, 2005. D. Rueckert, L. Sonoda, C. Hayes, D. Hill, M. Leach, and D. Hawkes, “Nonrigid registration using free-form deformations: Application to breast MR images,” IEEE Transactions on Medical Imaging, vol. 18, pp. 712–721, 1999. X. Xiong and F. De la Torre, “Supervised descent method and its applications to face alignment,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2013, pp. 532–539. F. De la Torre and M. Nguyen, “Parameterized kernel principal component analysis: Theory and applications to supervised and unsupervised image alignment,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8. G. Tzimiropoulos, V. Argyriou, S. Zafeiriou, and T. Stathaki, “Robust fft-based scale-invariant image registration with image gradients,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 10, pp. 1899–1906, 2010. X. Cao, Y. Wei, F. Wen, and J. Sun, “Face alignment by explicit shape regression,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2012, pp. 2887–2894. S. Liao, D. Shen, and A. Chung, “A markov random field groupwise registration framework for face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, pp. 657–669, 2014. J. Maintz and M. Viergever, “A survey of medical image registration,” Medical Image Analysis, vol. 2, pp. 1–36, 1998. M. Miller and L. Younes, “Group actions, homeomorphisms, amd matching: A general framework,” International Journal of Computer Vision, vol. 41, pp. 61–84, 2001. S. Joshi, B. Davis, M. Jomier, and G. Gerig, “Unbiased diffeomorphic atlas construction for computational anatomy,” NeuroImage, vol. 23, pp. 151–160, 2004. T. Cootes, C. Twining, V. Petrovic, K. Babalola, and C. Taylor, “Computing accurate correspondences across groups of images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 11, pp. 1994–2005, 2010. M. Hernandez, S. Olmos, and X. Pennec, “Comparing algorithms for diffeomorphic registration: Stationary lddmm and diffeomorphic demons,” in 2nd MICCAI Workshop on Mathematical Foundations of Computational Anatomy, 2008, pp. 24–35. J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227, 2009. R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B, vol. 58, pp. 267–288, 1996. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, 2004. P. Lucey, J. Cohn, T. Kanade, J. Saragih, and Z. Ambadar, “The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2010, pp. 94– 101. M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database for facial expression analysis,” in IEEE International Conference on Multimedia and Expo, 2005, pp. 317–321. 1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2537215, IEEE Transactions on Image Processing 16 [43] M. Valstar, M. Mehu, B. Jiang, M. Pantic, and S. K., “Meta-analysis of the first facial expression recognition challenge,” IEEE Transactions on Systems, Man, and Cybernetics Part B, vol. 42, no. 4, pp. 966–979, 2012. [44] P. Lucey, J. Cohn, K. Prkachin, P. Solomon, S. Chew, and I. Matthews, “Painful monitoring: Automatic pain monitoring using the unbcmcmaster shoulder pain expression archive database,” Image and Vision Computing, vol. 30, pp. 197–205, 2012. [45] I. Kemelmacher-Shlizerman and S. Seitz, “Collection flow,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2012, pp. 1792–1799. [46] I. Kemelmacher-Shlizerman, S. Suwajanakorn, and S. Seitz, “Illumination-aware age progression,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2014, pp. 3334–3341. [47] Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma, “Rasl: Robust alignment by sparse and low-rank decomposition for linearly correlated images,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 763–770. [48] C. Liu, Beyond Pixels: Exploring New Representations and Applications for Motion Analysis. PhD thesis, MIT, 2009. [49] Y. Gizatdinova and V. Surakka, “Feature-based detection of facial landmarks from neutral and expressive facial images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, pp. 135–139, 2006. [50] Z. Wang, S. Wang, and Q. Ji, “Capturing complex spatio-temporal relations among facial muscles for facial expression recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3422–3429. [51] P. Viola and M. Jones, “Robust real-time object detection,” International Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2002. [52] S. Yang and B. Bhanu, “Facial expression recognition using emotion avatar image,” in IEEE Int. Conf. Autom. Face Gesture Anal., 2011, pp. 866–871. [53] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Collecting large, richly annotated facial-expression databases from movies,” IEEE MultiMedia, vol. 19, pp. 34–41, 2012. [54] J. Chen, T. Takiguchi, and Y. Ariki, “Facial expression recognition with multithreaded cascade of rotation-invariant hog,” in International Conference on Affective Computing and Intelligent Interaction, 2015, pp. 636–642. [55] M. Liu, S. Shan, R. Wang, and X. Chen, “Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2014, pp. 1749–1756. [56] A. Dhall, R. Goecke, J. Joshi, K. Sikka, and T. Gedeon, “Emotion recognition in the wild challenge 2014: Baseline, data and protocol,” in ACM International Conference on Multimodal Interaction, 2014. [57] J. Chen, Z. Chen, Z. Chi, and H. Fu, “Emotion recognition in the wild with feature fusion and multiple kernel learning,” in ACM International Conference on Multimodal Interaction, 2014, pp. 508–513. [58] X. Huang, Q. He, X. Hong, G. Zhao, and M. Pietikäinen, “Emotion recognition in the wild with feature fusion and multiple kernel learning,” in ACM International Conference on Multimodal Interaction, 2014, pp. 514–520. [59] M. Liu, R. Wang, S. Li, S. Shan, Z. Huang, and X. Chen, “Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild,” in ACM International Conference on Multimodal Interaction, 2014, pp. 494–501. [60] K. Sikka, A. Dhall, and M. Bartlett, “Classification and weakly supervised pain localization using multiple segment representation,” Image and Vision Computing, vol. 32, no. 10, pp. 659–670, 2014. [61] C. Shan, S. Gong, and P. McOwan, “Facial expression recognition based on local binary patterns: a comprehensive study,” Image and Vision Computing, vol. 27, pp. 803–816, 2009. IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MONTH XX Yimo Guo received her B.Sc. and M.Sc. degrees in computer science in 2004 and 2007, and received the Ph.D. degree in computer science and engineering from the University of Oulu, Finland, in 2013. Her research interests include texture analysis and video synthesis. Guoying Zhao received the Ph.D. degree in computer science from the Chinese Academy of Sciences, Beijing, China, in 2005. She is currently an Associate Professor with the Center for Machine Vision Research, University of Oulu, Finland, where she has been a researcher since 2005. In 2011, she was selected to the highly competitive Academy Research Fellow position. She has authored or coauthored more than 140 papers in journals and conferences, and has served as a reviewer for many journals and conferences. She has lectured tutorials at ICPR 2006, ICCV 2009, and SCIA 2013, and authored/edited three books and four special issues in journals. Dr. Zhao was a Co-Chair of seven International Workshops at ECCV, ICCV, CVPR and ACCV, and two special sessions at FG13 and FG15. She is editorial board member for Image and Vision Computing Journal, International Journal of Applied Pattern Recognition and ISRN Machine Vision. She is IEEE Senior Member. Her current research interests include image and video descriptors, gait analysis, dynamic-texture recognition, facialexpression recognition, human motion analysis, and person identification. Matti Pietikäinen received his Doctor of Science in Technology degree from the University of Oulu, Finland. He is currently a professor, Scientific Director of Infotech Oulu and Director of Center for Machine Vision Research at the University of Oulu. From 1980 to 1981 and from 1984 to 1985, he visited the Computer Vision Laboratory at the University of Maryland. He has made pioneering contributions, e.g. to local binary pattern (LBP) methodology, texture-based image and video analysis, and facial image analysis. He has authored over 335 refereed papers. His papers have currently about 31,500 citations in Google Scholar (h-index 63), and six of his papers have over 1,000 citations. Dr. Pietikinen was Associate Editor of IEEE Transactions on Pattern Analysis and Machine Intelligence and Pattern Recognition journals, and currently serves as Associate Editor of Image and Vision Computing and IEEE Transactions on Forensics and Security journals. He was President of the Pattern Recognition Society of Finland from 1989 to 1992, and was named its Honorary Member in 2014. From 1989 to 2007 he served as Member of the Governing Board of International Association for Pattern Recognition (IAPR), and became one of the founding fellows of the IAPR in 1994. He is IEEE Fellow for contributions to texture and facial image analysis for machine vision. In 2014, his research on LBP-based face description was awarded the Koenderink Prize for Fundamental Contributions in Computer Vision. 1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
© Copyright 2026 Paperzz