Automatic Pose Correction for Local
Feature-based Face Authentication
Daniel González-Jiménez1, Federico Sukno2 , José Luis Alba-Castro1, and
Alejandro Frangi2
1
Departamento de Teorı́a de la Señal y Comunicaciones, Universidad de Vigo, Spain
{danisub,jalba}@gts.tsc.uvigo.es
2
Departamento de Tecnologı́a, Universidad Pompeu Fabra, Barcelona, Spain
{federico.sukno,alejandro.frangi}@upf.edu
Abstract. In this paper, we present an automatic face authentication
system. Accurate segmentation of prominent facial features is accomplished by means of an extension of the Active Shape Model (ASM)
approach, the so-called Active Shape Model with Invariant Optimal Features (IOF-ASM). Once the face has been segmented, a pose correction
step is applied, so that frontal face images are synthesized. For the generation of these virtual images, we make use of a subset of the shape
parameters extracted from a training dataset and Thin Plate Splines
texture mapping. Afterwards, sets of local features are computed from
these virtual images. The performance of the system is demonstrated on
configurations I and II of the XM2VTS database.
Keywords: Face Authentication, Automatic Segmentation, Pose Correction.
1
Introduction
Although many algorithms have been proposed during the last decade, the general face recognition problem still remains unsolved because of several causes that
affect the performance of face-based biometric approaches, such as illumination
and pose variations, expression changes, etc [19]. Moreover, face recognition algorithms must be supplied with cropped images that ideally contain only face
pixels, i.e. there must exist a previous step that locates the face (and perhaps a
set of facial features) within the input image. Face authentication contests like
[17] have shown that there is a general degradation in performance when changing between manual registration of faces and using automatic detection before
authentication. In this paper, we address two aspects of the face authentication
problem: automatic face modelling from still images and pose correction.
One of the most popular approaches for statistical modelling are the active
models of shape and appearance, introduced by Cootes et al. in [11, 12]. These
techniques allow for detailed modelling of a wide range of objects, as long as
an appropriate training set is available. Their application to facial images has
been previously exploited [16, 15] to locate the main facial features (e.g. eyes,
nose, lips) and recover shape and texture parameters. In this work we use the
Active Shape Models with Invariant Optimal Features (IOF-ASM), an extension
of Active Shape Models (ASM) that improves segmentation accuracy by means
of a non-linear texture model based on local image structure [21].
As stated above, the presence of pose differences within the input images
is one of the main factors that degrades the performance of face recognition
systems. Up to now, the most practical and successful algorithms dealing with
pose-invariant face recognition are those which make use of prior knowledge of
the class of faces such as [1], where an individual eigenspace is constructed for
each pose. Another approach is presented in [2], where from a single image of a
subject and making use of face class information, virtual views facing different
poses are synthesized, which are then used in a view-based recognizer. In [3],
a morphable 3D face model was fitted to the input images. Among others, the
parameters that account for pose are subject to modification, so that virtual
images facing the adequate pose can be synthesized. The main drawbacks of this
method are the need of a 3D face training database and the high computational
complexity. Using a training dataset of face images, we built a Point Distribution
Model and, from the main modes of variation, the parameters responsible for
the pose of the face (namely the pose parameters) were identified. Using the segmentation results provided by the IOF-ASM approach, our system compensates
for pose variations by normalizing these pose parameters and synthesizing virtual frontal images through texture mapping. Sets of local features are extracted
from these virtual images by means of a two-stage approach. Experiments on the
XM2VTS database showed how this simple strategy softens (moderated) pose
effects, achieving error rates comparable to the state of the art.
The paper is organized as follows: Section 2 presents the statistical modelling
of the face and the approach used for segmenting facial features. In Section 3, the
synthesis of pose corrected face images is addressed, while section 4 explains the
two stages of feature extraction. In Section 5, we show our experimental results
over the XM2VTS database [18]. Finally, conclusions are drawn in Section 6.
2
2.1
Statistical Face Modelling
A Point Distribution Model for faces
A Point Distribution Model (PDM) of a face is generated from a set of training
examples. For each training image Ii , N landmarks are located and their coordinates are stored, conforming a vector Xi = (x1i , x2i , . . . , xN i , y1i , y2i , . . . , yN i ).
The pair (xji , yji ) represents the coordinates of the j-th landmark in the i-th
training image. After aligning all training examples, a Principal Components
Analysis is performed in order to find the most important modes of shape variation. As a consequence, any training shape Xi can be approximately reconstructed as:
(1)
Xi = X̄ + Pb,
where X̄ stands for the mean shape, P is a matrix whose columns are unit eigenvectors of the first t modes of variation found in the training set, and b is the
−0.15
−0.15
−0.1
−0.1
−0.15
−0.1
−0.05
−0.05
−0.05
0
0
0
0.05
0.05
0.05
0.1
0.1
0.1
0.15
0.15
0.15
0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
−0.15
0.2
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.2
−0.25
−0.15
−0.2
−0.15
−0.1
−0.1
−0.05
−0.05
0
0
0.05
0.05
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
−0.1
−0.05
0
0.05
0.1
0.1
0.15
0.15
0.2
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.2
−0.2
0.1
0.15
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.2
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
Fig. 1. Effect of varying pose parameters. rotations in depth parameter (first row) and
elevation parameter (second row). The middle column shows the average face shape,
while the left and right columns are generated displacing the corresponding parameters
by ±5 times the standard deviation of the training set.
vector of parameters that define the actual shape of Xi . Notice that the k-th
component from b (bk , k = 1, 2, . . . , t) weighs the k-th mode of variation. Examining the shapes generated by varying bk within suitable limits, we find those
parameters responsible for pose, as indicated in figure 1. Note that, although a
given eigenvector should not be assigned to an unique mode of facial variation,
it is clear that the eigenvectors shown in this figure are mainly related to pose
changes. Let bpose be the set of parameters which accounts for pose variation.
Since PT P = I, then
b = PT Xi − X̄ ,
(2)
i.e. given any shape, it is possible to obtain its vector of parameters b and, in
particular, we are able to find its pose (i.e. bpose ).
We built a 62-point PDM using the set of manual annotated landmarks3 from
the training images shared by both configurations I and II [9] of the XM2VTS
database[18].
2.2
IOF-ASM
When a new image containing a face is presented to the system, the vector
of shape parameters that fits the data, b, should be computed automatically.
Active Shape Models with Invariant Optimal Features (IOF-ASM) is a statistical
modelling method specifically designed and tested to handle the complexities of
facial images. The algorithm learns the shape statistics as in the original ASMs
3
http://www-prima.inrialpes.fr/FGnet/data/07-XM2VTS/xm2vts-markup.html
Algorithm 1 IOF-ASM matching to a new image
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
Compute invariants for the whole image
T = Initial transformation guess for face position and size
X = X (modelShape = meanShape)
for i = 1 to number of iterations do
Project shape to image coordinates: Y = TX
for l = 1 to number of landmarks do
Sample invariants around l-th landmark
Determine best candidate point to place the landmark
if the best candidate is good enough then
Move the landmark to the best candidate point
else
Keep previous landmark position (do not move)
end if
end for
Let the shape with new positions be Y
− X)
Update T and PDM parameters: b = PT (T−1 Y
Apply PDM constraints: b = P dmConstrain(b, β)
Get new model shape: X = X̄ + Pb
end for
[11] but improves the local texture description by using a set of differential
invariants combined with non-linear classifiers. As a result, IOF-ASM produces
a more accurate segmentation of the facial features [21].
The matching procedure is summarized in Algorithm 1. In line 1 the image
is preprocessed to obtain a set of differential invariants. These invariants are
the core of the method and they consist on combinations of partial derivatives
that result invariant to rigid transformations [22, 20]. Moreover, IOF-ASM uses
a minimal set of order K so that any other algebraic invariant up to order K
can be reduced to a linear combination of elements of this minimal set [13].
The other key point of the algorithm is between lines 6 and 14. For each
landmark, an image-driven search is performed to determine the best position
for it to be placed. The process starts by sampling the invariants in a neighborhood of the landmark (line 7). In IOF-ASM this neighborhood is represented by
a rectangular grid, whose dimensions are parameters of the model. A non-linear
texture classifier analyzes the sampled data to determine if the local structure
of the image is compatible with the one learnt during training for this landmark. A predefined number of displacements are allowed for the position of the
landmark (perpendicularly to the boundary, as in [11]), so that the texture classifier analyzes several candidate positions. Once the best candidate is found, say
(xB , yB ), the matching between its local image structure and the one learnt during training is verified (line 9) by means of a robust metric [14]. The applied
metric consists on the evaluation of the sampled data grouped according to its
distance perpendicularly to the shape boundary. Grouping this way, the samples
can be organized in a one-dimensional profile of length lP . Based on the output
from the texture classifier, each position on this profile will result as a support-
ing point or an outlier (the supporting points are those profile points suggesting
that (xB , yB ) is the best position for the landmark to be placed, while outliers
indicate a different position and, therefore, suggest that (xB , yB ) is incorrect). If
the supporting points are (at least) two thirds of lP , then the matching is considered accurate and the landmark is moved to the new position. Otherwise the
matching is not trustworthy (i.e. the image structure does not clearly suggests
a landmark) and the landmark position is kept unchanged (see [21] for details).
The PDM constraints of line 17 ensure that the obtained shape is plausible
according to the learnt statistics (i.e. it looks √
like a face). For this purpose, each
component of b is limited so that |bk | ≤ β λk , (1 ≤ k ≤ t); where t is the
number of modes of variation of the PDM, λk is the eigenvalue associated to the
k-th mode and β is a constant, usually set between 1 and 3, that controls the
degree of flexibility of the PDM (see [11]).
3
Correcting pose variations in face images
Once the flexible shape model (with coordinates X) has been fitted to the face
image I, the shape parameters b are extracted using equation (2). In particular,
we are interested in the subset of parameters describing the pose (bpose ). In
order to generate a frontal mesh, these parameters are set to zero4 . Hence, we
obtain a new vector of parameters b̂ and, through equation 1, the frontal face
mesh X̂.
Given the original face I, the coordinates of its respective fitted flexible shape
model, X, and the new set of coordinates, X̂, a virtual face image Iˆ must be
synthesized by warping the original face onto the new shape. For this purpose,
we used a method developed in [4], based on thin plate splines. Provided the
set of correspondences between X and X̂, the original face I is allowed to be
deformed so that the original landmarks are moved to fit the new shape. The
full procedure of pose normalization is shown in figure 2.
3.1
Advantages over warping onto a mean shape
When warping an image onto the average shape (X̄) of a training set, all shape
parameters are set to zero. In other words, the fitted flexible shape model is
forced to be moved to the coordinates of X̄. Holistic approaches such as PCA
need all images to be embedded into a given reference frame (an average shape
for instance), in order to represent these images as vectors of ordered pixels. The
problem arises when the subject’s shape differs enough from the average shape,
as the warped image may appear geometrically distorted, and subject-specific
information may be removed. Given that our method is not holistic but uses local
features instead, the reference-frame constraint is avoided and the distortion is
minimized by modifying only pose parameters rather than the whole shape.
4
We will use the term frontal when referring to the pose of the mean shape of the
PDM. However, the only requirement of the method is that all shapes can be mapped
to a common view, then there is not a need for a strictly frontal mean-shape.
X
X
Test image
Pose
Normalization
IOF−ASM Fitting
I
TPS
Warping
I
Fig. 2. Block diagram for pose normalization. TPS stands for Thin Plate Splines.
4
Feature extraction
Once the normalization process has finished, we must proceed to extract features
ˆ Up to now, most algorithms encoding local
from the virtual frontal images I.
information have been based on localizing a pre-defined set of landmarks and
extracting features from the regions surrounding those points. The key idea behind our approach relies on selecting an own and discriminative set of points per
client, where features should be extracted. The choice of this set is accomplished
through a two-layer strategy, whose stages are explained below.
Layer I: Shape-driven selection and matching- In the first step, a preliminary selection of facial points is accomplished through the use of shape information [5]. Lines depicting face structure are extracted by thresholding the response
of a ridge and valley detector, and a set of points P = {p1 , p2 , . . . , pn } is chosen
automatically by sampling from these lines. Figure 3-A illustrates this procedure.
Then, a set of multi-scale and multi-orientation Gabor features (so-called jet) is
computed at each shape driven point. Let Jpi be the jet obtained from point pi .
Given the two faces to be compared, say Iˆtrain and Iˆtest , their respective sets
of points are computed: Ptrain = {p1 , p2 , . . . , pn } and Ptest = {q 1 , q 2 , . . . , q n },
and a shape matching algorithm based on shape contexts [6] is used to calculate
the correspondences between the two sets of points, ξ (i) : pi =⇒ q ξ(i) . Hence,
jet Jpi will be compared to Jqξ(i) . The comparison between Jpi and Jqξ(i) is
given by the normalized dot product (< Jpi , Jqξ(i) >), but taking into account
that only the moduli of jet coefficients are used.
Layer II: Accuracy-based selection- Some previous approaches have been
focused on identifying which features were the most important for authentication purposes. Among others, [8], [7] have selected and weighted the nodes from
a rectangular grid based on a Linear Discriminant Analysis (LDA). This kind of
analysis is possible due to the fact that a given node represents the same facial
region in every image. In our case, we can not assume this, so a different method
Ridges
&
Valleys
Thresholding
Sampling
A) Layer I
B) Layers I+II
Fig. 3. A) Layer I: A ridge and valley detector is applied to the original image (top
left), and its response is shown on the right. Thresholding this representation leads to
a set of lines depicting face structure (bottom left). The set of points P is obtained by
sampling from these lines (bottom right). B) Layers I+II: Final set of points after
layer II is applied.
is needed in order to select the most discriminative points. The problem can be
formulated as follows. Given:
ˆtrain ,
– a training image for client C, say I
– a set of images of the same client Iˆjc , j = 1, . . . , Nc , and
– a set of imposter images Iˆjim , j = 1, . . . , Nim ,
we want to find which subset, P ∗ ⊂ Ptrain , is the most discriminative. As long as
each pi from Ptrain has a correspondent point in any other image, we evaluate
the individual classification accuracy of its associated jet Jpi , so that only the
locations whose jets are good at discriminating between clients and imposters
are preserved. With the set of images given above, we have Nc client accesses and
Nim imposter trials for jet Jpi to classify. We measure the False Acceptance Rate
(FARi ) and the False Rejection Rate (FRRi ) for this jet and, if the Total Error
Rate (TERi = FARi + FRRi ) exceeds a threshold τ , jet Jpi will be discarded.
Finally, only a subset of points, P ∗ , is chosen per image, and the score between
Iˆtrain and Iˆtest is given by:
S = fn∗ < Jpi , Jqξ(i) > p ∈P ∗
(3)
i
where fn∗ stands for a generic combination rule of the n∗ dot products. Figure
3-B presents the set of points that was chosen after both layer selection.
5
Experimental results on the XM2VTS database
The proposed method was tested using the XM2VTS database on configurations
I and II of the Lausanne protocol [9]. The XM2VTS database contains image
data recorded on 295 subjects (200 clients, 25 evaluation imposters, and 70
test imposters).The database is divided into three sets: training, evaluation and
test. The training set was used to build client models, the PDM, and the IOFASM5 , while the evaluation set was used to select the best features and estimate
thresholds. Finally, the test set was employed to assess system performance.
In all the experiments, n = 130 shape-driven points are computed for every
image. However, only n∗ ≤ 130 local scores are computed, because of the feature
selection explained in Section 4. The median rule [10] was used to fuse these
scores, i.e. fn∗ ≡ median. Configurations I and II of the Lausanne protocol differ in the distribution of client training and client evaluation data, representing
configuration II the most realistic case. In configuration I, there are 3 training images per client, while in configuration II, 4 training images are available.
Hence, for a given test image, we get 3 and 4 scores respectively, which can
be fused in order to obtain better results. Again, the median rule was used to
combine these values, obtaining a final score ready for verification.
Table 1 shows a comparison between the proposed method (Pose Corr.(Auto))
and a set of algorithms that entered the competition held in conjunction with the
Audio- and Video-based Biometric Person Authentication (AVBPA) conference
in 2003 [17]. All these algorithms are automatic. In this table, and derived from
the work in [24], 90% confidence intervals for the TER measures are also given.
As we can see, our approach offers competitive error rates in both configurations
(with no statistically significant differences between methods). Furthermore, the
last three rows from this table show baseline results:
– Pose Corr.(Manual): The automatic segmentation provided by IOF-ASM is
replaced by manual annotation of landmarks.
– No Pose Corr.(Auto): Automatic segmentation without pose correction (only
in-plane rotations are corrected).
– No Pose Corr.(Manual): Manual segmentation without pose correction (only
in-plane rotations are corrected).
It is clear that the use of IOF-ASM offers accurate results for our task, as the
degradation between the error rates with manual and automatic segmentation is
small. Moreover, the comparison between lines 4 and 6-7, shows that the use of
pose-corrected images improves the performance of the system (even if manual
landmarks are used to segment the original faces).
6
Conclusions
We have presented an automatic face authentication system that reduces the
effect of pose variations by synthesizing frontal face images. The segmentation
of the face in the original image is accomplished by means of the IOF-ASM
approach. A set of discriminative points and features is then selected in two
steps: the shape-driven location stage and the accuracy-based selection step.
5
The IOF-ASM was built with the same parameters detailed in [21].
Table 1. False Acceptance Rate (FAR), False Rejection Rate (FRR) and Total Error
Rate (TER) over the test set for our method and automatic approaches from [17].
Conf. I
Conf. II
FAR(%) FRR(%) TER(%) FAR(%) FRR(%) TER(%)
UPV
1.23
2.75 3.98 ± 1.35 1.55
0.75 2.30 ± 0.71
UNIS-NC
1.36
2.5
3.86 ± 1.29 1.36
2
3.36 ± 1.15
IDIAP
1.95
2.75 4.70 ± 1.35 1.35
0.75 2.10 ± 0.71
Pose Corr.(Auto)
0.83
2.75 3.58 ± 1.35 0.85
2
2.85 ± 1.15
Pose Corr.(Manual)
0.46
2.75 3.21 ± 1.35 0.72
1.50 2.22 ± 1.00
No Pose Corr.(Auto)
0.65
3.75 4.40 ± 1.56 0.74
2.5
3.24 ± 1.28
No Pose Corr.(Manual) 0.89
4
4.89 ± 1.61 0.75
2.5
3.25 ± 1.28
The quality of the synthesized face (and thus, system performance) mainly
depends on the segmentation accuracy, which is intimately related to the degree
of pose variation in the input image and the dataset used for training. The
achieved results on the XM2VTS database demonstrate the usefulness of the
method in a limited range of pose variations, offering state-of-the-art error rates.
As a main future research line, we plan to work on video-sequences in which facial
features will be tracked in a frame-by-frame basis through the combination of
IOF-ASM segmentation and a robust face tracker [25].
Acknowledgments
This work is framed within the RAVIV project from Biosecure NoE, and has
also been partially funded by grants TEC2005-07212, TIC2002-04495-C02 and
FIT-390000-2004-30 from the Spanish Ministry of Science and Technology. FS
is supported by a BSCH grant. AF holds a Ramón y Cajal Research Fellowship.
References
1. Pentland, A. et al. View-based and Modular Eigenspaces for Face Recognition. In
Proc. IEEE Conference on Computer Vision and Pattern Recognition, 1994, pp.
84–91.
2. Beymer, D.J. and Poggio, T. Face Recognition from One Example View. In Proc.
International Conference on Computer Vision, 1995, pp. 500–507.
3. Blanz, V. and Vetter, T. A Morphable model for the synthesis of 3D faces. In
Proc. SIGGRAPH, 1999, pp. 187-194.
4. Bookstein, Fred L. Principal Warps: Thin-Plate Splines and the Decomposition of
Deformations. In IEEE Transactions on Pattern Analysis and Machine Intelligence
11, 6 (1989), 567–585.
5. González-Jiménez, D., Alba-Castro, J.L., “Shape Contexts and Gabor Features for
Face Description and Authentication,” in Proc. IEEE ICIP 2005, pp. 962-965.
6. Belongie, S., Malik, J., Puzicha J. Shape Matching and Object Recognition Using
Shape Contexts. In IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 24 (2002), 509–522.
7. Duc, B., Fischer, S., and Bigun, S. Face authentication with sparse grid gabor
information. In IEEE Proc. ICASSP, (Munich 1997), vol. 4, pp. 3053–3056.
8. Argones-Rúa, E., Kittler, J., Alba-Castro, J.L., González-Jiménez, D. Information
fusion for local Gabor features based frontal face verification. In Proc. International
Conference on Biometrics (ICB), Hong Kong 2006, (Springer), pp. 173–181.
9. Luttin, J. and Maı̂tre, G. Evaluation protocol for the extended M2VTS database
(XM2VTSDB). Technical report RR-21, IDIAP, 1998.
10. Kittler, J., Hatef, M., Duin, R., and Matas, J. On Combining Classifiers. In IEEE
Transactions on Pattern Analysis and Machine Intelligence 20, 3 (1998), 226–239.
11. Cootes, T., Taylor, C., Cooper, D., and Graham, J. Active shape models - their
training and application. Computer Vision and Image Understanding 61, 1 (1995),
38–59.
12. Cootes, T., Edwards, G., and Taylor, C. Active appearance models. In Proc.
European Conference on Computer Vision (Springer, 1998), vol. 2, pp. 484–498.
13. Florack, L. The Syntactical Structure of Scalar Images. PhD thesis, Utrecht University, Utrecht, The Nedherlands, 2001.
14. Huber, P. Robust Statistics. Wiley, New York, 1981.
15. Kang, H., Cootes, T., and Taylor, C. A comparison of face verification algorithms
using appearance models. In Proc. British Machine Vision Conference (Cardiff,
UK, 2002), vol. 2, pp. 477–486.
16. Lanitis, A., Taylor, C., and Cootes, T. Automatic interpretation and coding of
face images using flexible models. IEEE Transactions on Pattern Analysis and
Machine Intelligence 19, 7 (1997), 743–756.
17. Messer, K., Kittler, J., Sadeghi, M., Marcel, S., Marcel, C., Bengio, S., Cardinaux,
F., Sanderson, C., Czyz, J., Vandendorpe, L., and al. Face verification competition
on the XM2VTS database. In Proc. 4th International Conference on Audio- and
Video-based Biometric Person Authentication (AVBPA) Guildford, UK (2003), pp.
964–974.
18. Messer, K., Matas, J., Kittler, J., Luettin, J., and Maitre, G. XM2VTSDB: The
extended M2VTS database. In Proc. International Conference on Audio- and
Video-Based Person Authentication (1999), pp. 72–77.
19. Philips, P., Moon, H., Rizvi, S., and Rauss, P. The FERET evaluation methodology for face recognition algorithms. IEEE Transactions on Pattern Analysis and
Machine Intelligence 22(10) (2000), 1090–1104.
20. Schmid, C., and Mohr, R. Local greyvalue invariants for image retrieval. IEEE
Transactions on Pattern Analysis and Machine Intelligence 19(5) (1997), 530–535.
21. Sukno, F., Ordas, S., Butakoff, C., Cruz, S., and Frangi, A. Active shape models with invariant optimal features IOF-ASMs. In Proc. Audio- and Video-Based
Biometric Person Authentication (New York, USA, 2005), Springer, pp. 365–375.
22. Walker, K., Cootes, T., and Taylor, C. J. Correspondence using distinct points
based on image invariants. In British Machine Vision Conference (1997), vol. 1,
pp. 540–549.
23. Wiskott, L., Fellows, J.-M., Kruger, N., and von der Malsburg, C. Face recognition
by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and
Machine Intelligence 19, 7 (1997), 775–779.
24. Bengio, S. Mariéthoz, J. A statistical significance test for person authentication.
In Proc. Odyssey, 2004, pp. 237–244.
25. Baker, S. and Matthews, I. Equivalence and Efficiency of Image Alignment Algorithms. In Proc. IEEE Conference on Computer Vision and Pattern Recognition,
2001, vol. 1, pp. 1090–1097.
© Copyright 2026 Paperzz