Non-Cooperative Persons Identification at a Distance with 3D Face
Modeling
Gérard Medioni, Jongmoo Choi, Cheng-Hao Kuo, Anustup Choudhury, Li Zhang, Douglas Fidaleo
Institute for Robotics and Intelligent Systems
Viterbi School ofEngineering
University of Southern California
Abstract—We present an approach to identify non-cooperative
individuals at a distance from a sequence of images using 3D
face models. Most biometric features (such as fingerprints,
hand shape, iris or retinal scans) require cooperative subjects
in close proximity to the biometric system. We process images
acquired with an ultra-high resolution video camera, infer the
location of the subjects’ head, use this information to crop the
region of interest, build a 3D face model, and use this 3D
model to perform biometric identification. To build the 3D
model, we use an image sequence, as natural head and body
motion provides enough viewpoint variation to perform
stereo-motion for 3D face reconstruction. Experiments using a
3D matching engine suggest the feasibility of proposed
approach for recognition against 3D galleries.
T
I. INTRODUCTION
HE field of biometrics has seen rapid growth in the last
few years, both in advances in scientific knowledge and
in commercial applications. Many of biometric features that
are highly distinctive and have permanence (such as
fingerprints, iris or retinal scans) require a cooperative
subject in close proximity to the system [1]. Such features
become unusable when we must deal with a non-cooperative
individual whom we wish to observe unobtrusively and at a
distance, as required for many security applications.
Facial features can be measured at a distance, and without
cooperation, or even notice, by the observed individuals.
Unfortunately, even the best 2D face recognition systems
today are neither reliable nor accurate enough for arbitrary
lighting and pose in unconstrained environments [2, 3]. 3D
face recognition is receiving substantial attention because it is
commonly thought that the use of 3D shape matching might
overcome the fundamental limitations of 2D recognition. The
main advantages of using 3-D for recognition are pose and
lighting variation compensation. It appears that recognition
using 3D, especially combined with 2D, holds significant
promise, and could reach accuracy comparable to other
biometric features such as fingerprints and iris.
The majority of 3D face recognition research and
commercial 3D face recognition systems use range sensors.
Stereo camera, laser scanner, and structured light are the
typical ranges sensors that give the Euclidean 3D shape
information from a face. Bowyer et al. [4] point out the
desirable properties for an ideal 3D sensor for face
recognition applications based on image acquisition time,
depth of filed, robust operation under lighting conditions, eye
safety issues, and space/depth resolution, and none of
Fig. 1. 3D reconstruction results in indoor environment. (Top) sequences of
input images, reconstructed 3D models with texture mapping, and 3D shape
models. (Bottom) Reconstructed 3D models.
currently available 3D sensors meet these requirements. It
seems that 3D face recognition using active 3D range sensors
is appropriate only at a close distance.
We propose instead to perform 3D face recognition using a
3D face model generated from sequence of images at a
distance. To build the 3D model, we use an image sequence,
as natural head and body motion provides enough viewpoint
variation to perform stereo-motion for 3D face reconstruction.
We also use the images in the sequence to perform traditional
2D image based face recognition.
A significant contribution is the inference of a 3D face
model from natural head motion in a sequence of images. The
inference of a dense 3-D surface model of a human head from
a monocular video sequence is a very difficult computer
vision problem for which no textbook solution currently
applies. We propose a three steps approach that consists of
keyframe detection, camera motion estimation by a head
tracker, and multiple view dense stereo matching. The
evaluation results indicate that the 3D face model from video
can provide the identity of a person at a distance.
Consequently, this allows the use of true shape invariants for
recognition, and circumvents difficulties associated with pose
and lighting. Fig. 1 shows the reconstructed 3D face models
from video sequences.
Fig. 2. Overview of the proposed approach
II. OVERVIEW OF OUR SYSTEM
The overall architecture of the system is described in Fig. 2.
A single fixed ultra high-resolution camera can be used to
detect a people and locate his/her face. In practice, it is also
possible to use a two camera system consisting of an
inexpensive large field of view video camera (i.e. a webcam)
and a narrow focus high resolution camera. To find faces we
detect and track people in the video sequence. Face location is
then a simple corollary to the person detection module. After
finding the face, the region of interest is extracted from the
image. We generate a 3D dense face model from a sequence
of facial images [5], which is acquired from a subject moving
through the camera field of view. Computing the pose of the
face with respect to the camera in each frame is a typical
structure from motion problem. We can then perform
multi-baseline extended stereo to generate a dense model.
The reconstructed 3D face models can be matched with either
a 3D gallery or a 2D gallery.
subsequent frames using an elliptic shape mask and an
appearance map to deal with difficult crowd conditions [7].
An object is tracked by data association if its corresponding
detection response can be found; otherwise it is tracked by a
Meanshift tracker.
B. Camera Control and Face Capture
To infer the 3D facial characteristics of detected individuals,
a sufficient number of pixels must be acquired from the face.
We have found that 100 pixels between outer eye corners of a
face are necessary for 3D modeling and recognition.
We propose to use a two fixed cameras system which
consists of a high resolution camera (e.g. Redlake ES11000,
4008 x 2672 [8]) and a web camera (640x480). While we
detect and track the full body of a person with the web
camera, we capture the facial images including a sufficient
number of pixels to construct a 3D model with the high
resolution camera. The face localization in the ROI of the
high-resolution image is performed by the OpenCV face
detector [9].
III. FACE LOCALIZATION BY DETECTING AND TRACKING
PEOPLE
To obtain a set of facial images including various viewpoints,
we propose to locate a human head by finding the entire body
of a person. We can track and gather several face images of a
person while the person shows various behaviors which we
often observe in a natural environment (e.g. cover the face by
hands, turn around, gaze at other person or place) as shown in
Fig. 3.
A. Detection and Tracking of Humans
We use a model-based detection module which employs
Edgelet features [6] for detecting full-body of pedestrians.
We perform human tracking based on the results of human
detection. The human hypotheses are tracked in the
Fig. 3. People detection and tracking
Fig. 4. Overview of the approach for extracting a 3-D face model from a
video sequence
Fig. 5. Detected facial feature points, which lie along the contour of the
eyes, eyebrows, nose, mouth and the boundary
IV. 3D FACE MODELING FROM AN IMAGE SEQUENCE
A. Keyframe detection and initialization
A non cooperative subject may not necessarily look into the
camera. However, a carefully designed surveillance zone can
provide a set of images which includes a frontal face. Hence,
we assume that there exists at least one near-frontal face in
the sequences. Since facial feature points are a strong cue to
estimate facial pose, we employ a facial features extraction
algorithm [11] and set the frontal face frame as a keyframe.
1) Feature Points Extraction
Given the location of a face in an image, our face alignment
algorithm can extract 42 feature points corresponding to the
different components of the face [11] as shown in Fig. 5. The
algorithm consists of 3 main modules – the shape model, the
texture model and the shape parameter optimization module.
The shape model is the initial step in the alignment technique,
which is based on the Active Shape Model technique [12].
Principal Component Analysis is applied on a labelled feature
set of images to get the mean shape that constrains the
different parameters of the each of the feature points. In the
texture model, an AdaBoost Classifier is implemented which
uses simple 2-D Haar-like features to calculate the integral
image [13], used for every feature point in a local search area
to distinguish between the feature points and the non-feature
points. This technique gives us a confidence value for every
point in a local search area and the point with the maximum
confidence value is chosen as the best match for the feature
point.
1.5
1
0.5
Pose Information
A key contribution of the proposed approach is the inference
of a 3D face model from a sequence of facial images. We
propose a three steps approach that is outlined in Fig. 4. The
first critical step is head tracking which provides the initial
pose of a face with respect to the camera. Second, we use a
state-of-the-art real-time face tracker [10] to provide a good
initial estimate of head pose using a generic sparsely sampled
3D model of the face. After recovering poses, a subset of the
video frames is selected for reconstruction. Dense feature
matching is performed across rectified image pairs and
disparity and depth maps are computed. The 3D point clouds
are merged to produce the final dense 3D face model.
0
-0.5
-1
-1.5
-2
0
50
100
150
Index of the images in the sequence
200
250
Fig. 6. Pose estimation result
2) Keyframe detection by Frontal Pose Estimation
The pose that we are interested in this method is the
horizontal rotation angle of a face. Any rotation of face from
–(5 to 10) degrees to +(5 to 10) degrees is considered to be
frontal (can also be called near frontal). We have found that
there is a very strong correlation between the second
component of the shape parameter value and the horizontal
rotation of the face. Hence, we use this component value to
estimate of the pose of the face. We can use a constraint
because a video provides temporal coherence. In the ideal
scenario, the value of the most frontal pose will be close to 0
whereas if the face is moving right the pose value increases in
the positive direction and if the face is moving left, the pose
value gets a negative value. Hence, in order to get the value of
the most frontal pose, we find out the “zero-cross-over “
value of the pose and choose the frame closest to that pose.
For frontal pose estimation, a sequence of image was tried
out and the pose was calculated. Fig. 6 shows an experimental
result. The plot where the points are represented by ‘x’ and
connected by red lines is obtained after the alignment
algorithm. There are lots of jumps and this is caused due to
false alarms and possible bad eye detection. The plot where
the points are represented by blue squares is after applying
RANSAC algorithm (to remove the instances due to bad eye
Fig. 7. 3D reconstruction results with directional lighting (Top) and glasses
(Bottom)
detection) and after doing some filtering. As seen, it results in
a much smoother curve and removes most of the false alarms.
The point at which the curve intersects the blue line is the
point at which we expect to observe the frontal face.
To evaluate the keyframe detection performance, the
frames were labeled by three different subjects as frontal or
non-frontal. The first filtering provided 60.6% detection rate.
After applying the second step of filtering where we eliminate
the frontal frames that occur within 'p' frames of each other,
the detection rate improved to 74.6%. Since this algorithm is
used as an initialization step for the reconstruction system, we
have to supply one key frontal frame from the frames that
were detected. In order to do that, we can find out the value of
the poses of the detected "key frames" and then choose the
most-frontal pose-value (the pose-value that is closest to 0.2).
This filters out all the false alarms and gives us the most
frontal pose with 100% accuracy.
B. Camera Motion Estimation
Estimating head pose relative to the camera is a critical step
for accurate reconstruction. Ignoring the neck, a head motion
can be decoupled into two components: translation and
rotation, whose inverse transformation is equivalent to the
motion of the virtual camera required for dense
reconstruction.
Two fundamentally different approaches to online rigid
object tracking exist. Using the nomenclature from [10],
recursive trackers begin with an initial estimate of the head
pose. The pose estimate at a given time is dependent on the
estimate at the previous frame. Unfortunately, because of the
concatenation of motion estimates, errors accumulate and can
result in considerable tracking drift after several frames. To
eliminate drift, keyframe approaches perform tracking by
detection, utilizing information obtained offline such as the
known pose of the head in specific frames (keyframes) of the
tracking sequence. Input images are matched to existing
keyframes and provide accurate pose estimates at or near key
poses. Such approaches suffer from tracking jitter and require
several keyframes for robust tracking. In an uncontrolled
environment, it may not be possible to accurately establish
multiple keyframes.
For our purposes, the class of tracked objects is restricted
to faces. Therefore a priori knowledge of expected 3D face
structure can be leveraged to improve tracking accuracy and
resolve pose ambiguities. We derive initial pose estimates
using the tracker by Fua et al. that combines a recursive and
keyframe based approach to minimize tracking drift and jitter,
and reduce the number of keyframes required for stable
tracking [10]. A keyframe consists of a set of 2D feature
locations detected on the face with a Harris corner detector
and their 3D positions estimated by back-projecting onto a
registered tracking model. The keyframe accuracy is
dependent then on both the model alignment in the keyframe
image, as well as the geometric structure of the tracking
mesh. Especially when the face is far from the closest
keyframe, there may be several newly detected feature points
not present in any keyframe that are useful to determine
inter-frame motion. These points are matched to similar
points detected in the previous frame and combined with
feature points matched to the closest keyframe. The current
head pose estimate (or closest keyframe pose) serves as the
starting point for a local bundle adjustment.
C. Dense Model Inference by Multi-view Stereo
We use a two-stage approach for the reconstruction of the
mesh of the face from the 3-D reconstruction of the feature
points. The first stage involves automatic selection and
reconstruction of multiple stereo pairs. When the camera
moves very little, correspondence is easy, but 3D accuracy is
poor. When the baseline is large, accuracy is good, but
matching is hard. We have found experimentally that an
angular baseline of 8~12° is the best compromise. Optimal
camera poses are selected from the poses recovered in the
previous stage and dense disparity maps are computed using
the selected image pairs.
In the second phase, the individual pair-wise
reconstructions are integrated into a single 3D point cloud.
The obtained surface displays a thickness corresponding to
inaccuracies in the camera position computation, or the
inherent noise of the disparity maps. Outliers are rejected
using a tensor voting framework that enforces surface
self-consistency [14].
D. Results of 3D face modeling
Our modeling approach can generate 3D face models
representing accurate 3D geometry, as shown in Fig. 1. We
examine the performance variations of face reconstruction
with various factors including outdoor lighting conditions
and glasses. The reconstruction results in Fig. 7 (Top) were
generated from outdoor images acquired under typical natural
lighting conditions with a 10 megapixel digital camera at 3m.
1
0.9
cumulative accuracy
0.8
0.7
3m
6m
9m
0.6
0.5
0.4
0.3
0.2
0.1
0
2
4
6
8
10
12
rank
(a)
1
0.9
0.8
3m
6m
9m
correction rate
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
false alarm rate
(b)
Fig. 8. Cumulative rank accuracy (a) and ROC of 3D face recognition (b)
with distance changes at 3m, 6m, and 9m.
V. FACE RECOGNITION
In this paper, we presented 3D-3D recognition and 2D-2D
recognition approach. In the first method, a 3D probe face is
matched against a 3D database when only a 3D gallery is
available. In the second method, the collection of 2D facial
images is matched against a 2D gallery.
A. 3D-3D Recognition
We tested the recognition module independently, by taking
image sequences in controlled environments, using the
narrow angle camera only. The face image sequences were
obtained from the Redlake ES11000 [8], producing 4008 x
2672 images at 4 fps. We also acquired image sequences with
a digital SLR (single lens reflex) camera (Canon Digital
Rebel XTi) which produces similar images (3888 x 2592) at 4
fps. We tested different distances (3m, 6m, and 9m) and
adjusted the focal length accordingly. We also conducted
accuracy tests related to the length of the sequence.
Our 3D Gallery contains 358 3D face models of 358
persons. A stereo camera (Active ID, Geometrix [15]) was
used to acquire the geometric data and the texture. The system
has a standard single stereo sensor with one camera pair. The
subjects are asked to stand in front of the camera pair for
capture. The models are metrically accurate. After computing
the 3D face model with shape and texture, a 3D face
enrollment file is generated for recognition.
In order to generate a 3D face model with one single
camera, the subjects rotated their faces from left 45 degree to
right 45 degree horizontally in different conditions such as
pitching angle (0°, 30°), glasses and outdoor lighting. Each
sequence has near frontal face images. For the probe set, we
captured 23 facial sequences of 23 different subjects who
were enrolled in the gallery.
Since the quality of the 3D reconstruction depends on the
quality of 2D images, we evaluated the recognition
performance with respect to the capture distance at 3m, 6m,
and 9m. To compare gallery and probe, we used Geometrix
3D face recognition SDK. The 3D (shape-only) component of
the Geometrix facial authentication engine, which is
described in details in [16], is based on registration of two
candidate 3-D facial shapes followed by explicit
point-by-point difference assessment using iterative closest
point techniques [17]. It first cleans each mesh, then
automatically extracts a relevant mask, automatically aligns
the two meshes, then computes a distance map between the 2
aligned meshes, and finally performs the classification based
on the statistics derived from the distance map. Note that this
3D matching engine can be replaced with other 3D
recognition engine.
Given a probe image p and the gallery data G = {G(1),
G(2), …, G(N)}, the identity is decided by
ID = arg min d ( s ( p ), G (i ))
i
where d(.) and s(.) represent the distance and scaling
function, respectively. The gallery has Euclidean models
while the reconstruction module gives metric ones. To handle
this problem, we applied simple scaling method. Fig. 8 shows
the ROC and cumulative rank curve.
The quality of 3D reconstruction depends on the head
motion of a subject. In real situations, the environment
conditions may give constraints for the head motions. We
compared the recognition performance between complete
sequences (-45° to +45°) and partial sequences (-45° to 0°).
For the probe of half sequence, we used the same probe data
for the complete sequence but only images in the degree have
been selected. We obtained an identification rate of 70%
(rank 1) with the whole sequence and 60% (rank 1) with the
half sequence.
B. 2D-2D Recognition
In the event that the reconstruction module would fail to
produce a 3D model, possibly due to lack of sufficient motion
in the input sequence or very rapid expression/pose change,
the system may revert to pure 2D recognition against a 2D
gallery. We used the Neven Vision [18] 2D face recognition
SDK for 2D-2D face recognition. This system is based on the
detection of a set of landmark points on a frontal face and
matching the extracted “face-template” to a gallery of
1
0.998
0.996
Recognition Rate
0.994
0.992
0.99
0.988
0.986
0.984
3m
6m
9m
0.982
0.98
0
0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009
False Acceptance Rate
0.01
The evaluation results suggest a possibility that only video
frame captured at a distance gives the 3D face shape of a
person. Consequently, this allows the use of true shape
invariants for recognition, and circumvents difficulties
associated with pose and lighting.
Our approach seems to constitute a considerable step
forward in solving the challenging problem of people
recognition. The method provides a robust system for
inferring biometric characteristics for identifying non
cooperative individuals at a distance. The development of
such capabilities should significantly increase indoor and
outdoor monitoring capabilities near high profile facilities.
ACKNOWLEDGMENT
(a)
This research was funded by the United States Department of
Justice Grant 2006-DE-BX-K006.
1
0.998
0.996
REFERENCES
Recognition Rate
0.994
[1]
[2]
0.992
0.99
[3]
0.988
0.986
[4]
0.984
0.982
0.98
Pitch 30
Pitch 0
0
0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009
False Acceptance Rate
0.01
[5]
(b)
Fig. 9. 2D recognition results with different distances (a) and pitch angles (b)
templates extracted from 2D face images. The sparseness of
points and relative insensitivity of point detection to global
changes in illumination affords the system some robustness to
variations environmental conditions. Neven Vision
recognition technology was a top-3 performer in FRVT 2002.
For the gallery, we used the 358 2D images captured by the
Active ID (358 persons). For the probe, we used the same 2D
images that generated the 3D probe (23 sequences of 21
individuals). Given a set of probe images P = {P(1), P(2), ...,
P(M)} and the gallery data G = {G(1), G(2), …, G(N)}, the
identity is decided by
ID = arg min d ( P( j ), G (i ))
[6]
[7]
[8]
[9]
[10]
[11]
[12]
i
where d(.) represents the distance. Fig. 9 shows the
recognition result with different distances and pitch angles.
VI. CONCLUSION
We have presented an approach to identify non cooperative
individuals at a distance. The approach is accomplished by
processing images acquired with an ultra-high resolution
video camera, inferring the location of the subjects’ head,
using this information to crop the region of interest, building
a 3D face model, and using this 3D model to perform
biometric identification. We have simulated and validated
this approach on real data in an end-to-end system.
[13]
[14]
[15]
[16]
[17]
[18]
IEEE Computer. Special Issue on Biometrics. Feb. 2000
P.J. Phillips, P. Grother, R.J Michaels, D.M. Blackburn, E Tabassi, and
J.M. Bone. FRVT 2002: Evaluation Report, March 2003
P. J. Phillips, W. T. Scruggs, A. J. O’Toole, P. J. Flynn, K. W. Bowyer,
C. L. Schott, M. Sharpe, FRVT 2006 and ICE 2006 Large-Scale
Results, March 2007 (http://www.frvt.org)
Kevin W. Bowyer, Kyong Chang, Patrick Flynn, "A survey of
approaches and challenges in 3D and multi-modal 3D + 2D face
recognition," Computer Vision and Image Understanding, Volume 101,
Issue 1, pp.1-15, 2006
G. Medioni, B. Pesenti. “Generation of a 3-D Face Model from One
Camera,” Int. Conf. on Pattern Recognition, Quebec, Canada, vol.3, pp
667- 671, August, 2002
T. Zhao, R. Nevatia and F. Lv. “Segmentation and Tracking of Multiple
Humans in Complex Situations,” In the Proc. of CVPR, vol 2, pp
194-201 December 2001, Kauai
B.Wu, and R. Nevatia, "Tracking of Multiple, Partially Occluded
Humans based on Static Body Part Detection," IEEE, Conference on
Computer Vision and Pattern Recognition, CVPR 2006, New York,
New, York, June 2006
http://www.redlake.com/
http://www.intel.com/technology/computing/opencv/
Luca Vacchetti, Vincent Lepetit, Pascal Fua: Stable Real-Time 3D
Tracking Using Online and Offline Information. IEEE Trans. Pattern
Anal. Mach. Intell. 26(10): 1385-1391 (2004)
Li Zhang, Haizhou Ai, Shengjun Xin, Chang Huang, Shuichiro Tsukiji,
Shihong Lao, Robust Face Alignment Based on Local Texture
Classifiers, The IEEE International Conference on Image Processing
(ICIP-05), Genoa, Italy, September 11-14, 2005
T. F. Cootes, Statistical Models of Appearance for Computer Vision,
online
technical
report
available
from
http://www.isbe.man.ac.uk/˜bim/refs.html, Sept. 2001
R. E. Schapire and Y. Singer, Improved Boosting Algorithms Using
Confidence-rated Predictions, Machine Learning, 37, pp. 297-336,
1999
Mordohai and G. Medioni, "Perceptual Grouping for Multiple View
Stereo using Tensor Voting", International Conference on Pattern
Recognition, vol. 3, pp. 639-644, Quebec City, Canada, August 2002
Geometrix (http://www.geometrix.com)
G. Medioni, R. Waupotitsch. “ Face Modeling and Recognition in
3-D,” AMFG 2003, IEEE International Workshop, pp 232- 233, 2003
P.J. Besl and N.D. McKay. “A method for registration of 3-d shapes,”
IEEE Transactions on Pattern Analysis and Machine Intelligence,
14(2), pp239-256, 1992
NevenVision (http://www.nevenvision.com/)
© Copyright 2026 Paperzz