Automatic extraction of eye and mouth "elds from

Pattern Recognition 34 (2001) 2459}2466
Automatic extraction of eye and mouth "elds from a face image
using eigenfeatures and multilayer perceptrons
Yeon-Sik Ryu, Se-Young Oh*
Department of Electrical Engineering, Intelligent Systems Laboratory, Pohang University of Science and Technology (POSTECH),
Pohang, KyungBuk 790-784, South Korea
Received 22 September 1999; received in revised form 18 September 2000; accepted 6 October 2000
Abstract
This paper presents a novel algorithm for the extraction of the eye and mouth (facial features) "elds from 2-D gray-level
face images. The fundamental philosophy is that eigenfeatures, derived from the eigenvalues and eigenvectors of the
binary edge data set constructed from the eye and mouth "elds, are very good features to locate these "elds e$ciently.
The eigenfeatures extracted from the positive and negative training samples of the facial features are used to train
a multilayer perceptron whose output indicates the degree to which a particular image window contains an eye or
a mouth. It turns out that only a small number of frontal faces are su$cient to train the networks. Furthermore, they lend
themselves to good generalization to non-frontal pose and even other people's faces. It has been experimentally veri"ed
that the proposed algorithm is robust against facial size and slight variations of pose. 2001 Pattern Recognition
Society. Published by Elsevier Science Ltd. All rights reserved.
Keywords: Facial feature; Eye and mouth "elds; Eigenfeature; Multilayer perceptron; Positive (Negative) sample
1. Introduction
It is known that viewing a person's eyes and mouth is
essential in understanding the information and feeling
they convey. Because of this, automatic extraction of eyes
and mouths from a person's face can be very useful in
many applications. Many researchers have proposed
methods to "nd the eye and mouth regions [1}3] or to
locate the face region [4}8] in an image. These methods
can be classi"ed by their use of three types of information: template matching, intensity and geometrical
features. In general, template matching requires many
templates to accommodate varying pose whereas the
intensity method requires good lighting conditions.
In his pioneering work, Kanade [5] used a Laplacian
operator to "nd an edge image. Since the Laplacian
operator does not give information on edges, Brunelli [7]
* Corresponding author. Tel.: #82-54-279-2214; fax: #8254-279-2903.
E-mail address: [email protected] (S.-Y. Oh).
used horizontal gradients to detect the head top, eyes,
nose base and mouth. He found the location of eyes using
template matching. However, it requires many templates
to correctly locate the eyes while accommodating varying
pose. Beymer [6] used a "ve-level hierarchical system to
locate the eyes and nose lobes. He used many facial
feature templates covering di!erent poses and di!erent
people. Juell et al. [8] proposed a neural network structure to detect human faces in an image. They used three
child-level neural networks to locate the eyes, nose and
mouth. But they used enhanced gray images as input to
the neural networks, thus requiring too many input neurons and training samples to work well with various
poses.
In this paper, we present a novel algorithm for the
extraction of the eye and mouth "elds from 2-D graylevel face images. The fundamental philosophy is that
eigenfeatures, derived from the eigenvalues and eigenvectors of the binary edge data set constructed from the eye
and mouth "elds, are very good features to locate these
"elds. The eigenfeatures are extracted from the positive
and negative training samples of the facial features and
0031-3203/01/$20.00 2001 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.
PII: S 0 0 3 1 - 3 2 0 3 ( 0 0 ) 0 0 1 7 3 - 4
2460
Y.-S. Ryu, S.-Y. Oh / Pattern Recognition 34 (2001) 2459}2466
are used to train a multilayer perceptron (MLP) whose
output indicates the degree to which a particular image
window contains the complete portion of eyes or the
mouth within itself. The window template within a test
image is then shifted in image space and the one with the
maximum output for the MLP represents the eye or
mouth "eld. It turns out that only a small number of
frontal faces are su$cient to train the networks since they
lend themselves to good generalization to non-frontal
pose and even other people's faces.
Section 2 describes coarse extraction of the eye and
mouth "elds using geometrical constraints. Section 3
presents "ne extraction of the facial features using
neural networks. Experimental results are reported in
Section 4.
2. Coarse extraction of the eye and mouth 5elds from
geometrical constraints
The facial images used in this research consist of
92;112 pixels and have been obtained from the Olivetti
Research Lab (ORL) database. They consist of frontal
views as well as side views containing both eyes and the
mouth. The basic assumption in the database is that the
eyes and mouth reside around the center of the image
against a uniform background. Fig. 1 shows the overall
system for extraction of the facial features. Fig. 2 shows
some of the original images from which to extract the
three facial features of interest. The "rst "ve images of
Fig. 2(a) were used to extract the eye and mouth regions
while the last was used only to extract the mouth. Note
Fig. 2. The face images used for training and testing the proposed algorithm: (a) the images used to extract the training set
for the MLP detector and (b) the images used to extract the test
set.
that only frontal images were used to train the networks
since otherwise, there would be too many pose variations
to consider. Fig. 2(b) shows a portion of the test set used
to estimate the generalization performance.
2.1. Extraction of the face region
Fig. 1. The system block diagram.
It was decided not to go through a priori normalization and edge enhancement process to make the eye
location and facial size "xed since the locations of the
eyes and mouth all di!er according to the facial size and
pose. Instead, a binary edge dominance map [7] was
used "rst to roughly identify the face region excluding the
hair and ears within which the coarse locations of the
eyes and mouth are to be found. This limits the size of the
coarse regions containing these three features thus easing
the computational burden.
The horizontal and vertical projection information
obtained from the edge map will be used to estimate the
approximate locations of the eyes and mouth [7]. From
the gray image I(i, j), the vertical and horizontal edge
maps I and I are obtained "rst. These are the binary
4#
&#
images which have been computed from the original
image through the application of the vertical (horizontal)
gradients followed by proper thresholding. Then, the
vertical and horizontal projections of these maps, shown
Y.-S. Ryu, S.-Y. Oh / Pattern Recognition 34 (2001) 2459}2466
2461
Eq. (4) has been set to allow for enough margin around
y to enclose the entire mouth, since otherwise, y could
*
*
pass through the upper, middle or lower part of the lip
according to di!erent poses of the face.
2.2. Extraction of the eyes and mouth
The following heuristic assumptions are made in order
to simplify the process (see Fig. 4(a)):
E The eyes remain within the upper half of the face
region. Their exact positions, however, can vary according to the height of the forehead and the facial
pose.
E The mouth remains within the lower half of the face
region.
Fig. 3. Determination of a face boundary: (a) vertical edge
dominance map: I (i, j); (b) horizontal edge dominance map:
4#
I (i, j); (c) vertical integral projection of (a): H (i) and (d) hori&#
T
zontal integral projection of (b): H ( j).
F
in Fig. 3, are calculated as
,0
H (i)" I (i, j), 1)i)N ,
(1)
T
4#
!
H
,!
(2)
H ( j)" I (i, j), 1)j)N ,
F
&#
0
G
where N "112 and N "92. From these histograms,
0
!
the face region R(x , y , x , y ) is established, where
* * 0 &
N
(3)
y "arg max H ( j), 1)j) 0 ,
&
F
3
H
N
0 )j)N ,
y "arg max H ( j)#y,
(4)
*
F
0
2
H
N
(5)
x "arg max H (i), 1)i) ! ,
*
T
2
G
N
! )i)N
x "arg max H (i),
(6)
0
T
!
2
G
and R(x , y , x , y ) is the rectangle with (x , y ), (x , y ),
(x , y ) and (x , y ) as its four vertices.
The search boundary y in Eq. (3) is determined from
&
the geometrical observation that the hair boundary line
exists in the upper region of the guessed face region. y in
In order to locate the eyes, it is necessary to separate
the eyebrow from the eye using the information contained in the vertical edge map I . This is facilitated by
4#
observing that although the horizontal edge components
dominate in both the eye and the eyebrow, the eye
also contains more vertical edge components than the
eyebrow. Therefore, the horizontal integration of I is
4#
performed and the row index E resulting in the maxA
imum value is found as
2
,!
H ( j)" I (i, j), 1)j) (y !y ),
4#
&
FT
3 *
G
E "arg max H ( j).
A
FT
H
(7)
(8)
Since E represents the row with the strongest vertical
A
edge component within the region containing the two
eyes, the eye region is centered around that row.
However, even this statement may not be correct when
the eyes are not horizontal due to the changes in pose
or the lighting conditions. Therefore, the eye region is
tentatively set as the one centered around E with
A
$15 pixels safety margin to prevent eye or mouth from
being cut.
The "nal eye region is then found by assuming that the
eyes are more likely to reside in the upper or lower eye
Fig. 4. The 3-stage process of determining the left eye, right eye
and the mouth regions.
2462
Y.-S. Ryu, S.-Y. Oh / Pattern Recognition 34 (2001) 2459}2466
regions with respect to the center line E . Under this
A
assumption, the Eye "eld is determined from the horizontal edge components H ( j) de"ned in Eq. (2) as
F
R(x , E !5, x , E #10) if A*B,
* A
0 A
Eye" R(x , E !10, x , E #5) if A(()B,
* A
0 A
?
R(x , E !10, x , E #10) otherwise,
* A
0 A
where
(9)
#A
#A >
A" H ( j), B" H ( j)
F
F
H#A \
H#A
and is set to 1.5. The size of the eye "eld thus found
becomes (x !x );(15 or 20) and the left and right eye
0
*
"elds, Left}Eye and Right}Eye, are obtained by vertically
splitting the Eye region in half (Fig. 4(c)).
Finally, the mouth region is obtained by searching for
strong horizontal edge components. Similar to the eyes,
row M with the strongest sum of the horizontal edge
A
components H ( j) is searched within the coarse mouth
F
region. Then the mouth "eld is established as follows:
M "arg max H ( j), (y !y ))j)y ,
A
F
*
&
*
H
(10)
Mouth"R(x , M !15, x , M #15),
* A
0 A
(11)
with the size of the mouth "eld being (x !x );30
0
*
(Fig. 4(c)).
3. Fine tuning of the eye and mouth 5elds using neural
networks
In general, two approaches exist to search for objects:
(1) matching against the stored database and (2) eliminating non-matching regions [9]. One way to combine these
approaches is the use of positive and negative samples
for training. The positive samples contain the complete
objects whereas the negative samples do not. In this
research, the positive samples are the eye and mouth
regions extracted from the frontal faces while the
negative samples are obtained from the surrounding
areas of the positive samples. Fig. 5 shows these examples. The eye window is 20;10 and the mouth window is
40;20.
The facial feature extractor consists of 3 MLPs for the
left and right eyes along with the mouth. The MLPs are
then trained with a resilient backpropagation algorithm
[10] to produce a linear output whose value represents
the degree of similarity to the positive samples. That is,
the output is #1 for positive and !1 for negative
samples. Physically, the output will be #1 if the candidate window contains the full set of the relevant facial
features.
3.1. Eigenfeature extraction
The input and output representation has a major in#uence on the neural network performance [11]. Hence, it
would be more e$cient to present relevant features
rather than the raw image in order to reduce dimensionality as well as to facilitate training of the network.
Fig. 6 shows the binary edge map representing a 2-D
elliptic data distribution in (u, v) space. The edge pixels in
a search window may be considered as a 2-D pattern in
(u, v). The distribution of these pixels for the eye and
mouth regions is distinctive in nature compared to other
regions of the face. Herein, the Canny edge extractor was
used to precisely locate the eyes while the Sobel horizontal operator was used for the mouth.
In human face, the eyes and mouth have distinct
shapes compared to other features. That is, the general
shape of the eyes and mouth resides within a certain
range of the space spanned by the eigenvectors and the
eigenvalues describing their shapes. These important features lead to a facial feature detector that not only uses
a relatively small number of training samples, but also
lends themselves to robustness against pose variations.
The features used for the neural network training are
then obtained as follows.
The correlation matrix M is obtained "rst from the
N edge point samples:
N
P"(p ,2, p ,2, p N ),
G
,
(12)
Fig. 5. Examples of the training set for the MLPs: (a) the positive eye samples; (b) some negative eye samples; (c) the positive mouth
samples and (d) some negative mouth samples.
Y.-S. Ryu, S.-Y. Oh / Pattern Recognition 34 (2001) 2459}2466
Fig. 6. The binary edge map and the corresponding eigenvectors for each region.
2463
Fig. 7. (a) The MLP output pro"le map and (b) the resulting
regions of interest found.
where
u
p" G ,
G
v
G
M"P ) P2,
(13)
where (u , v )2 is the coordinate value of the ith sample
G G
point. Let the eigenvalues and the eigenvectors of M be
, , e , e , where ' . The eigenvectors represent
the mutually orthogonal principal directions of the data
distribution while the eigenvalues represent the strength
of variation along the direction of the eigenvectors [12].
From these, the following nine eigenfeatures are derived:
x " , x " , x " , x "e ,
}
}
S
x "e , x "e , x "e ,
T
S
T
(14)
1 ,N
1 ,N
u /N , x "
v /N ,
x "
G
S
G
T
N
N
N G
N G
where N and N are the horizontal and vertical sample
S
T
sizes for the eye or the mouth respectively and }
and
}
are the maximum of and over the entire
training set. All these features are signi"cant in some
sense to represent the shape of the facial features. The
"rst and second eigenfeatures are the normalized eigenvalues while the third one is their ratio. These three
represent the overall shape of the facial features. The
fourth to seventh components represent the orientations
of the facial features. Finally, the eighth and ninth components are the two coordinates of the centroid of the
candidate region containing the facial features. Section
4 will examine the in#uence of using di!erent sets of
eigenfeatures on the extraction performance.
3.2. Fine localization of the facial features from the neural
networks based on eigenfeatures
A small window of either 20;10 for the eyes or 40;20
for the mouth is shifted across and down within the
Fig. 8. The centering process for the region detected by the
MLP: (a) before the centering and (b) after the centering.
coarse Left}Eye, Right}Eye, and Mouth regions found in
Section 2. While being shifted, the eigenfeatures of these
small windows are formed and input to the MLP. As
stated earlier, the output of the MLP denotes the similarity between the eigenfeatures of the current window
and those of the typical facial features. Then for each
facial feature, the index of the sliding window resulting in
maximum similarity is determined and identi"ed as the
"ner eye or mouth "eld.
Fig. 7(a) shows the output distribution of the
M¸P
versus the amount of window shift, while
+MSRF
Fig. 7(b) shows the results of extraction of the three facial
features. Zero outputs occur because the test windows
that have a number of edge pixels below a threshold
(20 for the eye and 40 for the mouth) have been skipped
since in that case the possibility of "nding the facial
features of interest is quite low. Finally, these "ner regions are re-centered as shown in Fig. 8, thereby reducing
the load of learning many di!erent shifts of the same
feature images.
4. Experimental results and discussion
The structure of the neural nets used is (9, 50, 30, 10, 1):
"ve-layer neural nets having input layer of 9 nodes,
output layer of 1 node for showing the degree of
2464
Y.-S. Ryu, S.-Y. Oh / Pattern Recognition 34 (2001) 2459}2466
Fig. 9. Experimental results of extracting the regions of interest. The adjacent pairs of images represent the result of coarse "nding before
matching (on the left) with neural nets and those of "ner "nding after matching (on the right).
similarity between learned features and input feature, and
three intermediate layers with 50, 30 and 10 nodes respectively, with a learning rate of 0.01. As for the facial
database, 61 images taken from 14 persons in total were
used. The database contains the training set as well as the
test set containing faces with di!erent poses (looking left,
right or down) and/or from persons other than those in
the training set. The subjects did not wear glasses nor had
beards (Fig. 2(b)). The positive samples were taken from
the eyes and mouths in Fig. 2(a). The negative samples
(70 for the eyes and 29 for the mouth) partially shown in
Fig. 5(b) and (d) were taken from around the eyes and the
Y.-S. Ryu, S.-Y. Oh / Pattern Recognition 34 (2001) 2459}2466
Table 1
The detection performance (%) according to the number of
eigenfeatures used
MLP
Case 1
Case 2
Case 3
Case 4
M¸P }
*CDR #WC
77.1
(39.3)
83.6
(24.6)
95.1
(24.6)
96.8
(16.4)
M¸P
0GEFR}#WC
72.1
(37.7)
70.5
(36.0)
72.2
(34.4)
96.8
(19.6)
M¸P
+MSRF
88.6
(55.7)
88.6
(40.9)
96.8
(22.9)
100
(11.4)
Six eigenfeatures (x, x, x}x) used without re-centering.
G G G G
Seven eigenfeatures (x}x) used without re-centering.
G G
Nine eigenfeatures (x}x) used without re-centering.
G G
Nine eigenfeatures (x}x) used with re-centering.
G G
mouth plus the ones that resulted in misrecognition by
the neural network.
To save time, only those test windows with more than
a certain number of edge pixels were processed and when
the features were not found in the coarse feature window,
this threshold was lowered and retried. Sometimes, the
eyes could not be found when the coarse eye region was
erroneously set, when the region was too dark, or when
the eyes were too small. Fig. 9, due to lack of space, shows
only some of the results of "nding the three facial features
of test face using the full set of 9 eigenfeatures.
Table 1 compares the performance of the four cases of
di!erent combination of input features selected for the
MLP. Successful localization of the facial features is
de"ned as the case when at least 4/5 of these features falls
within the region, namely when the features are reasonably centered. In Table 1, the numbers within ( ) show the
ratios of the non-centered cases over all cases. The di!erences of 14.7% for the left eye, 1.7% for the right eye and
14.8% for the mouth between Case 1 and 2, for example,
signify that the eigenvalue ratio is important for accurate
localization of the eyes and the mouth. In addition, the
use of the normalized centroid of the data in the image
space is important to locate the mouth 18% better.
Finally, the "nal centering process also improves the
localization performance by 8.2, 14.8 and 11.5% for the
left eye, right eye and mouth, respectively. The experimental data suggest that the use of eigenfeatures and the
neural network leads to accurate "nding of the three
facial features, and further, each eigenfeature plays a
certain role in the process.
5. Summary and conclusions
A facial feature extraction algorithm from black and
white facial images has been presented based on eigenfea-
2465
tures and neural networks. The eigenfeatures were obtained from an expanded set of the eigenvectors and
eigenvalues calculated from the binary edge data set for
each of the facial features. The three MLPs for the facial
features were trained with positive and negative samples
so that the MLP can make a decision as to whether the
current test window contains the facial feature of interest.
The particular window resulting in maximum output for
the MLP is designated as the "nal facial feature location
after image centering. As a preprocessing step, the vertical and horizontal projections of the binary edge map
from the face image have been used to narrow down the
search. The salient features of the proposed methodology
are as follows:
E No a priori image normalization is necessary with
respect to intensity distribution, size, and position.
E The coarse eye and mouth regions are determined
within the facial image. While existing research "nds
the approximate eye position using the horizontal edge
map, the proposed research utilizes the horizontal
projection of the vertical edge map to capture the eye
region within smaller window.
E The eigenfeatures extracted from the 2-D data distribution of the eye and mouth edge features have been
introduced as important and signi"cant features to
locate the eye and mouth.
E The need for a large training set in template matching
has been avoided by using eigenfeatures and sliding
windows within the coarse feature "elds.
E The MLPs have a linear output neuron whose output
represents the similarity to any of the positive samples.
E The feature localization experiments have been performed on 61 facial images of 14 people with varying
facial size and pose and the recognition rate is 96.8%
for the eyes and 100% for the mouth.
E Good generalization towards the same or di!erent
person with largely varying size and pose was achieved
despite using a small training data set taken from only
"ve or six persons.
Facial feature extraction has a wide spectrum of
applications from user veri"cation and recognition to
emotion recognition in the framework of man}machine
interfaces. Future work includes the development of
a robust pose-and lighting-invariant face recognition
system as an extension of the current research.
Acknowledgements
The authors wish to acknowledge the "nancial support
of the Korea Research Foundation made in the program
year of 1998 and also in part by the Ministry of Education of Korea toward the Electrical and Computer
Engineering Division at POSTECH through its BK21
program.
2466
Y.-S. Ryu, S.-Y. Oh / Pattern Recognition 34 (2001) 2459}2466
References
[1] Yankang Wang, H. Kuroda, M. Fujumura, A. Nakamura,
Automatic extraction of eye and mouth "elds from monochrome face image using fuzzy technique, Proceedings of
the 1995 Fourth IEEE International Conference on Universal Personal Communications Record, Tokyo, Japan,
1995, pp. 778}782.
[2] R. Pinto-Elias, J.H. Sossa-Azuela, Automatic facial feature
detection and location, Proceedings of the 14th International Conference on Pattern Recognition, Vol. 2, 1998,
pp. 1360}1364.
[3] Weimin Huang, Q. Sun, C.P. Lam, J.K. Wu, A robust
approach to face and eyes detection from images with
cluttered background, Proceedings of the 14th International Conference on Pattern Recognition, Vol. 1, 1998,
pp. 110}113.
[4] Kin Choong Yow, R. Cipolla, Feature-based human
face detection, Image Vision Comput. 15 (9) (1997)
713}735.
[5] Takeo Kanade, Picture processing by computer complex
and recognition of human faces, Technical Report, Department of Information Science, Kyoto University, 1973.
[6] D.J. Beymer, Face recognition under varying pose, A.I
Memo No. 1461, 1993.
[7] R. Brunelli, T. Poggio, Face recognition: features versus
templates, IEEE Trans. PAMI 15 (10) (1993) 1042}1052.
[8] P. Juell, R. Marsh, A hierarchical neural network for human
face detection, Pattern Recognition 29 (5) (1996) 781}787.
[9] D.F. McCoy, V. Devarajan, Arti"cial immune systems and
aerial image segmentation, IEEE Int. Conf. Systems Man
Cybernet. 1 (1997) 867}872.
[10] M. Riedmiller, H. Braun, A direct adaptive method for
faster backpropagation learning: the RPROP algorithm,
IEEE Int. Conf. Neural Networks 1 (1993) 586}591.
[11] E. Fiesler, R. Beale, Handbook of Neural Computation,
Oxford University Press, Oxford, 1997.
[12] S. Romdhani, Face recognition using principal components analysis, MS Thesis, http://www.elec.gla.ac.uk/
%romdhani/pca.htm, 1996.
About the Author*YEON-SIK RYU received the BS degree in Electrical Engineering from HanYang University, Seoul, Korea in 1988,
and the MS degree from Pohang University of Science and Technology (POSTECH), Pohang, Korea in 1990. He is currently a research
engineer of LG Electronics Inc. and is working toward a Ph.D. in Electrical Engineering. His research interests are in the area of neural
networks, immune system, and evolutionary computation and its application to automation systems.
About the Author*SE-YOUNG OH received the BS degree in Electronics Engineering from Seoul National University, Seoul, Korea in
1974, and the MS and Ph.D. degree in Electrical Engineering, from Case Western Reserve University, Cleveland, OH, USA, in 1978 and
1981 respectively. From 1981 to 1984, he was an Assistant Professor in the Department of Electrical Engineering and Computer Science,
University of Illinois at Chicago, Chicago, IL. From 1984 to 1988, he was an Assistant Professor in the Department of Electrical
Engineering at the University of Florida, Gainesville, FL. In 1988, he joined the Department of Electrical Engineering, Pohang
University of Science and Technology, Pohang, Korea, where he is currently a Professor. His research interests include soft computing
technology including neural networks, fuzzy logic, and evolutionary computation and its applications to robotics, control, and
intelligent vehicles.