Available - Department of Electrical and Information Engineering

Comparison of skin color detection and tracking methods under varying illumination
Birgitta Martinkauppi*^, Maricor Soriano**, and Matti Pietikäinen*
*^Department of Electrical and Information Engineering, P.O. Box 4500, FIN-90014, University of Oulu, Finland.,
currently at: InFotonics Center, P.O. Box 111, FIN-80101, University of Joensuu, Finland.
**National Institute of Physics, University of the Philippines, 1101 Diliman, Quezon City, Philippines.
*Department
of Electrical and Information Engineering, P.O. Box 4500, FIN-90014, University of Oulu, Finland.
Abstract— When skin areas like faces and hands are imaged under natural environments their color appearance is frequently
affected by variations in illumination intensity and chromaticity. In color-based skin tracking and detection, changing
intensity is often dismissed either with the use of normalized, intensity-invariant color coordinates or by additionally modeling
possible skin intensities. Chromaticity variations are rarely considered although they are common in practice. In most
approaches considering chromaticity the experiments are done with a small or undefined variation range. It is difficult to
compare different approaches and assess their applicability range for this reason. To improve the situation, we evaluate the
performance of four state-of-the-art methods under drastic but practically common illumination changes. The effect of
illumination chromaticity for skin is clearly defined, and based on it we draw conclusion about the performance of these
approaches.
Index terms—color vision, RGB face images, face detection, face tracking, performance evaluation
1. INTRODUCTION
Faces and hands are important for human social interaction: faces are used to recognize other persons and hand gestures can be
used as a means of communication. They also provide information about actions, for example when a person is moving towards
1
a door or lifts an object. It is therefore not surprising that many machine vision applications, such as human-computer
interaction, seek data out them. In the first stage, faces and hands need to be found and separated from the rest of the image
content. Although many different cues have been utilized in the detection, color information is among the most powerful and
popular ones; color cue may have a low computational cost, simplicity, high discriminative power and robustness against
geometrical transformations like rotation, scaling, transfer and shape changes under a uniform, stable illumination field.
When using color as a cue, one must take into account its relationship to illumination conditions as it is sensitive to these
and white balancing of the camera. First, the same scene may appear with slightly different color between images taken under
illuminants with different spectral power distribution even though the camera was white balanced each time to the prevailing
illuminant. Second, if the camera is not calibrated to the illumination variation, especially in the case of changing illumination
chromaticity, the color appearance can vary drastically. For example, skin color can appear reddish or bluish and the dynamic
range of the camera can be more easily exceeded [1]. In some cases, even the surroundings can produce strong color casts.
The effect of varying illumination chromaticity is rarely considered by skin detection and tracking algorithms, possibly
because it is also generally ignored in test databases used for them. Because in general the image or video databases considering
illumination chromaticity have small and often unspecified variation, we have created a Face Video Database [2] which has
videos on time-varying illumination chromaticity taken with several cameras, face images taken under different but known white
balancing illumination and current illumination conditions and spectral information of skin and light sources. The spectral
sensitivities of the cameras and their settings are assumed to be unknown as this is common in practice. Therefore, the camera
independent skin color database (e.g. skin reflectances and SPDs of light sources are known like in [1]) would be useless. When
the videos were taken, automatic white balancing was turned off and the adjustment of the camera was fixed after initial settings.
The face areas are manually marked from frames by xy-coordinates (“bounding areas”) which can be used as ground truths for
the evaluation of an algorithm. The database is available via the Internet.
To tackle problems of varying appearance, a massive number of studies have been made about relations between color
appearance and illumination. The suggested classic “solution” for varying color appearances is color constancy, which means
either correcting colors of the image or using invariant color features. The color constancy algorithms are often only for limited
use because of their restricting assumptions, like unsaturated channels, and requirements, like an always present reference color.
In addition, they are usually designed for flat surfaces and a uniform, global illumination field, an exception being Retinex (see
about Retinex drawbacks in [3]). However, real scenes are perceived in three-dimensional space and the illumination field may
vary over the scene and its objects. The suitability of color constancy algorithms for machine vision purposes is more dubious
because they have often been designed to produce pleasing color appearances for humans. This is clearly demonstrated by Funt
2
et al. [4] who showed the failure of these algorithms in color indexing. Despite this, some authors claim illumination invariant or
robust techniques even when their technique only tolerates changes in illumination direction, intensity or both. In some papers it
is not even clear for a reader which kinds of illumination changes the paper is supposed to consider.
Quite typically, many skin detection and tracking papers take into account intensity variations of an object’s color. In some
papers, the intensity information is a part of the skin model. However, excluding it from the model seems be to a more
interesting approach because it reduces the dimensionality of the model (limits only for two factors need to be found out) and in
most of the cases, the variation in intensity is larger than variation in either chromaticity coordinates. A common method to
suppress the varying illumination intensity is the use of only chromaticity coordinates. These can be simply obtained by
converting RGB to a color space with separated intensity and chromaticity components. Of course, the discrimination between
colors separated by their intensities is now lost (see for example [5]). Furthermore, this method solves the problem of varying
illumination only partially. A more difficult problem is left intact: how to handle changes in illumination chromaticity. Again,
some improvement can be made by quantizing chromaticity values which increases the robustness against noise and small
changes, and speeds up the program but obviously reduces the discrimination between colors. Despite the fact that these
illumination chromaticity changes are common, they are still considered only in a few papers related to skin detection and
tracking.
In this paper, we classify the algorithms used into three groups: “frame-based”, “sequence-based” and “mixed” approaches.
The frame-based methods can only be applied for still images or single frames of videos, because they do not utilize the
information provided by the sequential dependence of frames. The sequence-based ones, on the other hand, need information
from previous frames to aid detection in the current frame and therefore cannot be applied directly to still images. The mixed
approach can be tuned for still images or videos when needed. It has knowledge of all possible skin colors under different
illumination conditions but the range of possible skin colors can be further limited if the information from previous frames is
available.
The frame-based approaches can be implemented via two alternative strategies to determine if a pixel belongs to skin: 1) the
color of the pixel is inside the defined region of possible skin colors, and 2) the color of the pixel has higher than a selected
threshold probability for belonging to a skin color distribution. Typically the region or probability distribution is obtained offline from a set of representative images with skin pixels labeled but also on-line methods have been suggested [6]. The
complexity of the first strategy varies from simple fixed values (for example [7]) to more complex shape functions [8]. From the
second strategy, the methods can be nonparametric like histograms (see, e.g. [9], [10] or [11]), semi parametric like the self
organizing map (SOM) [12] and neural networks [13], and parametric assuming a certain distribution like a Gaussian or
Gaussian mixture (for instance [11], [14] and [15])]. Not all frame-based approaches are static; for example adaptations to refine
3
the skin color model for the skin color appearance of the image have been proposed [16][17][18]. However, none of these
algorithms have been tested under significantly changing lighting chromaticity.
The sequence-based approaches use knowledge from previous frames for skin detection in subsequent frames, for example
pixels from face localization in the current frame are used to update the skin color histogram. To our knowledge, localization
information has been used in two adaptive schemes so far: Raja et al. [19] updated the skin color histogram using pixels inside
the face localization and Yoo and Oh [20] refreshed the skin color model from the area obtained inside elliptical face
localization. They both are sensitive to localization quality because the pixels for the model update are determined from the
localization. Therefore, it is not surprising that they easily adapt to non-skin objects (see for example [21]).
The mixed approach is based on the knowledge of possible skin chromaticities under a certain limited illumination variation.
This knowledge itself does not assume any probability for certain colors to belong to skin; it only defines the area for possible
chromaticities and therefore its applicability for a range of illumination variation and white balancing conditions is clearly
defined. It is inherently tolerant to a non-uniform illumination field. Of course, instead of modeling the skin locus as a constraint
in a color space, the constraint could be implemented as a look-up table. In normalized color coordinates (NCC) it is called a
skin locus after Störring et al. [22], who also showed that it follows the Planckian locus (a curve of Planckian illuminants’
chromaticities) for one white balancing condition with manual adjustment of lens opening and shutter speed to avoid channel
clipping. The skin locus is useful for selecting pixels in probability based face tracking as shown in [21]. However, in this case
the skin locus was created using several white balancing conditions which also makes it robust against different white balancing
conditions. Unlike the frame-based and sequence-based approaches, the skin locus has a physical background; it can be obtained
not only from images but also calculated from skin reflectances and the illumination spectral power distribution. Martinkauppi
and Soriano [23] demonstrated that the skin locus using different white balancing conditions can be reconstructed using a few
basis functions of the skin color signal and known camera responses. The skin color signal is the product of the skin reflectance
and spectral power distribution of a light source, and is useful for modeling a mixture of illumination and is, at least in principle,
device-independent. Skin modeling under mixture illumination is also studied by Störring et al. [22].
There are already some comparison studies of skin detection methods like [24] and [25] but in their data the skin appears in
skin tones or near. In this paper, we mainly consider problems caused by varying illumination, especially changing chromaticity
which is very common in practice. The main goal is the comparison of different skin detection methods under challenging
lighting conditions. A high skin pixel detection rate is sought after because good skin pixel selection is essential for further
processing. Color spaces have been also tested for skin detection – for both challenging and unchallenging cases - but they are
not considered so much in this paper. A review of color space studies can be found in [26]. The first experiment compares
frame-based approaches and skin locus with two video sequences taken by two different cameras. The first video contains a
4
varying background (a person is moving around) while the head size and position are quite fixed [26]. In the second video,
persons are moving in front of the camera so the head size and position are also changing. Both video sequences contain real,
drastic illumination chromaticity changes. The second experiment compares skin locus and sequence-based approaches using the
first video [28]. The algorithms used in the experiments were compared with the following measures: true positives, false
positives and additionally for tracking error in localization. True positives are those pixels found by the algorithm inside the
ground truth and they are considered here as skin pixels. Those pixels which are also labeled as skin but are outside the ground
truth bounding box are false positives. Both measures are normalized by all pixels of their class, that is, true positives by the
number of all pixels inside the ground truth and false positives by amount of all pixels outside the ground truth. With these two
measures, it is possible to evaluate algorithm performance for correctly labeling skin and non-skin pixels. Localization error is
calculated from the relationship between the computed bounding box and ground truth bounding box.
The paper is organized in the following way: Section 2 presents first some skin properties, the general camera model
and problems caused by varying illumination conditions. After this, a procedure to obtain skin locus from images is presented in
detail. Then, in Section 3, we introduce different skin detection schemes and select example methods from these schemes. In the
next section, the selected methods are applied on real data taken under realistic conditions. In Section 5 we discuss the results
and finally the conclusions are drawn in Section 6.
2. DETERMINING A SKIN LOCUS FROM COLOR FACE IMAGES
First, some properties of skin need to be described shortly. Then we introduce the problem related to changing illumination
which is highly relevant for imaging skin under real world conditions. After this, the procedure for creating skin locus is
described and applied for real camera data.
Many papers have demonstrated that different skin tones are separated mainly by their intensity. Naturally this is due to
the similarity in skin spectra which is primarily formed from three colorants, melanin, carotene and hemoglobin. According to
[1], skin reflectances have a similarity in shape and smoothness and they are separated at most by their level. The similarity of
skin reflectances between persons has been also investigated in [29][30] which showed the reconstruction of skin spectra with
high quality using only three basis vectors. As these results indicate, it is obviously attractive to make a model for skin detection
using only intensity independent chromaticity coordinates; this kind of model would be able to detect skin robustly despite its
tone. It would also intrinsically tolerate intensity variations in skin caused, for example by shadows or by light intensity changes.
Again, it is not invariant to the color shifts due to changes in the prevailing illumination chromaticity. Additional procedures
need to be done to tolerate changing chromaticity values because chromaticity changes cannot be cancelled like intensity
variations, and color constancy or color correction tend to often be unsuccessful [4].
5
First, one must study how the illumination affects image formation in cameras. Theoretically, the effect of illumination
can be simulated using Eq. 1. which presents a simplified but commonly used camera channel model. The channel output for
light reflected from an object is weighted by the channel response and normalized by the channel response for a white reference:
700 nm
Ci =
∑R
λ = 400 nm
i
(λ ) * I (λ ) * S (λ )
700 nm
∑R
λ = 400 nm
i
,
(1)
( λ ) * I wb ( λ )
where
i = red, green or blue, C = camera’s output for the channel i, λ = wavelength,
I = spectral power distribution of the light source, wb = white balancing,
S = reflectance of an object (e.g. skin), and R = relative spectral response of the camera.
The numerator describes the response of the camera to the prevailing illumination whereas the denominator indicates the effect
of normalization or the white balancing of the channel to an illuminant. The illumination affects the output via two different
mechanisms. In the first case, white balancing cannot totally remove the effect of illumination chromaticity to the color
appearance of objects. A successful white calibration can guarantee only a white target appearing similarly across images taken
under different conditions. If the camera is linear, the similarity is also preserved for different shades of gray. Another case of
color mismatch is caused by the difference between prevailing and calibration lighting. The change in intensity may distort the
colors for example via limited dynamic range. Disparency between the chromaticities of the lights also causes chromaticity shifts
in the color appearance of objects.
The relationship between white balancing lighting and prevailing lighting is also demonstrated in Fig. 1; the 16 images are
taken with four different calibrations and under four prevailing illumination conditions with a USB web camera (Nogatech). The
images were taken under quite ideal conditions: a dark room with a whitish background of photographic canvas and using
Macbeth SpectraLight II light sources (Planckian type Horizon 2300 K, Planckian type Incandescent A 2856 K, fluorescent
TL84 and simulated daylight 6500 K). One can note from Fig. 1 that the white balancing is not always successful which
suggests limited capabilities of the camera used. Even when it was successful, it was still not enough to produce the same color
appearance because the spectrum of lighting is different and due to camera properties. But if the illumination is different from
the white balancing conditions, the color appearance changes in more drastic ways and the direction of color shift can be directly
linked to the color temperature change. If the color temperature of the prevailing illumination increases, this results in more
bluish content in the spectral power distribution and the appearance of color shifts toward blue. In the opposite case, the reddish
content of the illumination rises and causes a reddish color shift. Störring et al. [22] showed that the modeled skin chromaticities
fulfill only part of the whole color space and that their curve follows the Planckian locus (curve). Hence, the area of skin
6
chromaticities was named as skin locus. The test was done with one camera white balance and couple of other prevailing light
sources. The camera lens and shutter speed were adjusted manually which is not done in our procedure. In addition, we use
image data to determine the skin locus instead of simulations.
When the images of possible illumination/camera calibration cases are available, they can be used to determine the range of
skin chromaticity variations (skin locus). The locus is camera specific and it is sometimes difficult and cumbersome to take all
the needed images under all cases. If spectral sensitivities are available, one may also simulate the locus (see Eq. 1). However,
this requires knowledge of the properties and settings of the camera.
The procedure for creating the skin locus constraint is simple: 1) skin colored areas are extracted manually, 2)
their values are converted to normalized color coordinates NCC, also marked as rgbI, and 3) the limiting boundary functions are
approximated. The chromaticity coordinates of NCC were selected because different skin color groups overlap quite well in
them; skin tones form a small compact cluster [31] and quite a large portion of skin colors of different ethnicity fall over each
other [2]. In addition, earlier research has shown that NCC chromaticities produced top performance when compared to other
color spaces for the spread of skin chromaticities under varying illumination [2]. The formula for NCC rgb is
n (l ) =
i (l )
,
R (l ) + G ( l ) + B ( l )
(2)
where
n = normalized color coordinate, i.e. r, g or b, l = location of the point, and i = R, G or B corresponding to n.
It is enough to use two NCC coordinates, which are here r and b, because of the redundancy: r+g+b=1. NCC rb chromaticities
from two 16 image series (yellowish and pale persons) are plotted in Fig. 2 (a). This selection of rb chromaticities allows
modeling with straight lines whereas the modeling with rg chromaticities would require curves [2]. As can be observed, all
possible skin chromaticities for four different white balances form a compact cluster while the prevailing lighting conditions
vary between sunset or sunrise and daylight. This cluster is, of course, camera-dependent as shown in (1). Due to this, different
cameras may have different clusters. Based on the cluster for the Nogatech camera, it is possible to create a filter to select skin
colored pixels specifically for this camera. For example, a pixel can be labeled as skin (marked as 1) or non-skin (marked as 0)
using the following equations (in Eq. 3a, the first line defines upper bound and the other two the lower bound for skin locus):
 b < 0 .71 − r / 1 .15 if 0 .22 < r < 0 .71 


Skin  b > 0 .61 − r / 1 .15 if 0 .22 < r <= 0 .34  = 1
 b > 0 .58 − r / 1 .15 if 0 .34 < r < 0 .71 


(3a)
(3b)
Skin ( All other cases ) = 0
These equations clearly define the possible skin chromaticities in the used color temperature range. However, a fluorescent lamp
with a strong green component within the range and not used in making the locus might produce skin chromaticities outside the
locus. We did not use a statistical method for determining the filter in order to avoid giving a preference to certain skin colors.
7
Some skin colors do occur more often in the 16 image series, but this does not mean that this is so in a video frame or an image.
One should also remember that many cameras produce nonlinear responses to lightness in order to compensate for the nonlinear
relationship between lightness and voltage introduced by display devices. This is maybe because chromaticity is not totally
independent of intensity and should be taken into account when applying skin locus and other methods.
Fig. 2 (b) shows a skin color cluster for a single white balancing condition. This sublocus is smaller than the locus which
covers all four white balancing conditions. Once again, the skin locus can be modeled using straight lines. For Fig. 2 (b) the
following equations were used to determine the class (skin/non-skin) of a pixel (in Eq. 4a, the first line defines upper bound and
the other two the lower bound for skin locus):
(4a)
 b < 0 . 69 − r / 1 . 15 if 0 . 22 < r < 0 . 6 


Skin  b > 0 . 62 − r / 1 .15 if 0 . 22 < r <= 0 . 34  = 1
 b > 0 . 60 − r / 1 . 15 if 0 . 34 < r < 0 . 6 


Skin ( All other cases ) = 0
(4b)
This kind of subcluster which only defines the possible skin chromaticities for lighting variations with one white balancing are
specific for one balancing case, whereas the use of a large cluster can offer robustness against different white balancings. A
common region for all subloci seems to be the region at white balancing conditions where the skin color has most natural
appearance. There is a trade-off between the generality of the locus and its discrimination capability; the whole locus offers
better detection of possible skin colors but weakens the separation between the background and possible skin chromaticities in
the frame.
Skin loci of different cameras may differ. Fig. 3 shows two skin loci, one for a Sony 1CCD camera and another for a
Nogatech 1CCD web camera. Although there are some regions which overlap (the region around skin tone), they differ
especially in chromaticities imaged under uncalibrated conditions. The skin locus for the 1CCD Sony camera can be modeled
using the following equations (in Eq. 5a, the first line defines upper bound and the other two the lower bound for skin locus):
 b < 0 .87 − 1 .2 * r

Skin  b > 0 . 75 − 1 . 1 * r
 b > 0 .81 − 1 .3 * r

Skin ( All other cases ) =
if 0 . 09 < r < 0 . 57 

if 0 .09 < r <= 0 .32  = 1
if 0 .32 < r < 0 . 57 
(5a)
(5b)
0
A skin locus is also able to handle mixed illumination cases, which are common in practice but often ignored by many
algorithms. For example in Fig. 4, a person is walking in the corridor. The left side of the face is illuminated by the fluorescent
lamp on the ceiling while the window gives daylight cast over the right side. This causes multiple facial skin colors to appear
even with unnatural shades. No global color correction/constancy schemes is able to work with this kind of image.
8
3.
METHODS FOR SKIN DETECTION
Skin detection methods can be roughly classified into three groups: frame-based, sequence-based and mixed approaches. The
frame-based approaches can be either static or adaptive. In static case, the possible skin colors (usually skin tones) are fixed a
priori and applied to a single frame or frame sequence. The adaptation in frame-based approaches usually means the use of some
kind of constraints like the size of the skin area is bigger than other similar colored objects to fine-tuning within the fixed skin
colors. In contrast, the sequence-based approaches rely after the initialization on the color information obtained from the
previous frame. They typically use some spatial constraint for selecting the pixels for skin color model update. The mixed
approach is skin locus which can be used with or without information from previous frames. The information from previous
frames can be used to limit the possible skin colors if the illumination conditions during imaging do not change too radically
between adjacent frames.
A. Frame-based skin detection
The following state-of-the-art algorithms were selected to represent different types of frame-based methods: 1) the adaptive
skin-color filter proposed by Cho et al. [18], 2) the probability based approach of Jones and Rehg [11], and 3) skin color
modeling with color correction by Hsu et al. [8]. These methods are applied on single images and frames in which they classify
pixels as skin or non-skin based on their skin definition. No information of previous frames in videos is used.
1) Adaptive skin-color filter
The adaptive skin-color filter has two operational stages: rough skin detection and fine tuning more closely to the skin color
values. Its color space is a modified version of HSV; the values of hue H components are shifted by 0.5 to move pure red to the
center of the H axis. In rough detection, those pixels with values inside a preset limit in HSV coordinates are labeled as skin. The
preset limits are determined off-line from a representative image set. The threshold values for the H component are fixed and are
assumed to be absolute. Table 1 shows the original values given for oriental persons [18]. After the selection of skin candidates,
an adaptive method is applied to more closely redefine the skin colored area. The basic requirement for adaptation is that the
region of skin color is at least comparable to the area of skin-like background colors.
In [18], the algorithm was tested with Internet images which were divided into four categories: normal, reddish, dark and
light. The detection rate was high. However, the number of reddish images was small, and no skin chromaticity range was
shown or defined for these images.
2) Statistical skin color model
Jones and Rehg [11] collected a large database from the Internet and derived from it a statistically reliable color model for skin
and non-skin classes. They modeled skin and non-skin classes with two different statistical approaches: histogram (non-
9
parametric approach) and Gaussian mixture (parametric approach). The modeling was done in RGB color space. These color
models were obtained from over 1 billion pixels labeled – therefore making it the most reliable probabilistic model so far.
According to their research, the non-skin color model seems to contain mainly those colors which appear between black and
white axes whereas skin color seems to have more reddish components. The histogram model performed slightly better than the
Gaussian mixture model from the standpoints of accuracy and computational cost.
Skin and non-skin class models can be used to select only those values which have more support in the skin class than in the
non-skin class. This is done by taking the ratio of a pixel’s support of skin and non-skin classes and comparing it to a preset
threshold. If the ratio is at least equal to the threshold, the pixel is labeled as skin. The threshold is determined off-line from a
representative image set using ROC (Receiver Operating Characteristics). The ROC curve describes how the number of true skin
pixels and falsely classified non-skin pixels behave as a function of different threshold values. The threshold value selection is
based on the user’s criteria for allowed correct detection and false classification rates. For our experiments, the Gaussian mixture
model was implemented using the parameters given in [11] (Appendix A). A keen reader might notice that the sum of weights is
smaller than one. According to Jones and Rehg (via email communications), this is caused either due to rounding errors or
numerical errors in the expectation-maximization procedure.
3) Skin detection based on color correction
The main idea of the method proposed by Hsu et al. [8] is to use color correction before skin detection. The color correction and
skin detection are done in the YCbCr space which is here assumed to be the international standard for studio-quality component
digital video (Rec. ITU-R BT-601-4).
The color correction algorithm is one version of the white patch method. It starts with a search of the pixels with a
luminance value belonging to the top 5 % of all image values. The detected pixels are assumed to belong to a white object. If the
number of these top pixels exceeds 100 and the mean of the whole image is not “skin color”, then color correction will be
performed. However, “skin color” was not described in further detail. The correction coefficients are calculated so that the
average of the selected top pixels becomes 255 for each channel.
After the correction stage, a skin filter is applied on the images. It is designed for YCbCr. The nonlinear transformation to
Cb and Cr should reduce luminance dependency (i.e. shadowing effects) of the detection. The skin color cluster itself is modeled
in this space with a shape function (see Appendix B). Those pixel values which fall inside the function are labeled into the skin
class and other ones into the non-skin class.
10
B. Sequence-based skin detection
For our comparison test, we chose the sequence-based method of Raja et al., in which the skin color model is updated using
pixels inside the face localization [19]. It is less sensitive to the quality of localization than the method of Yoo and Oh [20],
which uses all pixels inside the elliptical localization.
A visualization of the chosen pixel selection method is shown in Fig. 5. The outer box represents the face localization
while the inner box, which is 1/3 of the outer box size and 1/3 from the outer box boundaries, displays the pixels selected. Using
these pixels, there are many possible ways to refresh the skin color model. The moving average was chosen to be used in the
update in all adaptive sequence-based cases:
(1 − α ) * M t + αM t −1
,
max((1 − α ) * M t + αM t −1 )
Mˆ t =
(6)
where
Mˆ
= new, refreshed model, M = model, t = frame number, and α = weighting factor (set to 0.5).
In Eq. 6, the divisor is used to scale the new model to the probability range of 0-1. The moving average method was used
because it provides a smooth transition of the skin model between frames and reduces noise in images taken by a low quality
camera. The noise can cause changes in pixel color even when there is no change in the scene. Therefore, it is advisable not to
use a color model obtained just from the current frame.
C. Mixed approach
It is possible to use the skin locus as a frame-based or sequence-based version; the latter one is an option when knowledge about
skin color is available from previous frames. In the frame-based version, the whole locus is needed to select possible skin pixels
because no prior knowledge about illumination conditions is available. This reduces the discrimination between background and
face, but there are applications in which a high detection rate is more important than its effect on the discrimination. If the white
balancing condition is known, only a part of the locus needs be used and this reduces the number of false positives further. It is
then possible to use only a sub-locus instead of the whole locus. In the sequence-based approach, the target histogram can be
updated using the pixels selected by the skin locus. The knowledge from previous frames can be utilized to refine a smaller area
inside the locus, because only a limited color shift between adjacent frames is assumed.
4.
COMPARATIVE EXPERIMENTS
The detection schemes were compared with two image sequences: one taken with a 1CCD Nogatech camera and another with a
1CCD Sony dfw-x700 camera. These sequences consider remarkable but natural illumination changes. For example, the
11
Nogatech sequences (video 1) show a person walking in a corridor in which the illumination varies between pure fluorescent
light to mixed light (fluorescent and daylight from a window). In this video, the skin chromaticity and its distribution changes in
time as can be seen from Figs. 6 (a) and (b). The head size and position changes are minor. The Sony sequences have a fixed
camera setting and one or two persons move in front of the camera with varying distances (Fig. 6 (c)). The head size, orientation
and position changes are much larger than in the Nogatech video. This video is more challenging than the Nogatech one also
because
the
skin
color
appears
rarely
in
skin
tone.
The
videos
of
the
results
can
be
seen
at
http://www.cs.joensuu.fi/~jbm/Comparisons.html (ID: DetectionVideos / password: skindetmovies4). This webpage also
contains tests with another sequence taken under outdoor conditions and additional material concerning applications in face
detection including a technical report with more details.
A. Experiment 1: skin detection
The main goal of this experiment is finding the facial skin. We tested the two video sequences described earlier with four
different methods described in Section 3.
i.
Adaptive skin-color filter
In our experiments, we only used the initial range of H for skin detection, because if it fails, no detection or adaptation in SV
components can improve the results. In addition, the SV ranges are already “generous” so the limitation for possible pixels is
made by the H range. It also seems that the underlying assumption is the rareness of non-skin objects being skin like colored.
The H range used was the same as in [18], see Table 1. Even though Cho et al. obtained the limiting thresholds for Asiatic
persons, it has been shown that the skin chromaticities of Asiatic and Caucasian persons overlap quite well [2]. If we would use
a twice wider hue range (estimation based on [2]) it would definitely increase detection but it would cause the performance of
the following adaptive stage to deteriorate. The hue range is limited because of the original assumption that the skin color area
should be at least equal to the size of the non-skin color area.
Fig. 7 shows a few selected frames after thresholding the H range. The results are good when the skin color is close to the
skin tone. The results are much worse for the Sony video because the faces do not have a skin tone appearance so often. One can
observe that the thresholds of this algorithm have been mainly set for the skin tones. But when the illumination is changed from
the calibrated conditions, the selected thresholds work much worse. In fact, for some frames even the adaptive stage would fail.
12
ii.
Statistical skin color model
Jones and Rehg [11] have suggested statistical models obtained from a large dataset for skin pixel selection. The implementation
of their Gaussian mixture models was done for testing as it was detailed in their paper (see Appendix A). We squared covariance
values because otherwise the results were extremely bad for our material.
Before applying their model to the images, a threshold value for the determination of pixel class (skin / non-skin) must be
calculated for the data. This threshold value indicates the allowed minimum ratio between a pixel’s skin and non-skin class
probabilities and is used to label the pixel as skin. It is obtained from the ROC diagram and this requires knowledge of face
positions in a small set of images. The selection of the threshold value proved to be difficult for the video we used: Fig. 8 shows
ROC for 10 frames (Nogatech: the first 10 frames and the frames 496-505, Sony: the first 10 frames). While for the middle case
it is easy to determine the threshold value as 0.8 (criteria: true positives >90% and false positive <=20%), the other two are more
problematic. Setting the threshold based on it would give a large number of false positives. One reason for the poor ROC curve
for starting frames might be caused by unnatural skin color. In frames 496-505 the skin appears more natural to the human eye.
We did experiments with the threshold value 0.8 (Fig. 9) and with 0.4 as given in the original paper (Fig. 10). Fig. 9 shows
that the skin tones are separated quite well from the background in some images, but in some others the facial area is not
detected so well. In fact, in some frames they are not detected at all. This is especially true for the more challenging Sony video.
The higher threshold value lowers not only the number of wrongly detected non-skin pixels but also the number of correctly
detected skin pixels. From Fig. 10 we can see that in many cases not only the face pixels are selected, but many background
pixels as well. Also obvious non-skin colored pixels are detected like the green wall. These results might also be due to the fact
that the probability distributions were obtained from Internet images, in which illumination changes are usually less drastic and
the skin color often appears skin-tone like. Because the non-skin distribution was mainly on the black-white axis, it might not be
suitable for all natural backgrounds. Therefore, we may conclude that this algorithm is more suitable for Internet type images
than for videos taken under changing illumination.
iii.
Skin detection based on color correction
Hsu et al. suggested color correction so that skin would appear naturally with skin tones (i.e. belong to their skin color model).
The color correction part of their method [8] was modified to take into account saturated pixels, because in many video frames
the lamps are visible and their pixels are saturated to white (255, 255, 255). One should note that the description of the original
method [8] does not contain any mention about the rejection of pixels. After the saturated pixels are excluded, the algorithm
“corrects” the colors in 762 frames (total number of frames is 980) or 78 %. Fig. 11 displays couple of segmented images after
color correction. We also added the part in which the color correction part for frames is excluded if the average color is skin
13
tone (“skin tone restriction”). We assumed that the average color is calculated using pixels having the top 5 % luminance values
and being unsaturated. Now the color correction phase is done only for 176 frames or 18 %. Using the both exclusion criteria
mainly improved a little bit the results for some cases where the skin color appeared near skin tones but for the most of the
frames there was not so much difference. The quality of correction is still sometimes erroneous as can be seen from Fig. 12. This
figure shows one of the frames in which the color correction leads to worse results than without using any correction at all. As
can be seen from Table 2, the filtering selects skin tones quite well with a low false positive rate. Once again the results were
much poorer for the Sony video. The filtering results of this algorithm depend heavily on the color correction part – which
unfortunately very often fails to produce good correction.
iv.
Skin locus
The skin loci for the Sony and Nogatech cameras were calculated using the images taken under different calibration and
prevailing conditions. More details about the procedure for taking images and creating the locus can be found in Section 2. The
loci are different because different cameras can have different characteristics and parameters.
The skin locus method for the Sony was implemented with only one white balancing (see Fig. 3). The white point
(r=1/3, b=1/3) was excluded because the lamps at the ceilings were visible in many images and they appeared as white patches
due to saturation. Fig. 13 shows four different segmented frames. Because the skin locus contains a wide range of colors, there
are sometimes quite many false positives on the background (see Fig. 13, middle right image). In Fig. 13 some parts have a poor
detection rate maybe because it is too dark and affected by the gamma function of the camera or the pixel is saturated to white.
Unlike the other methods, the skin locus performed very well on the Sony video. This is not surprising since it is especially
designed for changing illumination chromaticities.
Fig. 14 summarizes the true positives obtained with different methods for the Nogatech video. We can see that the skin
locus provided the best overall detection rate. The same is true for the Sony video where the other methods do not succeed well
because the skin color does not appear as often as skin tone.
B. Experiment 2: skin tracking and segmentation.
In the second experiment, different approaches were applied for tracking and segmenting facial skin.
The goal of this
experiment is to show the advantage of using skin locus over a fixed color model and an adaptive model using spatial constraint.
The tracking methods were initialized using a manually selected face area (see figure 5 as an example). This kind of “perfect”
initial detection was selected because otherwise the comparison of tracking results would be difficult. The same initialization for
a video sequence was used for all methods.
14
The mean shift algorithm ([10], [32], Appendix C) for face detection and tracking provides a good state-of-the-art
approach for comparing different skin models [28]. In our experiments it was used with three different skin models: 1) the fixed
probability model obtained from the first frame, 2) the adaptive model based on the spatial constraint of Raja et al. [19], and 3)
the adaptive model based on the skin locus constraint. The size of the bounding box was fixed after the initialization area
because tracking with the first two models would not otherwise reach the end of the video. This also allows for easy comparison
of different skin color models. In the adaptive cases, the skin color model was updated using the moving average method (Eq.
6). The constraints were used to select the pixels for calculating the model for the current frame. For the skin locus constraint,
usually only a part of the locus was used for updating the skin color model at a time. From the current color model, the used
range inside the skin locus was determined and a little bit enlargened to cover possible illumination change (10%). This partial
skin locus was then used to filter the next frame after which the tracking part was applied. The NCC chromaticities were
quantized to 65 levels to model possible real-time requirements. Because of the quantization, the white point was automatically
excluded from the skin locus.
The main idea of this experiment is to track a face and to segment it out from its background based on the skin color model.
Fig. 15 shows selected frames for tracking with fixed color model (frame-based approach), Fig. 16 with adaptive model based on
information from previous frames (sequence-based approach) and Fig. 17 with adaptive model using skin locus constraint
(mixed approach), respectively. The left side of the figures shows the face localization while the right side displays the
segmentation results using the model chromaticities. The segmented result (right side) shows those pixels which belong to the
skin color model. The segmented results could be used for further processing, i.e. using other cues than color to find the face
areas more accurately and verifying which found blobs are indeed faces. Table 3 contains the numerical evaluation of results.
The localization error E is defined as complementary of successfulness of tracking as in Eq. (7):
E =1−
Ac ∩ Ag
Ac ∪ Ag
(7)
in which is the intersection between calculated localization Ac and manually selected ground truth localization Ag normalized by
the total area of both these localizations.
Based on the results, the fixed probability model is valid as long as the illumination change is relatively small. The color
model with geometrical adaptation works better, but can adapt to something else which is obviously not skin. It also seems to be
sensitive to non-uniform change in the illumination field because of the way the pixels are selected for updating the model.
These two previous models do not take into account the background. With the skin locus based approach, this is taken into
15
account by excluding those skin colors which are common also in the background. It also produced the most stable overall
performance. We have also used other videos in our earlier work with similar results (see for example [21]).
5.
DISCUSSION
Many color constancy schemes have been proposed (see for example [33][34][35]), but so far color constancy can fail in
machine vision applications [4]. Of course, since that study new algorithms has been developed (like [36][37]) but they have
severe limitations for general use. For example, Finlayson et al. [36] method requires either the knowledge of the spectral
sensitivities of the camera or a large number of surfaces imaged with the camera – this is not often possible in practice. In
addition, channel saturation may be a problem. The type of illumination SPD change in many videos and images, and this
restricts the usage of methods working on single illumination type [37]. Therefore, new techniques are needed to cope with
drastic and challenging illumination conditions. One new method is the skin locus. We have compared the skin locus and three
other state-of-the-art methods under challenging illumination conditions. All these methods have been suggested for general skin
detection but with exception of skin locus, their limitations are not clearly stated.
In the first experiment, we tested skin detection on single frames. For the method of Cho et al. [18], the basic assumption
seems to be that in the image, for the given HSV range the face covers the biggest uniformly colored blob. We tested the H
range given in the paper and it seems to work mainly for closely calibrated cases. The H range is fixed in the initialization and if
it fails also further processing (adaptation) will fail. Of course, a broader range H could be implemented (based on, for example,
knowledge of skin changes under varying illumination [2]), but it would also select large part of a frame – which contain in our
cases big uniformly colored, non-facial areas and smaller face areas. The statistical skin detection method proposed by Jones
and Rehg [11] selects also skin tones well. It was created using image material from Internet in which the published face images
have relatively good skin tone appearance. Its non-skin class was clearly not so representative for our video frames because the
non-facial colors were concentrated on the black-white axis [11]. For example, a green wall was classified as skin with a low
threshold value. Squaring the covariance values improved the situation. The threshold value for segmentation and detection with
this method was determined directly from the video frames according to the method suggested in [11] . For other than skin tones
the method did not produce so good detection and segmentation results. Also the determination of the threshold value was
problematic for skin colors with non-skin tones. The color correction was combined with skin detection in the method suggested
by Hsu et al. [8] and it is also the weakest part of the method: the color correction can lead to worse results than doing nothing. It
did not work well often in our videos and caused problems in the detection. In fact, it has been shown by Funt et al. [4] that the
current color correction methods do not necessarily provide results suitable for machine vision. The detection stage found the
skin areas very well only when the illumination was near the one used in the camera calibration.
16
In sequence-based facial skin tracking, it is quite obvious that the fixed model is limited to stable or nearly stable illumination
conditions. The spatial constraints are susceptible to the localization quality and representativity of the selected pixels for the
object.
The skin locus provides the most reliable selection of skin-colored pixels, but of course it can also have problems with a
skin-colored background. The results show that the skin locus provided the best and most stable detection rate for skin pixels in
both experiments, because it has been specially designed for changing illumination conditions. Of course, there is some trade-off
between the rate of false and true positives because skin colored pixels may belong to non-skin objects. In many cases, the main
goal of using color cue for skin detection is to find possible candidate areas and limit the search space, and after this use other
cues for determining if the area is e.g. face. In any case, the false positive rate was reasonable for skin locus. It is also clear that
information from consecutive frames should be utilized if possible to obtain the best possible results: Fig. 18 shows skin
chromaticities from two adjacent frames. The chromaticities overlap significantly because the illumination over the face has not
changed too much between these two frames. Therefore, only a part of the skin locus can be extended to “guess” quite reliably
the possible skin chromaticities in the next frame.
The average and standard deviation of the results are collected in Tables 2 (experiment 1) and 3 (experiment 2) for true
positives, false positives and localization error. Note that the frequency of illumination changes is not uniform. Our tests clearly
indicate that the other three methods for single frames are tuned for skin tone appearances and the tracking based on fixed or
spatial constraint adaptation has problems under normal illumination changes.
6.
CONCLUSION
Problems of detecting and tracking skin color in drastically varying illumination conditions were considered and different
suggested solutions were compared.
Most of the proposed methods seem to be designed only for skin tone filtering. According to our experiments, as long as they
are used for images with good calibration, they can achieve quite good filtering for faces and separation from the background.
We propose the use of skin locus for face tracking and segmentation under real world conditions because it provided clearly
better results than the other methods used in our comparisons.
If color is used as a preliminary cue for detecting and tracking faces or hands, skin locus based approaches are most useful
when the skin is subjected to illumination changes. Its drawback as a frame-based method is decreased discrimination capability
in some frames. However, the other methods discussed have been shown to be successful for their own image test set which did
contain some minor illumination variations.
17
ACKNOWLEDGEMENTS
The funding obtained from the Academy of Finland is gratefully acknowledged. We also thank Mr. P. Sangi for his help with
the implementation of the mean shift algorithm; and Mr. A. Hadid, Mr. S. Huovinen, M. Laaksonen and Mr. E. Galloix for their
participation to make the test material.
APPENDIX A:
GAUSSIAN MIXTURE
The class probability for a pixel can be calculated as the sum of Gaussian kernels:
P ( xxx ) =
N
∑
i =1
1
wi
( 2π )
3
2
Σi
1
2
e
−
1
( xxx − µ i ) T
2
−1
∑i
( xxx − µ i )
(8)
where
i = number of a Gaussian, N = total number of Gaussians, µi = mean for the ith Gaussian,
Σi = the diagonal covariance matrix of Gaussian, and xxx = RGB vector of the pixel.
This model is also called a Gaussian mixture. For the Gaussian mixtures of skin and non-skin classes, Jones and Rehg [11]
provide the following values: mean values (Table 4), diagonal covariance matrix values (Table 5) and weights (Table 6).
APPENDIX B: METHOD BASED ON COLOR CORRECTION
Hsu et al. [8] use a nonlinear transformation for Cb and Cr to reduce their luminance dependency as shown in (9-12):
( K l − Y ) * (118 − 108 )

108 +
if Y < K l
K l − Y min

C b (Y ) = 
( Y − K h ) * (118 − 108 )
108 +
if Y > K h
Y max − K h

(9a)
(9b)
( K l − Y ) * (154 − 144 )

if Y < K l
 154 −
K l − Y min
C r (Y ) = 
( Y − K h ) * (154 − 132 )
154 −
if Y > K h

Y max − K h
(10a)
(10b
)
(11a)
(Y − Ymin ) * (Wc i − WLc i )

if Y < K l
 WLc i +
K l − Ymin
Wc i (Y ) = 
(Y − Y ) * (Wc i − WHc i )
WHc i + max
if Y > K h

Ymax − K h
(11b
)
(12a)
Wc i

+ C i ( K h ) if K h < Y < K l
 ( C (Y ) − C i (Y ))
C i′ (Y ) =  i
Wc i (Y )

C i (Y )
if Y ∈ [K l , K h ]
(12b
)
where
Y = luminance, Cb, Cr = chrominance coordinates, Cb , Cr = centers of the skin color cluster,
18
Wcb = 46.97, WLcb = 23, WHcb = 14, Wcr = 38.76, WLcr = 20, WHcr=10, Kl = 125,
Kh = 188, Ymin =16, and Ymax = 235.
After the transformation, the skin pixels are selected on the basis of an elliptical model described in (13-14):
x = (cos θ ) * ( C b′ − c x ) + (sin θ ) * ( C r′ − c y )
(13a)
y = ( − sin θ ) * ( C b′ − c x ) + (cos θ ) * ( C r′ − c y )(
(13b)
2
( x − ec x ) 2 ( y − ec y )
+
<= 1(
a2
b2
(14)
where
cx = 109.38, cy = 152.02, θ = 2.53 in radians, ecx = 1.60, ecy = 2.41, a = 25.39, and b = 14.03.
One should notice that Eq. 9 (b) differs from the one given in [8] (- sign after the number 154 instead of + sign; also in their
errata web-page http://www.cse.msu.edu/~hsureinl/facloc/index_facloc.err.html) but is consistent with the filter image in the
same paper and produces much better results.
APPENDIX C: MEAN SHIFT ALGORITHM
The mean shift algorithm presented here was used by Comaniciu et al. [32] and Comaniciu and Ramesh [10] for face tracking.
They used a multivariate density estimate to weight colors based on their appearance inside the face localization. Its formula is
1
fˆ ( x ) =
nh d
n
∑ K(
i =1
x − xi
),
h
(15)
where
x = a point, n = number of points, K = selected kernel, d = dimension of the space, and h = window radius or bandwidth.
The kernel used is the Epanechnikov kernel, and its formula is
K
E
 1
C
(x) =  2

−1
d
( d + 2 )( 1 −
x
2
), if
x
(16)
< 1
0 , otherwise
where
Cd = the volume of the unit d-dimensional sphere.
The color distribution of the object is converted to a 1D distribution. In the original version, the target model is fixed and
compared against a new, calculated distribution. The distance between them was evaluated using a Bhattacharyya coefficient:
δ (y) =
1 −
(17)
m
∑
p(y)* q ,
u =1
where
19
q = target model, p = calculated distribution, and y = location.
The tracking starts by initialization from which the target histogram is set. In the next frame, the surroundings of former
localization is sought to find a position in which the difference d is smaller than a certain threshold.
The skin locus method is used to select pixels for updating the target color model thus making it as an adaptive mean
shift. The skin locus to be implemented depends on the selected cameras and whether one or several calibration is used (see eq.
3-6). For selecting the threshold values, see ref. [28].
7.
[1]
REFERENCES
E. Marszalec, B. Martinkauppi, M. Soriano, and M. Pietikäinen, “A physics-based face database for color research,”
Journal of Electronic Imaging 9(1), 32-38 (2000).
[2]
B. Martinkauppi, “Face colour under varying illumination – analysis and applications,” Ph.D. thesis, University of Oulu
(2002). Available at: http://herkules.oulu.fi/isbn9514267885/.
[3]
D. A. Brainard and B. A. Wandell, “Analysis of the retinex theory of color vision,” Journal of Optical Society of
America A 36(10), 1651-1661 (1986).
[4]
B. Funt, K. Barnard, and L. Martin, ”Is machine colour constancy good enough?,” in Proc. 5th European Conference on
Computer Vision, 445-459 (1998).
[5]
M.C. Shin, K.I. Chang and L.V. Tsap, ”Does colorspace transformation make any difference on skin detection,” in Proc.
6th IEEE Workshop on Applications of Computer Vision, 275–279 (2002).
[6]
D. Saxe and R. Foulds, “Toward robust skin identification in video images,” Proc. 2nd International Conference on
Automatic Face and Gesture Recognition, 379-384 (1996).
[7]
Y. Dai and Y. Nakano, “Face-texture model-based on SGLD and its application in face detection in a color scene,”
Pattern Recognition 29(6), 1007-1017 (1996).
[8]
R. L. Hsu, M. Abdel-Mottaleb, and A. K. Jain, “Face detection in color images,” IEEE Transactions on Pattern Analysis
and Machine Intelligence 24(5), 696-706 (2002).
[9]
B. Schiele and A. Waibel, “Gaze tracking based on face-color,” in Proc. International Workshop on Automatic Face- and
Gesture-Recognition, 344-348 (1995).
[10]
D. Comaniciu and V. Ramesh, ”Robust detection and tracking of human faces with an active camera,” in Proc. 3rd
IEEE International Workshop on Visual Surveillance, 11-18 (2000).
20
[11]
M.J. Jones and J.M. Rehg, “Statistical color models with application to skin detection,” International Journal of
Computer Vision 46(1), 81-96 (2002).
[12]
T. Piirainen, O. Silvén, and V. Tuulos, ”Layered self-organizing maps based video content classification,” in Proc.
Workshop on Real-time Image Sequence Analysis, 89-98 (2000).
[13]
L.M. Son, D. Chai, and A. Bouzerdoum, “A universal and robust human skin color model using neural networks,” in
Proc. IJCNN '01 International Joint Conference on Neural Networks 4, 2844-2849 (2001).
[14]
J.C. Terrillon, M. N. Shirazi, H. Fukamachi, and S. Akamatsu, “Comparative performance of different skin
chrominance models and chrominance spaces for the automatic detection of human faces in color images,” in Proc. 4th
IEEE International Conference on Automatic Face and Gesture Recognition, 54-61(2000).
[15]
M.H. Yang and N. Ahuja, “Detecting human faces in color images,” in Proc. 1998 International Conference on Image
Processing 1, 127-130 (1998).
[16]
H. Sahbi and N. Boujemaa, “From coarse to fine skin and face detection,” in Proc. 8th ACM International Conference,
432-434 (2000).
[17]
L. M. Bergasa, M. Mazo, A. Gardel, M. A. Sotelo, and L. Boquete, “Unsupervised and adaptive Gaussian skin-color
model,” Image and Vision Computing 18(12), 987-1003 (2000).
[18]
K-M. Cho, J-H. Jang, and K-S. Hong, “Adaptive skin-color filter,” Pattern Recognition 34(5), 1067-1073 (2001).
[19]
Y. Raja, S.J. McKenna, and G. Gong, “Tracking and segmenting people in varying lighting conditions using colour,” in
Proc. IEEE 3rd International Conference on Automatic Face and Gesture Recognition, 228-233 (1998).
[20]
T. W. Yoo and I. S. Oh, “A fast algorithm for tracking human faces based on chromaticity histograms,” Pattern
Recognition Letters 20(10), 967-978 (1999).
[21]
M. Soriano, J.B. Martinkauppi, S. Huovinen, and M. Laaksonen, ”Adaptive skin color modeling using the skin locus for
selecting training pixels,” Pattern Recognition 3(36), 681-690(2003).
[22]
M. Störring, H. J. Andersen, and E. Granum, “Physics-based modelling of human skin colour under mixed illuminants,”
Journal of Robotics and Autonomous Systems 35(3-4), 131-142 (2001).
[23]
B. Martinkauppi and M. Soriano, “Basis functions of color signal of skin under different illuminants,” in Proc. 3rd
International Conference on Multispectral Color Science (MSC’01), 21-24 (2001).
[24]
J. Brand and J.S. Mason, “A comparative assessment of three approaches to pixel-level human skin-detection”, In Proc.
of the International Conference on Pattern Recognition 1,1056-1059 (2000).
[25]
J.Y. Lee and S.I. Yoo, “An elliptical boundary model for skin color detection”, in Proc. the 2002 International
Conference on Imaging Science, Systems, and Technology (CISST'02), 579-584 (2002).
21
[26]
B. Martinkauppi and M. Pietikäinen, “Facial skin color modeling”, invited chapter in: S.Z. Li and A.K. Jain, Handbook
of Face Recognition, Springer, 109-131 (2005).
[27]
J.B. Martinkauppi, M. Soriano, and M. Pietikäinen,”Detection of Skin Color under Changing Illumination: A
Comparative Study,” in Proc. 12th International Conference on Image Analysis and Processing (ICIAP 2003), 652-657
(2003).
[28]
B. Martinkauppi, P. Sangi, M. Soriano, M. Pietikäinen, S. Huovinen, and M. Laaksonen, "Illumination-invariant face
tracking with mean shift and skin locus," in Proc. IEEE International Workshop on Cues in Communication (Cues
2001), 44-49 (2001).
[29]
F.H. Imai, N. Tsumura, H. Haneishi, and Y. Miyake, “Principal component analysis of skin color and its application to
colorimetric reproduction on CRT display and hardcopy,” The Journal of Imaging Science and Technology 40(5), 422430 (1996).
[30]
H. Nakai, Y. Manabe, and S. Inokuchi, “Simulation and analysis of spectral distribution of human skin,” in Proc. 14th
International Conference on Pattern Recognition 2, 1065-1067 (1998).
[31]
J. Yang and A. Waibel, "A real-time face tracker," in Proc. 3rd IEEE Workshop on Applications of Computer Vision,
142 -147 (1996).
[32]
D. Comaniciu, V. Ramesh, and P. Meer, "Real-time tracking of non-rigid objects using mean shift," in Proc. IEEE
Computer Society Conference on Computer Vision and Pattern Recognition 2, 142-149 (2000).
[33]
G. Finlayson, B. Funt, and K. Barnard, “Color constancy under varying illumination,” in Proc. 5th International
Conference on Computer Vision, 720-725 (1995).
[34]
E.H. Land and J.J. McCann, “Lightness and retinex theory,” Journal of the Optical Society of America, 61(1), 1-11
(1971).
[35]
L.T. Maloney and B.A. Wandell, “Color constancy: a method for recovering surface spectral reflectance,” Journal of
the Optical Society of America A 3(1), 29-33 (1986).
[36]
G.D. Finlayson, S.D. Hordley and P.M. Hubel, “Color by correlation: A simple, unifying frame work for color
constancy”, IEEE Transaction on Pattern Analysis and Machine Intelligence, 23(11), 1209-1221 (2001).
[37]
J.A. Marchant and C.M. Onyango, “Shadow-invariant classification for scenes illuminated by daylight”, Journal of the
Optical Society of America A, 17(11), 1952-1961 (2000).
Figure and table captions
Table 1. Threshold values given by Cho et al. [18 ].
Table 2: Results for experiment 1.
22
Table 3. Results for experiment 2.
Table 4. Mean values for the Gaussian mixture.
Table 5. Diagonal covariance matrix values.
Table 6. Weights for different classes.
Fig. 1. An image series of 16 images taken under different white balancing and prevailing lighting conditions.
Fig. 2. Skin chromaticities are plotted from (a) two 16 image series with all four white calibrations and (b) two 4 image series
with one white balancing calibration (TL84).
Fig. 3. The skin locus of the Sony 1CCD camera (black lines) and Nogatech 1CCD web camera (gray dots) differ. Both skin
loci are for one fixed calibration.
Fig. 4. A person located in a corridor under a nonuniform illumination field; the left side of the face is lighted by daylight and
the right side by ceiling fluorescent lamps. Notice that the background is quite near to its colors at white balancing conditions
whereas the facial colors are obviously distorted.
Fig. 5. Visualization of Raja et al.’s selection for refreshing the object histogram: the outer bounding box represents the
localization of the face and the inner one the area to be used fro color model update. The inner box is determined from the outer
box.
Fig. 6. (a) and (b) show images with manual face selection, and the corresponding facial chromaticities and their distribution for
Nogatech frames 150 and 500, respectively. Three selected frames for the Sony video are displayed in (c). These frames show
changes in head size, number of persons (one or two) and more drastic head movements than in the Nogatech video.
Fig. 7. The first two rows are from the Nogatech video. The results of skin pixel detection using the hue component vary from
good (upper left) to failure (lower right). The upper left shows good skin pixel selection whereas in the upper right also the wall
pixels are activated. The lower ones display failures. The third row shows segmented images for the Sony camera. These two
images are among the best ones but they still are not good for further processing.
Fig. 8. ROC curves for a set of 10 Nogatech frames: frames 1-10 (left) and frames 496-505 (middle). The threshold values
(marked with *) vary from 1 to 0.05 by a step of 0.05. The right picture shows a ROC curve for the Sony camera. As can be seen
from this “curve”, it is very difficult to determine a threshold value for the Sony video. This may be due to few skin tone pixels.
Fig. 9. Segmentation using statistical models with threshold value 0.8. The upper left image shows good detection of skin pixels
and separation from the background. The upper right and middle left images demonstrate poorer detection and segmentation
whereas the middle right one shows total failure. The two lowest images represented general segmentation results for Sony video
which were poor.
Fig. 10. Statistical model based segmentation with threshold 0.4. Upper left image displays good selection of skin pixels with a
surprising selection of the green wall. The rest of the segmented images show poor segmentation between skin and the
background.
Fig. 11. Good skin pixel selection for Hsu et al.’s method is obtained when skin appears in natural tones. The first two rows
show some segmented images of the Nogatech camera while the third row displays the best segmented images of the Sony
camera.
Fig. 12. Color correction can lead to wrong results. The skin areas are detected using Hsu et al.’s filter from original unchanged
images (upper left) and results are shown in the upper right. The color correction used by Hsu et al. may lead to poorer results,
the lower left image displays the color corrected image and the lower middle the detection results. The lower right one shows the
areas used for color correction (yellow curtains).
Fig. 13. The upper four selected frames of the Nogatech were segmented using a skin locus. The two up most images produce
good skin pixel detection and separation. By applying morphology operators on these images, the segmentation can be
improved. In the middle left image, the skin pixels are detected more poorly. The middle right image has poor separation
between the background and skin but part of the skin pixels and background do have similar chromaticities. The lowest two
segmented images are from the Sony video. They represent average results.
Fig. 14. The four different methods are compared by their true positives for the Nogatech video. All algorithms seem to fare
well when the skin appears nearly skin tones and there is no drastic intensity changes. The performance of skin locus is the most
stable over the entire video.
Fig. 15. Segmentation results for the static color model (frames 100, 200, 300, 500 and 900). The left frame of an image pair
shows the tracking results and the right one displays the image after segmentation. The localization of face is indicated with an
ellipse on the left frame.
Fig. 16. A few frames of geometrically adapting skin tracking: frames 100, 200, 300, 500 and 900. (Note: in some frames the
results were better). The left frame of an image pair shows the tracking results and the right one displays the image after
segmentation. The localization of face is indicated with a white ellipse on the left image. The inner magenta ellipse indicates the
area used for model update. In the right image, the magenta box shows the exact area used for model adaptation.
Fig. 17. Skin locus based tracking results for frames 100, 200, 300, 500 and 900. The left frame of an image pair shows the
tracking results and the right one displays the image after segmentation. The localization of face is indicated with a white ellipse
on the left image.
23
Fig. 18. The overlap of skin chromaticities between two consecutive frames is high if the illumination change between these two
frames is not too drastic. The red area in the left and middle images indicates the skin pixel selection. In the right image, the skin
chromaticities from two frames are marked with different colors.
Table 1. Threshold values given by Cho et al. [18].
Limit
Upper
Lower
H
0.7
0.4
S
0.75
0.15
V
0.95
0.35
Table 2: Results for experiment 1.
Method
Video
Hsu et al.
Nogatech
Nogatech (skin tone
restriction)
Sony
Jones &
Regh
Nogatech(threshold0.8)
Nogatech(threshold0.4)
Sony(threshold 0.8)
Skin locus
Nogatech
Sony
Cho et al.
Nogatech
Sony
Measure
True positives
[%]
False positives
[%]
Mean
Standard
Mean
Standard
Mean
Standard
Mean
Standard
Mean
Standard
Mean
Standard
Mean
Standard
Mean
Standard
Mean
Standard
Mean
Standard
54.3
34.4
57.6
35.0
8.5
8.6
77.2
16.1
89.7
11.0
21.5
14.2
93.4
7.5
83.1
6.3
73.5
31.3
22.5
15.0
9.0
3.9
9.3
4.0
0.4
0.5
54.8
14.2
74.4
11.4
18.6
6.1
38.8
12.4
4.8
3.3
42.7
15.5
1.6
1.4
Table 3. Results for experiment 2.
Method
Error in localization
True positives
False positives
Mean
Standard dev.
Mean
Standard dev.
Mean
Standard dev.
Skin locus
44.8
10.3
75.4
13.2
7.3
1.7
Static
72.4
18.5
66.6
26.3
74.9
12.6
Geom. adapt.
51.7
17.3
69.2
26.2
23.7
18.3
Table 4. Mean values for the Gaussian mixture.
Skin color model
Non-skin color model
R
G
B
R
G
B
73.53
249.71
161.68
186.07
189.26
29.94
233.94
116.25
136.62
98.37
17.76
217.49
96.95
114.40
51.18
254.37
9.39
96.57
160.44
74.98
254.41
8.09
96.95
162.49
62.23
253.82
8.52
91.53
159.06
46.33
24
247.00
37.76
206.85
212.78
212.78
151.19
152.20
72.66
171.09
152.82
152.82
97.74
90.84
150.10
156.34
120.04
120.04
74.59
121.83
202.18
193.06
51.88
30.88
44.97
60.88
154.88
201.93
57.14
26.84
85.96
18.31
91.04
206.55
61.55
25.32
131.95
120.52
192.20
214.29
99.57
238.88
77.55
119.62
136.08
54.33
203.08
59.82
83.32
87.24
38.06
176.91
236.02
207.86
99.83
135.06
135.96
236.27
191.20
148.11
131.92
103.89
230.70
164.12
188.17
123.10
66.88
Table 5. Diagonal covariance matrix values.
Skin color model
R
765.40
39.94
291.03
274.95
633.18
65.23
408.63
530.08
160.57
163.80
425.40
330.45
152.76
204.90
448.13
178.38
G
121.44
154.44
60.48
64.60
222.40
691.53
200.77
155.08
84.52
121.57
73.56
70.34
92.14
140.17
90.18
156.27
Non-skin color model
B
112.80
396.05
162.85
198.27
250.69
609.92
257.57
572.79
243.90
279.22
175.11
151.82
259.15
270.19
151.29
404.99
R
2.77
46.84
280.69
355.98
414.84
2502.24
957.42
562.88
344.11
222.07
651.32
225.03
494.04
955.88
350.35
806.44
G
2.81
33.59
156.79
115.89
245.95
1383.53
1766.94
190.23
191.77
118.65
840.52
117.29
237.69
654.95
130.30
642.20
Table 6. Weights for different classes.
i
1
2
3
4
5
6
7
8
Skin color model
Weight
i
Weight
0.0294
9
0.0956
0.0331
10 0.0763
0.0654
11 0.1100
0.0756
12 0.0676
0.0554
13 0.0755
0.0314
14 0.0500
0.0454
15 0.0667
0.0469
16 0.0749
i
1
2
3
4
5
6
7
8
25
Non-skin color model
Weight
i
Weight
0.0637
9 0.0656
0.0516
10 0.1189
0.0864
11 0.0362
0.0636
12 0.0849
0.0747
13 0.0368
0.0365
14 0.0389
0.0349
15 0.0943
0.0649
16 0.0477
B
5.46
32.48
436.58
591.24
361.27
237.18
1582.52
447.28
433.40
182.41
963.67
331.95
533.52
916.70
388.43
350.36
Fig. 1. An image series of 16 images taken under different white balancing and prevailing lighting conditions.
Fig. 2. Skin chromaticities are plotted from (a) two 16 image series with all four white calibrations and (b) two 4 image series
with one white balancing calibration (TL84).
(a)
(b)
Fig. 3. The skin locus of the Sony 1CCD camera (black lines) and Nogatech 1CCD web camera (gray dots) differ. Both skin
loci are for one fixed calibration.
26
Skin loci: Sony and Nogatech
1
0.9
Nogatech camera
Sony 1CCD camera
0.8
0.7
g
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
r
Fig. 4. A person located in a corridor under a nonuniform illumination field; the left side of the face is lighted by daylight and
the right side by ceiling fluorescent lamps. Notice that the background is quite near to its colors at white balancing conditions
whereas the facial colors are obviously distorted.
Fig. 5. Visualization of Raja et al.’s selection for refreshing the object histogram: the outer bounding box represents the
localization of the face and the inner one the area to be used fro color model update. The inner box is determined from the outer
box.
27
Fig.6. (a) and (b) show images with manual face selection, and the corresponding facial chromaticities and their distribution for
Nogatech frames 150 and 500, respectively. Three selected frames for the Sony video are displayed in (c). These frames show
changes in head size, number of persons (one or two) and more drastic head movements than in the Nogatech video.
1
1
b
0.8
0.6
0.5
0.4
0.2
0
0
0.5
r
1
0
50
100
b
500
100
r
(a)
1
1
b
0.8
0.6
0.5
0.4
0.2
0
0
0.5
r
1
(b)
(c)
28
0
50
100
b
10050
r
0
Fig. 7. The first two rows are from the Nogatech video. The results of skin pixel detection using the hue component vary from
good (upper left) to failure (lower right). The upper left shows good skin pixel selection whereas in the upper right also the wall
pixels are activated. The lower ones display failures. The third row shows segmented images for the Sony camera. These two
images are among the best ones but they still are not good for further processing.
Fig. 8. ROC curves for a set of 10 Nogatech frames: frames 1-10 (left) and frames 496-505 (middle). The threshold values
(marked with *) vary from 1 to 0.05 by a step of 0.05. The right picture shows a ROC curve for the Sony camera. As can be seen
from this “curve”, it is very difficult to determine a threshold value for the Sony video. This may be due to few skin tone pixels.
ROC for Sony 1−10
100
True Positives
90
80
70
60
50
40
30
20
40
60
False Positives
80
Fig. 9. Segmentation using statistical models with threshold value 0.8. The upper left image shows good detection of skin pixels
and separation from the background. The upper right and middle left images demonstrate poorer detection and segmentation
whereas the middle right one shows total failure. The two lowest images represented general segmentation results for Sony video
which were poor.
29
Fig. 10. Statistical model based segmentation with threshold 0.4. Upper left image displays good selection of skin pixels with a
surprising selection of the green wall. The rest of the segmented images show poor segmentation between skin and the
background.
30
Fig. 11. Good skin pixel selection for Hsu et al.’s method is obtained when skin appears in natural tones. The first two rows
show some segmented images of the Nogatech camera while the third row displays the best segmented images of the Sony
camera.
Fig. 12. Color correction can lead to wrong results. The skin areas are detected using Hsu et al.’s filter from original unchanged
images (upper left) and results are shown in the upper right. The color correction used by Hsu et al. may lead to poorer results,
the lower left image displays the color corrected image and the lower middle the detection results. The lower right one shows the
areas used for color correction (yellow curtains).
31
Fig. 13. The upper four selected frames of the Nogatech were segmented using a skin locus. The two up most images produce
good skin pixel detection and separation. By applying morphology operators on these images, the segmentation can be
improved. In the middle left image, the skin pixels are detected more poorly. The middle right image has poor separation
between the background and skin but part of the skin pixels and background do have similar chromaticities. The lowest two
segmented images are from the Sony video. They represent average results.
.
Fig. 14. The four different methods are compared by their true positives for the Nogatech video. All algorithms seem to fare
well when the skin appears nearly skin tones and there is no drastic intensity changes. The performance of skin locus is the most
stable over the entire video.
32
True Positives
100
90
80
70
[%]
60
50
40
30
20
skin locus
Cho et al.
Hsu et al.
Jones & Rehg (0.8)
10
0
100
200
300
400
500
Frame no
600
700
800
900
Fig. 15. Segmentation results for the static color model (frames 100, 200, 300, 500 and 900). The left frame of an image pair
shows the tracking results and the right one displays the image after segmentation. The localization of face is indicated with an
ellipse on the left frame.
33
Fig. 16. A few frames of geometrically adapting skin tracking: frames 100, 200, 300, 500 and 900. (Note: in some frames the
results were better). The left frame of an image pair shows the tracking results and the right one displays the image after
segmentation. The localization of face is indicated with a white ellipse on the left image. The inner magenta ellipse indicates the
area used for model update. In the right image, the magenta box shows the exact area used for model adaptation.
Fig. 17. Skin locus based tracking results for frames 100, 200, 300, 500 and 900. The left frame of an image pair shows the
tracking results and the right one displays the image after segmentation. The localization of face is indicated with a white ellipse
on the left image.
34
Figure 18. The overlap of skin chromaticities between two consecutive frames is high if the illumination change between these
two frames is not too drastic. The red area in the left and middle images indicates the skin pixel selection. In the right image, the
skin chromaticities from two frames are marked with different colors.
Frame 165
Chromaticities of two adjacent frames
NCC g
Frame 164
0.36
0.34
0.32
0.3
0.28
0.25 0.3 0.35
NCC r
35