Privileged information-based conditional regression forest for facial

Privileged Information-based Conditional Regression Forest
for Facial Feature Detection
Heng Yang, Student Member, IEEE, Ioannis Patras, Senior Member, IEEE
Abstract— In this paper we propose a method that utilises
privileged information, that is information that is available only
at the training phase, in order to train Regression Forests for
facial feature detection. Our method chooses the split functions
at some randomly chose internal tree nodes according to the
information gain calculated from the privileged information,
such as head pose or gender. In this way the training patches
arrive at leaves that tend to have low variance both in
displacements to facial points and in privileged information. At
each leaf node, we learn both the probability of the privileged
information and regression models conditioned on it. During
testing, the marginal probability of privileged information is
estimated and the facial feature locations are localised using
the appropriate conditional regression models. The proposed
model is validated by comparing with very recent methods on
two challenging datasets, namely Labelled Faces in the Wild
and Labelled Face Parts in the Wild.
I. INTRODUCTION
A random forest is an ensemble of randomized decision
trees, a classic method of inductive inference. It is easy to
implement and performs very well both in terms of prediction
accuracy and in terms of computational efficiency. In the last
few years it has become increasingly popular and is successfully applied to various high level computer vision tasks
such as action recognition [1] and image classification [2].
Particularly, there are some promising real-time applications
e.g. human pose estimation [3] and facial feature detection
[4].
Very recent works attempt to build regression forests that
are conditioned on some global/additional property. Sun et
al. [3] propose a conditional regression forest model for
human pose estimation. During training, at each leaf node,
the probabilistic vote is decomposed into the distribution
of 3D body joint locations for each codeword (leaf IDs)
and the codeword mapping probability. The latent variable
can encode both known and unknown/uncertain property of
the pose estimation problems. When the global property is
unknown, they propose to jointly estimate the body joint
locations and the global property. A similar work, proposed
by Dantone et al. [4] uses a regression forests model conditioned on head pose for facial feature detection. In their
method, they divide the training set into subsets according
to head pose yaw angle. An individual regression forest is
trained on each subset and during testing a set of regression
trees is selected according to the estimated probability of
Yang and Patras are with School of Electric Engineering and Computer Science, Queen Mary University of London, UK. {heng.yang,
i.patra}@eecs.qmul.ac.uk
This work has been funded in part by an EPSRC grant (EP/G033935/1).
Heng Yang is funded by a CSC/Queen Mary joint PhD scholarship.
the head pose. The later is given by an additional forest
trained to perform pose estimation. Both of these two models
have shown that learning the probabilities on the parameter
space conditioned on global properties can dramatically
increase the detection accuracy while still maintaining a
low computational cost. However, from the perspective of
decision tree induction, neither [3] nor [4] has exploited the
information provided by global properties to improve the
quality of decision trees, i.e. in the procedure of tree growing,
the additional information is inoperative.
Reflecting on this point, in this paper we ask one basic
question: Can we improve the quality of learned decision
trees with the help of additional information? The additional
information (such as global properties described above) we
are going to consider is only available at the training stage
but not available for testing, e.g., the head pose information
as in [4]. To be consistent with the LUPI (learning using privileged information) paradigm proposed by Vapnik [5], this
kind of additional information is called privileged information. The LUPI paradigm has been successfully implemented
in an SVM type algorithm and was shown to significantly
increase the rate of convergence. Inspired by their paradigm,
we propose a mechanism for regression forests which allows
one to take advantage of the privileged information when
training trees.
To avoid the high computational expense of training an
individual forest per state, we adapt the method in [3], to
share the tree structure and learn conditional models for
different states at each leaf node. So as to integrate different
types of privileged properties, we propose to use hybrid
regression forests. Specifically, a sub-forest is trained based
on yaw angle, roll angle and gender respectively and a simple
ensemble scheme is proposed to combine them for the facial
feature detection.
The main contributions of this paper are the following: i)
we exploit privileged information for learning better decision
trees in regression forests; we do so by selecting the splitting
function at randomly selected internal nodes according to
privileged information -based criteria, by learning regression
models that are conditioned on the privileged information
at leaf nodes and by choosing the appropriate conditional
models according to estimates of the privileged info that the
trees provide; ii) we propose using a hybrid forest to capture
the information from different privileged properties such as
head pose and gender.
II. RELATED WORK ON LOCALISATION
Facial feature detection, or face parts localisation, is a
well-studied problem in computer vision. Earlier works can
be classified into two categories: holistic shape-based and
local feature-based.
A typical holistic method is the Active Appearance Model
(AAM) [6] approach which reconstructs the entire face using
an appearance model and estimates the shape by minimizing
the texture residual. Such methods have difficulties with
large variations in facial appearance due to head pose,
illumination or expression. Their localisation accuracy also
degrades drastically on unseen faces [7] and low-resolution
images. Another issue of AAM is the sensitivity to the
initialization of the gradient descent optimization method.
Several improvements have been proposed, e.g. [8] uses a
non-parametric representation of the landmark distribution
for optimization.
Local feature-based methods or part-based modeling has
been very popular in recent years. Instead of learning a
holistic model for all the parts of a face, one can learn
descriptions for individual parts and then combine them. In
[9], an independent GentleBoost classifier for each of the
20 facial points is learned based on Gabor filter’s response
separately. Very close to our work, [4] proposed conditional
regression forests to detect individual parts that has reported
close-to-human performance in real-time. In their work,
conditional regression forests are proposed to deal with the
large variation caused by head poses.
Recent works also focus on incorporating global spatial
models with local part detection such as the Constrained
Local Models (CLM) [8]. Depending on how shape information is modeled, these methods can be classified into
two categories. The methods in the first category model the
shape information as a prior learned from all training samples
such as the BoRMaN point detector, proposed by Valstar et
al.[10] which combines the Support Vector Regressors (for
local part detection) with Markov Random Fields (for global
shape constraints). In a recent work [11], tree structured part
model was introduced for combining the local appearance
evidence and the spatial arrangement of different parts. The
second type is data-driven, in which the shape information
is always represented in a non-parametric way. A recent
work proposed in [12] combines the output of local detectors
with a non-parametric set of global models for part locations
in a Bayesian framework. This algorithm shows very good
performance even on images recorded in uncontrolled conditions like those in LFPW dataset [12]. However, this method
requires the availability of large number of exemplars and it
is computationally expensive to choose the global models for
each test images. A recent work [13] proposed an “Explicit
Shape Regression” approach for face alignment. Instead of
using a fixed shape model, the shape constraint is encoded
into two-level boosted regression based on fern regressor
[14].
III. P RIVILEGED I NFORMATION - BASED C ONDITIONAL
R EGRESSION F ORESTS (PI-CRF)
The classical paradigm of supervised regression forests
(and of other standard supervised machine learning paradigm
such as SVM) is described as follows: given a set of
input/output pairs (training data)
(x1 , y1 ), ..., (xn , yn ), xi ∈ X , yi ∈ Y
generated according to a fixed but unknown probability
P (x, y), the goal is to find a mapping function f : x → y
from a set of mapping functions F : X → Y that reduces
an error measure on the prediction y = f (x). Similar to [5],
in our PI-CRF paradigm, additional privileged information
y ∗ ∈ Y ∗ is available during training. That is the training set
are triplets (x, y ∗ , y) instead of pairs (x, y). The privileged
information y ∗ ∈ Y ∗ is available only during training and
belongs to a space which is different from the space Y. The
goal is the same as in the classical paradigm, that is to find
the best function f : x → y in the set of admissible functions.
A. PI-CRF Tree Induction
A regression forest T = {Tt } is an ensemble of regression
trees Tt . Each regression tree is most often induced greedily
based on a randomly selected subsets of the training data
set {(xi , yi )}N
i=1 in the following manner [15]. An empty
tree starts with only one root node. Then a number of
split function candidates are sampled from a predefined
distribution. Each split function partitions the training set
into a left and right subset by applying a certain test on
x. Each candidate split function is evaluated according to a
certain scoring function, e.g. information gain, so that high
scores are assigned to splits that aid in predicting the output
y well, i.e. those that reduce the average uncertainty about
the target. The best split function, that is the one with highest
score, is selected and the test function is stored at this node.
Then, the training set is partitioned according to this test into
two subsets that are propagated to the two children nodes.
The same procedure is recursively applied at each subsequent
child node. The procedure stops when certain criteria are met,
typically, when there are fewer than a minimum number of
examples or a maximum tree depth is reached.
As stated before, in our proposed method, privileged
information is provided at the training stage and the training
set is {(xi , yi∗ , yi )}N
i=1 . Following the procedure above for
tree induction, at each non-leaf node, a set of split function
candidates are generated, however, at some randomly picked
internal nodes, the best split function is selected according
to its ability of predicting the privileged information y ∗ .
The split functions at other internal nodes are still selected
based on y. A similar concept is used in Hough Forests
[1]: by randomly selecting the uncertainty measure, nodes
decreasing class uncertainty and nodes decreasing the displacement uncertainty are interleaved throughout the tree.
However, our model is capable of capturing more informative
properties such as head pose and gender which is beyond
object detection. Following our proposed procedure, training
elements (e.g. image patches) which end at the leaves tend
Training
Patches
, ∗ , the quality of a split and has been used for both classification
and regression [16]. As stated in [17], the information gain is
just the mutual information between the local node decision
(left or right) and the predictive output which is defined as:
1
2
1
RF
2
1
2
PI-RF
1
1 1 1
2 2 2
∗ arg max 1
2 2 1
1
∗ arg max ∗ ∗
∗
∗
,
Δ , ∗
,
Δ , ∗
, Δ , IG(φ) = H(P) −
∗
Fig. 1: (Left) Tree induction of PI-RF vs. RF. (Right)
Conditional models learned at the each leaf. The training
patches are from face images with a large variety wrt the
privileged information (PI) like the head pose. A classical
RF attempts to gather patches around the same facial point
at each leaf. However, as the example shows, the visual
features vary a lot due to changes in the PI and therefore it is
difficult to gather them into the same leaf. On the contrary,
in PI-RF framework, the best split-function at some random
internal nodes (in red) is selected directly according to PI.
As such, patches being stored at the leaves tend to have low
variation both in global properties and in displacement. The
information gain IGy at dark nodes is calculated based on
the entropy Hy (4) while at the color nodes, the information
gain IGy∗ is calculated based on the entropy Hy∗ , defined
in (6). At each leaf, we group the patches according to their
discrete PI-class labels and we store 1) the probability of each
k
label p(y k∗ |l), 2) the relative offsets ∆kil and weights ωil
to
individual points calculated from each cluster (indicated by
different shapes) using a Mean-Shift algorithm.
to have low variety in both privileged information and in
displacements to the facial points.
We use the standard linear split functions in input feature
space, similar to [4]. From each training image, a set of
square patches {Pi = (xi , yi∗ , yi )} are randomly extracted.
In our facial feature localisation case, x = {x1 , x2 , ...xF }
represents F channels of image features. y ∗ represents the
privileged information including the head pose and gender
status. y contains N displacement vectors from the patch
center to each of the N facial feature points. Then, the split
function compares the mean value of the f -th feature channel
in two regions p and q as follows:
0 if xf (p) < xf (q) + τ
tf,p,q,τ (P) =
.
(1)
1 otherwise
According to the binary test function of (1), the set of patches
P is divided into two subsets PL and PR . At each internal
node, a pool of such binary test candidates φ = (f, p, q, t)
is generated and the candidate that maximizes the scoring
function is selected. The information gain, (IG) split scoring
function is used in our method.
φ∗ = arg max IG(φ).
|PR (φ)|
|PL (φ)|
H(PL (φ)) −
H(PR (φ)).
|P|
|P|
(3)
(2)
The information gain is a popular criterion used to determine
Depending on the nature of Y and Y ∗ , H(P) can either be a
discrete entropy or a differential entropy. |P| represents the
number of elements in set P.
In our case, since Y and Y ∗ are different spaces, with
different properties, we need to devise appropriate entropy
function. For Y, we use the class-affiliation method proposed
by [4] to measure the uncertainty which is defined as:
P
N P
X
i p(cn |Pi )
i p(cn |Pi )
log
, (4)
HY (P) = −
|P|
|P|
n=1
p(cn |Pi ) ∝ exp
|yin |
,
λ
(5)
where p(cn |Pi ) indicates the probability that the patch Pi
is informative about the location of the feature point n.
The class affiliation assignment is based on the Euclidean
distance to the feature point. The variable λ is used to control
the steepness of this function. In this way, we can avoid
making a multivariate Normal distribution assumption on
multiple feature points and calculate the differential entropy
as in [16].
So far as Y ∗ is concerned, we only consider discrete
privileged information because: 1) it is difficult to obtain
the ground truth of continuous head pose for each face
image; 2) learning the model conditioned on continuous
variable is still not well studied as stated in [3]. Therefore
we discretise the head pose information by partitioning of the
pose space. In this context, head pose estimation becomes a
multi-class classification problem. The finite set of privileged
information classes is represented as Y ∗ = {1, 2, ..., K}.
For each class, let hkPbe the number of occurrences of
∗
the class, then hkP=
Pi ∈P δ(yi = k). The sum of all
counts is |P | =
k hk . The empirical class probabilities
hk
p̂k (P) = |P|
are often used to calculate the naive entropy
PK
estimate, e.g. HN (P) = − k=1 p̂k (P)log p̂k (P) (see e.g.
[16] and references therein), however, it is pointed out by
Nowozin [17] that the naive entropy estimate is not a good
estimator for calculating information gain since it is biased
and universally underestimates the true entropy. Therefore, as
suggested in [17], we use the Grassberger entropy estimator
[18] , given as:
K
HY ∗ (P) = log|P| −
1 X
hk G(hk ),
|P|
(6)
k=1
where the function G(h) is given by G(hk ) = ψ(hk ) +
1
hk
ψ( hk2+1 ) − ψ( h2k ) , and ψ is the digamma func2 (−1)
tion. For large hk the above function behaves like a logarithm
and (6) is identical to naive entropy when n → ∞. For small
hk , the estimation using (6) is shown to be more accurate.
So far we have discussed how to estimate the entropy for
both Y and Y ∗ . During tree induction, at each internal node,
we randomly select an entropy estimator from (6) and (4) to
calculate the information gain (3).
B. Conditional Regression Model
The purpose of using the above procedure of tree construction is to gather training patches with low variance both
in displacements to feature points and status of privileged
information at leaf nodes. This section provides a description
of our conditional regression model based on [3]. Firstly, at
each leaf node, we calculate the probability of each state
(or class) of the privileged information. Let n be the total
number of training patches that arrived at a leaf node l, and
let nk be the number of patches belonging to class k. Then
the probability for the class k at leaf l is
nk
,
(7)
p(y k∗ |l) =
n
where y k∗ ∈ Y ∗ is a shorthand notation that y ∗ belongs to
the class k. Note that in order to balance different classes, we
should consider the distribution of each class in the original
training set when it is not uniform by dividing their original
proportion.
We use the same algorithm as [3], [19] to learn a compact regression model for each facial point conditioned on
the privileged information. More specifically, a mean-shift
algorithm with a Gaussian kernel of fixed bandwidth is
applied on the subset of patches belonging to each of the
privileged information classes in order to cluster the relative
votes. Then, at each leaf l, for each point i in class k, the
largest cluster is stored. More specifically, we store with the
information about the relative vote ∆kil , given by the meanshift mode, and the weight ωilk , that is the relative size of the
cluster. Then when a test patch centred at position (pixel) zx
ends at leaf l, the conditional distribution of votes for the
i-th feature point in class k is approximated as,
p(yi |y k∗ , l) ∝
yi − (∆kil + zx ) 2
k
ωil
· exp −||
||2 · δ(||∆kil ||22 ≤ γ), (8)
b
where γ is the threshold that controls the distant voting.
For notational clarity we will drop the subscript i in the
subsequent equations.
In this work we consider different types of privileged
information, where Y ∗ can contain discrete and/or continuous dimensions, for example continuous 3 dimensional head
pose and discrete gender information. In order to avoid
partitioning this heterogeneous space into K classes and train
a single forest, we train several sub-forests, one for each
of the dimensions. More specifically, we discretise the yaw
angle into 5 classes, the roll angle into 3 classes and use the
gender information (2 classes), and train three sub-forests.
The discretization / partitioning could be performed in an
unsupervised manner (e.g. by clustering of the training data).
Here we use uniform binning.
IV. PI-CRF I NFERENCE
In what follows, we will use X to denote the set of
voting elements (image patches) from a test image. A random
forest trained as described in Section III-A is utilised to
discriminatively map each voting element to class labels
and specific conditional models. For each test image, we
calculate a scoring function defined over the class labels
of the privileged information, while for each facial feature
point, we calculate a scoring function S defined over the
facial point location on image, y ∈ Y. These functions can
be written as a sum of probabilistic votes contributed from
all voting elements. The scoring function on each class is
defined as:
X X
p(y k∗ |l)p(l|x),
(9)
S(y k∗ |X) =
x∈X l∈Lx
where Lx = {l} is the set of the IDs of the leaves to which
the voting element x ends up to (in the different trees of the
forest).
Then, we estimate the most likely class label of the
privileged information ŷ ∗ as:
ŷ ∗ = arg max S(y k∗ |X).
(10)
y k∗ ∈Y ∗
The probability for each state of the privileged information
can be incorporated in the scoring function as a prior
probability over the corresponding state. Then, the voting
score for each facial point can be written as
X X
p(y|y k∗ , l)· (y k∗ |l)p(l|x).
(11)
Syk∗ (y|X) =
x∈X l∈Lx
Based on the estimation of y ∗ from (10), the best scoring
candidate of facial point is selected as:
Ŝ(y|X) = Syk∗ (y|X)|yk∗ =ŷ∗ .
(12)
In order to combine different types of privileged information we propose a very simple scheme: the best scoring candidates for different privileged information as stated
above are normalized and summed together to a new voting
map. Finally, feature points are localised using a mean-shift
algorithm as [4].
V. E XPERIMENTS
A. Dataset
We evaluate our method in the problem of detecting facial
features in images recorded in uncontrolled conditions. There
are a few typical datasets like Labeled Face Parts in the Wild
(LFPW1 ) [12], the Labeled Faces in the Wild (LFW2 ). The
LPFW dataset shares only image URLs on web and some of
them are no longer valid. The LFW dataset contains facial
images of 5479 individuals, 1680 of which hve more than one
image, exhibiting a large variety of face appearances (e.g.,
pose, expression, ethnicity, age, gender) as well as general
imaging and environmental conditions.
1 http://www.kbvt.com/LFPW
2 http://vis-www.cs.umass.edu/lfw/
TABLE I: Estimation accuracy of privileged information.
yaw (5 classes)
68.25%
roll (3 classes)
85.10%
gender (2 classes)
87.5%
Therefore, LFW dataset with annotation from [4] is
utilised for training the forest and the trained model is applied on both LFW and LFPW. In detail, 13,233 face images
from the LFW database are annotated with the locations
of 10 facial feature points. For the privileged information,
[4] provides discrete head pose labels for the yaw angle
(left profile, left, front, right, right profile). Based on the
locations of the facial points, we roughly estimate the roll
angles of head poses using the POSIT algorithm [20] and
discretise them into 3 labels (left tilt, upright, right tilt). We
discard the pitch angle because it is difficult to get the ground
truth for the face images in the wild. We also annotate the
gender status (male, female) for each face image and use it
as another type of privileged information.
B. Training and Testing Setup
In our experiments, the forest consists of 3 sub-forests
each of which utilises one of the three sources of privileged
information: the yaw and the roll angles of the head pose
and the gender class. Our experimental setup is the default
setting of facial features detector (called C-RF 3 ) from [4].
Those key setting parameters are : maximum depth of each
tree (20), test candidates at split node (2500), resized face
box (125 × 125), patch size (0.25 × face box size), image
features (one channel of normalized gray values, 35 channels
of Gabor features and 2 channels of Sobel features). Also
1500 randomly selected image samples are used for training
each tree. In the meantime the same parameters of the meanshift algorithm are used at testing phase.
C. Results
As in previous work [4], [10], we report the localisation
errors as a fraction of the inter-ocular distance DIO . For
the test images from LFW dataset, the locations of eye
centers are not annotated so the inter-ocular distance is
calculated as the distance between the midpoints of the
ground truth eye corners. An estimate is regarded as a
correct detection if the localisation error is less than 0.1DIO
and this criteria is used to calculate the accuracy. We also
evaluate the performance of our model to estimate the latent
privileged properties. Surprisingly, the estimation accuracy of
the privileged information (discrete labels) is quite high, as
shown in Table I. Even though the accuracy of the estimation
of the yaw angle is not as good as the roll angle and the
gender, it has achieved almost the level that is reported in [4]
(i.e. 72.15%) where an dedicated additional forest is trained
explicitly for head pose estimation.
In order to show the efficacy of tree induction of PI-CRF,
we conducted the experiments for both PI-CRF and CRF. The
only difference between them is in the tree induction process:
3 http://blog.gimiatlicho.webfactional.com/?page_
id=38
0.8
Fraction of Fiducial Points
Property
Accuracy
1
0.9
0.7
0.6
0.5
1 tree per privileged information
2 tree per privileged information
0.4
3 tree per privileged information
4 tree per privileged information
0.3
0.2
0.1
0
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Fraction of Inter−Ocular Distance
0.16
0.18
0.2
Fig. 2: Overall performance against number of trees.
PI-CRF takes the advantage of using privileged information
while CRF creates the trees in a regular method. During
testing, we randomly select 1000 images at each time and
record the the performance of PI-CRF, CRF and C-RF. It is
repeated for four times. We reported results on C-RF using
the trained trees publicly provided by the authors of [4].
Note that the results that we obtained are different from
what they reported in the paper which might be due to
using different image features, parameter setting, or forest
size. For fairness, in our experiments, we use the same
experimental setup as their on-line code and the same number
(10) of trees. More specifically, 4 using the yaw angle as
privileged information, 3 for roll angle and 3 for gender.
The performance is shown in Fig. 3. Our method does not act
the reported close-to-human performance, however, under the
same experimental setup, our PI-CRF outperforms the C-RF
and CRF in terms of mean error, detection accuracy and also
the successful detection against error thresholds. Besides the
performance, our training phase is also more efficient than CRF. C-RF trains an additional forest for head pose estimation
and also one forest for each head pose subset while only
one forest is trained in our method. In the meantime, the
improvement over CRF strongly validates the efficacy of
our proposed tree induction method. We also report how
the performance improves as the number of trees for each
privileged information increases in Fig. 2.
We also carried out the comparison on testing images of
LFPW dataset with higher image quality. Again the same
experimental setting is utilised for fair comparison and the
performance is shown in Fig. 4. Comparing with the reported
result of [12] ([13] has reported result by comparing with
it), our detector is a bit worse but: 1) our detector is not
trained on this dataset; 2) both [13] and [12] have considered
the shape constrains, while our method as well as [4] is
only a local detector. Local detection sometimes results in
failure cases with large localisation errors as shown in Fig.
4 (right). Even a very small number of such outliers will
heavily influence the average when the estimation is very far
away from the correct location. This almost never happens
on detectors with global shape constraints. Therefore we
also calculate the detection accuracy, as shown in Fig. 4
(middle). Unfortunately neither [13] nor [12] has reported
the measurementaccuracy for comparison.
1
1
0.9
0.1
0.8
0.8
0.06
0.04
0.02
Human
C−RF (Reported)
Our PI−CRF
C−RF
CRF
0
left ight left ight r lip r lip left ight left ight
eye ye r uth th r we ppe eye ye r nosenose r
left left e mo mououter louter u right right e
o
0.6
0.4
Human
C−RF (Reported)
0.2
Our PI−CRF
C−RF
CRF
0
left ight left ight r lip r lip left ight left ight
eyet eye rmouth outh r r lower uppe ht eyet eye r nosenose r
t
f
le lef
m oute oute rig righ
Fraction of Fiducial Points
0.7
0.08
Accuracy
Mean Error/Inter−Ocular Distance
0.12
0.6
0.5
0.4
Human
0.3
Our PI−CRF
C−RF
0.2
CRF
0.1
0
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Fraction of Inter−Ocular Distance
0.16
0.18
Fig. 3: Overall performance and comparison to other methods on LFW dataset. (Left) Mean error. (Middle) Accuracy. (Right)
Accuracy plotted against the success thresholds.
1
0.07
0.8
0.06
0.05
Accuracy
Mean Error/Inter−Ocular Distance
0.08
0.04
0.03
Human
Belhumeur et al
PI−CRF
C−RF
0
left right th left right er lipper lipye left rightse left right
e
y
e
w p
u
th
e
e
ye no nose
y
left left e mo mououter louter u right right e
o
0.6
0.4
0.02
0.01
0.2
Our PI−CRF
C−RF
0
left ight left ight r lip r lip left ight left ight
eye ye r uth th r we ppe eye ye r nosenose r
left left e mo mououter louter u right right e
o
Fig. 4: Overall performance on LFPW dataset. (Left) Mean error. (Middle) Accuracy. (Right) Some qualitative results of
which the second row shows some error cases caused by occlusion or low image quality.
VI. CONCLUSION
In this paper we have presented an algorithm for facial
feature detection called privileged information-based conditional regression forests. We show how our method can
utilise privileged information that is available only at the
training stage by selecting the best split functions according
to privileged information gain at some random internal
nodes. At each leaf node, conditional models are learned
based on state clusters of the privileged information, with
one per state. The latent privileged information and locations
of feature points are jointly estimated during testing. We
demonstrate that using as privileged information the yaw and
roll angles of the head pose and the gender status our model
outperforms the conditional regression forests method on two
challenging datasets, namely LFW and LFPW.
R EFERENCES
[1] Gall, J., Yao, A., Razavi, N., Van Gool, L., Lempitsky, V.: Hough
forests for object detection, tracking, and action recognition. Pattern
Analysis and Machine Intelligence, IEEE Transactions on 33 (2011)
2188–2202
[2] Leistner, C., Saffari, A., Santner, J., Bischof, H.: Semi-supervised
random forests. In: Computer Vision, 2009 IEEE 12th International
Conference on, IEEE (2009) 506–513
[3] Sun, M., Kohli, P., Shotton, J.: Conditional regression forests for
human pose estimation, CVPR (2012)
[4] Dantone, M., Gall, J.: Real-time facial feature detection using
conditional regression forests. In: CVPR. (2012)
[5] Vapnik, V., Vashist, A.: A new learning paradigm: Learning using
privileged information. Neural Networks 22 (2009) 544–557
[6] Cootes, T., Wheeler, G., Walker, K., Taylor, C.: View-based active
appearance models. Image and Vision Computing 20 (2002)
[7] Gross, R., Matthews, I., Baker, S.: Generic vs. person specific active
appearance models. Image and Vision Computing 23 (2005)
[8] Saragih, J., Lucey, S., Cohn, J.: Face alignment through subspace
constrained mean-shifts. In: ICCV. (2009)
[9] Vukadinovic, D., Pantic, M.: Fully automatic facial feature point
detection using gabor feature based boosted classifiers. In: Systems,
Man and Cybernetics,IEEE International Conference on. (2005)
[10] Valstar, M., Martinez, B., Binefa, X., Pantic, M.: Facial point detection
using boosted regression and graph models. In: CVPR. (2010)
[11] Zhu, X. Ramanan, D.: Face detection, pose estimation and landmark
localization in the wild. In: CVPR. (2012)
[12] Belhumeur, P., Jacobs, D., Kriegman, D., Kumar, N.: Localizing parts
of faces using a consensus of exemplars. In: CVPR. (2011)
[13] Xudong Cao, Yichen Wei, F.W., Sun, J.: Face alignment by explicit
shape regression. In: CVPR. (2012)
[14] Dollar, P., Welinder, P., Perona, P.: Cascaded pose regression. In:
CVPR. (2010)
[15] Breiman, L.: Random forests. Machine learning 45 (2001) 5–32
[16] Criminisi, A.: Decision forests: A unified framework for classification,
regression, density estimation, manifold learning and semi-supervised
R in Computer Graphics and Vision
learning. Foundations and Trends
7 (2011) 81–227
[17] Nowozin, S.: Improved information gain estimates for decision tree
induction. ICML (2012)
[18] Grassberger, P.: Entropy estimates from insufficient samplings. Arxiv
preprint physics/0307138 (2003)
[19] Girshick, R., Shotton, J., Kohli, P., Criminisi, A., Fitzgibbon, A.: Efficient regression of general-activity human poses from depth images.
In: ICCV. (2011)
[20] DeMenthon, D., Davis, L.: Model-based object pose in 25 lines of
code. International Journal of Computer Vision 15 (1995) 123–141
0.2

Download Report

Privileged information-based conditional regression forest for facial

Paperzz.com

Your Paperzz