Face Detection Using
Large Margin Classifiers
Ming-Hsuan Yang Dan Roth Narendra Ahuja
Presented by Kiang “Sean” Zhou
Beckman Institute
University of Illinois at Urbana-Champaign
Urbana, IL 61801
Overview
Large margin classifiers have demonstrated success
in visual learning
Support Vector Machine (SVM)
Sparse Network of Winnows (SNoW)
Aim to present a theoretical account for their
success and suitability in visual recognition
Theoretical and empirical analysis of these two
classifiers within the context of face detection
Generalization error: expected error in test
Efficiency: computational capability to represent features
Face Detection
Goal: Identify and locate human faces in an
image (usually gray scale) regardless of
their position, scale, in plane rotation,
orientation, pose and illumination
The first step for any automatic face
recognition system
A very difficult problem! First aim to detect
upright frontal faces with certain ability to
detect faces with different pose, scale, and
illumination
See “Detecting Faces in Images: A Survey”,
by M.-H. Yang, D. Kriegman, and N. Ahuja,
to appear in IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2002.
http://vision.ai.uiuc.edu/mhyang/face-detection-survey.html
Where are the faces, if any?
Large Margin Classifiers
Based on linear decision surface (hyperplane)
f: wT x + b = 0
Compute w and b from samples
SNoW: based on Winnow with multiplicative update
rule
SVM: based on Perceptron with additive update
rule
Though SVM can be developed independently of
the relation to perceptron, we view them as a large
margin classifier for the sake of derivation of
theoretical analysis
Sparse Network of Winnows (SNoW)
Target nodes
Feature Vector
On line, mistake driven algorithm based on Winnow
Attribute (feature) efficiency
Allocations of nodes and links is data driven
time complexity depends on number of active features
Mechanisms for discarding irrelevant features
Allows for combining task hierarchically
Winnow Update Rule
Multiplicative weight update algorithm:
Prediction
is
1 iff
wx θ
If Class 1 but w x θ,
w i w i (if x i 1) (promotion )
If Class 0 but w x θ,
w i w i (if x i 1) (demotion)
Usually, 2, 0.5
Number of mistakes in training is O (k log n) where k is the
number of relevant features of the concept and n is the
number of features
Tolerate a large number of features
Mistake bound is logaritimic in number of features
Advantageous when function space is sparse
Robust in the presence of noisy features
Support Vector Machine (SVM)
Can be viewed as a perceptron with maximum
margin
Based on statistical learning theory
Extend to nonlinear SVM using kernel tricks
Computational efficiency
Expressive representation with nonlinear features
Have demonstrated excellent empirical results in
visual recognition tasks
Training can be time consuming though fast
algorithms have been developed
Generalization Error Bounds: SVM
Theorem 1: If data is L2 norm bounded as
||x||2b, and the family of hyperplanes w
such that ||w||2<a, then for any margin
<0, with probability 1- over n random
samples, the misclassification error err(w)
k
C 2 2 nab
1
err ( w)
a b ln(
2) ln
2
n
n
where k = |{I: wTxiyi<}| is the number of
samples with margin less than
Generalization Error Bounds: SNoW
Theorem 2: If data is L norm bounded as
||x||b, and the family of hyperplanes w such
that ||w||1<a and jln( )c, then for any
margin <0, with probability 1- over n
random samples, the misclassification error
err(w)
k
C 2 2
nab
1
wj
1
j w 1
err ( w)
n
n
2
b (a ac) ln(
2) ln
where k = |{I: wTxiyi<}| is the number of
samples with margin less than
Generalization Error Bounds
In summary
SNoW has lower generalization error if
Data is L norm bounded and there is a small L1 norm
hyperplane
SVM has lower generalization error if
SVM: Ea ||w||22 max ||xi||22
SNoW: Em 2 ln 2n||w||12 max ||xi||2
Data is L2 norm bounded and there is a small L2 norm
hyperplane
SNoW performs better than SVM if the data has
small L norm but large L2 norm
Efficiency
Features in nonlinear SVMs are more
expressive than linear features (and
efficient as a result of kernel trick)
Can use conjunctive features in SNoW
as nonlinear features
Represent the occurrence (conjunction)
of intensity values of m pixels within a
window by a new feature value
Experiments
Training set:
6,977 2020 upright, frontal images: 2,429 faces and
4,548 nonfaces
Appearance-based approach:
Histogram equalized
Convert each image to a vector of intensity values
Test set:
24,045 images: 472 faces and 23,573 nonfaces
Empirical Results
SVM with linear features
SNoW with local features
SNoW with conjunctive features
SVM with 2nd poly kernel
SNoW with local
features
performs better
linear SVM
SVM with 2nd
order polynomial
performs better
than SNoW with
conjunctive
features
Discussion
SVM with linear features
SVM with linear features
Studies have shown that
the target hyperplane
function in visual pattern
recognition is usually
sparse, i.e.,
SNoW with local features
SNoW with local features
SNoW with conjunctive features
SVM with 2nd poly kernel
the L2 norm and L1 of
||w|| are usually small
Perceptron does not have
any theoretical
advantage over Winnow
(or SNoW)
In the experiments, L2 is
on average 10.2 times
larger than L
Empirical results conform
to theoretical analysis
Conclusion
Theoretical and empirical arguments suggest
SNoW-based learning framework has important
advantages for visual learning task
SVMs have nice computational properties to
represent nonlinear features as a result of kernel
tricks
Future work will focus on efficient methods (i.e.,
similar to kernel ticks) to represent nonlinear
features for SNoW-based learning framework
© Copyright 2026 Paperzz